Comments on FictionPublishingExample.
Here is a very preliminary assessment of mine. I may change it after I dig more. For one, I don't understand the domain very well. What is a "house" there, for example?
Requests for domain clarification have been answered on FictionPublishingExample
You mention both HTML files and delimited files. I have no clue what the relationship betweens these are. A short example of the inputs and resulting outputs (updating just a few tables/entities) would be veeeeery helpful to get the big picture.
Also, I suspect that you can consolidate some tables. For example, have a single "Constants" table to store lists of constants rather than a table for each constant kind. That way, you can write a single Constant editor and have fewer tables floating around. (The "description" would be assumed the same as the abbreviation code if not present.) The workflow tables also seem to have a similar column repetition pattern. (See ConstantTable) -- TopMind
You mean, make the schema something like this (category, code, value) [for Constants]? That would give somewhat more flexibility, but at a cost. The big problem is referential integrity: 'code' is no longer a CandidateKey for the relation, because it may be duplicated across unrelated entities (imagine an era of 'Riddle', a character of 'Riddle', and a genre of 'Riddle'). The full key would have to be (category, code), which means that every time one of these values is referenced, the ForeignKey would need to be (category, code). That's bloating other tables significantly for dubious benefit. -- JonathanTang
1. Parse input into work table(s) 2. Index/sort/preprocess work table(s) as needed 3. Traverse work table(s) to update RDBMS-- top
I hadn't thought of using temporary tables. I'm worried about performance, however; as I said, this script has to process about 1.5 gig of data. Traversing the work table with a "SELECT * FROM work_data" will result in a result set that is at least 1.5 gigs. This actually isn't too bad, because most of it will be paged out at any given time (though it may blow the swap partition ;)). But unless MySql does some funky stuff with mmap(), that's a lot of disk activity. Data has to read from disk and written to DB in step 1. It needs multiple disk hits in step 2. Then, in step 3, it needs to be read from disk as it's queried, written back to disk as it's paged out, read from disk as it's paged back in, and then finally written back to the DB. I believe this is all done in a single pass currently, as it doesn't take much longer than untarring the files. -- jt
(response moved below)
However, I suppose that PHP does not provide much in the way of NimbleDatabase features. I would be inclined to use FoxPro or to a lesser extent MS-Access to get nimble table creation, viewing, and processing capability. Sure, their language is not so hot, but they make up for it with nimble tabling. (I don't know about their regex capability though.) PHP is perhaps not a very good "batch" language either because if it's lack of "regional" scoping. But that is less of an issue if one uses work tables instead of RAM arrays. -- top
We chose PHP for a bunch of reasons, most of them social rather than technical:
There also seems to be a lot of repetition of routines. For example, there are a lot of insertX routines where X is each entity. Perhaps a single generic one can be made which uses a ControlTable(s) and/or DataDictionary to store the mappings between file fields and database columns. The job of mapping then becomes mostly data entry into a table grid instead of typing similar-looking conversion arrays and subroutine calls over and over. -- top
This actually is a pretty good idea, and if I had to do it again (or if I ever have to rewrite it) I might do this. It actually doesn't save any typing when database's only interface is SQL. In fact, it increases it, since you have to write the SQL insert statements and they now need to include the function name as a table attribute. But when there's a GUI editor for tables, it could be significantly easier. -- jt
Yip, nimble database tools and table editors have spoiled me and changed the way I view application design. I would have to get a lobotomy to ignore that experience. Some have chided me for trying to turn programming into "data entry", but if that is the road to simplicity and abstraction, then so be it. Hopefully better tools will be available in the future for web languages. Note that ToadSoft? is allegedly porting their table editor to MySql, but it costs mula.
Cursors and Exclusive Batch Jobs
Re: "I hadn't thought of using temporary tables. I'm worried about performance.... Traversing the work table with a "SELECT * FROM work_data" will result in a result set that is at least 1.5 gigs."
FoxPro traditionally uses a cursor-based approach which brings in one record at a time instead of one giant result set. (Actually, it brings in chunks behind the scenes in a read-ahead caching optimization.)
However, direct cursors are generally outside of the relational paradigm. A "Codd-approved" result-set oriented approach generally is done a bit differently. For example, a larger-scale entity with fewer records may be chosen as the "outer loop" over which only chunks at a time of detail are brought into RAM-bound result sets. For example, one may process tax returns zip-code at a time rather than read every tax return into a result set. As I understand the requirements more in this example, I may be able to make some specific suggestions for choosing the "outer entity" in the future.
The cursor-based approach of FoxPro/Xbase is fast and simple for such batch jobs because it was not originally designed for multi-user databases, and thus could toss (or ignore) feature baggage helpful for multi-user but irrelevant under exclusive mode. Multi-user features were later tacked on, but I will agree that true relational and result sets is better when dealing with live *shared* data. Later versions added direct SQL support for talking to BigIron RDBMS. But for some jobs that don't need full sharing, the direct approach is still simpler IMO. It is all a matter of the right tool for the right job. And, for the tables that FoxPro has exclusive access to, it is fricken fast.
Even BigIron databases have cursor-oriented features that are helpful for certain tasks. It is a curious issue whether this is a limitation/hindrance of relational theory (with result sets), or simply an implementation issue. It is possible to emulate cursor-oriented approaches with result sets, but I have yet to find a way to make it simple and fast. True relational has no concept of ordering and indexes are supposed to be a hidden implementation detail, but cursors usually "surf" directly on indexes. Perhaps this belongs in another topic. There are some tasks that just seem simpler and faster under cursoring.
See also: SeparateIoFromCalculation