Fiction Publishing Example Discussion

Comments on FictionPublishingExample.

Here is a very preliminary assessment of mine. I may change it after I dig more. For one, I don't understand the domain very well. What is a "house" there, for example?

Requests for domain clarification have been answered on FictionPublishingExample

You mention both HTML files and delimited files. I have no clue what the relationship betweens these are. A short example of the inputs and resulting outputs (updating just a few tables/entities) would be veeeeery helpful to get the big picture.

Also, I suspect that you can consolidate some tables. For example, have a single "Constants" table to store lists of constants rather than a table for each constant kind. That way, you can write a single Constant editor and have fewer tables floating around. (The "description" would be assumed the same as the abbreviation code if not present.) The workflow tables also seem to have a similar column repetition pattern. (See ConstantTable) -- TopMind

You mean, make the schema something like this (category, code, value) [for Constants]? That would give somewhat more flexibility, but at a cost. The big problem is referential integrity: 'code' is no longer a CandidateKey for the relation, because it may be duplicated across unrelated entities (imagine an era of 'Riddle', a character of 'Riddle', and a genre of 'Riddle'). The full key would have to be (category, code), which means that every time one of these values is referenced, the ForeignKey would need to be (category, code). That's bloating other tables significantly for dubious benefit. -- JonathanTang

I don't think you need to store the category in the using tables anymore than you need to store the table name under the current approach. One can pick the wrong table just as easily as picking the wrong constant category, it appears to me. You are just using the table name as a pseudo category key. Perhaps it is a personal preference. I just usually find a system easier to grok if there are fewer tables and similarities are factored together.

It also seems you are trying to update the database during the read and "parse" step, and this is partly why you had difficulty managing the order of processing certain items. I would perhaps separate these. Perhaps use an intermediate work table so that we don't have to worry about order. The intermediate table can later be sorted/searched and/or post-processed in the order needed by the load-into-RDBMS stage. That way, you don't need funny "buffer" arrays to deal with sequence management. It separates the "gather information" logic from the "load into RDBMS" logic. The steps may then resemble:

  1. Parse input into work table(s)
  2. Index/sort/preprocess work table(s) as needed
  3. Traverse work table(s) to update RDBMS

-- top

I hadn't thought of using temporary tables. I'm worried about performance, however; as I said, this script has to process about 1.5 gig of data. Traversing the work table with a "SELECT * FROM work_data" will result in a result set that is at least 1.5 gigs. This actually isn't too bad, because most of it will be paged out at any given time (though it may blow the swap partition ;)). But unless MySql does some funky stuff with mmap(), that's a lot of disk activity. Data has to read from disk and written to DB in step 1. It needs multiple disk hits in step 2. Then, in step 3, it needs to be read from disk as it's queried, written back to disk as it's paged out, read from disk as it's paged back in, and then finally written back to the DB. I believe this is all done in a single pass currently, as it doesn't take much longer than untarring the files. -- jt

(response moved below)

However, I suppose that PHP does not provide much in the way of NimbleDatabase features. I would be inclined to use FoxPro or to a lesser extent MS-Access to get nimble table creation, viewing, and processing capability. Sure, their language is not so hot, but they make up for it with nimble tabling. (I don't know about their regex capability though.) PHP is perhaps not a very good "batch" language either because if it's lack of "regional" scoping. But that is less of an issue if one uses work tables instead of RAM arrays. -- top

We chose PHP for a bunch of reasons, most of them social rather than technical:

It's easy to find PHP developers, and the learning curve isn't too high. The HarryPotter fandom is literally 99% female. The software engineering profession is probably 90-95% male. Understandably, there isn't too much overlap between them.
Our forum software is PHP. By making all scripts in PHP, we can avoid loading any other modules into the webserver, keeping it's memory footprint small and speeding up site performance. (We could avoid this by running this script locally and uploading the database, which would let me choose whichever programming language I want. But 1.5 gigs over a DSL line takes a full night to download, half an hour to unzip, and a bit over an hour to untar. When you've got 50,000 registered users, there's a big difference between a couple hours of downtime and a day and a half of downtime.)
We will eventually have to interface with said forum software. This is much easier if we can include their code.

I would much prefer to write it in CommonLisp or to a lesser extent Java, to get a programming language that doesn't suck. But I don't want to be tied to this project for the remainder of my life, which is what implementing it in CommonLisp would entail. So the programming language is essentially non-negotiable. -- jt

There also seems to be a lot of repetition of routines. For example, there are a lot of insertX routines where X is each entity. Perhaps a single generic one can be made which uses a ControlTable(s) and/or DataDictionary to store the mappings between file fields and database columns. The job of mapping then becomes mostly data entry into a table grid instead of typing similar-looking conversion arrays and subroutine calls over and over. -- top

This actually is a pretty good idea, and if I had to do it again (or if I ever have to rewrite it) I might do this. It actually doesn't save any typing when database's only interface is SQL. In fact, it increases it, since you have to write the SQL insert statements and they now need to include the function name as a table attribute. But when there's a GUI editor for tables, it could be significantly easier. -- jt

Yip, nimble database tools and table editors have spoiled me and changed the way I view application design. I would have to get a lobotomy to ignore that experience. Some have chided me for trying to turn programming into "data entry", but if that is the road to simplicity and abstraction, then so be it. Hopefully better tools will be available in the future for web languages. Note that ToadSoft? is allegedly porting their table editor to MySql, but it costs mula.

Cursors and Exclusive Batch Jobs

Re: "I hadn't thought of using temporary tables. I'm worried about performance.... Traversing the work table with a "SELECT * FROM work_data" will result in a result set that is at least 1.5 gigs."

FoxPro traditionally uses a cursor-based approach which brings in one record at a time instead of one giant result set. (Actually, it brings in chunks behind the scenes in a read-ahead caching optimization.)

However, direct cursors are generally outside of the relational paradigm. A "Codd-approved" result-set oriented approach generally is done a bit differently. For example, a larger-scale entity with fewer records may be chosen as the "outer loop" over which only chunks at a time of detail are brought into RAM-bound result sets. For example, one may process tax returns zip-code at a time rather than read every tax return into a result set. As I understand the requirements more in this example, I may be able to make some specific suggestions for choosing the "outer entity" in the future.

The cursor-based approach of FoxPro/Xbase is fast and simple for such batch jobs because it was not originally designed for multi-user databases, and thus could toss (or ignore) feature baggage helpful for multi-user but irrelevant under exclusive mode. Multi-user features were later tacked on, but I will agree that true relational and result sets is better when dealing with live *shared* data. Later versions added direct SQL support for talking to BigIron RDBMS. But for some jobs that don't need full sharing, the direct approach is still simpler IMO. It is all a matter of the right tool for the right job. And, for the tables that FoxPro has exclusive access to, it is fricken fast.

Even BigIron databases have cursor-oriented features that are helpful for certain tasks. It is a curious issue whether this is a limitation/hindrance of relational theory (with result sets), or simply an implementation issue. It is possible to emulate cursor-oriented approaches with result sets, but I have yet to find a way to make it simple and fast. True relational has no concept of ordering and indexes are supposed to be a hidden implementation detail, but cursors usually "surf" directly on indexes. Perhaps this belongs in another topic. There are some tasks that just seem simpler and faster under cursoring.