Dedicated Structures Versus Rdbms

Pros and cons of using a RDBMS (RelationalDatabase) of some sort versus dedicated or custom-made structures such as bags, arrays, stacks, etc.


RDBMS


Dedicated

        * More natural description of data. (For instance a table cannot contain an array of sub-objects). 


Changing the format of a database can be an extreme pain. Hence the above scaleabilty argument is debatable.

To my knowledge there is no standard for embedded procedures so using them implies vendor lockin.

NickKeighley


In an application, it is easy to store absolute quantities as data, and calculate changes. This sort of query gives many RDBMSes fits. For example:

The user has data for the grain level in the silo on any given day. It is straightforward to write a program that calculates (on the fly, in response to a query) the changes in the grain level between days. Writing a similar query in SQL is very painful. In Microsoft SQL Server, it is likely to time out if the database is a reasonable size and has significant other data on the same rows. (Specifically, it is likely to run into the 1 MB buffer limit for a SQL Server job.)

A more SQL-oriented way to solve this problem is to store the data in terms of the changes in the grain level between days. The queries then add together various days' changes to determine (on the fly, in response to a query) the grain level in the silo on any given day.

Storing the absolute data in the database (a la application method) has reliability advantages. An error in the database only affects the total for that day, and calculated deltas versus that day. Also, the value for the day can be double-checked against a physical inventory done that day.

Whereas, storing the deltas in the database (a la SQL method) means that an error on one day affects the calculated totals for all future days.

I am not sure what you mean. Lets assume a table like this:

  Table: grain
  ------------
  dayNumber
  grainLevel
To get the difference, do:

  select b.grainLevel - a.grainLevel
  from grain a, grain b
  where a.dayNumber = $day1 and b.dayNumber = $day2
(I used number instead of date type to keep the example simple.) I have encountered situations where more complex different day comparisons were not very straightforward, but the fault was with SQL, and not relational in general. However, I used a "local" table to adjust it to the view I needed. Local tables are starting to come back in fashion in tools like ColdFusion and DotNet.

Yes, but what if b needs to be the most recent record before a? Not every day has an entry, and some days have more than one entry, so you cannot just subtract 1 day?

Something like:

  select top 2 * from grain order by timeStamp
Or something like:

  select top 1 A.grainLevel - B.grainLevel
  from grain A, (select top 1 *
from grain B
where B.timeStamp < A.timeStamp
order by A.timeStamp desc)
  order by B.timeStamp desc
(Top X varies per dialect. I haven't tested this, so don't put money on it.) I agree that SQL is not the most intuitive relational language for such problems, but it is perfectly doable.

Thank you for the suggestion.


Re: Is usually faster to develop without a database.

I would like more information on this claim. In my NimbleDatabase days, I would whip out applications pretty damned quick, usually faster than with current "in style" tools. One nice thing about most NimbleDatabases is that you can readily enter, query, and dump your structures to test and debug. Dedicated structures generally don't come out-of-the-box with the DatabaseVerbs that allows one to analyze, populate, test, and study the structures. How do you quickly and visually populate a stack for testing purposes without writing population code? I don't see many kits that come with a stack browser/editor, queue browser/editor, bag browser/editor, vector browser/editor, etc. The few that do require you to start the application first.

Plus, you can reuse many custom-made table-based frameworks because tables are tables. A stack-specific tool is not going to work on a queue without a lot of rework, for example. I can reuse most of the DataDictionary code for a different variation of a data dictionary. However, if your version of such a thing is in a stack of dictionaries, then converting it to a queue of bags is going to be much more of a rewrite. Tables are just a better catch-all structure in my opinion. To misquote IBM, they are "bendible, flexible, and reimplementable".


Re: Can be smaller and faster than even the nimblest of databases.

This is situational. If you have an implementation for say 15 different dedicated structures, then you may have more code than a single implementation of a lite table engine. One is getting more per-application re-use for the table engine. As far as speed? Well I guess I will generally agree with you on that for a single-purpose structure. However, if you start having to do things like sort or query it in more than one way (different "access paths"), then it starts to lose its speed advantage. Further, you may have to do a lot of code rewriting to move to a structure(s) that handles more access paths if you later need it.

There are 2 situations: one where you don't use all of the features of a RDBMS, one where you do. If you don't use all of the features of an RDBMS then a dedicated solution can be faster and smaller. If you use all of the features of an RDBMS, just use an RDBMS.

"All" is perhaps a little extreme. Plus, it may be possible to split a RDBMS engine up so that you only include what you need in a given release/compile. (It is hypothetical because I have never seen it in action, except for possibly swappable ODBC drivers that allow one to do things like query text files in SQL. Clipper also had replaceable drivers, but I never set out to implement a "lite" version.) The key advantage of such approach is that one generally does not have to change the interface. You have to change structure interfaces, such as convert from say a list to a dictionary, to get better/different features, while a RDB interface generally doesn't because it starts with tables and ends up with tables no matter what. Smalltalk APIs come somewhat close to this concept by reusing the same verbs on different structures, but fell short. If they completed the consolidation in a logical way, it would actually be a form of database (although perhaps not relational).

You have a different perspective than I do. It's often cheaper to throw a set of references into a list or a dictionary than to use a RDBMS. You don't need to change structure interfaces. You just use the tool that's appropriate to the task.

What if you already have the RDBMS available? I agree that "big-iron" DB's are not very nimble in this regard, but fortunately they are not the only flavor of RDBMS.

Even the nimblest of RDBMSs requires more overhead than creating a list or a dictionary object.

Well, I use dictionary arrays for some things also. However, if I see a reasonable chance that it will need than 2 columns, I will consider going to tables, depending on the language and shop practices. To me that is all a dictionary array is: a 2-columned, one-keyed table. That is a somewhat arbitrary limit. It is almost like looking at a human hand and wondering why we don't have 6 fingers or 4 fingers. (Interesting exercise: make a language/API that allows one to extend a dictionary into a table without a syntactic overhaul. I have played with the idea, but have yet to find something comfortable.) Why do I use dictionaries instead of tables for some things? Because they are syntactically simpler in most languages. It keeps the code smaller and visually lean if the needs on it don't grow. NimbleDatabase tables even on 386's were often pretty darn fast, though. Tables are not inherently slow. But when choosing, I will consider human effort long before I consider CPU effort. I say "work the damned machines to the bone". They are here to serve us. Make them slave away. Even in India, machine power is still cheaper than human power. However, if by chance resources are thin for whatever reason, I might focus more on speed and machine resources and sacrifice some "future-proofing" of the design. For example, the boss might say, "Please go easy on the database. We don't have the funds to upgrade or migrate our Sun box.".


Code Overhaul Example

For example, suppose you are using an associative array to store name-value pairs while parsing individual XML tags.

No one with half a brain would ever do this. The xml already is data structure, you can work directly with it as-is, putting it into an array is ignorant. You're just trying to setup a strawman so you can knock it down.

<exampletag foo="bar" blah=7 ...>

assoc['tagname'] = 'exampletag' assoc['foo'] = 'bar' assoc['blah'] = 7 etc...
A requirement then comes along that requires you to also store the sequence order each pair appeared in an XML tag so that it can be recreated more faithfully. If using an AttributeTable-like schema, the only change to our structure is a new column. Any references to existing columns in the code still work as-is. Instead now you have to toss your usage of an associative array and replace it with something else. Most code references to the dictionary are going to need to be changed. I suppose you could now nest associative arrays, but any prior references to the value portion of the old array still have to be overhauled. That would likely be roughly 50 percent of all existing array references.

In my experience most developers don't even bother to overhaul such things, but rather hack something up, such as a parallel array that stores the sequences. (True, bad programmers probably screw up tables also, but not quite as much in my experience.)

This change-tolerance comes about because the DatabaseIsRepresenterOfFacts. Its primary purpose is to describe nouns, and relational is such that it decouples the representation and access from how a given application may happen to use it. Thus, if you change the way you use the information, or need to add more information to describe something, the interface to existing info does not change. The only thing that tends to trigger big overhauls is changes in quantities of relationships (1-to-1, one-to-many, etc.) But these kinds of changes mess up competitors as well.

{Simple solution, requiring no changes to existing code. Store the sequence as another element in the associative array, e.g.}

assoc['tagorder'] = { foo, blah ... }

Well, that is kind of similar to the "parallel array" solution described above. I find it a bit hacky. For one, it is a violation of OnceAndOnlyOnce because you are repeating the names in a sub-structure. (Repeating of keys does sometimes happen, but it is not necessary in this case.) And, it is inconsistent because everyone will implement it differently, using different combinations or nestings of maps, lists, and sets most likely. Possibilities:

A normalized relational solution will likely be more consistent from developer to developer (schema examples given later below). I suppose the biggest open issue for schemas would be how to encode the tag name. But similar (analogous) decisions are also available using maps and lists, in addition to the ones I already mentioned.

And, it is trivial to list the table in the new order ("select * from tag order by sequence"). Finally, that solution does not scale well to N columns. For example, what if we wanted an "errorText" field to insert parsing error messages into per attribute? Yes, you can have another parallel list, but at this point I don't think most would agree that is pretty. In the past when I did not have a database readily available or was not permitted one, I ran into such DiscontinuitySpikes all the time.

And a lot lighter weight than RDBMS systems.

If hardware and code-size are more important that developer maintenance cost, then I agree a DBMS is probably overkill.

There's another problem with the question as stated. No sane developer would just create a data structure and expose the guts to the world in any system of moderate complexity. There'd be some code that hides that data structure, meaning that if the guts change, you only change the handling in one place, not in a bunch of references scattered haphazardly throughout the system. That's the real meaning of OnceAndOnlyOnce. Careful consideration might lead to the conclusion that the entire example using this badly-designed associative array implemention is a StrawMan set up to be easily knocked down by a simple table schema. And of course, there's no behavior.

A database is an interface. Why have yet another interface over an interface?

Because a database is a god awful slow out-of-process interface that forces one to do double or triple the work for no advantage other than generic querying, which is rarely needed. Database are for storing long term data, not for working with short term in memory data as it's being processed. You may disagree, but you're wrong. Working with a hashtable or array requires virtually no code, working with a database requires lots of code. You have to build the SQL, open a connection, execute the query, get the results back and then process data you already had in memory to begin with, and that's just ignorant.

Besides, this gets back into that "information hiding" discussion that was deleted. You are welcome to resurrect it. There is nothing evil about separation of data and behavior via databases. Further, note that most of my criticisms here are demonstrated with specific change scenarios. If you can show some specific scenarios where "naked tables" cause problems, that would be helpful. In other words, what is the "non-strawman" version?

[I'd keep an ordered list of references to the elements in addition to the map. That's no different than adding a column to the table.]

Well, to move things along I won't dispute that it is more code changes here. However, it is more complexity. You now have 2 "kinds" of structures (a map and a list), whereas I only extended an existing one. It is almost analogous to adding another column to an existing array by extending the subscript range versus adding a second array. (I said "almost" because tables and arrays are not the same thing.) It would generally take longer to understand a 2-structure system than a one-structure system.

Again, relational offers more consistency and predictability. I propose that there would be less variations in design if tables were the primary structure. One only has to see and understand the schema, they don't have to learn how to navigate it because tables are tables. However, I cannot offer any metrics or surveys to demonstrate this at this point. Consistency is an expensive metric to measure, unfortunately. It would require a bunch of "blind" developers be given the same requirements, one set told to use tables, and the other set dedicated structures. We can't perform this test with our existing resources. But I do think the code would be noticeably simpler if yet more scenarios need more columns to be added (see "error text" scenario above).

Because of the difficulty of testing such a claim, we just may have to AgreeToDisagree. I can grok schemas and related code much quicker than nested combinations of maps, bags, lists, stacks, etc. Plus, my experience is that one developer will use a bag of lists, and another use a stack of maps, etc. The rules of relational normalization reduce the "creativity" possible with tables. Some find that boring, but that is another issue. Perl's mantra of "there is more than one way to do everything" sometimes does not sit well with me after bad experiences debugging others' code. We are back to WhenAreStandardsRestrictive debates.

[Would you update the database after each tag was parsed? That seems inefficient, especially for for large (10,000+ element) XML documents. Even if I was going to keep them in a database, I'd probably parse the document, cache the results and then commit the results.]

I don't think that was part of our example requirements. Suppose we are just receiving simple one-liner XML commands over HTTP and then acting on them right after parsing.

[All the more reason not to bother putting the data in a RDBMS, I'd say. If it's just a few elements I'd build the context and respond to it inside the SAX parser.]

Why do you use the word "bother"? Like I already said, RDBMS don't have to be big bulky things. If your complaint is that most existing RDBMS tools focus on "big iron" RDBMS rather than small-footprint nimble tables, I would readily agree with you. Also, why couldn't a SAX-like parser "dump" the results to tables instead of linked lists?

[I use the word "bother" because inserting the elements into a database requires more work than keeping them in the program's local memory. The SAX parser will extract the strings. I might as well use those instead of inserting them into a database and the selecting them later, unless there's some specific requirement that depends on that behavior.]

What does the SAX parser do with the parsed results?

[It passes them to my methods as it parses the XML. In those methods I can do whatever I want with them. Put them in a list, a map, a custom business object, a database, etc.]

I would rather the results be in relational tables than some foreign thing that presents a new way to access structures that I have to learn.

[The "foreign thing" is a class. The "new way" to access it is through its public methods. Almost every programmer I've ever worked with was able to do that.]

{Doable and better are two different things. Plus, I would rather load it into a structure first, and then process it. That way if there are problems, I can separate parsing from processing problems more easily.}

[So you're advocating that we insert 10,000 records into a database, then find out the 5th element was bad and delete all 10,000 records. I'm advocating that we validate as we parse, bail out on the 5th element and let the garbage collector remove the results of the first 4 elements.]

{How is that worse than processing 3/4 of it and then discarding it when an error is found out? Is this going to turn into an argument about whether wasting temporary RAM and/or CPU is more sinful than wasting temporary disk? Besides, if we are parsing it as we are loading records, then we can optionally stop when an error is found, and dump the structure.}

[I must have misunderstood you. It sounded like you wanted to "load it into a structure first" (presumably a relational database) and then "process it". If that isn't what you mean, please clarify.]

Relational tables introduce consistency to "structures". I don't want to learn 200 ways to query 200 different structure "types" in 200 different API's.

[Instead you have to remember 200 ways to query 200 different schemas through one API. I'd rather have 200 APIs that remember the nitty gritty for me.]

{What 200 schemas? There is nothing near that in this example.}

[What 200 structure "types"? What 200 ways to query them? If there are 200 different classes there will probably be 200 different tables.]

{If there is 200 different potential ways to implement something, then you will generally encounter more variation for similar designs. That means more surprises. Similar things will more likely be implemented with similar table designs. One will encounter less variation. It is harder to learn a trade if each shop has vastly different tools and techniques.}

[I thought you said table design was interface, not implementation? Either way, if you can have 200 different structures you can have 200 different tables. You're correct, there's nothing in this example that would lead to 200 schemas. Neither is there anything that would lead to 200 structures.]

{Since tables are a bit more powerful than most dedicated structures, the ratio of tables should be lower. For example, one required both a hash array and an ordered list in our example to get the sorting features that one table had. As far as whether tables are an "interface" or an "implementation", one could consider an OO class an "implementation" even if we only use its interface. It is an interface implemented in the OO paradigm.}

That is anti-reuse. As far as speed, if you do it just because they are allegedly faster, then so be it. We'll just have to AgreeToDisagree. I tend to focus on human factors before I focus on machine speed. Flexibility in code change and growth is usually more important than machine speed in my experience. See also ItsNotAboutSpeed.

It is about speed if your solution is 100 to a 1000 times slower, which it is.

{Again, how about a speed shoot-out between FoxPro and Smalltalk. You keep saying that tables are intollerably slow, so how about a demonstration?}

It's also more work. Relational tables offer nothing of value if I already have the data in memory. You don't process short term data in a database, you do it in memory. You can put it in the database when you're done. You tend to focus on CodeAvoidance, because you're not a good programmer. Don't assume everyone has that handicap.''

{Give me my favorite tools and I kick ass against code-heavy approaches because I use the database instead of reinvent it. }

{I believe that most algorithms can be reduced to mostly DatabaseVerbs (except for maybe complex conditionals).}

I believe that statement sums up all of your misconceptions quite nicely.

{Thus, I data-tize most of it and then the code is less. It is not a "handicap", but a technique.}

[Let's look at the 10,000+ element example. Would you execute 10,000+ SQL INSERT statements?]

Do you mean 10,000 XML statements or 10,000 attribute pairs? SQL tends to have a lot of overhead per insert statement in the implementations I am familiar with, so perhaps I would try some kind of bulk load from a comma-delimited file if using SQL. However, using languages with built-in tables, such as FoxPro, its equivalent is quite fast. I used to parse lots of large stuff via tables in FoxPro and Clipper. I am not sure it would win speed contests, but it was usually fine for the task at hand. In FoxPro, a record insert was much faster than a user-defined function call as I remember it.

As usual, you're missing the whole point top. To insert into the database requires code that builds the insert. If I have that, I can just use it directly, no need for a database. Bulk load a file, that's silly, if I have the file, I can just stream it and use it. Databases aren't useful unless I need lot's of filtering and sorting, which is not the general case when working with XML. Databases suck at dealing with XML because databases suck at working with hieratical data.

Perhaps, but XmlSucks to begin with. I am not fond of deep or complex hierarchies. They are artificial and an AntiPattern because the real world is not tree-shaped for the most part. My example assumed rather "flat" XML, I would note.

I'm sure you do, you always seem to assume ignorant things.

I am not sure what you mean. It was an example. XML can be deeply nested or it can be almost "flat". It depends on the XML structure designer. In other words, approach A is possible, and so is approach B. I based my example on approach B. Is that a sin of some sort?


Moved text to AreTablesGeneralPurposeStructures.


On the other hand, would a thoughtful developer even choose to parse XML into an associative array?

Note that it said "individual tags", not entire XML documents. But, I bet the table approach scales to entire documents, and multiple documents with fewer changes. All I have to do is add a new column for each change. It is only additions to an existing schema:

  Tag Only:

Table: Tags ----------- attribName attribValue

Tag with sequence:

Table: Tags ----------- attribName attribValue sequence

Whole document:

Table: Tags ----------- tagSequence attribName attribValue sequence

Multiple documents:

Table: Tags ----------- docRef tagSequence attribName attribValue sequence

Table: Documents// optional ---------------- docID docTitle docPath


Low-Level Structures Versus "High-level"

You guys do realise that tables are implemented using dedicated data structures, right? Which means that this discussion is equivalent to discussing whether bricks or buildings made of bricks are better.

I don't see how that matters any. Of course "lower-level" structures are going to be used. That is a given. That would be like saying that under all OOP is procedural machine code, implying that procedural is better because of that.

Procedural will be better if a single procedure is all you need to write (e.g. a short shell script), similarly, if a low-level structure, e.g. a stack, is all you need, it will be stupid to emulate it in a database unless you want to persist it.

Re: "if a low-level structure, e.g. a stack, is all you need...."

All I need for how long? Plus, using tables is not necessarily harder even if it does not change. You seem to think it has a big up-front cost. Well, it does not. Speed is the only possible reason to object, and it usually is not an issue. An indexed table of say 1000 items is not noticeably slower than a hash map of 1000 items. And, I don't have to worry about filling up RAM if the user somehow does something accidental to load 1000000 instead of 1000.

If I need to be loading large amounts of data then I'll use a database. But once I've loaded that data I need to do stuff with it. I have to express complex algorithms and use sophisticated data-structures. That's why people hire programmers and don't just do everything in the database. That's what programmers are there for.

A database/table-engine is a "sophisticated structure". I have not seen any evidence that dedicated structures handle "complex algorithms" better overall, at least not in my domain.

And please stop suggesting that I want "everything" done in the database. That is a false statement.

[It seems like a true statement given your contributions here. Every time someone mentions using another data structure you say you'd rather use a table. Perhaps you should tell us which dedicated data structures you approve of.]

How about giving us some examples of WhenNotToUseTableOrientedProgramming. If it's just another approach and not the holy grail you should be able to add quite a few items.

[So you are avoiding the question then?]

I don't think that was my reply above. Anyhow, I would say that an unordered list and an unordered dictionary are about as high as I would go, as a rule of thumb. If you need anything more complicated, or think it likely that you will, then go with tables. -- top


See also: SqlFlaws


EditText of this page (last edited November 12, 2014) or FindPage with title or text search