Bag Set Impedance Mismatch

After reading BagNeedScenarios and BagAtational, I started wondering... should the name "BagSetImpedanceMismatch" be used as a label for the difficulties when exchanging data between Bags and Sets? Would the ideal data management environment offer both a relational and a bagatational data definition and manipulation language? (2 very tightly integrated languages, or a single language that can operate with both relational and bagatational algebra?) Maybe such language can be built on top of ExtendedSetTheory?

On the other hand, any Bag can be transformed in to a Set adding a column with a surrogate key or a column with a count of the duplicates. Is that (adding such column) all the extent of the BagSetImpedanceMismatch problem? And if so... is it really always trivial to add it to make it possible to uniquely identify the rows?

Restrictions in access, result "shape", and/or size may limit the addition or location of surrogate key.

[Restrictions in access may prevent addition of a local surrogate key. It does not prevent dynamic generation of a surrogate key on import or link. Not sure what you mean by "result 'shape'".]

And what about a column with a count of the duplicates?

I have no problem with that as long as it can be omitted for "output" or being directed to a bag-table. Of course, this will likely trigger another HolyWar over "relational purity".

[Hardly. I am a "relational purist" and have no problem with that either.]

That is why this page was created... maybe the solution is to accept that we need a BagSetMapper? (in the spirit of ObjectRelationalMappers, but to map Sets to Bags and viceversa).

Let's hope they are not the pain in the [bleep] that ObjectRelationalMappers are.

"Bag" Not Quite the Right Term

"Bag" is a semi-misnomer in practice because many if not most RDBMS have a unique internal row identification system, and often expose the row-ID to the "query user" via a special column name or function. The real issue is the "permanency" of such. In some cases it may only be unique for a given query result. If we re-query the same data at a later time, we may get a different ID for the same row (if same-ness can be determined if we had a God-view of the system). In others it's unique across a table, but new records may re-use ID's from prior deleted rows. Thus, referencing such ID's as foreign keys could present some surprises. Further, certain forms of cross-server load distribution may render the internal ID unreliable in various ways. Perhaps the term "key-unstable" can be used instead of "bag".

Adding surrogate keys to a "bag" query result set (such as to force rows to have a unique key) may also produce unreliable keys such that the same record may have a different ID upon next query. For example, a given stored procedure or database view may not have a unique key and/or may not supply a unique key for various reasons (security, efficiency, politics, app vendor rules, bad design, etc.). Thus, "bags" may not come from just key-less tables, but also from transformation services that don't "send" a primary or unique key. In medium and large companies, one often will not have the option of having such redesigned to emit a primary or unique key. One is a data user, not data designer for many existing systems. There is no known generic way to "auto-guess" a primary key that is stable over time for such situations.

Since most result sets are returned in a linear fashion, we can assign unique row numbers by simply using the "row count" as the primary key. However, these numbers may not be unique across the table. For example, if record #2 is deleted overnight, then #2 will be a different record the next day. Similarly, if we change the query criteria, the numbers will fall differently. We could study the data and perhaps come up with a fairly reliable guessing mechanism to determine a unique key in many cases, but still may never have a 100% guarantee. -top

In such cases, a generated key does not need to be stable over time. For queries on data without a guarantee of uniqueness, a generated key only needs to be stable over the lifetime of a given query. Systems that need to work with possibly-duplicate or "keyless" data are designed to be aware of this.

How are they "aware" of it? Please clarify.

Relational systems that need to work with possibly-duplicate or "keyless" data do so by generating a key that does not need to be preserved beyond the lifetime of any given query that uses the data. Of course, a trivial objection is that if the keys are transient, you can't -- in the long term -- unambiguously identify any specific row in the source data. That's true, but in the long term you couldn't have identified any specific row anyway, because the data has duplicate rows.

That's why I suggest it may be better to...

Rest of discussion moved to BagSetImpedanceMismatchDiscussion

Following the reasoning and arguments Top presents for supporting bags, we should also favor first-class mutable references to records to reduce the impedence mismatch between NavigationalDatabases and RelationalDatabases. This would offer all of us an ObjectIonable model, corresponding to Top's BagAtational models. Thoughts?

I'm leaning to believe that "navigational" is really a style of interfacing rather than an absolute set of principles. Both can be seen as views/re-mappings of the other. Yes, it's my EverythingIsRelative thinking popping out again. For example, if the DB engine automatically includes a "rowID" that's stable for the life of that particular row in the DB table (similar to what I propose in DynamicRelational), then we've made it a "pure" relational system on paper. However, that doesn't remove the domain, political, and WetWare problems associated with inter-system key management. In the end, it's still a management or people problem. --top

Is BagSetImpedanceMismatch a generalization of the TrueRelationalToPseudoRelationalImpedanceMismatch ?

Footnotes

[1] One possible way to create a key if not given one is to take a snapshot of the "problem" dataset, give the snapshot an ID, such as "Version_ID", and combine it with the record number of the snapshot. Thus, we have a compound primary key that is Version_ID + record_ID. However, if you have to cross-reference or use such data across and with different versions, it can get very hairy. You may not even have enough info to say that one version's record is a certain match in another version's set. It becomes the problem domain of "probabilistic computing". I've built "guessing engines" before when given messy data to work with, and it wasn't trivial. You often can't be cavalier about artificial keys. -t

JanuaryTwelve