These are four scenarios where a "bag" may be needed or helpful instead of unique rows. (Related to BagAtational.)
Scenario Summary:
- 1 - Limited Access to Large Log Table
- 2 - Ad-Hoc Column Trimming for Fit
- 3 - Contracted Delivery Columns
- 4 - Compact Sales Summary
- 5 - Fouled-up Key (new Jan. 2012)
Counter-argument summary:
Assume we are using a true relational DBMS (see the RelProject, for example) which does not permit duplicate rows in tables (RelVars)...
- 1 - Limited Access to Large Log Table - The "large log table" -- which presumably has duplicates -- must be external since duplicate rows can't exist in a true relational database. Therefore, the mechanism that provides linkage to (or import of) external data can trivially generate an autonumber key, or coalesce duplicates and generate a "count" attribute.
- 2 - Ad-Hoc Column Trimming for Fit - Trivially handled by the output presentation mechanisms of the true relational DBMS or its client applications.
- 3 - Contracted Delivery Columns - Trivially handled by the output presentation mechanisms of the true relational DBMS or its client applications.
- 4 - Compact Sales Summary - Compaction does not necessarily imply forgoing duplicate detection/elimination, and the scenario borders on "you need a bag because I say you need a bag."
A summarized outline is in the works under
BagNeedScenariosReWork.
It's the same stuff again. Do we really need to prolong this, let alone repeat it?
I'm trying to summarize it. In other words, a compact version of the "same stuff".
Like the stuff that's already above, yeah?
I would put it there, but this page is TooBigToEdit. As mentioned on BagNeedScenariosReWork, when the summary settles, a topic re-factoring may take place.
Scenario 1: limited access to large log table
You are given read-only access to an event log table created by a customer/partner without an auto-sequence key or any other (accessible) unique key. The events are just "dumped" into the table with no guarantee of uniqueness. The logging device ID's were supposed to be unique, but due to sloppiness or lack of maintenance (such as the repair-man doing a reset and skipping device ID re-entry, leaving factory default), they are not in practice. And the time-stamps emitted by the logging devices are rounded to a tenth of a second. The table is too large to practically entirely copy to a local work-space. The working assumption here is that you, a data user, don't have the political power and/or rank to fix the loggers "right" or change the table structure.
- [This event-log scenario is about access to remote data resource for processing. The following scenario would be equally valid: "I have here an event-log CSV file that is too big (or changing too rapidly) for importing to SQL or Rel". Problem is, you're blaming the RelationalModel for a scenario where your own favorite tool would do no better, and yet you (intentionally or not) hid this issue by asserting that the event-log happens to be to an SQL table already. For the CSV problem, it would be more obvious that any solution you find for SQL (such as ODBC driver atop CSV) would also work for Rel (such as ODBC driver, with line-numbers, atop CSV).]
- This is not part of the scenario as stated. From our perspective we may not even know the source. I've worked with "tables" that turned out to be in an old IBM IMS database, not an RDBMS. They were just made to look like RDBMS tables. Again, the real world often requires interfacing with imperfect data sources.
- There are four possibilities here:
- 1. You are working in a BagAtational system and are using a native table for the log.
- Works as-is.
- However... A BagAtational system cannot distinguish legitimate duplicates from erroneous double-entries, and may inadvertently generate duplicates from JOINs.
- That may be true, but "fixing" that is out of our control. We are given a bag and must work with the bag because it's our job. The real world often gives us "dirty" data as part of the job.
- 2. You are working in a relational system and are using a native RelVar for the log.
- Such an event log could not have been created in the first place. Every native RelVar must specify an explicit key. The log would have a key, and this scenario would not exist.
- However... It's good.
- The description excludes this as a possibility, if I'm not mistaken. I've tightened the wording a bit.
- 3. You are working in a BagAtational system and are linking to the log table hosted by an external system.
- Works as-is.
- However... A BagAtational system cannot distinguish legitimate duplicates from erroneous double-entries, and may inadvertently generate duplicates from JOINs.
- The domain as it is may not provide usable unique keys even if the RDBMS could accept them. Fixing the loggers, which may be in hard-to-reach areas, may be deemed too expensive to bother. You'd have to make a case to management to visit or rebuild all the loggers, and they may turn you down, living with some potential error instead. In their head, they are weighing the cost/benefits of fixing them versus leaving them as-is.
- 4. You are working in a relational system and are linking to the log table hosted by an external system.
- The linkage itself will provide an auto-sequence key.
- However... It's Good.
- Please clarify "provide". What is this mechanism and how does it work for large tables? Where are the auto-sequences stored? Is it created all at once or in chunks? (MS-Access's link feature has been known to act strange for large tables.)
Scenario 2: Ad-Hoc Column Trimming
You are doing ad-hoc querying to hunt down a suspect data pattern. Chopping out compound keys makes the listing fit on screen and/or paste better into a spreadsheet or eNotepad.
- This is a presentation problem, not a representation problem.
- Perhaps, but changing the classification doesn't change a need.
- [That need can be met by tools outside the query subsystem. And an honest person would have phrased it "correcting my misclassification".]
- I didn't classify, so there's nothing to correct. And yes, outside tools can compensate for a deficient query language.
- [You are incorrect. You explicitly classified the above under 'BagNeedsScenarios?: where bags may be needed [for representation within a database or query result set]'. This was a misclassification.]
- I meant in the context of presentation versus representation.
- What I meant by this being a presentation problem is that a bag is not required to represent, or store, the data. A bag is not required as any part of this. This is purely an issue of being able to generate a report -- i.e., a presentation -- to the target eNotepad, spreadsheet or whatever that omits particular columns.
- Per below, every possible delivery method qualifies as a "report" by your ever-widening definition. Thus the act of moving or copying is always "presentation" by your def.
Scenario 3: Contracted Delivery Columns
A marketing service that your company contracts with to study customers is asking for a list of addresses and a purchase count of a given product from your customer database. For security purposes, your legal team has negotiated to omit the customer ID and name from the list, and the columns present and only those present have been clearly defined in the contract. It's thus possible some addresses may be duplicates because different customers may live in the same household; or due to typos, such as missing apartment number. These won't necessarily hurt the analysis because approximate location, "neighborhood", is sufficient for the marketing analysis.
(For those using the non-keyed data on the receiving end, see BagSetImpedanceMismatchDiscussion)
- This is a presentation problem, not a representation problem.
- We don't know that. Our job is to supply the company with a data set. They may load it into their own tools/system and what we give them is merely a "representation". They may never even look at its raw form.
- [Are you trying to turn this into a RequirementsAnalysis problem? If they need a data-set, and you don't want to give them duplicates, but you do wish to sanitize your outputs, then simply encrypt or secure-hash the sensitive fields. Security issues should be handled by security techniques.]
- Suppose they ask for CSV format. Where is the encryption going to come in that?
- [Encrypt the protected columns, represent them as base32 strings, and emit these strings into the CSV file. I'm not seeing a problem, but I guess you don't work much with security techniques.]
- You mean send them garbage? Have you ever worked outside of academic-centric projects? The above says, "the columns and order have been clearly defined in the contract". Thus, they are not to receive extra nor extra encrypted columns, per contract.
- [You keep contradicting yourself, Topper. You contradicted 'the above' when you said "our job is to supply the company with a data set". Now you are saying, "our job is to deliver a report with a specific format". Report formatting is independent of data access, analysis, and composition issues. The original poster was correct: this is a presentation problem, independent of the semantics and data-model, and representation within the query system and database. (It seems you misinterpreted what 'representation problem' means, though.)]
- No, I did not use the word "report". You just want to see the word report so you can go, "There! see see see: format! format! format!". Your brain craves confirmation of your world view so much it's biasing you. In fact, forget the "order of columns" statement. It's superfluous to the concept and just distracts you. I removed it. [1]
- [Order of columns, and order of rows, and duplication doesn't matter to 'data'... and so, wherever they DO matter, you MUST be dealing with something other than 'data' - one likely possibility being the 'presentation or representation of' data. And you misunderstand the relationship to 'report': I don't say you're asking for a report so I can say "format! format! format!". I see you asking to format for output to an external client, which - by at least the definition of 'report' used in computers vernacular - is a report.]
- As I said, I removed "order" from the description, so it's a non-issue. I will add a side note that data can be data about display preferences, such as preferred sorting order.
- [You insist that "data can be about display preferences". This is true in some trivial sense: for example, I can develop a relation that maintains information about mine and TopMind's and Organization XIII's display preferences. However, no information is lost if I provide you data in a manner that ignores your preferences. (The information is only lost when you throw it away because you can't make any sense of it.) And, therefore, concerns about information and data are orthogonal to display preferences.]
- My point is that we are generally giving them a representation, NOT a presentation. Your classification is incorrect, or at least subject to interpretation.
- [I'll grant that the other author may have misclassified this one [since recanted, in light of Top's clarification]. But you said "we don't know that" as though this is a RequirementsAnalysis issue - i.e. one where the solution is "talk to the customer". If it isn't a RequirementsAnalysis problem, then you should simply clarify the scenario because there was a miscommunication between you and the first respondent.]
- The contract says "deliver bag X", and that's what we do. We don't need to know what they do with it to satisfy our obligation. This also means we don't have enough info to classify our product. Tools should be based on utility anyhow, not how you classify stuff.
- What I meant by this being a presentation problem is that a bag is not required to represent, or store, the data. A bag is not required as any part of this. This is purely an issue of being able to generate a report -- i.e., a presentation -- to the list target that omits particular columns. You aren't delivering a bag, you're delivering a file unless they're requested a serialised Java object or some such.
- We have to pick some representation to transfer it. We don't have ESPnet technology yet. Even when stored on hard-drives, there has to be some physical representation. And moving the medium doesn't change anything relevant. If we tore the drive out of the server and gave them that, would it change your classification? Odd stuff.
- Presentation is essentially about copying data from one representation to another. A typical example of a presentation tool is a report generator, which typically copies data from a database representation to a PDF, on-screen, or file representation. However, we typically consider the data source the representation, and the collective process of producing and delivering output the presentation. The fact that the result of the presentation is also a representation is generally of negligible interest. Now: A bag is a data structure, yes? And you will agree that a TXT or CSV file is not a data structure, yes? In this example, we copy data from a non-bag database representation to a non-bag file (CSV or TXT) representation. No bag needed, anywhere.
- This still makes no sense. Every data structure must have some physical "presentation" to exist. On disk it's bit arrays. That's no different conceptually than CSV (except harder for most humans to read). The info has got to exist somewhere, and you seem to be saying that all physical representations are "presentation". If that's so, then the original table on disk as bit arrays is also "presentation". Thus, everything's presentation, or has presentation according to that. The "primary" representation can even be CSV. ODBC drivers exist to use CSV files as database tables. It's merely a different choice of physical representation than bit arrays of the more common physical presence. Would it change your argument if the customer instead asked to copy the specified columns into a new Oracle table on a different disk, and then mail them the disk via FedEx? And, data structures only exist in the heads of humans. Computers only represent or implement them.
- Sorry, not following you here. Are you saying a CSV or TXT file with duplicate records/lines is a bag, hence bags are needed?
- [Data structures must have a representation to exist. Presentation is about formatting of information in order to communicate it, not storage of it for internal use. Consider strings, for example. Inside a computer process, a string might be stored as a list or rope distributed wildly throughout memory and the hard-disk drive. A very big string might even be stored in part on remote computers. The individual characters must, themselves, be represented... you have a lot of choices (ASCII, UTF-8, UTF-16, UTF-32, EBCDIC, BIG5, etc.) and different parts of a single string might even use different representations for characters. All that storage is its 'representation'. However, external clients are generally unable to consume a string formatted in this manner. Thus, you must deal with issues of 'presentation' when you choose to 'present' this string to the external clients. If another machine is an external client, this often means formatting and framing the string within a file, choosing a particular character-set, etc. If a human is the external client, then some severe transforms are needed for presentation... i.e. using glyphs/fonts for characters, paragraph formatting, word-wrap, etc. However, after the human reads the information, some fraction of that information will be 'represented' in the human's head (and almost assuredly not as an organization of raw glyphs). When that human later wishes to tell someone else the information, or input it to a computer, that human will need to work on 'presentation' again: how to organize that information for another's consumption.]
- So you are saying that ANY conversion between systems is "presentation"? ALL interface conventions are "presentation"? And generally "presentation" is associated with human reading, not computer "reading". You are going out on a limb. Further, order can be represented by sequence numbers. We don't need physical proximity to indicate such info. (However, physical proximity must exist by the laws of nature, so sometimes it's more compact to rely on that instead of explicit sequence numbers.)
- The distinction between representation and presentation is perhaps best understood via a simple example: Imagine an IEEE-754 floating point "double" value. Its representation consists of 64 bits, in which the first is the sign bit, the next 11 are the exponent, followed by 52 bits for the mantissa. Its usual presentation (not counting exponential notation, and considerably simplified for illustration) consists of a digit 1-9, followed by a series of digits 0-9, followed by a decimal point, followed by a series of digits 0-9. Obviously, the presentation (which is a string) has a representation too.
- So if it's "human-friendly" it's presentation, but if not, it's "representation"? The problem is that "human-friendly" is a continuous concept, and perhaps subjective. And please answer the Oracle-disk-Fedex question to help clarify this. And also the case there the DBMS uses "direct" CSV files to store tables.
- "Human-friendly" is not conceptually relevant, really. Given a data structure, representation is how it is natively stored or modelled in a system for manipulation, querying, interpretation, etc. Presentation is how the data structure is converted to another representation for use by an observer, another system, or another part of the same system. An analogy can be drawn here with the "model" and "view" elements of the ModelViewController concept.
- [Presentation has nothing to do with "human-friendly". It has to do with the act of formatting and stylization of information on behalf of a client ("client-friendly") even when the result is inconvenient for further internal use. Humans are a common client, but far from the only client.]
- So any information we give to the client, regardless of its "encoding", is "presentation"?
- [No. If I give a client information with complete disregard to the format/language/structure/encoding the client can process, that is not "presentation". Presentation needs to be friendly to a client. (What is with your repeated efforts to over-generalize?)]
- I'm just asking questions to try to triangulate your def. I can't read minds. If you don't give me enough info to propose a narrower interpretation, then generalization is all I have to go on. Your definition of "presentation" is so different from mine that this is foreign territory to me.
- So now it seems it's about the "intent" or mind-set of the handing-over of information? That's how it reads to me, per "with complete disregard" phrase. Data doesn't "regard", humans do, so I assume you are talking about humans and their mental processes and not characteristics of the data itself.
- No, it's not about the "intent" or "mind-set" of the handing-over of information, unless we're talking about the intent or mind-set of the human developer. Your correspondent wrote "[i]f I gave a client information ...", not "[i]f it gave a client information ..." Here's another analogy/example that might clarify: A database is about representation, CrystalReports is about presentation. Another: 'int x;' is about representation, 'printf("%i", x);' is about presentation. Does that help?
- But those are not mutually exclusive. Representation can be a means to achieve presentation. Representation must exist (as long as info is outside of our head). Presentation is an option, or a form of representation. What a PRINT statement generates is also a representation. Creating a presentation does not destroy representation, only changes it.
- [Think in terms of contracts (a fancy word for expectations substantiated by incentives). For contracts, 'complete disregard' could mean: I neglect backwards compatibility, I may change formats at any time on my own convenience without alerting clients, I don't provide documentation even when requested (you get to struggle to keep up with my changes). Presentation involves focus on the clients rather than immediate convenience for the developers. Accessibility overall often requires: providing network access within the security constraints of the client, using standard protocols, standardizing the serialization, providing information to in the client's own language, using units of measure with which clients are familiar (switching from 'feet' to 'miles' to 'light years' as the numbers get bigger), highlighting the important stuff, lining up columns horizontally, and so on. Presentation is a subset of those accessibility issues: those focusing on just the formatting and expression of information (on the assumption it is already reaching the client). Addressing presentation (and accessibility as a whole) is important because it makes things easier on the client, which is a good business strategy - especially if you expect to serve multiple clients. But presentation and other forms of accessibility are only necessary at points of integration between your service and the client. For intermediate composition and storage, one must address 'representation' concerns - space and time efficiency, system utilization, performance predictability (latencies and memory overhead, especially), safety, scalability, concurrency, persistence, modularity/extensibility, maintainability, and so on. These representation concerns are, of course, influenced by the fact that you must, at some point, address presentation. Often, the representation is closely aligned with the presentation. But, especially as programs scale in size and complexity, it is often useful to use very different representation from the presentation and perform translation at the edges. For example, we don't usually store numbers as decimal strings for math libraries, because doing so would require a large amount of extra processing for each math operation in a complex expression.]
- Sigh. This has turned into a fractal LaynesLaw and your pet concepts are leaking back into the discussion. Let's back away from classification of data blobs for the moment. I need to deliver a bag and want tools that simplify that process regardless of the vocab it may fall under. If your tool is not bag-friendly, it will put obstacles in my way. If it does that too often, I will switch tools and trash the fucker. That's the bottom line. If you feel those are "safety features" that protect you from horrible dangers, so be it; go with it. That's your choice. -t
- [Your bottom line is a logical fallacy: a FalseDichotomy. RelationalModel queries are not bag-friendly, but they also do not place any obstacles in your way to deal with such presentation issues in another layer. (They are 'bag-neutral' in that sense.) Safety issues were mentioned in passing - one representation concern among very many. And yet, a mere mention of it and you are suddenly blind to everything else and digging in your heels. The strength of your biases exceeds the limits of sanity and reason, TopMind. Your assumptions and biases seems to be the issues standing in your way the most often. I suggest you "switch [mental] tools and trash the fucker". ]
- If it was true that "they do not place any obstacles", then it would be a cinch to simply omit keys from the Select clause. And you didn't sufficiently address the event log issue. MS-Access is buggy for linking.
- Yes, MS-Access is buggy for linking. How is that relevant here?
- [There is a difference between "providing no support" and "placing an obstacle". The argument you are attempting to present now could be used just as easily to blame the '+' operator for addition of numbers because it is incapable of dropping columns from a table. It's a nonsense argument - fallacy.]
- If there is a decent alternative to '+' that can do something the other can't, it's certainly worth considering. If it was just this issue, I might be willing to live with your limits. However, it appears to cause problems elsewhere also, such as the event-log scenario.
- No, I've already explained how it causes no problems with the event-log scenario at all. Rel links to the external log table and dynamically generates row IDs. No problem.
- You haven't explained how it's done and failed to address repeated questions about it.
- Do you read anything we write? I explained elsewhere that row IDs may either be obtained from the external DBMS on those JDBC/ODBC drivers that support it, or via sequential integers, i.e., generated line numbers. Client-side caching may be used where it is appropriate to give the IDs a longer than per-retrieval lifetime, though this is rarely needed because the original table didn't have any notion of row identity.
- Please see "Link Specifics" below.
- [Topper, that specification is quite sufficient as RequirementsAnalysis for anyone who works in SystemsSoftware development and who knows a bit about ODBC/JDBC. The problem is your own incompetence.]
- You are just deflecting the blame for being too lazy to be explicit, insultive bastard! If it depends on existing features of drivers/transports, that's relevant info. And I don't work in SystemsSoftware and this wiki is more than just a SS audience. You seem uncomfortable working with people who don't have a background that's a mirror image of yours.
- [Actually, I was in the process of answering the questions below even while you called me 'lazy'. Nonetheless, the fact remains that these answers are quite obvious given knowledge of JDBC/ODBC and actually reading what the author wrote ("The linkage itself will provide an auto-sequence key."). Indeed, he only presented one solution among many possible solutions. If you feel insulted, then take it as incentive to either (1) gain some competence in SystemsSoftware if the subject interests you, or (2) stop asking questions about SystemsSoftware if you aren't interested in in the subject. The moment you left high-school, you became responsible for obtaining your own education on all the subjects that interest you.]
- Oh, so you are allowed to insult me up, down, and all around, but I can't insult you? If TD/Rel depends on the driver/transport-mechanism to supply temp/working sequence numbers, then just say so. That doesn't require going to ODBC school. You are being a drama queen again.
- [You are allowed to insult me if you wish, but I do ask you attempt to do so on justifiable (which means you can argue the point effectively, not that it's true) and non-hypocritical grounds. And the author did say so. His words were: "The linkage itself will provide an auto-sequence key.". You lacked comprehension; heck, you probably didn't even know this was a good explanation, at least without hindsight.]
- It didn't indicate by who, what, where, when, and how.
- [It provided enough. Who: a participant in the linkage. What: provide an auto-sequence. Where: at one end or the other of the link. When: during linkage. How: based on something already part of the link, such as serialization. Nonetheless, this is one solution among many. The fact is, that unless you think it impossible, it is premature for you to insist on pursuit of a particular solution at this time. (Should we start demanding you provide the low-level details of how SmeQl performs indexing on multiple candidate keys? That's the sort of pursuit you have been making.)]
- Remember that TutorialDee is an illustrative specification. It does not describe I/O of any kind. Rel, as a concrete implementation, does describe such mechanisms. Rel will soon provide facilities to link to external data sources. The external data sources may be CSV/TXT files, Excel spreadsheets, JDBC/ODBC data sources, etc. They will appear as RelVars in the Rel system catalogue. When linking to data sources that do not have an explicit primary key, Rel will generate unique row IDs. I'm exploring a number of strategies for generating these, though my current favourite is simply an ever-increasing global sequence number.
- See UsingInternalRowIdentifiers.
- Note that the entire SELECT clause appears to be "presentation" by your explanation (except maybe math transformations). Thus, SELECT should be tossed for being "formatting-related" by your...*cough*....logic.
- [Presentation issues certainly tainted SELECT, but SELECT itself is clearly not entirely "presentation", either (since it can be used for filtering or composing information, rather than for formatting and framing it. That is, all 'joins', 'projections', and 'where' clauses happen via SQL SELECT). Thus, I would advocate splitting the presentation and query issues from the SELECT clause into two or more dedicated clauses. Whether you reuse the name 'SELECT' for one clause of the split is up to you. The authors of TutorialDee chose not to do so, which is probably wise (since reusing the name 'SELECT' could cause confusion when it doesn't work the exact way it did in SQL) but is hardly a technical issue.]
- You are hedging and hawing here. You guys widened the def of "presentation" so much that you even knocked your own sacred and pure relational operators out of the game. I hope you feel silly right now. -t
- Not at all. It is SQL's SELECT that conflates presentation and representation. Rel/TD clearly divides manipulation of representation -- the RelationalAlgebra operators like UNION, WHERE, projection, MINUS -- from presentation statements, operators and iterators like WRITELN, EXPORT, FOR and so forth.
- [Indeed, SELECT includes both data presentation concerns (ORDER BY, column display order, hidden columns) and data semantics concerns (projection, filtering, aggregation). I do not see how acknowledging the conflation in SELECT will 'knock' a system that provides a clean division between RelationalAlgebra and presentation operators. Perhaps you should clarify the logic that leads you to your accusation, because it speaks to me of a continued misunderstanding on your part.]
- You guys are playing word games. Projection (column filtering) can and will be used for BOTH. Do you deny this? I will grant that Rel/TD makes a clear separation between relational and non-relational operators. (I still reject your usage of "presentation", but will accept "non-relational".) But then you have two interfaces and/or steps that make a distinction between "column-chopping" that falls under relational and column-chopping that does NOT fall under relational. In my opinion, they are too similar in utility to bother to separate. You are factoring by "data structure purity" and I tend to prefer to factor by utility: the task at hand. In your view I am polluting the relational world, and in my view you are polluting utility factoring with arcane purity barriers/separation. It may lead to relational operators, and a duplicate/mirror set of bag operators that do generally the same thing (which is better than loop and Print statements in my opinion because they text-ify "cells" prematurely). In my mind that's stupid and anal, but will grant that tool design is largely about mind-fit (WetWare) and will try to be relativistic about it. It may work better for you, but please don't extrapolate. I just don't see "duplication problems" as a major factor in practice, not nearly enough to create weird barriers in the tools. (And the few I do see could be reduced by some minor changes in SQL.) -t
- [TopMind attempts to achieve simplicity by ignoring EssentialComplexity. The so-called "arcane" barriers/separation naturally exist in the service model (input to and output from the database); they are EssentialComplexity. By acknowledging this EssentialComplexity, true RelationalModel DBMS's gain useful features (performance optimizations, idempotence against duplicate entry or delete, etc.) for the relatively minor cost of the distinction (separate syntax for import/export vs. internal queries). This minor cost quickly pays for itself in flexibility, since import/export can then be extended to a wide variety of non-relation-like resources. Thus, the true RelationalModel approach ends up being more flexible and more optimal than the SQL approach. Denial of EssentialComplexity mostly buys crap.]
- As described below, you have not demonstrated this clearly, only talk about it in an authoritative tone.
- Yes, exactly. There is no SELECT clause or anything like it in TutorialDee. There are a collection of RelationalAlgebra operators. Presentation is not defined in TutorialDee, bar glancing mention of the ARRAY construct (essentially a CURSOR), which is as close as the TutorialDee specification gets. The RelProject implements presentation mechanisms as extensions to TutorialDee.
- It looks like it does to me. It's usually represented as square brackets or curly braces after a binary relational operator: (table op foo) {columnA, column B, etc.}. RENAME smells presentationy also.
- [TutorialDee certainly has projection. And joins. And filtering (where) clauses. The author of the above statement knows all this, so I suspect he meant something else when asserting there is nothing like SELECT in TutorialDee. Certainly, there is no clause in TutorialDee that combines all the properties of SELECT (i.e. you can't introduce order-by, nor duplicates, during a projection, join, or filtering clause; and it seems the individual query properties from SELECT were split into respective operators). RENAME is important to avoid ambiguity when performing a join on two tables with identical column names, and may be unrelated to any final name for reports.]
- Exactly. There is no single 'SELECT' operator or 'SELECT' statement that combines all the properties of a SQL SELECT. There are individual operators of the RelationalAlgebra that act on relations, plus other operators that may be used to generate formatted output (presentations) from relations.
- It appears this separation causes more and/or ugly code.
- I suppose that's a matter of opinion. TutorialDee is not SQL.
- [You over-generalize: not all "conversions" are about 'presentation'; conversions are often used to optimize internal representations or similar purpose. And programming languages generally aim to minimize the degree to which you deal with 'presentation' concerns internally, as those concerns tend to hinder composition, and features such as laziness and parallelism and optimization. Thus, PLs tend to focus on semantics rather than presentation, which gives a great deal more freedom to the implementation. Anyhow, are you saying the client (or, say, you - when you are looking for patterns in data), would be happy with sequence numbers?]
- There is no evidence they will be looking at the raw data. Only a "packaging standard" was specified. It says nothing about whether they will eye the raw data. They may run a "File/Import" menu in Access or the like and never even look at the original file.
- [That is not relevant to anything in the prior paragraph. Eyeballs are not a prerequisite to being a clients. Presentation could just as easily be an audio file, Braille, or CSV for import. What matters is that you are doing it for external consumption.]
- Externalness? Never mind. Let's not go there.
- [The notion of query isn't about 'presentation' of data, but rather about access, analysis, composition of it - the ability to produce new information by massaging the existing information; the ability to access the information immediately relevant to you. Formatting the resulting information for human consumption isn't the job of a query language. It's the job of the command-line client to the database, or of the table browser, etc.]
- We had this debate about "not it's job" in another topic. There's nothing new here and I stand by my argument there.
- [We basically agreed on that subject in another topic that presentation issues can reasonably be separated from the RelationalModel 'query sub-language'. The crux of your argument has only been that such formatting features be conveniently accessible and declarative if feasible, and nobody has been arguing against that, though we do have different ideas on exactly what 'conveniently accessible' means.]
- We also have a disagreement about what "formatting" is. You consider mere inclusion/exclusion to be "formatting". I don't. I find that silly and forced.
- [I do not find "mere inclusion/exclusion to be 'formatting'". Filtering is not formatting. However, coding to any specification regarding HOW information is to be presented to a client IS formatting, whether that covers 'SI vs. English vs. Nautical units' or 'font size' or 'duplicates' or 'column order' or 'color' or 'cell borders' or 'page width' or a specific natural language (English vs. Japanese vs. Swahili).]
- I disagree, but don't want to debate vocabulary any further right now. You are making everything and anything be "formatting" by bending a fuzzy concept to its very limits.
- [I suppose it is worthless to explain something to a man who assumes he already knows. The issues aren't fuzzy, but assumptions such as you possess will give anyone mental cataracts.]
- If the rules are clear, clearly state what they are in a lawyer/geometry-proof hybrid kind of way. Give them clause names, and say, "One or more of the following must be true", etc. There's a right way to document and a wrong way(s). You always do it the wrong way. Show you are smart, don't claim it. Claims of smartness are a dime-a-dozen on the web. Otherwise, stop your pseudo-rigor. (And you also have to show the source of your rules.)
- [I already did. Reference the paragraph on contracts and accessibility. The problem is on your end: you are the sort of irrational fool who requires excessive burden of proof for facts you find inconvenient ("you haven't shown it necessary") and far too little burden of proof for the facts you want to be true ("you haven't shown it impossible").]
- Projection there, bub. And it's worded as disconnected meandering. Word it closer to a legal document or a formal proof. And define any vague words within.
- [I take it you haven't read many legal documents.]
- I'm not saying their format is perfect, but it suggests organization concepts that you completely omit.
- By the way, in my mind "presentation" is a continuous concept, not discrete. Some formats/organizations are more human-friendly than others.
- ["Presentation concerns" cover a large number of more specific concerns (such as SI vs. English units, GUI sliders vs. toggles, column order, sorting within columns, font colors, presenting 'duplicates' in a particular manner (rather than one of the many RelationalModel-compatible alternatives), etc.). The large number of issues can certainly make it appear to be a continuous concept from a high-level view. But the individual concerns are clearly not associated with accessing, analyzing, or composing information. That is, the information-content going into presentation opoerations is usually equal to the information-content coming back out, with the only difference being format and framing. Presentation certainly isn't binary, though; a given tool can cover a little or a lot of presentation concerns. The degree to which a format is 'human-friendly' is unrelated to whether a concern addresses presentation.]
- Please elaborate on "accessing". Smells open-ended. Or is this back to "external consumption" again? Are you sure you want to pivot on that?
- [The "accessing" refers only to the fact that, fundamentally, a query language must answer the query. It is unclear to me how you think this open-ended. How do you imagine "accessing" being used, that it might cause problems?]
- It's not clear whether it's the nature of the data blob, what people do with it, what people think of it, or the org-chart relationship of those involved. You seam to be making up the rules as you go.
- [I don't see how it's related to any of those things. Would it be clearer in your mind if I replaced the word 'accessing' with 'fetching'?]
- If sub-query A references sub-query B, is that also "fetching"? Or can fetching only be done by humans?
- [It is certainly 'access', which is why I used that word. But, indeed, any sub-query must eventually fetch the data in order to process it.]
- When systems communicate with each other, they often have to "convert" the information into a form more usable by the recipient. Computers and humans both have "formats" (interface standards/protocols) that they use for this. They usually don't just scoop up bits and toss them at the other side as-is.
- [Yes. And that conversion is 'presentation'. Fetching is just the 'scoop up bits and toss them at the other side as is' (or hand a pointer, either way). That is, at the point of fetching the actual representation is still abstract (AbstractDataType).]
- Are you making a distinction between biological components of a process(es) and silicone ones?
- I think you mean "silicon", not silicone. Silicon is the semiconductor basis for computer CPUs. Silicone is used to seal bathtubs and make breast implants.
- [When I use the word 'process', I speak of OS-style process-model processes, so your question is a confusing one. I do not distinguish between different sorts of clients - i.e. whether a human is a client, or a separate service.]
- I don't see the use in forcing a hard distinction between "formatting" for human eyes versus different sub-systems of machines. I cannot think of any hard distinctions even if I wanted there to be a hard distinction. It's irrelevant in this case because we are simply delivering what the contract asks for, not knowing or caring what or who will use the data. Knowing wouldn't change the result that we create.
Scenario 4 – Sales Summary
A retail company is sending out a CD to managers with a small reporting application that provides interactive QueryByExample reports of prior year's sales from different perspectives. Urban managers can get it off the intranet, but there are still sufficient numbers of rural stores with spotty or no access. (Another take on this is given at PageAnchor "scenario_four_alpha".)
Normally only a product catalog is sent on the CD, but the techies concluded that the reporting app could also fit on the same CD with some adjustments.
For this project, a summary statistical table is produced with this schema:
Table: sales_sum
----------------
store_id // retail store identifier
month // date rounded down to 1st of month
prod_categ // primary product category
repeat_cust // repeat customer flag: Y or N
sales // dollar amount of sales
The first four columns make up the primary key. However,
disk-space was a premium on the CD, so the primary key was eliminated, creating essentially a "bag".
Individual indexes on the first 3 columns remain. Slower repeat-customer-related queries were noted, but deemed acceptable considering the tight space.
An explicit primary key objectively takes up more disk space, almost as much as the table itself. This is a bit smelly. Purity is not free here.
What would you have done if the catalogue was so large it wouldn't fit on the disk at all? What if there was only room for an index on one of the three columns? You've certainly shown a "need" for a bag in a situation that just happens to be delicately balanced between sufficient space for a relational implementation and not enough space for any implementation at all. That makes it the very epitome of contrived. Very, very dubious, Top.
- I'm not following your complaint. There could be the option to compress catalog images further, or perhaps ship 2 CD's instead of one (adding shipping and handling costs), or drop the sales summary app altogether. It's a matter of weighing the trade-offs, including the time needed to diddle/tweak bits to clear more space. If you don't need a key index, then you don't need a key index. It's like a toddler's teddy bear for you guys, a security blanket: can't leave home without it or you get irritable.
- [Usually, when you start bumping into space issues, you aren't just barely tipping the edge. Thus, your example seems too contrived (not impossible, but improbable). Running into space issues in general has been a common problem, but it is often far from obvious which utilities gets first dibs on high-performance direct-from-disk access, and which should be decompressed or regenerated at the client's machine. It would not be unusual, in a situation like this one, to strip all the indexes, compress the data as tightly as one can (a '.zip' atop exported text is more compact than most Database file formats!) then to decompress and regenerate the indexes client-side. Anything non-essential gets reduced, then you are free to expand from there to the extent that running certain features directly from CD is necessary. (Of course, nowadays you could just switch to a DVD, a flash stick, or an SD card...)]
- Sure, there's lot's of ways to squeaze more space out of stuff, but that takes effort and tweaking and perhaps a more complicated installation, risking installation problem phone calls. I've built installation scripts before to only find out that the user doesn't have a "C:\temp" folder. I had thought all Windows computers had it as part of Windows installation, but discovered the hard way a handful lacked it for unknown reasons. Another time permissions didn't permit the creation of a new root folder. Who or what tweaked with the PC permissions, I don't know. Another only had an "E" and not a "C" drive. Thus, if you can run the entire thing from an EXE that's on the CD itself, there's usually less things to go wrong. It's not like we always have a Sony-sized budget to build and test a bullet-proof installation utility. I dismiss your complaint that it's an unlikely scenario. (Although rarer in the internet age perhaps.) The important thing is that if we don't need a primary key set, and that if the application works just fine without it, then why include if space is tight? What is simpler, removing the primary index or building a bullet-proof hard-drive installation script? -t
- These kinds of projects often start when a manager comes to your boss and says something like, "You know that little app you guys wrote for Bob and Lisa? Would it by chance be possible to put it on the next catalog CD?" Analysis then shows that as-is it's slightly too large, so you go in see what can be tossed or shrunken.
- [Fair enough. I can see how a lot of little additions might have you bumping the edge of a CD's capacity and deciding what stays, what goes. I have no problem eliding an unnecessary indexes, though consider doing so to be irrelevant to BagNeedScenarios. And you still make a lot of contrived assumptions regarding the abilities of the RDBMS: it can handle read-only persisted resources, but it can't have a uniqueness constraint without a corresponding index? is there a specific RDBMS you're thinking about?]
- Why is it irrelevant? Note than if the DB files are on CD, they are read-only by the nature of the medium.
- [For at least two independently sufficient reasons: (1) duplicates are required in a demonstration that bags (multi-sets) are necessary or helpful relative to sets. (2) your "no primary key because I can't have the index" excuse does not imply bags are helpful, only that indexes are sometimes harmful.]
- As far as the abilities of DB's, I'd have to test them. I'd have to test file-based DB's such as MS-Access and sqLite.
- [Therefore the indexing requirements of this hypothetical RDBMS you're depending upon for Scenario 4 are highly contrived.]
- Uniqueness probably requires something similar to an index to work efficiently, and one is certainly free to sacrifice this efficiency for immutable media.
- [Sure, assuming a big enough database. But uniqueness doesn't need to work efficiently (as you say, especially not on immutable media), nor would it need a dedicated index to work efficiently. Sacrificing efficient uniqueness doesn't mean sacrificing uniqueness.]
- MS-SQL-Server has "clustered" indexes which physically group the data by primary key, but it's not file-based, and thus overkill for a CD.
- [Indeed, there will be many RDBMSs that won't be suitable for running from CD or other arbitrarily constrained devices, without tweaks. Same applies to any particular product.]
- The requirement of a unique-a-tron feature adds to the difficulty of selecting and supporting a DB engine.
[
TopMind, this entire scenario is you mistakingly conflating indexing (and a particular implementation thereof, no less!) with uniqueness constraints.
FAIL!]
[As a note, there are techniques to index on all five columns at once with trivial additional space costs. Do a little research on kd-trees and R-trees if the subject interests you.]
A given RDBMS does not necessarily separate the two nor provide space-friendly alternatives. It's not practical to provide key/constraint implementations optimized for every possible domain-specific optimization scenario. Sure, if there's enough time, one can research different DB engines and perhaps find one that does, and convert any vendor-specific SQL over, but that's an idealistic assumption. Plus, using a different DB increases cross-training costs. Staff has to know how to use a given RDBMS engine/format. Managers often want PlugCompatibleInterchangeableEngineers, and a new "weird" database brought into the mix will go against their wishes. -t
Further, why check "constraints" when it's read-only after release? There's not likely going to be any INSERTs done after shipped. Usually the file(s) are replaced in their entirety for such a product. One can run a simple query to detect any duplicates before shipping. -t
[I was about to make the same point. It was very silly of you to call something a 'bag need scenario' when it lacks duplicates, does not need duplicates, and cannot be later modified to include duplicates.]
Bags don't require duplicates to actually be bags. An empty data-structure is still a data structure. However, I will agree this is a gray area because defining a data structure by what it "does" versus what's in it can be sticky. -t
[Duplicates are required to demonstrate that bags (multi-sets) are necessary or helpful relative to sets. That 'gray area' you've wandered into hardly seems relevant.]
In fact, I've seen similar tables in which last minute corrections were added by simply "appending" the delta's to make adjustments to existing records, creating what would be a duplicate key. As long as all report queries are GROUP BY, which this type of summary reporting usually requires, it will work.
[What a silly way to handle deltas.]
What would you recommend instead? This is the least intrusive and easiest to find and undo. If you reprocess the whole shebang, you risk some silly mistake or command typo. In my case, it was only about 10 record such that the size difference would be negligible.
[You also subjected your process to command typos by not scripting it? Good work, TopMind. If the RDBMS supports some non-intrusive approaches, use them, otherwise reprocess; either way, deltas require both inserts and deletes. But we are drifting far afield of the contrived and irrelevant scenario at this point.]
All technical issues aside, my objection to this scenario is that you can always come up with a hypothetical constrained situation to justify any "need". Why a CD? Why not data storage on an RFID tag to be scanned by a low-capacity hand-held device? How about a low-capacity promotional USB key that can be plugged into any USB-capable device including phones and car stereos? How about limited storage on a satellite or space-shuttle experiment? And so on. There are innumerable one-off situations that require a special set of circumstances. Whilst these may warrant bags or any number of other constrained strategies, what they do not do is justify providing bag support in relational systems in general. You still haven't presented any argument to support the notion that a relational system benefits from supporting bags in addition to relations, except for being able to import/link duplicate records at the input, and export duplicate records at the output. Can you show any general justification for representing data in bags within the system boundaries? Can you show any justification to have a 'bag' data type in addition to a 'relation' data type on systems capable of supporting relations?
Note that you wrote at the top of this page, "[t]hese are [...] scenarios where a 'bag' may be needed or helpful instead of unique rows." If we're dealing with a space-constrained situation that forces us to eliminate unique rows as an option, then this isn't a case where a bag may be needed or helpful instead of unique rows. "Instead" implies the option to choose. You've deliberately constructed the scenario to eliminate the option to choose. Thus, this is a case where a bag is the only possible option, by deliberate construction. As such, your scenario is almost identical to this one: "After one of Chris Date's lectures, your boss got into a heated argument with Chris over allowing duplicate rows in tables. As a result, your boss refuses to let you use anything that implements the true relational model and requires that every table allow duplicates. Therefore, bags are needed." Well, no. Bags aren't needed, bags are being forced. The distinction is significant.
[I must disagree. That isn't a significant distinction. Something being forced by real circumstances is the very essence of being needed. Top's error here is attempting to pin the requirement on the peculiarities of a particular RDBMS implementation, hypothetical or not. (If he wanted to be a little more obvious, he could produce another scenario: hypothetical RDBMS X forces duplicates on you always; that is, it literally checks for duplicates and refuses to do work, except for inserts, until at least one of its relations contains a duplicate. Therefore, bags are needed or helpful.) I also disagree with your assumption that space-constrained situations can force us to utilize relations with duplicated rows. Something to keep in mind is that, the more space-constrained the RDBMS is, the less you need indexes (indexes are all about upwards scalable performance). At a small scale, an RDBMS could be efficiently implemented by linear searches on arrays. Additionally, it is a mistake to confuse representation (how the RDBMS organizes the bits) with the semantics (what this information means within queries or upon export). It is perfectly legit, if you wish, to use position in a linear array to store information regarding an autonum column; that would not introduce 'duplicate rows', though it is of questionable suitability if rows can be deleted and the autonums can't be reused.]
What I've tried, and perhaps failed, to make clear by my use of the term "forced", is that if you ignore all the technical issues, this is a purely contrived circumstance wherein Top has attempted -- in a somewhat roundabout way -- to state "you need a bag because I say you need a bag."
The bottom line is that removing artifacts that prevent duplicates saves space. If you don't hire a bouncer for your loft party, you can save money; the trade-off being you risk having riff-raff around. Depending on the circumstances, you may decide it's worth it, such as making the party more interesting at the expense of having to buy new furniture afterward. Maybe you decided your furniture is out of style anyhow. Having a choice can be good and allowing bags offers choices. -t
[Bad analogy. The trade-off is that new riff-raff won't be recognized and removed with the same efficiency as a dedicated bouncer might manage. And if you want choice: there are many, many approaches to handling duplicates. Indexes are not even among the best of them (they trade space for time). Feel free to introduce the choice there, at the point of optimizations tuning. No need to allow bags in order to introduce choices.]
Well, let me temper that a bit: having a choice can sometimes help. This does not mean that gazillion choices is necessarily good. I'm illustrating specific places where it helps. Whether it's worth other downsides in other situations is a larger question than this scenario itself. I'm not addressing every possible variable of the bag issue here. -t
[Just because a choice is moved off into a different operator does not mean the choice is gone. You don't need to mourn the choice. You only need to learn its new address.]
If the "operator" doesn't have to fit on a tight CD, it simplifies life.
Scenario #5 - Fouled-up Key
I once worked on a state government project as a contractor whereby the state had once accidentally issued duplicate license plate numbers. Rather than recall the plates, they decided put out a notice to law enforcement agencies and to live with the overlap. However, as a contractor we had to work with their data and had to live with "bags". The data set was too big and changed too often to make a local copy with surrogate keys, barring a big expenditure and extra staff. The state may have created their own internal unique key, but as a contractor we didn't have access to it.
Note that the issue is from the perspective of the contractor. Perhaps the state should have cleaned up the mess rather than live with junky data, but I'm not assuming king-level control of all systems. In the real world you have to work with stupid people and stupid systems.
These scenarios are based on my actual experiences. Some are simplified to keep the examples explainable without giving a domain lesson. -- top
Link Specifics
Suppose we are querying for the most recent week's worth of logs. What comes back from ODBC? Does it have a unique identifier? If so, is it relative to the original table, or only the current request? Is it usable by TD/Rel? What if a driver only sent CSV to TD/Rel of just the requested columns? Where is the key stored while the client is chomping the data set? Does something have to scan the entire source table to make unique keys? It's generally fairly easy to get a unique working surrogate key for a single given query, just number the row-set starting with "1" and on up. However, that's not usable across different queries, such as if we want to get updates, like any straggler logs that may have arrived late because a station's network was down and the loggers queue. I don't know if one-time-use is sufficient for TD/Rel. I don't have to ask most these questions of existing RDBMS because they can just stuff it into a bag. (Auto-sequence surrogates in the copy can optionally be added.) It does appear that TD/Rel are finicky compared to the current RDBMS. It's like dogs versus cats: dogs will eat just about anything, but cats are finicky, AKA, high-maintenance.
- [I think you're magnifying this into something waaaay more complex and significant than it is, including the degree to which Rel is "finicky." Indeed, I've been praised by users of SQL DBMSes for the degree to which it's not finicky and is predictable, consistent, intuitive, and easy-to-use.]
- The complainers probably just drop it and move on. I cannot check the filtering mechanisms involved there.
- [That complainers drop it and move on is one hypothesis. That people genuinely like it is another.]
- Such a scientific survey.
- [Huh? I've proposed hypotheses (as did you!), not conducted surveys. You do understand the difference, don't you?]
Based on the earlier statement, all of these questions have already been answered:
- What comes back from ODBC? the normal result-set. No changes here.
- Does it have a unique identifier? If so, is it relative to the original table, or only the current request? Based on sequencing within the ODBC response, unique identifiers are added. The identifiers themselves are specific to the request, but: "Client-side caching may be used where it is appropriate to give the IDs a longer than per-retrieval lifetime, though this is rarely needed because the original table didn't have any notion of row identity."
- Is it usable by TD/Rel? It is a proposed feature of Rel. TD is ignorant of where identifiers come from.
- What if a driver only sent CSV to TD/Rel of just the requested columns? That does not hinder anything.
- What then supplies the temp/working sequence numbers?
- [Rel does, in the layer that interfaces with the ODBC/JDBC/TXT/CSV/Excel driver.]
- That's just a bag in a relational dress.
- [It's a relation. It contains no duplicate tuples.]
- Where is the key stored while the client is chomping the data set? In the cache, or not at all.
- Does something have to scan the entire source table to make unique keys? No. The serialization provided via TCP is sufficient for this task.
- Do you mean that the order it comes over serves as the de-facto sequencing for a given query result?
- [Yes.]
The following was conjecture on your part:
- just number the row-set starting with "1" and on up... Why risk funky collisions? Instead, have a global or per-session 64-bit autonum, and start there each time.
- However, that's not usable across different queries, such as if we want to get updates, like any straggler logs that may have arrived late because a station's network was down and the loggers queue. In the case of a log, stragglers could be fetched by date/time. This would also be the case in SQL. However, SQL and TutorialDee both assume a 'frozen' view of the world while performing operations (thus the ideal of AtomicConsistentIsolatedDurable). There will always be stragglers after some cut-off point. How to deal with these stragglers is really a higher-level architectural issue, and this is the wrong place to solve it.
- That doesn't mean this issue isn't an important. FunctionalReactiveProgramming and ReactiveDemandProgramming and (to a lesser degree) PublishSubscribeModel, MultiCaster, EventDrivenProgramming, DataflowProgramming and such are aimed at reducing 'temporal coupling' of the sort you describe. This is a problem that ought to be solved for RDBMS in general... i.e. by supporting 'subscriptions' to ad-hoc queries without need for polling. Well, I should hop off this HobbyHorse of mine: even if you brought it into this discussion, it is mostly orthogonal to BagNeedScenarios (excepting to note that duplicates + order for a query will introduce issues of which duplicate is removed when a subscription emits a 'DELETE', and thus count as one more point against supporting duplicates and order-by from SELECT).
- I don't know if one-time-use is sufficient for TD/Rel. Clearly, zero-time use was sufficient for SQL. Rel only needs to introduce these identifiers for consistency (i.e. across table self-joins). One-time use is not ideal, but will be sufficient for that purpose. That is, by introducing these extra identifiers, Rel can ensure that it at least doesn't multiply duplicates and generally minimize introduction of new garbage.
- I don't have to ask most these questions of existing RDBMS because they can just stuff it into a bag. Indeed. And, thus, you have two problems.
- [Beautiful. That, in a nutshell, is why I started favouring true relational systems over SQL about ten years ago.]
- To flexible for ya? You had to go and gum up it with your impractical idealism.
- [Huh? Now you're just quibbling. If you're going to take the time to edit, at least try to make a sound point.]
- It does appear that TD/Rel are finicky compared to the current RDBMS. TD/Rel does not introduce any new maintenance issues. However, it does highlight issues that exist in the current RDBMS.
- It's like dogs versus cats: dogs will eat just about anything, but cats are finicky, AKA, high-maintenance. Perhaps an analogy to a Diesel engine would have served you better.
- [Yes, a diesel engine is a very good analogy. Or, even any modern petrol (gas) engine. In the early days of internal combustion engines, they would burn just about any flammable fluid. My father ran a Model T engine on paint. Of course, there was a price for such flexibility: They had to run on relatively imprecise fuel metering so they could vaporise any liquid and employ a low compression ratio so that the fuel wouldn't detonate (ping or knock) and damage the engine. The result was low power and poor economy. When engine design significantly increased power and economy, the fuel needed to be more refined but the increase in power and economy was worth it. Similarly, using systems based on the RelationalModel does create occasional minor speed-bumps at the boundaries of the system, but only when dealing with the rare cases of needing to import data with duplicate records or needing to export data with duplicate records. However, what we gain with the RelationalModel is significantly improved optimisation, conceptual simplicity and guaranteed avoidance of accidental duplicates. So, for a minor, occasional slight inconvenience at input and output, we gain a continuous improvement throughout.]
- It's not about technological progress. Maps have existed for quite a while. It's a matter of whether a tighter base data structure or a looser one is the best fit for modeling domains. It's not a free lunch; it involves intricate and interrelated trade-offs. I believe IBM and Oracle made the right choice out of practical concerns. Even if maps have a very slight advantage, the weight of existing implementations outweighs that. You don't knock an existing de-facto standard off unless it's either replaced by a significant improvement, can be added incrementally without major headaches, or comes on the coattails of some other new technology. I doubt TD/Rel is the coattail maker because it satisfies a purity itch more than an expressiveness itch, and the majority of practitioners will not buy the purity argument without a clear and relevant demonstration. Avoiding "duplication" mistakes is not the highest wish-item on most lists.
- [What lists are you referring to?]
- A figurative "wish list" of areas to improve query languages/tools. Ability to have UDF's that take tables as parameters and improved meta ability are high on my list. I find I need more expressiveness far more than I need e-spanking mechanisms. Errors cause by lack of expressiveness are greater than those caused by "purity" problems in my experience. -t
- [It's good, then, that TutorialDee allows UDF's to define RelVars as parameters and Rel defines a catalogue for meta-querying. I believe you'll find the restriction on duplicates to not be a restriction at all. And note that what is high on your list may not be high on someone else's, and vice versa. It's unreasonable to expect every tool to precisely match your priorities unless you build your own tools.]
- [I find it interesting that you say this debate is about finding "the best fit for modelling domains." I look at relational databases from a somewhat different point of view. My concern is not about the best fit for modelling domains, but about the best fit for recording facts about domains. The difference is subtle but very significant. In my view, the latter much more closely reflects what we're attempting to do when we create database-driven applications than the former.]
- [Yes, IBM and Oracle made a choice out of practical concerns. In the first SQL implementations, there was a concern that removing duplicate tuples -- required explicitly by the projection and union operators, and implemented implicitly by others employing projection and union -- would be too computationally expensive to be feasible. This was a valid consideration given the CPU, disk and memory constraints of the 1970s and early 1980s. (Indeed, there are still domains where table-oriented systems in general are considered infeasible and hierarchical systems like IBM's IMS hold sway.) The issue regarding duplicates in projection and union was significant enough to influence the entire SQL model in its early days, and so we're saddled with a problem -- having to deal with possible duplicates -- that no longer has a practical basis. For a while we've had more than enough CPU, disk, RAM, and suitable approaches to make duplicate-removal insignificant.]
- Note that we also process bigger and more complex datasets now, so processing resource issues are not entirely swept under the rug by improved hardware.
- [True, resource issues are not entirely eliminated, but they are reduced in significance to the point of being negligible. When they do become an issue, other indexing issues become equally a problem.]
- [This whole argument reminds me of the old structured vs unstructured programming debates.]
- Or more like the "anal verbose tight types will protect you from meteors and cancer" debates. Note that some of those "goto" programmers were fricken good at their goto craft. It's not my bag (pun), but some of them did some impressive reading of them.
- (Social note: calling the other technology "goto-like" is the equivalent of calling somebody's opinion Hitler-like or "Nazi" in political debates.)
- [Is it? I didn't know that.]
- [Wow. I don't know what "anal verbose tight types" are (though they sound painful) or how they protect from "meteors and cancer," and I don't recall anyone advocating any of that, but thanks for clarifying which side of the fence you'd have (did?) occupy on the structured vs unstructured programming debate.]
- I didn't pick sides.
- [Maybe not then, but you did here.]
- I did? Thanks for letting me know what my opinion is.
- [I think it was pretty clear in your non-sequitur praise of "'goto' programmers" who were apparently "fricken good". Anyway, I can see intelligent debate has mostly petered out here. I'll be looking forward to you railing against Rel and Tutorial D, now that you're no longer getting much mileage out of the anti-OO rhetoric.]
- Some were good. Does that bother you? Rel/TD are not popular enough to bother with the same kind of criticism effort I devoted to OO.
- [That some were good? It doesn't bother me at all. It just doesn't seem relevant. In the context of this page, the equivalent would be, what? That some SQL developers are "fricken good" at SQL? It's true. So what?]
- [Anyway, keep an eye on Rel/TD. 'Tain't big now, true, but it will be. By the way, for something that's not popular enough to bother with the same kind of criticism effort you devoted to OO, you've certainly devoted a lot of effort here.]
- You've yet to see the full fury of the Topster. In a way they are similar, a battle against impractical idealism: all domain nouns "should be" objects and anything more complex than an array "should be" pure relational. Abstractions should assist us, not become a religion.
- Pure RelationalModel does assist us (cleaner wrgt. duplicates, optimizations, distributed concurrency and idempotence). You might not find this help relevant, but it's still there - and the issues resolved are non-trivial (not solvable by local syntactic tweaks). Where do bags 'assist us' in a significant way, with regards to data management?
- Yes, you claimed that already repeatedly over and over again many times.
- Then, even if you do not believe the claims, at least stop pretending that our argument is about "impractical idealism". That's just an AdHominem attack, and sophistry.
- The most significant [place where bags 'assist us'] is that natively handling them can simplify reuse of existing database engines (such as with new query languages). It would be silly to toss them for that concern alone unless there was a demonstrated significant downside.
- Ah. By that argument, you should be promoting that SQL engines work directly on blobs, in order to better simplify reuse of FileSystems. After all, we have not demonstrated that there is a significant downside to blobs.
- I kind of proposed such in FileSystemAlternatives. I'm not at all against mixing and matching underlying "storage engines" and query languages. -t
- You sidestep the point. Nobody here is against providing adaptors to different data resources. The question is how much the existing systems should influence the development of new systems, especially when the adaptation to the old system is relatively trivial (as it is with SQL storing and returning blobs, or a true-RDBMS storing and exporting bags).
- In any case, I think your subjective view of 'significant' is the greatest barrier to rational discussion. By your own definition you won't find anything significant that you don't want to believe is significant. Must be a convenient basis for an argument. I consider potential for duplicate entry errors (for a table that attaches meaning to duplicates), and lost opportunity for optimizations, to be significant simply because they cannot be fixed 'locally'. The former needs be fixed at every single application. The latter can only be fixed with a lot of extra usage context, and a super-intelligent optimizer.
- Below I've given you an opportunity to demonstrate "significant" using specific scenarios.
- Below, I've given you opportunity to commit to a rational, objective definition of "significant". You refused.
You've still not shown any need for bags. Let's summarise:
You've shown that there is a need to generate files with duplicate records to meet client requirements.
- However, that's not a bag, nor does it require bags in the program generating it.
- Please clarify
- [Formatted output to a file does not, for example, expose the interfaces associated with a bag/multi-set (union, intersection, difference, etc.). Nor do you ever need a structure with these properties during construction of such a file.]
- This is still not clear, what "interface" are you referring to? The customer is to receive a bag per contract agreement.
- [No, they aren't. The customer is to receive a character stream.]
- It's both; stop trying to create fake vocab roadblocks. ANY data structure requires a representation, whether it's a binary blob or character stream. The choice doesn't affect the issue at hand.
- [Bags do require an underlying representation, but the 'bag interface' is NOT the same as an interface to a binary blob or character stream. That IS the issue at hand. The customer does not receive a bag interface. The customer receives an interface to a character stream.]
- Why is that relevant here? It's only packaging details. Would it change the argument if they requested a RPC that delivered bags?
- [The problem itself is about "packaging details" to meet client requirements, was it not? I think you just proved your whole fucking argument to be sophistry by saying "Why is that relevant here? It's only packaging details?" Anyhow, I think that if clients are expected to request arbitrary new services back in reply to requests, you'll have something very different than a DBMS.]
- The relevant details are that the customer is to only receive a predetermined set of columns, which do not include the key. Whether that is sent via CSV or bar-code tattoos on a monkey's butt is immaterial (other than being impractical). -t
- [If duplicates aren't a relevant detail, just drop the duplicates and send them the requested columns.]
- My apology, keeping the duplicates are a relevant detail. It's the "bit packaging" stuff that I'm trying to separate.
- It would only change the argument if the scenario was presented as follows: "Our client needs to be given live access to the database to retrieve customer address information, but for legal/policy/whatever reasons, our client must not be given any unique row identifiers. Our client must receive only address information, with all duplicates intact, and may only communicate directly with the DBMS. No intermediary software may be used." That would be an interesting requirement, and no doubt very rare, but let's work with it. Given such a requirement, we do need a bag interface at the boundary of the system. However, this does not mean the underlying representation of the customer data must be a bag, only that the machinery providing access to the data must present a bag interface. Bags are still not required within a relational system; we are merely providing a mechanism to present data using a bag interface. Conceptually, this is no different to whatever machinery we also provide at the boundary of the relational system for importing/linking data with duplicate records. We can trivially convert bag sources to relations at the import/linkage point. Likewise, we can easily convert relations to bag targets at the export/external-access point.
- Per below, we don't absolutely "require" anything, including relational itself. Thus, "require" is a non-issue. But, you are adding more steps, more resources, and more tools in order to deal with the bag-to-relational impedance mismatch. I think we both agree to this in general, the real issue is how significant and common these impedance issues are compared to the benefits of a "pure" relational DBMS. -t
- Note that this is no different from, say, importing WordPerfect files into Microsoft Word, manipulating them in Word, and then exporting them as WordPerfect. By your argument, Microsoft Word should be internally representing files as... WordPerfect... Files? Replace "Word" with "relations" and "WordPerfect" with "bags", and we have precisely your argument.
- I don't see the similarities. There's no "return" step, for instance. -t
You've shown that there is a need to import or link to files and tables with duplicate records.
- However, that doesn't require bags in the implementation of the RelationalModel linking to it. The linkage mechanism can generate row IDs.
- Sure, one can always add dummy ID's to de-bag anything. In fact, that's what most existing RDBMS do automatically. It's just that the ID's are not usable for cross-referencing. You are doing it twice and wasting resources, bordering on a OnceAndOnlyOnce violation. We are dancing with vocab here.
- [IDs not usable for cross-referencing are not IDs at all. You may believe the 'dummy IDs' are trivial, but they are quite relevant for semantics across joins. And protecting those semantics is relevant for supporting optimizations.]
- There has not been a need for cross-referencing identified yet. Perhaps the result table will be cleaned of bad entries and then summarized (Grouped-by) before we use it further. The internal keys are perfectly fine for that need. A bit of YagNi should be applied.
You've shown that you'd like to be able to display relations without keys shown.
- However, that's a matter of presentation (how the data is shown -- as a table) rather than representation (a relation).
- I dismiss your excessively creative and ever-changing usage of "presentation". Your approach requires an extra step regardless of the label you slap on it.
- Regardless of your dismissal, the simple fact remains that bags are not required. Data can be represented using relations and displayed in a tabular format, sans keys, without any need for bag data structures.
- High-level languages and query languages are not "required" either, for it could all be coded with machine language. Languages and tools are to make our life easier and/or more productive, and if silly rules get in the way of that, then rid the [bleep]ers. Be careful about playing word games with "required" because I'll happily use your own "rules" against you.
- [Wrong, TopMind. RicesTheorem means you can't say jack about code written in an arbitrary TuringComplete language. The rules of a language allow you to say useful things about subprograms developed in it. That is the basis for everything else: optimizations (partial-evaluation, parallelization), availability, portability, security, termination, consistency and integrity, tolerance or graceful-degradation under partial-failure, resilience, extensibility, modularity, maintainability, and so on. HLLs and their 'silly rules' truly are "required" to meet any non-trivial set of user and developer requirements, especially once concurrency and communication get involved - i.e. when we are "required" to make correct and useful assertions about code, or perform optimizations. Productivity is useful too, of course, but if all we needed was productivity then we don't need anything more than MachineCode and RealMacros.]
- No, you are putting words in my mouth. I meant implementation of an application, as in TuringEquivalency. Machine language can implement anything relational can. Of course tools, using abstraction, help out as far as saving coding time, etc. That I don't dispute. -t
- [I'm not seeing where I made an error about which words I put in your mouth. You continue to ignore NonFunctionalRequirements.]
- No, you do. Being bag-friendly improves some NFR's also.
- [How did you jump from TuringEquivalency and MachineCode to 'being bag-friendly', exactly?]
- How did you jump from "required" to automated code analysis?
- [Via requirements to meet any non-trivial set of user and developer requirements (especially once concurrency and communication get involved). For example, you won't be able to say much about anything if IO to an application accepts machine code. Thus, you must use languages that are restricted in useful ways - i.e. HLLs. Now answer my question.]
- [Extra syntax is needed to separate formatting from projection. You wildly exaggerate how much this extra syntax hinders you, in a disgustingly weak attempt to justify your 'BagNeedScenarios'.]
- How ironic, you "wildly exaggerate" the downsides of bags.
- [I acknowledge them. When I see some upsides of multi-sets, I'll be sure to acknowledge that, too.]
So, where are bags
needed?
Nothing is "needed" in the absolute sense. Per above, I should have perhaps used the word "helpful" instead of "needed".
[You did use the word 'helpful'. We have yet to see a case where bags are non-trivially helpful (ooh, boy! I saved five characters relative to a particular syntax - and that's really a feature of the syntax and not bags!). We have many instances where bags are harmful in non-trivial ways (optimizations being the only one a narrow-minded bigot is likely to acknowledge, because duplicate-entry errors, resilience under distribution and concurrency, idempotence, and such can be dismissed with wild HandWaving).]
YOU are HandWaving. You have not demonstrated openly-dissect-able practical significant downsides. You just plain have not. Notions in your head don't count.
[Downsides of bags are the same as ever: potential for duplicate-entry errors, certain optimizations are not possible without a lot of extra context, lack of idempotence on concurrent or disrupted inserts and deletes, duplicates multiply across direct or indirect self-joins. Are you claiming these are insignificant? Or are you denying they are downsides? And Note: remember that this topic is about YOUR burden of proof: showing that bags have openly-dissect-able practical significant upsides.]
I gave 3 domain scenarios to dissect, your turn now...
[You say that as though they weren't already dissected and found wanting.]
Where are the 3 domain narratives?'
[I am waiting for a specific, clear, objective denial from you. I already asked the questions above.]
Denial of what? I don't know if the problems you claim with duplicates are "insignificant". That's not my burden. I'm not the Chicken Little of duplicates. The burden of evidence is on Chicken Little.
[My questions are written above. I placed them in bold for you. The problem is that you have, in fact, been dismissing these issues as idealist nonsense. You cannot pretend now that you are uncommitted. I would like you to make your commitment clear, and objective, such that we have something to target with any scenarios we develop.]
I cannot answer them in terms of how often they are a practical issue, or the practical implications such as how often and in what circumstances do they cause significant problems. I have insufficient knowledge to apply occurrence probabilities. I don't question that there are probably worse-case scenarios, but every tool has worse-case scenarios. The devil in computation of "significant" is in probabilities presented by the domain and human psychology in terms of interacting with a given tool. -t
How often do you expect to have to import external data with no identifiable key, either via the ODBC/JDBC driver or in the data itself, such that it would not be possible to represent it without bags or relations with auto-generated keys?
How often do you expect to have to export or otherwise present data with any and all keys removed?
I thought you meant your scenarios for duplicates causing problems, not my scenarios. Let's make sure we are talking about the same questions before we continue. For the record, I would NOT call the problems/scenarios I listed here "significant". "Relatively minor" is probably the phrase I would use. Although it matters more in some shops/domains than others. -t
[TopMind, beyond the clear hypocrisy regarding attempts at 'worst-case scenarios', you and I define 'significant' differently. To me, a 'significant problem' is a systemic one - it cannot be solved in-the-large by local changes. (This doesn't ignore syntax; I consider inability to abstract, resulting in need for BoilerPlateCode, to be a systemic issue.) I am willing to force developers to (occasionally) change their habits to solve significant problems, and I see all forms of standardization to be instances of this. To you, 'significant' seems to mean anything that forces you to change your habits or make local changes - i.e. it is a matter of "psychology" and "familiarity", and when you say 'significant' you seem to mean 'personally significant'. I find all of your scenarios to be insignificant: none demonstrates a systemic problem. Based on your response above, I suspect you would say the same of mine - at least until the implementation fails you, which might not happen if you don't stress the limits of concurrency, availability, scalability, and performance. The problem is, a lot of people aren't you, and databases really are stressed to their limits in practice. I suspect you have rarely ever had to rewrite large projects due to performance, concurrency, or architectural issues, but that is not uncommon in the industry. Scalability and concurrency issues have affected almost every large website, ever, often costing that site millions of dollars in traffic (i.e. for e-bay and Yahoo). Massive multi-player games must often run continuously, 24/7, which means there is never a moment for batch processing. Your self-centered 'psychology-oriented' viewpoint is, I believe, ultimately insignificant and impractical in the real-world. Yet, I know you are prepared to dismiss the scenarios whole as 'relatively' insignificant by use of subjective assertions regarding probability and 'WherePsychologyMatters'. I won't do that dance with you again. I ask you to either clearly take a position that either (a) clearly and objectively rejects the points we have made (so you can't resort to your usual subjective nonsense arguments to counter objective claims), (b) clearly and humbly requests an education on a specific point (so you can't deny being a student on the subject), or (c) admits honestly that, while you might not fully understand the point, you simply won't care until it becomes personal by bashing you over the head (so we can at least drop the subject).]
First of all, if it's about "scalability" and high-end performance, why can't you create a reasonably-easy-to-describe scenario from such a situation that illustrates the power of "no bags"? You talk in general terms, lacking details and specifics. Enough brochure-talk; give 3 sufficiently-detailed scenarios. If you don't feel like it, fine, just say so. Just don't expect readers to take your claims at face value and then insult them for not reading your favorite library of obscure books. I did my part with my 3 scenarios and I expect you to reciprocate.
[Answer your own questions, seriously. You ask why I cannot offer a "reasonably-easy-to-describe scenario" that "illustrates" an issue such as "scalability". It's an issue of ArgumentByLabToy. Unless you understand the math, you won't understand the significance of the example. And, thus, describing the scenario is insufficient; I'd need to describe the math to you and, frankly, that isn't something I can reduce to your level of competence unless you were heavily motivated to understand the significance of the example. Considering that your agenda would be to deny the significance of the example, I don't consider you worth the effort. Besides, your three examples didn't actually demonstrate anything significant, even by your own admission. What is there to reciprocate?]
- PageAnchor: no_bags_in_this_one - You keep changing the gist of your argument. I don't propose that bags are always or even usually used. Only that it be an option. -t
- [You have yet to demonstrate the option is worth its inherent cost or complexity for the users and database implementors.]
- Yes I have, but this spot is about large-scale performance issues.
- [This page is about BagNeedScenarios, and you've yet to provide one (despite four attempts, thus far). I'll create a new page if I choose to provide any counter-scenarios.]
- Now if every table involved in a given query has a primary key, why would it suffer performance issues that a no-bags-allowed RDBMS won't? -t
- [Many problems would still express at intermediate tables, i.e. those produced by sub-queries, projections, joins, views. And I will not consider appeals to a SufficientlySmartDatabase or SufficientlySmartCompiler (of queries) to be realistic.]
- See PageAnchor "inherently_slow"
- You seem to keep thinking of the worse case; it's stuck in your head. When people are paranoid about X they tend to magnify the likely-hood of X in their mind. -t
- [True. I should note that the same applies to you, except that your 'worst case' concerns appealing for BagNeedScenarios are not significant: even to you, they are "Relatively minor"; to us, they are "trivial".]
- You have argued that it would complicate the compiler/interpreter to have two sets of operators: one for bags and one for sets. I'll accept "more complicated interpreter" as a legit downside. But why would the mere existence of two sets of ops cause detectable performance problems by itself? (see example "I-love-options") -t
- [The performance issues are associated with optimizations across intermediate query results. Performance issues arise because one cannot know, except from context, whether the results of an SQL projection should be unique. This is especially the case for named intermediate queries, i.e. when one references a view or a procedure or a result-set whose implementation is otherwise abstracted behind a reference. (For SQL, you can add more performance issues that arise because of ORDER BY statements and the unclear meaning of invoking ORDER BY prior to a join, union, intersect, difference.)]
- [Also those complexity issues apply to both the interpreter and the users of the DBMS, who must be more aware about which options were used, and where, especially when a large number of users are involved building a large project piece-mail. The complexity will, thus, also affect maintenance issues... i.e. because to reason about the system users must know which 'views' and 'procedures' and such introduce semantically significant duplicates or ordering somewhere deep in their guts, and because one must get to the underlying data in order to reuse the data for different views (especially if those 'different views' require joins).]
- I suppose the answer could be that internal table formats designed for both bags and sets are less efficient than those designed for just sets. However, if a given table has a primary key, it could be implemented using the set-based layout. The RDBMS could offer a copy or conversion utility to convert the hybrid-oriented internal format to the set-based one. Yes, it's more coding to have the option, but not an impediment to the alleged efficiency of the set-based format. -t
- [The different internal table formats was not a consideration on our part, though it is a complexity concern.]
- And I doubt there is such a choice anyhow because an internal surrogate key could be created by the DB engine to make a bag into a de-facto set. -t
- [This is not true. A hidden key - specifically, one unavailable for joins and that will not survive intermediate operations (projections, unions, differences, other joins) - is not a substitute for uniqueness constraints on a set.]
- It's still a set, just one with limited abilities.
- [Even if you insist that it is a 'set' based on some implementation detail, it certainly is not a Relation or 'set' with regards to the RelationalAlgebra operators, and therefore your argument is ultimately equivocation: you are using the word 'set' in a very different sense or context than is appropriate for this topic. I suggest you let it drop.]
// Example: "I-love-options"
If (all tables of this query are sets) {
load and run set-based optimizer;
} else { // some are bags
use hybrid processing;
}
Second, there's a lot more to app development than
BigIron apps. There's lots of small- and medium-sized niches out there where machine
performance is not the primary cost/resource factor, but rather software creation and maintenance effort. It doesn't make sense for these to adopt practices meant to cater to "the machine" over "the human" if "the human" is the most expensive resource.
PickTheRightToolForTheJob.
[Indeed. And, yet, you've offered no reason to believe that the bag-friendly approach is more suitable for these domains. But true RelationalModel promises to prevent duplicate-entry errors that might cause double-purchases and credit-card expenditures and such, and thus reduces maintenance and compensation issues. That promise seems helpful to me, as a human.]
- I gave 3 scenarios and I believe demonstrated my point sufficiently to the reader. If you disagree, so be it. Let's LetTheReaderDecide now rather than repeat the same arguments over and over. It's getting old. -t
- [If you pretend to write only for people who never respond, you might as well go preach your views to a cactus. I am a reader. Don't ever forget that.]
- Produce a specific scenario demonstrating your "double purchases" then. And please heed the warning above about common usage of bags versus having it as an option. -t
- [You're segueing into an assertion that: "if it's a problem there, don't use it there! bags are optional!" Well, I won't deny that is true, but it is also vacuous (i.e. I don't need to give you a scenario before you say it). But I consider it unreasonable to assume foresight of problems. Issues exist whether or not a users know about them in advance of using them. Start by assuming imperfect foresight, alongside other imperfect things (i.e. imperfect network connections; EightFallaciesOfDistributedComputing; not-especially-smart optimizers limited to local information; etc.). Once you assume a realistic context, you cannot reasonably appeal that a feature is 'optional'. Instead, you must wonder how likely it is to be used ineffectively or incorrectly in practice, especially in a multi-developer environment with long-term maintenance life-cycles, and weigh those probabilities against the potential benefits of correct and effective use.]
- [In any case, until we see a case demonstrating need for bags, there isn't any real burden to demonstrate their down-sides. (After all, that's like asking us to demonstrate the down-sides of adding a SnuspLanguage interpreter to the RDBMS, when we have yet to find a good reason to include one.) BagNeedScenarios does an awful job of demonstrating any benefits whatsoever for bags, given the straightforward integration solutions. (And, indeed, every one of the BagNeedScenarios is BeggingTheQuestion by assuming a bag is needed as part of the system in order to prove a bag is needed as part of the system!)]
I spend a lot of effort producing and managing "marshalling code", for example. That is the interface and conversion between systems and apps. Tools that simplify that hit one of the bigger time-suckers in my world. -t
[Yeah, I also do a lot of serialization and parsing. I do not believe the RDBMS is the right place to tackle this problem (since it hits a lot more tools than the RDBMS). But, until a better solution is a commodity, it sure would be nice to have an RDBMS with dedicated import/export utilities flexible enough to support a variety of common and ad-hoc formats. Why, it would be a shame to depend upon an inflexible vestigial DBMS features that have potential to cause scalability, performance, and maintenance issues. (And what's with all those unfriendly gotchas? Why can't I abstract my SELECT INTO for use within another SELECT, or use it in a VIEW?)]
Once you learn that certain features are only available at the final/end query, then it will no longer confuse you. If that's the kind of stuff that trips you up, then perhaps you should consider another career, like grumpy high-school math teacher.
Would it make you more comfortable to have a differently-named operation/clause for "output selects"? "outSelect", or "format" or something?
[Why yes! What a fantastic idea, TopMind! If you're digging for names, I'll suggest 'EXPORT'. Maybe something like EXPORT myTable ORDER(ASC a) EXCLUDE k FORMAT CSV TO "myfile.csv". But, once I had an EXPORT, it would seem like SELECT was a bunch of LanguageIdiomClutter - two operators attempting to do the same thing. So, it would probably be best that EXPORT can work with tables produced by embedded SELECT statements. Why, something like this would have kept me from ever getting tripped up, keep me from accidentally using the wrong feature at the wrong place.]
- Rather than very different names for two very similar things, why not make it an option switch, such as "-dupallow" or something (in command-line speak)? -t
- Sorry, what "two very similar things" are you referring to here?
And, I don't expect a query language to offer super-fancy rainbows, flowers, and parsley; but I do expect it to provide the
basics of common export and output, including "showing" only the columns I ask it to show.
That's what Rel does. Just don't expect the projection operator to be overloaded such that it both emits duplicates and doesn't emit duplicates. Projection, like every other RelationalAlgebra operator, does one and only one thing. Other statements are used for importing/linking and outputting data.
I vote overloading. Classification unity of activity is more important than unity of "type" (type relational and type bag) in my opinion. I wish to focus on the task/duty at hand, not noun taxonomies of the underlying parts. If you disagree, so be it.
Top advocates polymorphism! Hell experiences cold snap; freezes over!
That's an implementation detail. The user of the query language won't see whether polymorphism or IF's or whatnot is actually used under the hood. Whether there are even distinct "types" may be an implementation detail (HAS-A-primary-key versus IS-A-bag).
[Polymorphism is not an implementation detail. The implementation detail is whether polymorphism it is implemented with if conditionals or a VTable or whatnot.]
Please clarify. There are not necessarily multiple "types" here.
[Implementation details are exactly the set of things I cannot distinguish in terms of FunctionalRequirements. Polymorphism, when it exists, is never an implementation detail. (Potential point of confusion: you seem to be assuming that 'polymorphism' means 'subtype polymorphism'. OperatorOverloading is technically a form of polymorphism.) Now, whether or not polymorphism exists in this case is arguable; I just wanted to comment, more generally, that you misused the phrase "implementation detail" by applying it to a feature that is by its own definition part of the interface.]
It can be both, but let's save that squabble for another day if it's not material to this.
[Interface vs. implementation may be layered, true, and thus are relative to position within the stack or protocol. But you still misused the term.]
I disagree, but put it in the haggle-queue for another day if it's not pivotal to this topic.
Style note: I will sign my messages with "-t" rather than rely exclusively on italics etc. I may also mark the prior "Anonymous donor" to separate it. This topic is getting too long for me to track where I use italics and where I don't. It grew out of sync for some reason.
Top wrote:
// Example: "I-love-options" (internal to DB engine)
If (all tables of this query are sets) {
load and run set-based optimizer;
} else { // some are bags
use hybrid processing;
}
Yeah, I could provide that in the
RelProject. What would a user gain from all that additional development effort on my part, including considerable thought to define hybrid optimisation strategies that are provably less performant than strictly using relations? Nothing. What would a user gain from learning the distinction between "duplicate-emitting" projections and "non duplicate-emitting" projections? Nothing. What would a user gain from having to learn
not to make mistakes by choosing these incorrectly? Nothing. What would a user gain from avoiding all that complexity? Much. Users always benefit from not being subject to unecessary complexity.
What does a user lose from having to use relations instead of bags? Nothing, but she might occasionally have to employ an extra keyword when exporting data in order to eliminate keys and retain duplicates.
Top, you have presented no justification for representing data in the database using bags. If you have a good reason to represent data using bags -- as distinct from importing, exporting, or displaying it -- then you may be onto something. Otherwise, you have only defended what appears to be a marginal, infrequent need to import data containing duplicate records, or output (including displaying) data containing duplicate records. In the former case, detecting the need to add an auto-number key can be automatic. User effort: 0. In the latter case, excluding keys from the output (regardless of output mechanism) requires a keyword or two. User effort: negligible.
Therefore: There is no reason to maintain data in the database using bags, even as an option, hence "hybrid processing" is pointless.
I'll return to these questions later. But first, let's finish up addressing your performance claims, the main reason for that pseudo-code sample. -t
See the example near the top of BagAtational, which identifies a significant optimisation transformation available to relations that is not available to bags. There are others.
PageAnchor no_bags_in_this_one has NO BAGS. You are "solving" the wrong problem. -t
Sorry, not following you here. Your pseudo-code is clearly selecting on bags vs sets.
That's the point. Your lovely set-friendly optimization is still available if a given query has no bags in it. Determining such is not part of the tight inner loops of actual query execution, just one-time selection(s) up-front.
So... Let me get this straight: You want me to implement an entire bag-oriented query optimiser or a hybrid bag+relation query optimiser, which will inevitably deliver poorer performance than the pure relational query optimiser, and require that users understand the performance and possible-inadvertent-duplication implications of using bags and learn a set of options to choose bags vs relations, all to avoid having to learn as little as one keyword to eliminate keys and preserve duplicates on those exceptional occasions when a client demands output stripped of unique IDs? That's what you're asking for, because you've certainly not shown any justification to represent data using bags within the database. --AnonymousDonor
I see no reason why there has to be two sets of explicit operators in the query language. The other issues I will address later. -t
There is a fundamental reason for having two sets of explicit operators in the query language. It's because they are completely distinct operations, more different than '+' and '-'. One is relational projection, a well-defined mathematical operator on a relation that returns a relation. The other is an option on an export/output statement that removes of specified columns in the emitted data. It isn't even an operator, really. They are as different as chalk and cheese.
Fancy math under the hood is an implementation detail.
That's like claiming the difference between addition and calculating a square root is an implementation detail. It's a fundamental conceptual distinction, and it's explicit, not "under the hood". -AnonymousDonor
No, it's conceptually very similar, at least when looking at tables through "domain eyes". Even the few times I have made one of those alleged planet-ending "duplication errors", the results were almost indistinguishable from what "proper" key management would have returned. -t
Yet, if you try to apply that "conceptual similarity" in a naive fashion, you quickly discover that a "projection" that returns duplicates cause errors when used with relational operators. In other words, a superficial and almost coincidental commonality, i.e., "pick some columns", disguises a more fundamental conceptual distinction. Claiming that they're "conceptually very similar" is as silly as claiming that '-' and '/' are "conceptually very similar" because both symbols use a single, straight line.
Concepts are in the head. The computer doesn't otherwise care, it just follows rules. And it's not coincidental: there are certain similarities in many if not most data structure operations among different data structures. And "disguising" is sometimes a good thing; it's part of abstraction in fact: disguising complex things as simple things in order to give us a UsefulLie can simplify our work and grokkability. The key is finding the balance/threshold where there is too much abstraction. And your "strait line" analogy is poor. Try again. -t
By the way, wouldn't life have been better if you couldn't have made duplication errors at all? Or can you simply not abide typing an extra keyword for those three or four times in an entire career that you have to emit duplicate records sans keys?
Many if not most of them could be reduced by minor adjustments to SQL, such as UNION defaulting to meaning UNION ALL, with something like UNION UNIQUE for the other alternative. This would be more consistent with SELECT's "bag first" behavior. SQL has some stupid defaults. Whatever replaces SQL hopefully has smarter defaults (if bag-able).
And I will live with occasional "purity errors" if it sufficiently simplifies common code. For me, errors caused by MISreading long-winded code are on the average greater than those caused by "purity" faults. I like purity too, but not at the expense of error-causing lack of expressiveness. It's all about weighing trade-offs.
[SQL is really long-winded. That means, if I'm following TopMind's logic, that bags are bad. I hate it when impurity causes all this verbosity. Really, we should trade-off: more purity, for less verbosity. That's what Rel did.]
No, it pushed conversion/interface issues to other tools and limited its ability to use existing RDBMS engines.
- [Surely you jest. To which tool did Rel push its IMPORT and EXPORT issues? And precisely which existing RDBMS engines does Rel have limited ability to interface with?]
No, it didn't. It was one of your SQL examples that pushed conversion/interface issues onto TOAD, as I recall. Specific examples have shown that Rel incorporates import/link/export capabilities that SQL does not. Rel is unlimited in its ability to link to existing RDBMS engines. If a JDBC driver exists, Rel can connect to it.
By "use" I meant hook into and act as, or even be, a native query language, not merely import or export, or something in-between and goofy like MS-Access's "links". -t
I do not understand. What does it even mean to say one RDBMS engine has limited ability to "use" another in that sense? I would not blame SqLite for being a bad query language for Oracle. Are you confusing Rel with TutorialDee?
--AnonymousDonor
Let me see if I can clarify this. There is the (semi) standard "SQL" and vendor-specific dialects. SQL the semi-standard can more or less be used on all the RDBMS. Any challenger to SQL should strive for something similar: a common base language that could potentially work on all the vendors' RDBMS with hopefully minimal fuss.
People don't want to toss their existing RDBMS and start over if by chance a better query language started displacing SQL. They would want their RDBMS to be able to use the new language.
In fact, as long as the replacement language is not protected by some proprietary restriction, vendors would end up adopting some form of it for their engines anyhow. Thus, if TD/Rel took off, vendors would create a version that works on their existing engines whether you want them to or not. And, this may include adapting it for bags. You cannot stop the baggification if and when TD/Rel becomes successful.
In some ways this can all be compared to RISC machine language: the momentum of existing CISC was just too much such that RISC appears to have lost the race even though it was theoretically cleaner (although ARM is still hanging around nicely in small-device niches). IBM and HP/DEC made a pretty good try at RISC, but have since pulled the plug because they couldn't get sufficient performance boosts to compete with x86 due to the economy of scale of x86.
I'm not talking about the theoretically best technology here; I'm talking about the likely behavior of the market. Thus, if you want to control how TD/Rel works with bags, do it early, or somebody else will do it for you.
--top
[The obvious solution would be simply to import/link tables from SQL data models using the same (to the user) mechanisms it does for remote tables, albeit with a slight tweak for the local URI. If TD/Rel becomes successful, it will be in part because people acknowledge the value of its purity. I agree on general principle that one should plan for success. However, I do not believe this to be a problem scenario.]
"Simply"? Non-native "linking" has not proven efficient and reliable beyond trivial data needs.
return linkNative(L) if uriIsLocalToRDBMS(L) else linkRemote(L);
[There's your code. It is simple. No need for
HandWaving about efficiency or reliability. No need for Oracle or MySQL to invent a new syntax for linking, or introduce extra
options to the user, should they ever follow up on that (slim, but gratifying) possibility of implementing
RelProject's variation of
TutorialDee to ride DV's coattails. Well, they might change it anyway, for
VendorLockIn, but they do not
need to do so.]
I've been working on this sort of thing for years, including developing conventional multi-database driven clients via ODBC/JDBC, heterogeneous database systems, ETL systems, database IDEs, and now support for linking external DBMSes within Rel. Two things based on that experience:
- I could happily provide Rel with no links to external DBMSes, and I doubt it would have any significant effect on its popularity. Note that MySQL, PostgreSQL, DB2, and numerous others provide no direct links to external DBMSes. Linkage is a bonus feature, not a requirement.
- Non-native linking that is highly performant, integrates with optimisation strategies, and reliable requires a non-trivial implementation in terms of programmer effort, but it doesn't involve any novel techniques. As described above, it is not a problem. Dispensing with bags in favour of relations is by far the least worry. Converting linked bags (i.e., external DBMS tables with no identifiable key) to relations is trivial.
Doing a join using two linked tables can be a bear because it cannot do the look-ups on the original host without some clever parsing. One runs into a lot of problems similar to
RemoteIterationIssue.
Please define "can be a bear", i.e., is it difficult? Is it impossible? Does it not work reliably on all DBMSes? Etc. What do you mean by "cannot do the lookup-ups on the original host without some clever parsing." Define "clever." While you're at it, "clever parsing" of what? What do you mean by "lot of problems?" Etc. The search phrase you're looking for is "distributed heterogeneous databases." Happy reading, and please come back only when you've done some of it.
Are you saying that linked tables have just as much access to indexes and host-side filtering that native ones do? I'm not in a mood right now to produce a scenario demonstrating the inefficiencies and limits of linked tables. Maybe another day. -t
- [You seem to assume that "linking" must be implemented by a specific driver (such as ODBC or JDBC). That is your error. Under your assumption that "vendors would create a version [of TD/Rel] that works on their existing [SQL] engines", the question is how a TD/Rel data model could efficiently import data from an SQL data model that happens to be stored inside the same RDBMS engine. The answer is: no changes are needed to TD/Rel; the assumed import syntax that works for remote tables will work equally well for local tables. The underlying driver for the linking can be selected to take advantage of the table's locality. Both identifying the locality of the table and developing a specialized driver for linking local tables are entirely tractable problems. Optimizing complex queries involving remote databases can be much more difficult, especially if it involves joins with local tables. That problem is related to automated code distribution, which (IMO) should be solved more generally.]
I'm not sure what point you're trying to make in this context. There appears to be some misunderstanding, but let me attempt to clarify some things:
There are standard ways of linking to DBMSes. ODBC and (to a lesser degree) JDBC are the standard APIs supported by most DBMSes, but there are also vendor-specific APIs like Oracle's OCI and Microsoft ADO.NET. All of these work the same way: Statements in a database language -- typically some SQL dialect -- are sent via the API to the DBMS and result sets or error and/or status messages are returned. All distributed heterogeneous database systems access individual DBMSes via these APIs. As such, no heterogeneous database system has any more or less "access to indexes and host-side filtering" than any other. Note that Rel, by providing links to external DBMSes, is a heterogeneous database system (i.e., it can link to a variety of external DBMSes) in addition to providing native tables aka RelVars. Some other DBMSes provide this facility. Oracle, for example, provides Oracle Generic Connectivity and Oracle Transparent Gateway to (relatively) seamlessly connect to external DBMSes in addition to using its native table format.
Now, it is always the case that a local connection to a native database is going to have access to internal structures that a remote connection will not. That much is obvious. However, for the users of the heterogeneous database system this makes no difference. For example, given a Rel system, assume the following two RelVar definitions (simplified for illustration):
VAR myVarNative REAL RELATION {name CHAR, address CHAR} KEY {name};
VAR myVarExternal JDBC postgresqldb TABLE customers MAP {name CHAR(thename), address CHAR(addr)} KEY {name};
A user can query either myVarNative or myVarExternal using TutorialDee syntax, for example:
myVarNative WHERE name = 'Dave'
...or...
myVarExternal WHERE name = 'Dave'
The former will perform the query using Rel's internal query processor, optimiser, and data storage and retrieval mechanisms. The latter will be translated by Rel into "SELECT DISTINCT * FROM customers WHERE thename = 'Dave'" and executed against the appropriate external DBMS. Therefore, the external DBMS's query processor, optimser, and data storage and retrieval mechanisms will be used. In both cases, the performance of the query will be dependent on the performance of the relevant query processor, optimiser, and data storage and retrieval mechanisms. In neither case does the existence and/or use of bags, relations (or the lack of either anywhere in any system) have any relevance at all, and all heterogeneous DBMS systems work this way. Rel is no different from any other heterogeneous database systems. From a client's point of view, the only difference between Rel and (say) Oracle is that Rel uses Rel's TutorialDee dialect and Oracle uses Oracle's SQL dialect.
The issues generally start to show up during JOIN's, not single-table operations. Again, I'll provide more details some other day. Think about this: where is the primary processing of joins happening: on the client or server, and why.
In the worst case -- e.g., "myrelvarx JOIN myrelvary" where myrelvarx and myrelvary reside in different databases and are of similar cardinality -- the entire tables must be retrieved from their respective databases and JOINed within the heterogeneous system. This is an issue common to all distributed heterogeneous database systems. It is certainly not unique to Rel or true relational systems and (again) has nothing to do with bags vs relations. This is now wandering off the topic of this page.
Even if the two tables are on the same host, it still may have to retreive entire tables if you link instead of do it native. This would put a link-to-translate-bags language at a disadvantage to a native language if one or both is a bag-table. (Damn, you suckered me into this sub-debate.)
If you're JOINing tables on different databases, whether on the same host or not, you may have to retrieve entire tables.
If you're JOINing tables on the same database, you can always employ the native JOIN defined by that database's DBMS.
- And how would this take place over a typical connection? TD/Rel translates itself into SQL, and send the SQL to the database?
- Yes. SQL is always needed to communicate with a SQL DBMS. That's, y'know, why they're called SQL DBMSes. SQL is an adequate low-level language for communicating with legacy DBMSes, though obviously you wouldn't want to use it directly. Similarly, in future Rel releases I intend to support sending and receiving SOAP messages to communicate with WebServices, use the IMAP protocol to communicate with email repositories, and so on.
- I wouldn't call it "low level". With single tables, generated SQL can be easy. However, with joins and In's and filters and grouping expressions etc., it grows increasingly non-trivial. A native query language is the better choice.
- SQL is "low level" compared to other database languages. Generating SQL from TutorialDee is conceptually trivial, regardless of whatever developer effort is required to create the translator. It requires no grappling with difficult mathematics, complex algorithms, squirrely data structures, solving NP hard problems, etc. It's merely a translation issue and a straightforward one at that. You say "a native query language is a better choice." A better choice for what? Do you disagree with the very existence of heterogeneous database systems? The alternative is manually writing SQL (and possibly other database languages) to access a diverse collection of databases.
- That's looking like the better option right now.
- Based on what? How do you define "better"? Do you also feel hand-assembly is "looking like the better option right now" compared to using compilers? For consistency's sake you'll have to answer "yes", because that's precisely what you're claiming.
- SQL is assembler? I don't think so.
- Certainly it is. Is SQL not the lowest level or most primitive language by which you manipulate a typical BagAtational database system? Can you not create higher level languages -- like Rel/TutorialDee and Alphora's Dataphor -- that generate SQL to manipulate BagAtational systems? Therefore, SQL is the "assembly language" of such systems. But this, of course, is stretching the definition of "assembler" and is not the point. More to the point, you haven't answered my questions: What is your basis for deprecating heterogeneous database systems, and how do you define "better" when you claim that coding multi-database systems is "better" done by writing in 'n' different database language dialects instead of just one?
- I am not against heterogeneous database systems, but merely concerned about the inherant drawbacks and limits of mixing. If TD/Rel can magically bridge the gaps, that would be great, but I am skeptical.
- What are the "inher[e]nt drawbacks and limits of mixing"? Why are you skeptical?
- I believe there will be too many "impedance mismatches" between both SQL and your idealistic view of RDBMS. When stuff goes wrong, a lot of head-scratching will come about to figure out why as one tries to trace the conversion steps bit by bit, having to invest in learning your conversion techniques. SQL is not low enough a level to serve as a base assembler; it has tons of idiosyncratic little rules. With gajillion hours of painstaking programming and trial-and-error, it may be possible to reach a usable state of translation, but that's not likely without a better sales pitch than duplication purity.
- Interesting. What is your basis for your conclusions? Do you have references, specific experience in database language translation, etc., that you can relate? Or are you speculating? By the way, providing heterogeneous database features in Rel is pure bonus. It is not a requirement, and I'm only doing it because I've had a number of requests from people who would like both heterogeneous database capability and a true relational language. Note again that what you call "duplication purity" has nothing whatsoever to do with this threadlet. Hence, your claim that a "usable state of translation" is not likely "without a better sales pitch than duplication purity" (Huh? How does one follow from the other?) demonstrates nothing but your lack of understanding.
- You are not providing anything sufficiently better than SQL to obtain enough programming effort to make it usable with existing RDBMS, and it won't gain widespread acceptence without the ability to work well with existing RDBMS. A niche product, maybe. If I was going to "take on" SQL, I would make something that can become a native language on most RDBMS engines. Linking is not a sifficent replacement for native unless you have translators that will end up far more complicated than the new language itself. Living with bags and nulls makes it easier to be native.
- A number of database researchers, developers and users believe not having to live with bags and nulls makes it easier to work with databases. Note that TheThirdManifesto is popular enough to have warranted three editions plus ancillary publications. Note that Date's AnIntroductionToDatabaseSystems -- many of its examples are written in TutorialDee, and it advocates avoiding NULLs and duplicates -- is considered the definitive database text. These are not marginal views I'm presenting, but a significant movement, reflective of ideas that many database practitioners have been demanding for years. We've shown here how importing/linking bags into a pure relational environment is trivial. I've shown elsewhere that mapping NULLs to values described by a type definition is trivial. So, your claim that "living with bags and nulls makes it easier to be native" is false. It is entirely possible for a pure relational system to "work well with existing RDBMS." The process for translating TutorialDee to SQL is straightforward. I've been working on it; your speculations about its complexity are baseless. I'll admit it requires some programming effort, but there are certainly no intractable, or even difficult problems. It merely requires coding gruntwork. I can't follow the rest of your comment; it seems to be a confused mix of idle speculation, grumpy guesswork and pointless quibbling. By the way, you have still not shown why bags are of value in that all-important region between linking/importing data with duplicates and outputting it.
- You've only shown it trivial for trivial examples. And, I find your null "work-around" silly. Anyhow, I'll let you build it and let the marketplace test it and we'll see how it goes from there. No need arguing any further.
- Converting bags to relations is the same trivial mechanism regardless of example, and whilst you may find the "null 'work-around' silly", others regard it is a reasonable way to add semantic content and consistency to a mechanism that in SQL lacks both. However, I agree with your point about the marketplace. Ultimately, it will decide regardless of either of our views.
- As a side question, do you assume analysis of the schema, such as primary key definition? Otherwise, how do you know you are dealing with a bag or not? Security limitations, view usage, and bureaucratic name space management (Oracle cough cough) can sometimes make schema info difficult to acquire. -t
- There's a JDBC metadata method to obtain primary key information given a table name. Some JDBC drivers throw an exception to indicate it's not implemented. :-P In that case, all bets are off and the database developer can either manually identify the key via a keyword or rely on Rel to automatically supply an auto-generated surrogate key.
- If JDBC is unable to indicate the primary key, but you know it exists, is there a strait-forward way to "force" it? A DBA may not be so quick to hunt down the problem for you unless it's a proven show-stopper, especially if using a "weird" query language. For example, suppose a product catalog has the primary key of vendor_ID + product_ID. You are doing an incremental refresh (...WHERE last_changed >= $target_date...) rather than a complete download such that a temporary surrogate key won't work. I suppose one could do a GROUP-BY with a duplication count (which should always be one), but that's an extra processing step that a bag DB wouldn't need. And you have to find a way to toss the duplication count if you don't want to keep unnecessary baggage. Does the "key word" you speak of actually check the claim?
- You "force" the primary key by using the KEY keyword to specify it. E.g., "KEY {vendor_ID, product_ID}". If the key can be determined via the JDBC driver, it will be checked against the KEY specification, if there is one.
- Out of curiosity, what happens if one lies? In other words, the given keys are not really unique.
- Good question. I will delete, unannounced, a random tuple from a randomly-selected RelVar (though the selector will be weighted toward picking RelVars with 'payroll', 'paycheque', or 'payment' in the name) whenever the user lies. That, and occasionally randomly alter an attribute of type Password. That will discourage lying. *Joke!* Actually, I haven't yet decided what to do in that circumstance. It's one of a number of as-yet unanswered questions on my plate.
- Note that if you do manage to make bag-to-set/set-to-bag conversion easy, then people will still create de-facto bags with a surrogate key, and make most of the same "duplication mistakes" that they've made before. Making the compiler happy and doing the "right thing" may not be the same. Same with nulls. I remember back in my FORTRAN days, people would invent de-facto nulls by using "magic numbers" such as -1 or -9999 as stand-in for nulls.
- Careless people will always find a way to injure themselves with a tool, whether it's a top quality tool or cheap rubbish. Is that a reason not to make good tools?
- No, but you can only take automated spanking so far.
- What do you mean? In most jurisdictions, it's still considered illegal to make fatal error messages that are actually... Fatal.
- Just plug it with a useless auto-key
- Huh? Now you're just quibbling.
- No, it illistrates the blowback in your argument: if switching between bags and sets is so easy, then the downsides of bags can "leak" into the set world. I think you'd agree that sticking an auto-key onto a bag does not necessarily remove many of the problems caused by "bagginess". -t
- [It does remove many problems caused by bagginess, though you are right that it leaves a few (such as duplicate entry errors). You could remove the latter by instead generating unique identifiers at the sender's side - such as (cash-register ID, store number, date) - instead of depending upon auto-keys. Indeed, I fully embrace this approach. I am quite convinced that 'autonum' is a mistake of its own, in part because you cannot easily describe multi-table inserts with references to foreign autonum keys. Generating identifiers at the sender - even if those IDs were large random numbers - prevents sends repeated across a network disruption from resulting in duplicated inserts on the Database side. Similar techniques can be used to protect against pressing the 'Submit' button more than once and such.]
- Complex compound keys can be laborious and verbose to work with. I don't miss them.
- [They aren't nearly so much trouble if the query and data manipulation languages support composite structures, such that tuples can be passed around within one column.]
- Most languages and transport mechanisms don't support those well, And such nested-ness could be argued to be anti-relational. And technically it's not much different than a surrogate key anyhow, it just becomes an object reference instead of an integer key: an obfuscated surrogate in some ways. -t
- "And such nested-ness could be argued to be anti-relational..."? By whom? Where do you get this stuff? Or do you just make it up as you go along? As for your "surrogate key" argument, please re-read the previous comment -- you seem to have mis-read or misunderstood something.
- As far as the cash register example, there's no guarantee that a given cash register or its local network won't burp and send duplicate records and/or reuse keys even if a domain key is possible. Many companies and vendors are cheapskates and don't want to spend a lot to keep equipment in good condition or trace down infrequent bugs, meaning more work for IT to clean up crap data. I've been down that road many times. They are a little better if money is involved, but not much. In their mind it's not worth $5,000 to track down a lost $100 if the risk of getting sued is small or somebody else's problem. Related: AutoKeysVersusDomainKeys.
- [Indeed, there are other sources of error. Spelling and typographical errors would be another common one. One should tackle problems based on one's ability to do so (known cause, fixable cause) and relative severity. Many 'bagginess' issues are eliminated by use of surrogate keys. Of the remaining issues, many more are eliminated by favoring sender-generated keys. Note: it doesn't need to be a sender-generated domain key. It could be a sender-generated auto-key, so long as it is unique.]
- Like I said before, in large organizations one often has little control over most of the schema design. Even if you do control one corner, you use non-controllable data from many other corners.
- [I am confused. Please explain the relationship between what you said just now and the prior comment. Delete this comment after you edit.]
- We seem to be thinking of different kinds of scenarios. Why would the supplier/sender bother to generate keys for us if we could do it ourselves? It's not their problem (at least as they see it). They'll just tell you to use SQL "like everybody else".
- [Regarding the client-side keys: separating key-generation from insert is useful for at least two reasons: because networks are faulty (you might end up sending a message twice, but providing your own key guarantees idempotence for duplicate inserts or submissions), and because you often wish to insert identical references to multiple tables (i.e. both as a primary key and a foreign key, or using intermediate databases).]
- It would not be unrealistic for the conversation to go something like:
- DBA: "Why not just generate keys on your end after you receive the dataset"?
- You: "Because the network might burp and lose or duplicate packets."
- DBA: "Has this happened to you before?"
- You: "I've lost the connection in the middle before, and had to start over."
- DBA: "How will adding a key stop that particular scenario?"
- You: "That particular scenario, no, it won't help. But it guarantees I don't receive duplicates from the transfer process."
- DBA: "To be frank, I don't have time for make-believe or once-in-blue-moon problems. I have plenty of real and immediate problems on my plate. You have access to the daily summary count table and we know your detail query won't be off more than a fraction of a percent from that. That's good enough to detect any major hiccups. I've never seen ODBC duplicate results in all my years here."
- [An example of the problem being solved is that pressing 'Submit' more than once STILL risks duplicate purchases or forum postings in a wide variety of real systems. If the key was generated and available to the client in advance of pressing 'Submit' (which is most easily achieved via generation at the client), then duplicate sends would result in a single database entry ('idempotence'); similarly, a user could go back and edit a submission easily, since he has the key. But I agree that the above conversation wouldn't be unrealistic - especially the part with the DBA completely failing to grok the argument, then acting arrogant and dismissive.]
- [Most of what you're saying indicates misunderstanding. How are you interpreting "sender-generated keys" such that it suggests adding an auto-key on the server-side? And schema developers (including the DBA) do heavily influence what an application does, both with regards to interface between the application and the database and the resulting organization of the application. ]
- Are you envisioning something like using Oracle's "RowNum?" column as a transfer integrity check? I've never tested it for that purpose, and don't know if it works right with views and stored procedures, for example. Again, if the DBA has to jump through hoops, they'd probably put your request on the back-burner. (Maybe ODBC already does some integrity checks.) As far as DBA influence on apps, it's generally fairly limited in my experience, although it does depend on the org structure.
- [No. Those are RDBMS-side generated values. What I mean by "sender-generated key" is that, when I (a sender of data to the RDBMS) INSERT data, I provide the key. I've provided two examples: a domain key "(cash-register ID, store number, date)", and use of "large random numbers" (e.g. 128-bit numbers). Any number of mechanisms are possible. The point is that the key is decided sender-side prior to INSERT, as opposed to being decided RDBMS-side. This doesn't require a DBA manage an 'auto-key' column, and I have no idea how that idea entered your head. (Indeed, I said earlier: "I am quite convinced that 'autonum' is a mistake", referring to autonum columns or auto-keys.) Starting with your 'DBA conversation', nothing you have said makes any sense in context.]
- But what the logger (or cash register) does is a couple of staffing levels removed from your request. You are saying, "Ideally, the logger should do such-and-such, make it so!". While you may have a technical point, organizational-wise it's not likely to happen. You have to live with the reality of what you are given, and unless you can show a clear and common problem, your "suggestion" will likely go nowhere. Perhaps the logger or cash register is supplied by a vendor on contract and such changes would require a new contract or different brand. You can't assume a King's view of things. I work with data from 3 "feeder" systems right now, and I have plenty of gripes about each one, including potential integrity problems; but I have little power to change them unless there is a show-stopper or event that gets the big-wigs' attention. There's no budget for changes anyhow due to sour economy. -t
- [It isn't unusual to place an intermediate service between a database and a 'logger' if that is necessary. Can you please make a comment with technical merit?]
- Can you please stop asking questions in an asperger way? I didn't do anything wrong. I cannot read your mind. Please give an example of an intermediate service that does more or is easier than adding an auto-key. And still, that may be 2 levels of staffing away from your request. It appears to me that you don't have a feel for real-world maintenance programming. People don't bend over backward for obscure what-if's in big orgs. Whether they "should" or not is moot. They don't in practice. If it did become an issue, they more likely would just give you reports or summary query access for getting check-sums and counts of some sort. That doesn't require changing the client-side (app) of the system. (Assuming the problem you are trying to address is the "ODBC duplication burp".)
- [You should not make RedHerring arguments about 'budgets for changes' when it was never even specified whether the situation involves developing a new service vs. upgrading vs. maintaining an existing one. If you're creating a new one, it is easy to take the King's view. If you're upgrading an existing system, my experience is that adaptors are usually the way to go. If you're maintaining a system, you'll tend to make few changes as possible for fear of breaking things. Anyhow, an intermediate service that adds its own key before sending store updates onwards to a regional database is different from adding a key at the point the data enters the regional database. I've already described two criterion from which this 'does more'.]
- I am talking about existing systems, not new ones. A new query language working with an existing system is the sub-topic here. If you change the app-side to somehow provide a unique key, then both the app-side and the DBA have to make changes for a new column. Adding an auto-key would likely only affect the DBA's side. Adding it on the app side would require the approval of two departments, decreasing the chances more than asking just one department. Your chances are already slim because it's competing against known common and existing problems. Going thru a 2nd department is flirting with lottery odds.
- Further, "forcing" a key doesn't mean the data is good, only that we forced a key. They may feel it's putting an extra blame risk on them for making potentially non-unique data look unique (and have to manage the sequence number seeder, which can be a pain or added steps during crash recoveries.). Using scenario 1 (logger), the DBA may not even have any control over the logger devices; their job is only to collect the data that comes from it, not attempt to clean it. Big org's are like this. The options may be different if we need one-time keys, or "persistent" keys, meaning the same record has the same key if we come back (re-query) a later time.
- [There is nothing you can do to guarantee the data is good. Forcing a key for queries allows you to maintain integrity and consistency you possessed across JOINs. Forcing a key to be part of the table allows you to maintain integrity and consistency across UPDATE and DELETE manipulations. This doesn't guarantee you have integrity and consistency, only that you aren't making things worse. You cannot say the same of bag systems - you cannot write SQL queries or manipulations for bags that will maintain integrity and consistency across JOIN, UPDATE, and DELETE - i.e. you'll update/delete too many rows, or you'll multiply duplicates across joins. Bag systems will, in general, make good data bad and bad data even worse.]
- But that's not the DBA's (supplier's) concern.
- [It has always been my impression that data integrity is eminently the DBA's concern. Why would you say otherwise?]
- Often DB's are overworked and handle immediate known problems over hypothetical what-if's. It may have worked fine for many years, and managing an auto-key is yet another duty. Where I currently work there are known potential data integrity problems (not related to keys), and pleas to fix it have been ignored.
- [If reducing the DBA's workload is a big concern, then a true RDBMS avoids several known data integrity problems for the DBA - and instead addends a relatively small burden to the much larger correct-schema-usage burden experienced by application developers. This would be a good thing, right? Lazy DBAs with even a small spark of intelligence are clever enough to be a little less lazy today so they can be much lazier for all of next week.]
- As I've said before, in my experience, the alleged problems with duplicates are exaggerated. If one doesn't understand the schemas and the data from a domain perspective, they make much more visible problems than duplicates. I know you disagree, and we'll just have to part ways on that. And again, due to staff turnover, the standing DBA is fairly likely to not be the creator of the data and/or app. Auto-keys are not free, they create new problems, may use more disk-space, slow inserts, and add new risks such as seed slippage. And finally, what people "should" do and what they actually do are often not the same.
- [Your summary of my opinion is inaccurate. Erroneous duplicate entry vs. semantically correct duplicate entry is indistinguishable, subtle, difficult to detect... but I agree there are bigger problems to solve. What I do not understand is your belief that duplicates are worth the trouble and complexity and harm to idempotence and optimizations. The disk-space for surrogate keys is not a serious concern. And regarding the "what people 'should' do" comment: I'll never argue a solution is fool-proof (as that'd make me Murphy's bitch), but I also see little reason for developers of a new query language to accommodate questionable 'features' whose primary effect is to give fools an extra way to fail in subtle ways.]
- I believe there would be enough BagSetImpedanceMismatch such that support for bags is worth it. A new query language should perhaps make it harder to inadvertently create bags, but not pretend like they don't exist and requiring using somewhat odd work-arounds.
- Ideally, yes, you could generate unique identifiers at the sender's side. The only reason for providing auto-generated keys at all is because you do not have that capability. The sender's side could be (for example) an Oracle database for which you have only read-only permissions, via a closed-source JDBC driver provided by Oracle.
This has nothing to do with whether a database language supports bags or not. Note that any auto-generated keys used to convert remote bags to local sets will be applied to the ResultSet -- e.g., the result of employing a remote database's native JOIN -- and not the remote database's source tables.
Any "disadvantages" incurred by creating any heterogeneous database system are the same for all heterogeneous database systems, regardless of language or data model. The only reason for providing a link-to-translate-bags mechanism in Rel is to be able to use Rel as a heterogeneous database system. It's the same reason Oracle provides Oracle Generic Connectivity and Oracle Transparent Gateway. Using Rel as a heterogeneous database system and using Oracle's OGC/OTG have exactly the same "disadvantages". Both Rel and Oracle OGC/OTG have the same advantage: a single DBMS and database language can be used to seamlessly access a variety of DBMSes and/or a number of databases.
Once again, this has nothing whatsoever to do with the bags vs relations debate.
PageAnchor: inherently_slow
Let me ask for clarification regarding machine performance. Agree or disagree: Human error/judgment issues aside, a DBMS that optionally allows bags in addition to sets does not inherently perform noticeably worse than a set-only DBMS. Your argument about performance issues has to do with bags inadvertently slipping in and "polluting" the optimizer. It's not the existence of the option/choice itself of including bags in the DB that slows the computer down. Is this correct? -t
Arbitrary expressions consisting of operations on bags will be slower on average than equivalent expressions on relations due to the inability to apply certain optimisation strategies. However, the mere existence of the option/choice itself does not represent a performance hit.
That once again begs the question that still has no answer from you: Why provide bags as an option at all? We've shown that importing/linking to data with duplicates adds no complexity and exporting data with duplicates adds negligible complexity (a keyword). However, providing bags as an option within the database definitely adds complexity. So, why provide the option? We might as well provide a BrainFuck implementation in the database too, for all the benefits that have been shown for providing bags in the database.
[To the extent that an optimizer must depend upon locally available information (i.e. when dealing with abstraction or modularity), the mere option/choice can hinder optimization. Under the assumption that the optimizer has global information about all parts of the query, the option/choice will not necessarily hinder performance, but still greatly increases complexity (complexity to handle the extra options, plus complexity to manage the information to make the right decisions). Historically, complexity hinders development, testing, and validation of optimizers... which may indirectly impact performance.]
- I don't dispute that having the choice adds complexity to the interpreter/compiler. But the point stands that there is no direct issue with having the choice that would cause non-negligable performance difference. (Assuming something "weird" doesn't happen, such as removing the primary key in the middle of a query.)
- [The pure sets approach can be optimized locally, whereas the mixed optional approach requires global information. That isn't an issue only of implementation complexity. The need for global information introduces its own issues. I would agree that, in the case of modern RDBMS's, the concern for 'global' vs. 'local' information on the bag/set issue is somewhat moot since the query programs are, for the most part, relatively small compared to the data (making it more profitable to delay optimization of the query and specialize it to the input tables). However, that should not be assumed to always hold, especially not as you begin to rely more upon named views or procedures, or as you begin to perform meta-level operations (e.g. making use of data dictionaries and table meta-data to describe queries).]
- Your warning is taken. Meta techniques generally should be used where human effort is a bigger constraint than machine effort.
See the RISC/CISC analogy above.
[CISC technologies work on the same principle that leads to AttributesInNameSmell. That is, certain combinations of useful features can be optimized. But I do not see a valid analogy here, in this section about performance.]
[1] Often such documents do not explicitly state a column order or lack of. But out of professional courtesy, one usually follows the order in the written specification. Paper has an inherent order whether we want it to or not. In fact, so does any digital file, regardless of encoding technique. Thus, it's difficult to get a true bag. Nature sort of forces it into a list. If you want a bag, then generally it has to be done by suppressing or hiding internal physical ordering from the interface. And even that's difficult because it will generally deliver the info in a time order, a sequence. Maybe if the interface was 100% parallel, then there would be no time-of-delivery order.
Why provide bags as an option in a Relational DBMS? We've shown that importing/linking to data with duplicates adds no complexity and exporting data with duplicates adds negligible complexity (a keyword). However, providing bags as an option within the database definitely adds complexity. So, why provide the option?
I disagree it will be simple, but I am not going to repeat the entire arguments all over again here. We'll wait until TD/Rel comes out and kick the tires then. Done. -t
Ok, let's assume, for the time being, that importing/linking to data with duplicates adds no complexity and exporting data with duplicates adds negligible complexity. Why provide bags as an option in a Relational DBMS?
Are you talking about the query language, or the new RDBMS? A new query language that is not friendly when used with/for *existing* RDBMS will face resistence. I see no real problem with making the DB engine be bag-free, but the query language is another matter. Oracle could add an option switch to their existing engines to dissallow bags also if it becomes a PR issue.
Assume that importing/linking to data with duplicates adds no complexity and exporting data with duplicates adds negligible complexity. Assume the database engine is bag-free. Is there any reason to support bags in the database language?
Well, okay, but you know I'm going to say that's a tall "assume" right now.
"Well, okay" what? Do you mean you agree there is no reason to support bags in the database language?
No. I disagree with the stated assumptions.
That's fine. I know you disagree with what the assumptions state. That's why the word "assume" is used. We can assume whatever we like, even if it is purely hypothetical, nonsensical, unreasonable, or simply unrelated. E.g., we can "assume infinite bandwidth" or "assume there exists a perfectly spherical cow of infinite density" or "assume that importing/linking to data with duplicates adds no complexity and exporting data with duplicates adds negligible complexity." The last one is an example of "simply unrelated". I am intentionally separating what goes on inside the DBMS from what goes on at its import/link and output boundaries.
Anyway, let's set that aside and look at my question another way: Imagine that our requirements specify, for (say) security reasons, a database environment that is self-contained. Input will be entirely via data-entry forms written in C# that connect to the DBMS via ODBC; output from the DBMS will be strictly on-screen or printed reports via some CrystalReports-like tool. I.e., this is a fairly typical desktop-oriented database-driven application environment sans import/export. If we choose a pure relational DBMS to meet our requirements, do you see any reason to provide bags in addition to relations?
In the future the requirements may change to integrate better with other RDBMS, or they decided to dump the non-SQL-based system and switch to SQL. CrystalReports and similar tools use SQL under the hood for most database interaction. If you only have TutorialDee, how will you interact with such tools? I suppose you could make a simple single-SELECT SQL interface that calls more complex TD under the hood, but that's a little on the silly side and may still keep you from using fancier SQL-based UI's. It's kind of silly to dump a standard used by many tools for minor shortcomings. It's almost like making your own web browser that is based on something other than HTTP because HTTP has a little something that bothers you.
Did I mention TutorialDee in this section? I did not. I wrote, "a pure relational DBMS." Such a DBMS could even support a dialect of SQL more adherent to the Standard than that provided by most commercial SQL DBMSes. CrystalReports and other tools could trivially support it. TutorialDee is not the issue here. Please, stop being evasive and answer the question.
My apologies. I didn't notice the non-statement. I just noticed you didn't mention Philadelphia either. However, it does shift the question toward how this "special" dialect of SQL deals with various issues. The different ways to go about it could probably fill up another topic, and make us each call each other another 50 unique names or so.
Please, stop being evasive and answer the question. I have highlighted it, above, in case you've forgotten it.
I don't think I can anticipate all the potential downsides to make a good estimate. I've never lived a life without the option of bags.
I see. Is it possible that some of your objections to bagless systems are based on a fear of the unknown?
No, it's based on my experience with trying to forecast problems down the road with complex systems. I can already foresee some minor annoyances working with outside systems and data having bags. There are probably things I cannot foresee being that it's difficult to mentally compute all possible usages and query kinds. The more complex the system, such as more combinations, the more things are probably skipped in estimations.
PageAnchor: scenario_four_alpha
This is to clarify and illustrate some of the statements in scenario 4 (above).
Boss: The mass-mailing of the thumb drives to potential clients will commence in two weeks. This is an important sales drive. However, our budget is being cut and we'll probably have to purchase the 512k drives instead of the 1gig drives. Thus, we will probably have to dump some of the items we planned to include. Will the Cleveland actual sample data query system you built fit in 512k? I need to know how much of the other materials we need to cut fairly soon to make room for it.
You: Well, I may be able to remove some of the indexes, but it would slow down certain queries.
Boss: We'll look bad if too many queries run too slow, losing potential customers. Which index do you think would have the least impact on speed?
You: Let see, well the primary key index can possibly be tossed. It's pretty big because of the way our stats are laid out. (Sample schema already given.) It takes up almost as much space as the data itself.
Boss: What does it do?
You: It mostly just prevents accidental duplicates on our part. It does help a few kinds of queries on the first element of the key run faster, but I can replace it with a single-column index.
Boss: Is there another way to check for duplicates?
You: We could write a query set that would let us know if there are accidental duplicates. We just have to remember to run it before we send out the final cut.
Co-worker: Maybe we can switch to a different database engine that prevents duplicates without such a big index. ABC 3.0 allegedly has this property.
Boss: We already enough database engines around here to support. When one of you leaves for that big raise, we are left training somebody in all those. I'm hesitant to add another one. If we did go with ABC 3.0, how much time would it take?
You: It would cut it close to our deadline, I estimate. I don't recommend changing the engine this late in the game. We already did most of the testing, which would have to be repeated.
Co-worker: I'm pretty sure I can burn the midnight oil to get ABC 3.0 going before the deadline. I'd like to gain some knowledge on it anyhow so that we have another tool approach in the future.
Boss: No, I don't want to risk new problems this close do the deadline, and add another DB engine to learn and support. Tossing the primary key and remembering to run the duplication checking utility sounds like the simplest fix with the least risk. Maybe next year we can look at alternatives if we have space problems again, but at this time we need to focus on getting the current stick ready.
Good heavens. I haven't seen juvenile "story" illustrations since junior high school.
- That sounds like an insult. I don't appreciate that. Anyhow, I try different techniques when one technique is not getting my point across. Sometimes you have to go back to basics when all else fails. -t
- It wasn't intended to be an insult but an observation. Logic is more effective on me than fictional stories.
- Ideally we'd be able to use pure logic and pure math. However, neither side has yet to be able to pull that off, and we are thus stuck with "descriptive logic", which has proven insufficient.
You've given a completely contrived example, along with some unrelated editorialising. Here's a much simpler conceptual analogue:
Me: I can write our new DBMS so that doesn't require bags.
Boss: Do it and you're fired!
Me: Ah, you're right as always -- our DBMS requires bags.
Anyway, if space is a concern (a) you can use the BTree (or other data structure) itself to store the data, and/or (b) eliminate the duplicates in the source data. Of course, if this is a query system I presume it's read-only, in which case it's probably a good idea to remove duplicates in the source data -- before putting it on the USB stick -- for performance and space reasons. Since it's read-only, we don't need any mechanisms to detect duplicates.
Overhauling the DB engine or (data structure) was addressed already. And remember we want multi-indexing in most query utilities. I'm not sure how much we'd have to reinvent if we did a hand-rolled indexing system. And as I explained elsewhere, I don't know what you mean by "contrived". It's not a technically-precise term.
Re: "Since it's read-only, we don't need any mechanisms to detect duplicates". Yeah, in other words it can be bag! Thanks for agreeing with me.
Huh? How did you get from "we don't need any mechanisms to detect duplicates" to "it can be bag"? We can construct a database that contains no duplicates. Since it's read-only, we can't add records. If we can't add records and there are no duplicates, there are no bags.
As we already discussed somewhere, we may want to add last-minute adjustments that would make it have duplicates. I've added last-minute adjustments to similar "summary" tables many times. Ideally we'd reconstitute it, but sometimes one has deadlines. Further, a bag that happens to contain no duplicates is still a bag. It fits the definition of a bag far more than a set.
I bet some of your last-minute adjustments have introduced erroneous duplicates.
Perhaps. I and/or my boss made an economic decision of risk versus time versus resources etc. The work environment generally is such that AllRoadsLeadToBeeMinus. I usually try to run some quick check-sums to help ensure there are no glaring errors and that if there are errors, most likely they'd be small and "local". -t
Re: "Boss: Do it and you're fired!"
I'd be willing to bet a paycheck that 95% of all bosses don't give a fudge as long as stuff usually works. I've had a few bosses that do what I'd call "obsess" or "fixate" on certain things, but bags wasn't one of them.
I see. I, however, would bet a paycheck that no matter what anyone writes, you'd find some reason to quibble with it even if quibbling adds absolutely nothing to the discussion at hand.
Projection. Your insistent set-only stance is silly and quibbly from my perspective, I would note. It's like you are saying that your quibbles are better than my quibbles.
I defend a set-only stance because I like maximum optimisation and minimum opportunity for erroneous duplicates either in entered data or query results.
As for quibbling, did you really need to question the motives of my fictional "boss" in something that is clearly intended to be nothing more than an illustration of "story" structure?
I generally based it on what I believe is most likely to happen; what a "typical" boss would do in that circumstance. If you disagree with my probability assessment, be my guest. Just don't be rude about it.
I don't agree or disagree with your probability assessment. I simply wonder what it has to do with anything. It's as if I'd mentioned someone had a coffee in the evening as part of some tale about the evolution of the relational model, and you interjected to point out that drinking coffee in the evening will keep you awake, which is bad because you need sleep to be most effective. True, but who cares? It's irrelevant to the discussion at hand.
I'm trying to illustrate the work environment that would put one in a position similar to the scenario. In other words, what would YOU do in that person's shoes, and why. It also illustrates the limits of one's control over the decision choices/paths per role in a typical company. Some of your statements made me question, or at least puzzle over, your familiarity of a typical business work environment as far as what one has control over, what they don't, how one would likely communicate options to managers, and likely responses. I'm trying to put you in a business simulator to see what your responses are likely to trigger as far as manager reaction. We often don't get to solve technical issues alone on an island. It's part of a team and organization whether we want it that way or not. Whatever solution you suggest has to be compatible with the given environment.
- [To quote you from only a little earlier on this page: "I'd be willing to bet a paycheck that 95% of all bosses don't give a fudge as long as stuff usually works and we try to make reasonable decisions/trade-offs. I've had a few bosses that do what I'd call "obsess" or "fixate" on certain things, but bags wasn't one of them." Typical circular arguing. -- DanMuller ]
- In this particular scenario, media space was a key issue because the boss had to coordinate files coming from multiple departments and deal with the budget of flash drive purchasing. And query speed would be his/her concern also because it affects the customers response to and perception of the sample product. It affects things that he/she is immediately involved with and rewarded/punished on. Bags versus sets does not UNLESS it affects issues of his/her typical concern. In this case it did. If space was not in issue, then most likely the boss would never care either way about the primary index. I perhaps should have qualified the above more, but I didn't come here to write the algorithm of a boss. Experienced people should know that already.
- [But according to your own estimation, such scenarios would involve at most 5% of bosses, so why waste time and space on it here? Because you like to argue. Endlessly. Circularly. Evasively. I think we are approaching 95% Top wiki content, by volume, although I wouldn't bet a paycheck on it. Yet.]
- No, a non-situational summary command "no bags" would be estimated 5% (or less). I assumed that's what your statement was, but perhaps you had a different context in mind? In this particular situation, the existence of bags intersected with a specific concern/task of management. I've agreed to leave the index in if there is not a space issue. So that's not what the debate is about.
Sorry, you've completely lost me here. Your comment appeared to be non sequitur, and it still appears non sequitur. I can't imagine why you'd see the "boss" as anything but the slightly-comical, vaguely-Dilbertian illustration it was intended to be. I'm even less capable of imagining why you'd think your comment should represent some sort of business lesson. If that was the intent, it was even more out of place.
I'm trying my best to communiate to what to me is a strange foriegn mind. I did not intend to fail at such communication. There seems to be some underlying assumptions that we are not communicating, and I hoped that putting it into a step-by-step situation in a "field" setting would help tease out these underlying assumptions.
Top's arguments remind me of a mistake I've seen made several times. One instance in particular caused ongoing headaches for years in a product that I worked on. We were developing a database-based shrink-wrapped Windows desktop product. It was to replace a DOS application that used flat files for data storage.
There were many improvements to the schema of the new product's data, which variously reflected lessons learned from the old product, support for features that didn't exist in the old project, and leveraging of the better technology that we were using. An important part of the application was a converter that imported data from the old application. A big stumbling block was that the old data often couldn't satisfy the constraints imposed in the new system without significant additional work, and in some cases might even have needed additional input from a user. An argument developed within the team over whether to relax constraints during the import process.
The question seemed patently absurd to me. You deal with these issues at the system boundary, so that the rest of the system can take advantage of the assurances provided by these constraints. If you don't, then the constraints have little or no value. Inexplicably, the manager (who actually was technical, if a bit out of practice, and usually made better decisions) sided with the more junior staff working on the import feature. We turned all constraint checking off during import.
We paid for this decision many times over in the following years. Even though new data could not be entered that did not satisfy constraints, the system had to be littered with code to handle the old data as a special case. Numerous bugs arose in the field and had to be "fixed" with such code. New customers weren't plagued by these problems, but old customers were, and the erosion in confidence may have prolonged the support cycle for the old product. I can't quantify the costs with any precision, but they were considerable.
If you consider "no duplicates" as a constraint, then some of Top's scenarios are exactly analogous. (In fact, some of the constraints alluded to were, exactly, uniqueness constraints. And he is arguing for making this mistake on a grand scale. He justifies this by belittling the value of the constraint.
E.g. the term "purity" is used as if it were a disadvantage. On the contrary, reliable adherence to any constraint gives guarantees to a programmer about the state of the data, reducing the number of possible states that have to be catered to by code. This has profound beneficial effects. It's at the heart of type systems, in fact. And this topic is closely related to types. Nobody argues against the ability to have bag-like data in a system; the argument is over what should constitute a relation in a relational system. (The argument is also absurd because there's a whole body of knowledge built around relational systems, and nobody's going to redefine that in order to accommodate one person's insistence on changing a fundamental definition.)
Top also underestimates the cost of duplicates. In my experience, duplicate errors are made fairly frequently in SQL systems, and when summary operations are involved, often cause more than cosmetic problems and are hard to diagnose.
-- DanMuller
This reminds me that QualityIsFree.
- I've had bad experiences with CMM. -t
- You would.
Ultimately it's still an economic decision and your boss made a decision based on his/her best estimate of the trade-off in problems. If your organization had gone with the tighter re-import decision,
it perhaps could have angered existing customers enough to sink your business. Since we don't have a parallel universe to observe that alternative in action, we can only speculate at this point what the actual costs and drawbacks of that would be.
At least the "loose data" decision allowed your company to survive, or else it wouldn't be around long enough to have developers who complain about the annoyance of the legacy features. It thus passed at least one test: it didn't sink the product.
If one applies FutureDiscounting, the path may be the right decision per current finance practices. Under FutureDiscounting, paying more in the long run is often worse than avoiding near-term losses. (Some argue that FutureDiscounting shouldn't apply to IT, or at least apply less, but that's another debate.)
Windows Vista by many accounts sunk itself into suckage lore largely because security was given higher priority over backward compatibility. Microsoft struggles with backward compatibility versus better security all the time (MicrosoftWindowsBeyondRepair). Consider another MS product: the Xbox was rushed to market with known flaws, but by many accounts this rush did allow Microsoft to get a foothold in the game market at the right time, just before competitors released their next generation of consoles.
There are SafetyGoldPlating horror stories and there are also slipshod slop horror stories. Sometimes WorseIsBetter. There is no up-front right spot on the "slop level" slider. Picking the right level on this slider requires skillful timing and domain experience.
Anyhow, these are tough decisions where it's difficult to model the future precisely and test the impact of different decisions, especially from an economic standpoint. I can respect your viewpoint and I can respect your boss's viewpoint regardless of whether I agree or not. Let's agree to respect each other on this difficult issue.
We can disagree without calling each other "sloppy" or "anal" or "detached from the organic business reality", etc. It's okay to be confident you are right, but don't be rudely insistent unless you have pure math or logic behind you. (If I violated these social rules with your content, I apologize.) However, with economic choices, pure math or logic usually is not available. Economics, like psychology, is still a "grey science".
Often in the absence of good, hard scientific tests (outcome analysis), people fill in the gaps with their personality traits and quirks, and this is possibly why such issues become so heated. Some people are naturally fastidious and others naturally loosey-goosey. Each of these groups may project these personal traits into such estimates.
(I believe I'm in the middle of the road, at least in the shop. In this particular wiki, I'd probably be considered a bit toward the loosey-goosey side. But this may be because of my domain/specialty.)
I wasn't there for your DOS-to-Windows re-write, and so didn't see the details that would allow me to make an informed vote, so I am not picking a side in your scenario. At least it sounds like you guys gave the issue a good hard thought and had probing discussions.
My main point is that the bag in scenario 4 provided a viable business option that wouldn't exist without it. (Switching to a new database engine also carries mistake risks, I would note, because one tends to make more mistakes in with new tools.) If a technology option provides management with more options that are viable, this is generally a good thing. You may not agree with management, but that may be stepping outside of your actual role. In my book, something that supplies more options to owners and decision makers a plus (all else being equal), and I will score it as such in my working IT decision-making models. If you wish to argue it's irrational, be my guest, but please make it a new topic. -t
Re: "Top also underestimates the cost of duplicates. In my experience, duplicate errors are made fairly frequently in SQL systems, and when summary operations are involved, often cause more than cosmetic problems and are hard to diagnose."
Well, I just don't see it that often except as a side-effect of something else. Usually if it happens, it's a result of a bad schema design, insufficient input validation, a poorly-skilled query author who shouldn't be writing queries in general, unavailable info on primary keys, or a sacrifice made for backward compatibility (similar to above). Perhaps you can present the scenarios of actual screw-ups so we can dissect what went wrong and why.
Also, if we could reboot the world and existing RDBMS started out with the no-bag rule, then existing tools and data-sources would be set-friendly and some of the above issues would go away. But bags and related practices have too much of a foothold already and have become a de-facto standard. QwertySyndrome wins in this case.
In short, the organic-versus-planned decision is still a grey art to measure, and anybody who pretends it's an exact science is kidding themselves.
--top
{I've been in the same position as DanMuller, having to port an old DOS-based proprietary-database-driven application to a SQL DBMS (that strictly enforced constraints, including no duplicates anywhere) driven application. We ran into the same problem -- legacy data sometimes violated constraints in the new database. There was some debate over whether to relax constraints or not, but we chose not to relax constraints. What won the argument was the reality that data was a business asset that was increased in value by ensuring its correctness under the new constraints, and that relaxing constraints would result in possible long-term support and reporting difficulties which would cut into our bottom line and our clients' bottom line. We wrote a conversion program that involved the user in the conversion process to resolve ambiguities, eliminate duplicates, fill in missing data, etc. Some users did kvetch about the time the process took and the effort involved, but none genuinely complained because we took care to inform users that the process was necessary to ensure the correctness of their valuable data and help avoid possible expensive problems in the future. I've never regretted the decision, especially as conversion problems would only ever be a one-time issue, but relaxing constraints would have meant possible ongoing problems.}
I'd argue that as technicians, we are often not qualified to make such final decisions. We'd like to think we are, but to be frank, those who eat and breathe the business side are better qualified than those who mostly eat and breathe technical issues. Our job is to present the technical options as accurately as we can. --top
{In that role, I was the CEO of the software company.}
You'd still be if you were any good :-)
{On paper, I still am. However, I essentially retired 10 years ago, moved to a foreign country, bought a massive house on the country edge of a quaint village with my girlfriend -- a former actress/model, but now working as a computer geek because it's what she loves -- and have settled down to a life of research, teaching, and amusement. I attribute almost all of my success to faithfully adhering to the pure relational model and avoiding bags.}
- Holy cow. You see what you're missing out on, Top? I don't think any further discussion should be necessary at this point... LOL -- DanMuller
- In other words, become a wind-bag in place of bags. ;-)
- {Dude, step out from behind that mirror. You'll see more.}
What about using the carrot instead of the stick? Keep the old version going for those who don't want to convert. If they really want the new features and new GUI, then they will eventually decide to byte the bullet and clean up their data. (Offer free cleaning tools/reports.) Eventually most probably will. Some might argue they'd switch vendors entirely, but it's still easier to convert to something that's almost the same instead of starting over, as long as your product is roughly on par with the competition. Offer them a loyalty discount. I've been on the customer side of the coin, and key system upgrades can be very stressful on a company. If you make it easier for them to convert on their terms, they will be greatful. --top
{Completely off-topic. However, our clients were allowed to convert to the new system at their leisure.}
It's nice to have that option.
...Ummm, my scenario for Bags are when there is something meaningful that can happen when keys are identical. That is, is there a useful "conflation" that can be taken advantage of to make a smaller table? This could be as simple as a union-merge, but things really get interesting when a more meaningful conflation function can be used. But no one seems to organize their data in ways that make use of such functionality. The only real use is when the relationships between data are so interconnected that they are better represented as a pure graph with no tables at all. Like a forest of sprouts, over towers of trees. This means it's useless for institutional data, but perfect for a WikiDb? engine. --MarkJanssen
Can you give an example of something meaningful that can happen when keys are identical? Is it something that cannot be achieved by replacing the duplicates with a {key, quantity_of_duplicates} pair? Not sure what you mean about "conflation function" et al.
This page is TooBigToEdit
Indead! It confuzes my spail chekker.
See Also: ComplexityOfOutputtingDuplicateTuplesInTutorialDee, BagAtational
AprilTen NovemberTen
CategoryDatabase