Database Domains For Numbers

Numbers may be used to represent data in very different contexts. If those contexts are not clarified their use may become confusing (the same way that None, Null and Zero can be confused). Examples:

Numbers for OrdinalScaleData?: In this scale type, the numbers assigned to objects or events represent the rank order (1st, 2nd, 3rd etc.) of the entities assessed. An example of ordinal measurement is the results of a horse race, which say only which horses arrived first, second, third, etc., if we use a number field in a database to represent this, when a horse gets injured at the middle of the race and never reaches the end: What number do we use for that? Note that using Zero may lead to inconsistencies:does Zero mean that it arrived before the first? (if we sort by that column that might well be the answer) or does it mean that it did not arrive?. Also note that SqlNull (Unknown) would not be accurate: we do know that the horse did not end the race, so we are not dealing with unknowns here, what we want to represent is the known fact that the horse did not end the race. So maybe "None" is a better match? Or maybe an special "OrdinalZero?"? or "OrdinalNone?"?

Numbers for IntervalScaleData? An interval scale is a scale of measurement where the distance between any two adjacent units of measurement (or 'intervals') is the same but the zero point is arbitrary. Scores on an interval scale can be added and subtracted but can not be meaningfully multiplied or divided. The formal mathematical term is an AffineSpace? (in this case an AffineLine?). Variables measured at the interval level are called "interval variables" or sometimes "scaled variables" as they have units of measurement. A highly familiar example of interval scale measurement is temperature with the Celsius scale. In this particular scale, the unit of measurement is 1/100 of the difference between the melting temperature and the boiling temperature of water at atmospheric pressure. It makes no sense to multiply 2°C * 8°C. Another example is Gregorian calendar years. (And it also makes no sense to multiply them: year 1999 * year 2000 = ¿what does the answer mean? ). But we generally see this implemented as plain "Integers" in typical PseudoRelationalDatabase?s.

Numbers for RatioScaleData?:Most measurement in the physical sciences and engineering is done on ratio scales. Mass, length, time, plane angle, energy and electric charge are examples of physical measures that are ratio scales.Informally, the distinguishing feature of a ratio scale is the possession of a non-arbitrary zero value. For example, the Kelvin temperature scale.


No need here for new inventions

As for the horse race example, the most practical approach would be to use null for non-finishers, but have a separate "status" column that would indicate the status of a given horse for a given race such as: "finished", "injured not-finished", "injured finished", "disqualified", "jockey didn't show", "other, see booth", etc. There's no need to invent custom nulls or custom types.

Depending on display room, it may be a dedicated column or a footnote letter on the result card or reports. But notice that it's the status of the race, not just the rank. Thus, associating it with just the rank value is perhaps a poor association approach.

Further, with a "status" column, it's easier to make/reuse a CRUD editor so that administrators etc. can add or change status information (perhaps in a ConstantTable if we don't need other attributes). Custom complex types would generally require a programmer visit to make changes to the status info if it's part of a "type". It may be good job security for you, but customers don't like that. Your "type" is morphing into a mini database, but without out-of-box database abilities and tools (GreencoddsTenthRuleOfProgramming).

As far as the year multiplication example, I have multiplied date elements in the past to create a hash for security or file re-distribution. If we create a custom type that forbids certain operations, then such uses may be made more difficult. I'd instead suggest a "warning system" rather than hard rejection of suspect uses if you want to go the "protection" route. We've learned from elements such as Java's "final" indicator that original authors are often not very good at anticipating future uses.

--top

"Further, with a 'status' column, it's easier to make/reuse a CRUD editor so that administrators etc. can add or change status information ..."

Interesting you mention that, because something I've been experimenting with in the RelProject is the ability for clients to automatically download type definitions from the database and automatically generate client-side user interfaces (UIs) to edit type values. This is possible because every user-defined type definition -- no matter how complex -- describes a directed graph of types where nodes of outdegree = 0 are primitive, built-in types. Thus, it appears to be possible to automatically create a client-side UI editor to manipulate values of any type. To test the basic idea, some time ago I successfully created a NakedObjects-like environment for Java, which used Java reflection to allow a user to instantiate arbitrary Java class instances and manipulate the instances with appropriate textboxes, etc., automatically displayed for every primitive property. I employed it in an experimental general-purpose GraphicalProgrammingLanguage called Tomato. See http://tomatoide.sourceforge.net/

Thus, support for complex types permits a considerably higher degree of automated UI generation than is possible in an environment having only discrete primitive types, because a type inherently defines a grouping of primitive attributes. Only having discrete primitive types means the developer is forced to describe such groups every UI "form" or dialogue box where they are used. For example, you would need to explicitly code the fact that your 'Status' attribute is related to your 'HorseRaceResult' attribute every time the two are used in a UI form. With a single type definition that includes both attributes, their relationship is implicit in the type definition and automatically can apply to every UI element that references it.

As you probably expect, I'm going to call GreencoddsTenthRuleOfProgramming on that one. One can use a database to group and manage base/primitive types also. It may be more flexible because it's less subject to encapsulation rules typical of types. For example, the original design above tightly couples the placement (race rank) with the "status". If later we want to decouple it and make it associated with the general horse-per-race row rather than just placement, it's merely a matter of foreign keys (and possibly constraints) if done the DB-centric way, and possibly no schema change at all. As I've said before, RDB's generally make better "relativity engines" than types and object because of the "soft" nesting, or rather, reference-based associations instead of encapsulation. If this trades in protection for flexibility, so be it. In domains where I hang out, flexibility is more important. (I'm not sure it does trade, but am preparing for that claim.)

The domains where you hang out are ERP related, are they not? Working on ERP systems can lead you to believe the entire IT world needs only INTEGER and CHAR, and to a lesser degree DATE, TIME, REAL, MONEY and BOOLEAN types. Outside of ERP -- such as numerical computing, geography, games, simulations, engineering, financial management, and medical informatics, to name just a few -- those canonical types are still popular, but the need for rich user-defined type support becomes significant. Yes, you can define COMPLEX, POLYNOMAL, TEMPERATURE, BINARY_TREE, GEOGRAPHICAL_COORDINATE, INFINITE_PRECISION_REAL, DNA_STRAND, MEDICATION_FREQUENCY, and innumerable user-defined types via database schemas and user-defined procedures tied to specific schemas, but the work this requires is laborious, tedious and error-prone compared to using user-defined types. The reason for this is simple: A complex type defined via a database schema requires either that (a) type operations be repeated in every query, or (b) they must be defined in procedure definitions tied specifically to a given schema or schema structure. With proper support for complex types, type definitions do not depend on any database schema, operators are specified OnceAndOnlyOnce in the type definition, and values of complex types can be manipulated as easily as primitive canonical types like INTEGER and CHAR. Support for providing multiple views of a given value is explicitly available in certain type systems. For example, DateAndDarwensTypeSystem provides POSSREPs; i.e., multiple possible representations for a given type.

Further, while automatic CRUD form generation is a nice idea, in practice one tends to need different groupings and representations for different situations. The EightyTwentyRule reigns supreme in CRUD design. Thus, "soft" groupings are the better route in my experience. Type- and OO-centric groupings tend to be inappropriately "hard" (overly-coupled). --top

That may be true for schemas; it is not true for types. An INTEGER will always be edited as an INTEGER. If it should not be edited as an INTEGER, it is a different type (or at least an INTEGER with multiple POSSREPs; see above). This applies analogously to complex types.

I’m bothered by the idea that the needs of one specific user out of many may dictate a total change in design. But, the devil’s in the details, and I will only comment on a case-by-case basis.

Eh? Did you mean to put your comment here? It doesn't appear related to what's gone before.

Let me see if I can flesh this out a bit. Suppose a system uses an integer ID (key). The system is a bit old and unfortunately uses domain info embedded in the ID, such as the first 3 digits representing the originating office location. If there is a mess-up in the location, then one may have to edit the key in a string-like way. Thus, "An INTEGER will always be edited as an INTEGER" may not always hold.

In Rel, I simply create a new ID_INTEGER type, inherited from the built-in INTEGER, to represent your specialised INTEGER.

Would it break any existing code that assumed it was Integer?

No, because by virtue of inheritance, it is an INTEGER.

At the app side or database side?

Both.


From a Statistics perspective we an see that not all the statistic operations can be applied to all of them. So, this classification may also be relevant to decide which statistical operators can (or can not be) applied to a particular "numerical" database field. See LevelOfMeasurement


RE: Your "type" is morphing into a mini database, but without out-of-box database abilities and tools (GreencoddsTenthRuleOfProgramming)

I can appreciate that concern, TopMind. I'd really like to out-of-box support for "types" as a mini-databases. For example: relations as FirstClass column values, useful for describing complex graph values. ExtendedSetTheory seems to aim in this direction, and has repeatedly been a topic of interest to the author of RelProject.

[Indeed. The author of the RelProject is exploring ExtendedSetTheory as part of his PhD work.]

When you talk about "mini-databases", then it begins to sound like hard-nesting (database inside a database). Encapsulation of "types" tends to go against the database view: that borders, interfaces, and restrictions are data or meta-data themselves, not hard-wired into the design of parts. I agree that hard-boundaries may improve compile-time checking of models and code, but perhaps at the expense of flexibility. It's an age-old trade-off debate. (RelationalBreaksEncapsulation) -t

[Your claim that "encapsulation of 'types' tends to go against the database view ..." appears to be uniquely your own opinion. I've found no equivalent in the literature. Indeed, DateAndDarwensTypeSystem from TheThirdManifesto is in opposition to this, and Codd's writings made it clear that attribute values could be of any type. Use of types in databases is partly about compile-time checking, but mainly about supporting OnceAndOnlyOnce when defining and using typeful behaviour. Appropriate type support makes it possible to manipulate complex types like TEMPERATURE, HORSE_RACE_RESULT, and GEOGRAPHICAL_COORDINATE as easily as INTEGER and CHAR.]

Using "thin" entities as custom types can lead to almost the same thing; it’s just a matter of perspective and syntactic conveniences. However, either you have encapsulation or don’t have encapsulation. I don’t see any middle ground. If I can query the sub-elements with the same language/system that I query traditional entities, then it’s not a true ADT. This conflict has never been resolved. (Didn’t we have this debate regarding x-ray-able "stacks"?) -t

[The internal representation of a type can be anything. There may well be particular type values that benefit from being represented in relations, sets, arrays, and so forth. However, that is not relevant. The important thing is a type's operations. A type is defined in terms of its operations, not its internal representation. The internal representation is for the convenience of the implementor and to achieve certain performance minima vis a vis the operations; it is of no consequence to the type user.]

But you are risking paralysis-via-label here. We may initially call it a "type" and define it by behavior, but later decide we need to do database operations on the "data" aspects of types and start treating some of them like an entity. Usage and needs change over time, at least in my domain. To keep info flexible, it should be easy to make it accessible to the regular query system if and when such a need comes along. Even if we don't end up with a domain need, query-ability makes a handy debugging tool. A behavior-centric view of info inherently conflicts with a value-centric view. Your type approach is creating a behavior-value-impedance-mismatch, such as conversion DiscontinuitySpikes when we decide we want to x-ray or change the guts of "types". And I haven't seen that the value-centric view inherently creates OnceAndOnlyOnce violations, as you appeared to claim. Just store the "operation" info at one spot and reference it instead of copy.

[As I asked before, your domain is essentially ERP, isn't it? The specific set of canonical database types we're used to owe much to the traditional prevalance of database systems and the data type values they needed to capture in the ERP domain. Thus, the push for invariant, complex, user-defined types is probably weaker there than in any other domain. The canonical set appear to be sufficient for the vast majority of ERP purposes because, in fact, they were intended to be sufficient for all ERP purposes.]

[As for OnceAndOnlyOnce violations, we've shown elsewhere that even for a simple "complex" type like a GEOGRAPHICAL_COORDINATE, without appropriate user-defined type support, you're forced either to create procedures that make assumptions about schema structure, or you must repeat GEOGRAPHICAL_COORDINATE operations in every query that manipulates GEOGRAPHICAL_COORDINATEs.]

Making assumptions about the schema is an acceptable trade-off in my opinion to obtain typish-to-DBish swappable viewpoints. As usual, I'll place flexibility over "protection", at least in my domain. I know this rubs against your compiler-centric approach to everything. -t

[If you feel a need for "typeish-to-DBish swappable viewpoints" -- which I take to mean support for multiple user-view presentations of a value's internal representation, something that is already supported in some type systems -- there is nothing that precludes providing POSSREPs, on a type-by-type basis, to support them. However, I'd like to see a case that justifies representing, say, a GEOGRAPHICAL_COORDINATE as a table. I suppose a character string type could represent a string as an ordered table of characters, but outside of being a technical curiosity and (perhaps) an interesting academic exercise, I don't see any practical benefit to providing such a view, nor can I see how using relational operators on such a representation would be superior to the canonical string manipulation operators.]

Geographical info can get quite complex. For example, the "state plane" system is generally considered more accurate than lat/long, partly because it's semi-independent from tectonic drift. However, it requires more info to encode, such as the "plane ID". But this is kind of getting away from the point. Functions or operators can "wrap" the guts of data implementation regardless whether it's multiple columns, encoded strings, or sets of tables. The trick is integrating this with the database engine, query language, and app languages. Custom types are generally more difficult to share across such boundaries that base types. The Rel "solution" appears to be a GodLanguage to reduce the need to share, at the expense of tool mix-and-match.

Indeed, representing any given complex value from a description of possible values is precisely what type systems do well, and they do it (in part) by doing exactly what you wrote: Using functions or operators to 'wrap' the guts of implementation.

The intent of Rel is not to reduce the need to share. The intent is to make sharing easier. Rather than making sharing easier by restricting what types may be shared, I seek to make sharing easier by making it easy to share complex types.

Only if the recipient can digest it. If you stick with the "base" types, then it's easier to transfer.

[At the expense of increased complexity everywhere else... Remember that so-called "base" types are only easier to transfer because they're so ubiquitous. They're ubiquitous because of their popularity in the ERP domain, which is still the primary and (almost exclusive, relatively-speaking) domain where relational DBMSes are used. Spatial types are now becoming increasingly common due to their use in other domains. It's only a matter of time before more types, from other domains, become equally common along with mechanisms to transparently transfer complex user-defined types. The need to use and transfer complex types in a variety of non-ERP domains -- without any undue rigamarole to represent them in databases and applications -- will inevitably drive this.]

[I don't know what you mean by "rubs against [my] compiler-centric approach to everything". Was that intended to be an insult? It otherwise conveys no information.]

You prefer that a compiler or compiler-like device checks everything "up-front", such as missing references/links, incompatible formats/types, etc. The more dynamic something is, the more difficult this is beyond a "warning machine".

[I prefer to avoid errors wherever possible and as early as possible, without losing capability. In order to achieve it, I will accept the occasional need for an extra keyword or two.]

When it's a 30% or more "formality tax", it can hinder productivity when working with and reading code.

[Do you have a reference to support the 30% figure or a definition of 'formality tax'? I find no mention of it in the literature.]

That's just a hypothetical example. I don't know the actual figure, for it probably depends on design style. It is quite possible to make "bureaucratic" code. There may be a place for bureaucratic code, but it's not everywhere.


RE: "A behavior-centric view of info inherently conflicts with a value-centric view." -t

That is an artifact or flaw common to mainstream programming systems, not a general truth. Behavior-centric view and value-centric view come together very elegantly in TermRewriting systems. But I'll grant there often seems to be a pick two: behavior-view, value-view, or information hiding issue in the distributed scenario. I've been pointed to work by Joseph Goguen that indicates otherwise, but have yet to take opportunity to read and grok it. If we can get all three properties without significant compromise, I would love to see such features brought into mainstream programming.

What about a work-able way without a complete Armageddon language and tool overhaul. Any new database is going to have to work well with existing apps, databases, and transfer conventions to be more than a narrow niche.

There are many strategies to successful long-term adoption that do not require "complete Armageddon and tool overhaul". A new database or language platform will only need to work competitively well for new apps or services - those that are developed to leverage it. Whether one must work 'well' with "existing apps, databases, and transfer conventions" should not be taken as a granted assumption. Sabotaging the quality of certain aspects of adaptor elements is a not uncommon business strategy to encourage VendorLockIn, or to tweak those benchmarks relative to the competing transfer conventions... which is to say, working well with "existing apps, databases, and transfer conventions" may actually turn out to be counter-productive to one's long term evolutionary success. Where necessary, an adaptor element (intermediate service, protocol, or plugin) of generally questionable quality will be GoodEnough.

If you are in a domain where there's a lot of sharing, then writing translators/adapters can become the bottleneck.

I'm not seeing how that's a problem in practice. It is the standardization that is hard. If you are creating a new entry in a domain with a lot of sharing, then one should assume a few standards for sharing already exist. By nature, these standards will necessarily support a 'lowest common denominator' of the previous technologies. It is not difficult to adapt to a 'lowest common denominator' for a few common standards, and it doesn't pay to spend much more effort than that. Perhaps you should think about the perspective of the platform developer: adaptation-layer standards are often your enemy. They, by nature, turn your platform into a 'commodity' (something easily replaced). You must support standards in order to get people to start using your competing product. But there is such a thing as supporting them too well, such that nobody bothers to leverage the features and qualities by which you intended to market your platform. If you wish to compete for 'best implementation of a standard' (e.g. fastest correct HtmlDomJsCss), then go ahead and give it a best shot. But if your goal is to advance a new standard, to get people to try something new, then high-quality adaptor layers may prove counter-productive to your goals. In that case, it is easier - and wiser - to support a subset of the competing standard. If you want people to integrate your system's features and tie themselves to your platform, then perhaps extend the standard a bit.

If you play too nicely with your competitors, they'll bury you.


[I guess they should never have invented C because COBOL was just fine. Etc.]

They both suck in different ways.

Please don't quibble just for the sake of quibbling. Either address the point, or don't bother posting.

I'm just addressing a dumb analogy with a dumb reply. GIGO. I'd rather use COBOL to sort and merge a deck of IBM cards than C, by the way. As a DomainSpecificLanguage, it did it's job.

You're still quibbling, and you know it. At least you admit it's a "dumb reply", though you apparently didn't recognise the essence of the analogy and focused purely on its literal substance. If you wish to debate the relative merits of C vs COBOL, please take it to some other page.

I assume the (poor) analogy has something to do with evolutionary change versus revolutionary change, but beyond that, I don't trust my guessing skills.

You've almost got it! I award you a 'B', but you should have more confidence in yourself and try harder.

Further, it doesn't hurt to pursue both and see what the market picks. AMD's approach to 64 bits was more evolutionary than Intel's, and it appears that's what buyers preferred. There is some potential overlap between types and entities.

Pursuing different approaches is fine. I shall pursue one, you may pursue another. There is, however, no overlap between types and entities. That is the FirstGreatBlunder.

One can force an overlap, for good or bad.

Generally bad. See FirstGreatBlunder.

It's short on specific scenarios to explore its alleged evilness and alternatives.

The FirstGreatBlunder, or rather the correct vs incorrect equation it identifies, was derived logically rather than empirically.

But it's based on the assumption of "hard" classifications, such as "x IS-A type". It's possible to have "type-like" things without having to buy into the whole behavioral-only-interface thing, for example.

Sorry, I don't see the relevance. Indeed, I don't see the relevance of this threadlet. It seems to be quibbling for the sake of quibbling.

You are quibbling about my quibbling over quibbling? Anyhow, I was hoping somebody would provide an example/scenario demonstrating the "generally bad" comment above that we could bounce around instead of talk about generalities; but it looks like it ain't gonna happen.

*sigh* A common illustration is simply the result of any query employing JOIN and/or projection. If relations are equivalent to classes, then classes can be formed via JOIN or subdivided via projection. They can't, so a relation/class equivalence is clearly incorrect.

I see no technical limitation to joining collections of object instances to get new objects (or views of objects). However, this may end up LaynesLawing on the barbed wire fence of OO's def.

Joining collections of object instances to get new collections (which is what I presume you meant) of object instances which is fine. The notion of joining two classes to create... Something, but we know not what... ...Is what the FirstGreatBlunder deprecates.

Classes are not required for OOP. Clone-based OO languages follow a biology-like approach and clone to "inherit". Another approach is to "point" to the parent object to "inherit". Classes as we know them are mostly an artifact of compiler-centric or static languages. -t

What does that have to do with the FirstGreatBlunder?

{Sure, we can reify object classes and table schema and such. (Ruby, Smalltalk, Newspeak, SQL's DML or DataDictionary, etc. all give credence to that.) How is that relevant?}


See NullVersusNone


MayTen


EditText of this page (last edited May 26, 2010) or FindPage with title or text search