Sharing Data Is Important

One of the big stumbling blocks in the ObjectRelationalPsychologicalMismatch has to do with sharing data among diverse things. (The other big stumbling block is TablesAndObjectsAreTooDifferent IMO).

This issue comes up again and again, so this is an attempt to factor or reference topics related to sharing. More links or cut-n-pastes are to come.

In my experience, multiple different applications, languages, tools, and paradigms are going to need to share data. Some in near real-time (live), and some in batch-like modes. It does not make sense to me to ignore or delay this fact until the sharing requirements actually come up, and then hack together piecemeal solutions. If they have a bad cold, give them the Kleenex box instead of individual tissues. Many persistence or query tools seem to be floating around that ignore this important need. Perhaps it is a YagniAndDatabases HolyWar. I don't know.

Possible sharing-challenged tools:

PrevalenceLayer

Disagree... it's really sharing agnostic. Realize that a prevalence layer is another term for a journalized system. It has nothing to do with distribution, querying, etc. Now, the Prevayler, being billed as a database, is definitely sharing challenged, unless you treat it exclusively as a convenient implementation of journalling, in which case it becomes sharing agnostic as well.

Okay, say I want to grab some data out of PrevalenceLayer for use in CrystalReports and also an existing Python script. What are the steps, and are they as easy, familiar, and available as ODBC, CSV, SQL, etc? -t

Not all uses of PrevalenceLayer carry 'data' that would be useful outside a debugger. If you use PrevalenceLayer to implement a FileSystem or DBMS or RepresentationalStateTransfer service, then those services will (by nature) need to provide appropriate standardized APIs for access. But, as noted above, PrevalenceLayer itself is sharing-agnostic - even more so than FileSystem or RepresentationalStateTransfer, which at least provide sharing semantics. You won't find any 'standard' mechanism for sharing because PrevalenceLayer doesn't know about sharing. It doesn't hurt sharing, either, so you can use PrevalenceLayer to help implement data management systems. If you want data in CrystalReports, you'll need to gather it and prepare it for CrystalReports same as you would data from any other arbitrary data source. (You could support ODBC hooks in the debugger, of course, if you really needed arbitrary access to object-graph implementation details for a report.)

Well, I agree that ItDepends. However, it's generally portrayed as something to use with application development, not SystemsSoftware. And you seem to be admitting that one would have to re-invent a lot of the infrastructure to share data that already exists with alternatives. Most developers already know SQL. With PrevalenceLayer one may have to build interface API's and/or interfaces and the other developers have to learn it. You are essentially asking that Java be a DBMS, and a non-standard one at that. -t

PrevalenceLayer is targeted heavily at robust services (thus the emphasis on load-balancing, mirroring, automatic failover, CommandPattern, and (of course) transactions and persistence). A DBMS is one possible service that could be implemented atop PrevalenceLayer, and one could certainly provide a standard API to that service. (This is not fundamentally different than Oracle implementing its own "non-standard" persistence atop flat files and block storage devices, then providing a standard DBMS API.) PrevalenceLayer can certainly be used for persistent GUI applications, but you'll find less support for that than would be ideal - e.g. PrevalenceLayer doesn't integrate with JavaSwing or JavaAwt, and provides nothing special to support opening GUI windows and dialogs back into a consistent state after a crash.

Anyhow, no matter how you spin it, persistence helps sharing if only by virtue that persistent components are more likely to live long enough to become widely referenced and usefully shared. The only reason you complain about "sharing" in PrevalenceLayer is that it makes this persistence easy without quite taking that extra step to making sharing easy. I suspect, therefore, that your complaint is essentially: "if people have easy access to persistence, it becomes too easy to forget to push information obtained during the execution of an application or service out to a 'real' data-management and data-sharing service". To this we can add a related comment in accordance with your preference for TableOrientedProgramming: "a lot of application and service implementation details can be effectively represented as data". It may surprise you, but I would agree that both of these complaints are legitimate.

I'm not sure what you mean by "forget to push" here. Do you mean replication between the app and RDBMS? If we don't assume a GodLanguage for all needs, then we need to draw lines somewhere to form partitions, and ideally we want these partitions to fit common or typical usage patterns.
Even if you assumed a GodLanguage for all needs, you'd want to push information acquired during runtime to a dedicated data management service. The significant differences the GodLanguage would make to the user of the data service API: the 'dedicated database service' would support arbitrary language values/types without an explicit translation layer, and the user of the database could also be partisan to persistence and transactions and distribution and such without any headaches (i.e. if the service crashes and restarts, it could continue incomplete transactions). And I don't mean 'replication', since that would imply explicit mirroring. I simply mean forgetting to push information into a dedicated data management service, whether that service be a database or a publish/subscribe system (to which someone could hook a database, if they wish to and are capable).
Typical services that tend to be used together or are often related, as listed in DatabaseDefinition, usually make the best partition candidates. If you buck the pattern, then you risk reinventing the rest of the list the hard way or do a lot of ugly copying. Putting *only* persistence on the app side goes against typical usage pattern grains.
I've never suggested "putting *only* persistence on the app side". Even PrevalenceLayer supports more than just persistence.
That's because it's morphing into a Java-centric DBMS. It favors usage with Java at the expense of usage with other languages/tools. The big question is whether being Java-friendly outweighs difficulty of sharing outside of Java. For those of us who prefer a "type-lite" style of programming, PrevL usually doesn't give us enough to justify a dedicated fit. However, those who prefer a type-heavy approach, or work in a domain that leans that way, may conclude that the sharing sacrifice is worth it.
I disagree. When PrevalenceLayer begins to provide support for ad-hoc queries and data manipulation, then it will be a DBMS. At the moment, PrevalenceLayer simply helps Java get some useful features it was missing.
I'll bet you they'll start adding such because people will build complicated applications and rather than try to coordinate with and/or copy to a RDBMS, they'll say, "if we just merely add feature X to prevL, then we don't need to install an RDBMS just to get X", where X will end up being almost everything on the DB list when all is said and done. It's an incremental way to get closer to a GodLanguage. Unless an app language does everything, there will always be something that "it's missing". The issue here seems to be coming back to the classic services-versus-GodLanguage debate.

You and I just answer these complaints differently. You favor making persistence inconvenient in order to force programmers to use an external service, like a relational DBMS, which can both persist and share the information. I favor making it more convenient for programmers to precisely create, share, and compose service components (including data, via DBMS or MultiCaster/PublishSubscribeModel) among applications - an approach that would be impractical and fragile without convenient support for persistence and a few other features (critically: distribution, concurrency, and redundancy). You consider my approach to require a GodLanguage and GoldPlating. I consider your approach to be BondageAndDiscipline and counter-progressive. We needn't reiterate the arguments behind those complaints here (you can see 'Definition 2' under TableOrientedProgramming if you want details of my complaints, and I can look at ExternalServiceVersusIntegration or GodLanguage for details of yours) but we should at least recognize that we both acknowledge 'SharingDataIsImportant'.

(Moved from TablesAndObjectsAreTooDifferent)

Another issue is that relational databases (and perhaps databases in general) assume that the NounModel is in the database so that multiple applications and languages can access it without duplicate coding efforts for each language. However, the object approach seems to assume that each "application" needs a different or slightly different NounModel. You thus get class structures which more or less mirror the database schemas. This smacks of a violation of OnceAndOnlyOnce to me and should be cleaned up somehow. Do we have to duplicate so much of the schemas to make say 10 percent changes for a more local view? Further, application-level is still not fine-grained enough in my opinion. Queries allow one to produce the needed view at task-specific level, smaller than application-level granularity. Besides, the boundaries for "application" can be fuzzy in some situations (SystemSizeMetrics).

The page title is myopic. SharingDataIsImportant in some types of software but not others. There's an assumption here that all software is multi-user business apps.

The topic is meant to be a reference topic for where it applies, not global.

So a better title might be "WhenSharingDataIsImportant?"?

There is also an assumption here that a RelationalDatabase is the only reasonable way to share data between apps in a big complex system.

Perhaps, but it appears most common in topics about relational. Besides, there is not a lot of living non-legacy competition to RDBMS for "big complex systems". OODBMS are rarely used in sites where multiple paradigm apps are readily present.

A few popular alternatives to relational databases for sharing data in "big complex systems":

FTP
HTTP
WebDAV
MQSeries
SOAP
NFS
CORBA

I view these mostly for external-to-external transfer. I guess I was not clear on the scope. The "setup" for external transfers is expected to be greater than for internal transfers/sharing for security purposes. But it is interesting to ponder about using a relational query languages for external transfers. One would probably have to limit the columns and tables viewable (AccessControlList) and provide a standard ASCII format, perhaps a derivative of CommaSeparatedValues.

My employer uses all of those technologies for "internal" transfer. They are "living non-legacy competition to RDBMS".

But those are "flat" protocols. You can only do what the developer envisioned more or less. There is a reason that query languages were invented. What if you wanted to take data from app A and cross reference it with app B? You not only have to find/be a specialist in app A and B, but do all the fiddle faddle to join the data. If it is a large amount of data, then you have to deal with perhaps making indexes and/or temp storage areas, and so on. A query language abstracts all that crap away from you. I still have to bend over backward to search my (hierarchical) file system to see what I want to see and find what I want to find. I would rather have the ability to query the stuff. I don't want Bill Gates nor Joe Blow Programmer pre-limiting my view of info that is mine. (See Also CodeAvoidance)

I don't think anyone is arguing against query languages. The page topic is data sharing, not data querying. You claimed there weren't many alternatives to RDBMSs for sharing data, and I listed a few of the most popular.

I did not intend to make it sound like a Boolean issue. If sharing (inside organization) is important, then we need to find ways to facilitate sharing. Query languages are one very useful approach because for one, it hides the "guts" of each application that may act on data. Otherwise, there is no consistency to data access because each app will generally invent its own data-centric operations (add, change, delete, save, find, join, sort, etc.), and probably skip some due to YouArentGonnaNeedIt thinking, meaning that you have to go and program it in each different app's language when you do finally need it. Databases provide more OneStopShopping? for data.

Perhaps another way to describe this is that the scope/granularity/divisions/usages of data tend not to map well to devisions of behavior. Perhaps this leans toward SeparationOfDataAndCode. Perhaps we also need to better analyze how and why things are divided into "applications" to begin with. I find the division perhaps anachronistic. (ApplicationPartitioning) "Application" may be a satisfactory partitioning of behavior, but not of data IMO.

Lets say we use "entity" as our unit here for discussion. In any given "application", I would say that roughly half the entities references are mostly specific to that application, but the other half are fairly heavily shared. Thus, even if it made sense to "wrap" the dedicated entities behind OO-style or RPC-style interfaces, it does not make sense to wrap the shared half. It would be an arbitrary assignment of a given entity to a specific application. It is not even a UsefulLie that I can see.

Again, your bias is showing. There are entire classes of applications with no shared entities. Life is not CRUD business apps. Whether it makes sense to wrap entities behind "OO-style" inside an application has nothing to do with sharing and everything to do with ease of development.

I will agree that what works for one domain may not work for another. Should we change the title to SharingDataIsImportantForSomeDomains?? I do notice that OO and NavigationalDatabase-like proponents tend to be from non-business domains. Same with textbook and GOF-like pattern examples.

I see that differently. I see that RDBMS proponents tend to be interested in a few slices of business apps (such as CRUD front ends, data mining and report generation).

If you have a specific "slice" you would like to explore, we would be happy to take a look.

I was part of a team that developed and maintained a homegrown b-tree (NavigationalDatabase) for a major corporation. This was clearly in the "business domain". About once a year we searched for an off the shelf RDBMS that could replace our b-tree, but we never found one that met the throughput requirements without at least an order of magnitude increase in hardware costs.

If hardware costs are all you care about, then ReinventingTheDatabaseInApplication might make some sense. Then again, so will assembly language.

We used that, too. Every business I know of cares about hardware costs. The business mentioned above provided a high speed transaction processing system built around a NavigationalDatabase. They would not have been profitable using an RDBMS. My point is that there is more to business software than RDBMSs.

How do you know that they would not be profitable? Perhaps they could get more information out of their system and thus make better business decisions if they used a RDBMS. If the owners don't understand the benefits of flexible data manipulation, then they may focus just on hardware costs, because a bill is something they do understand. Put PostGre? on a Linux box if you want an inexpensive RDBMS box. Staff costs are generally more than hardware costs (although cheap 3rd-world labor may change all that soon).

We knew because we did the math. We had hard requirements for transaction times, volume, fault tolerance, etc. When we projected what it would cost to meet those requirements with an off the shelf RDBMS it was clear that the hardware cost would destroy our profit margin, regardless of any savings we could get from easier data analysis. Even the cost of keeping a one day old copy in an RDBMS was significantly more than the cost of paying programmers to write reports against our homegrown DBMS. PostGres? on a Linux box (or 1000 Linux boxes, or Oracle on Sun boxes, etc.) would not provide sub-half-second response times for millions of transactions per day against over a terabyte of data. The company was in competition with huge corporations that used big iron (running NavigationalDatabases, btw) and split the cost over multiple lines of business.

I wasn't there. I cannot verify the calculations. Maybe in some cases RDBMS are not the right tool for the job. That's true of any tool. It is said by some that navigational techniques are quicker for narrow-purpose processes.

[And it's said because it's true. Commercial RDBMS's have a performance edge in exactly one area: unanticipated ad-hoc queries. All this talk of "sharing" sounds a lot like "one app needs to support ad-hoc query's on its data by another app." Here is the news, YouArentGonnaNeedIt.]

[If you know what queries are needed, then a navigational database will always be able to outperform a relational one. For instance, many (if not all) commercial RDBMS products use a navigational database internally (b-trees, red/black trees etc.) because the possibly ad-hoc SQL query gets re-written into a combination of already known queries on the internal data store. Therefore, if you know that your application doesn't need to support arbitrary ad-hoc queries, and many don't, you might as well not put the RDBMS layer in the way. Except that for political reasons you often will have to. Or, you might want to use some sort of DataWarehouse solution to handle your known queries, and DW's are explicitly agnostic about the database technology underneath them, be it relational, multi-dimensional, even OO.]

Yeah, what he said. It sounds like some RelationalWeenies see that section of the business domain that relies on ad hoc queries as the entire business domain. My point is that the business domain is bigger than that. It sounds like they think their credit card approvals are hitting an RDBMS.

I don't think it is practical to write code that is DB-agnostic without stooping to a lowest-common-denominator interface and doing a lot in app code. The rest of my reply about "not needing it up front" is the usual YagniAndDatabases. Businesses need to be nimble to new needs. If you are into mass-production but predictable data processing, then countries with cheaper labor will eat you alive.

I don't understand how that last statement relates to the previous statements.

I read it again, and I don't see a problem. I am saying that betting against change is too risky business-wise. "We only have done X queries in the past, so we will target only X queries in the future." Perhaps you are talking about replicating some of the data to special "devices" tuned for specific queries?

Who's advocating betting against change? Are you arguing that high volume transaction processing ought to use RDBMSs? Where will the money come from to pay for that? It isn't in the market right now.

We have insufficient information here to say "all" or "no" high volume transaction processing should use RDBMS. In your case it might not have been feasible. But, we cannot extrapolate up or down at this point. I sense that this should be moved to a different topic, BTW.

You may have insufficient information, but I worked in that domain for eight years. I'll be you real money that every time you swipe your credit card in a POS device or cash register, the transaction never hits an RDBMS.

Fine. RDBMS suck for credit cards. Now lets move on.

Are you still happy taking a look?

I suppose we would need a 20 million dollar budget to perform adaquate performance tests to settle this. Performance issues like are discussed a bit under NavigationalDatabase

No, we don't need to spend a cent on it. The card acceptance banks and their service providers already have.

Data sharing is important from the point of view of synchronization, beyond genericity actross application types, legacy etc. The third tier is largely covered out of the box with rdB's, and certain OODB's. As one adds the architectural attributes one would like to see, so the solution converges on rdB's plus objects. Thus I contend that OODB's and rdB's will converge.

Perhaps you can describe your plan under MultiParadigmDatabase. I am a bit skeptical of a practical mix.