Distribution Is Abstraction

Some would say that DistributionIsOptimization, and often a premature one.

I claim that distribution is an abstraction in the sense that a relational database is an abstraction. We must accept that a distributed computing support system serves a fundamental role to solve a fundamental problem.

RelationalDataBaseManagementSystems have evolved over time but the defining moment when they really changed the computing world was when everybody who was anybody decided that they weren't going to tackle any nontrivial data-oriented project without one. This could only happen (to a wide degree) after some other people established a standard theory of data with facilities to address the core values of consistency, correctness, completeness, and the ability to transform raw data into real information at a higher level.

Yes, I know that there are bookshelves filled with more technical explanations of the concepts, which do not bear detailed repetition here. In a very real sense, that's the point. I install PostgreSql or an OracleDatabase or the next db which fits my needs and be done with it, assuming I'm halfway competent with StructuredQueryLanguage.

Performance is a characteristic of a system which is not treated by the mathematical formulation of RelationalAlgebra. It is a domain not germane to RDBMSes, but it is certainly germane to distribution systems.

Having all views of an object be consistent is definitely an issue for an RDBMS. On the other hand, for a DistributionFramework?, the point is more likely that all available views (from any given point of view) be consistent, and that all views everywhere be reconcilable. Further, we demand that the framework be tolerant of a certain level of physical failure.

Let's compare the D in ACID: Durability. If your RDBMS transaction commits, then the transaction is durable, for some level of tolerance. If you lose your backups the moment your hard drive rack gets struck by lightning and civil war breaks out then you're pretty screwed no matter what your documentation may say. It's beyond the limits of the system.

Similarly Distribution has to be about buying certain guarantees with certain tradeoffs and certain limitations. Then you can build your system within the framework and founded on the premise that when the DistributionFramework? holds up its end of the bargain, your system requirements will be met.

For scholarly works check out OliverSims; he designed a framework back in the early 90s call Newi (New World Infrastructure) and introduced the concept of UID (user interface domain) and SRD (shared resource domain) - abstraction and distribution. Or see CiteSeer.

It's important to separate necessary or inherent distribution from that performed for optimization by some particular measure (parallelization of processing, reduction of latency, replication, resilience, etc.) and management (separation of concerns, etc.). Only the last can really constitute 'abstraction' of a sorts. Optimization is something of the opposite of an abstraction; it creates many cross-cutting concerns and brings to the front many issues of timings and such; any abstractions that can be maintained in this situation are boons that should be embraced. Finally, necessary distribution is the sort inherent to the problem domain... it is why you're writing the app in the first place (e.g. you have distributed sensors, so you need a distributed system).

Distribution: to divide among many, to spread out so as to cover something, to divide or separate especially into kinds

Abstraction: to remove properties of a complex object so as to attend to others (this is my abstraction of the definition of abstraction)

I'm not clear why saying DistributionIsAbstraction is useful. I do think it is useful to recognize that the process of distribution is similar to the process of abstraction. An example:

Let us start with a example distributed framework. There are several options today, I like J2EE. One way of using this (this is an art, there is no OneRightWay?) is by introducing 3 layers:

data access layer. Session EJBs are built to encapsulate RDBMs access (I�ve not had much success with Entity Beans).
business logic layer. Implemented using Session EJBs which provide facades to the enterprise business logic.
presentation logic layer. Separate, distributed and can leverage a framework like Apache Struts which provides MVC behavior further separating layout (defined using JSPs) from user interface logic and component invocation (implemented in Java classes).

The frame work should support a system being decoupled in to layers. Each layer has a clear purpose and the system architecture communicates this division of responsibilities. The framework enables cohesion by providing run-time communication between the layers.

Now to abstract and distribute using this framework.

Data Access: Components are built to provide CRUD (Create Read Update Delete) interfaces to business developers. All knowledge of RDBMs tables, connection pooling, SQL, column names, joins are contained in this black box. Data access has been abstracted and distributed.

Enterprise Business Logic. Components are built to provide business interfaces which provide information required by the user interface logic. The components maintain the integrity of the enterprise data provide shared services which are made available to multiple client applications. Enterprise Services have been abstracted and distributed.

User interface Logic: Applications (e.g.1 Web Applications using JSPs in the Struts example; e.g.2 Fat client Java Swing applications) are built to provide a single user with useful information and services. All knowledge of how the data is stored, how transactions maintain the integrity of shared data have been abstracted and distributed.

A picture for those who think IllustrationsClarifyText. A moving version can be seen on AnimatedArchitecture.

The question still remains: Is this a useful abstraction?

As pointed out above, distribution and abstraction are two very different activities. Perhaps what is meant is that the distributed view of a system, i.e., an abstraction of the system that omits all details unrelated to the network locations of objects, is an abstraction. Which is obvious. All views of systems are abstractions, since no view is complete.

First question: is it an abstraction at all ? The (in)famous CRUD "abstraction" is taking OO developers back to the level of COBOL, because it is create one object, insert one object, update one object, remove one object whereas you substitue object with record and you got your COBOL structured file. That's the reason why EntityBeans failed miserably in many projects and continue to fail. Then you "hide/abstract" table names and columns names, and instead you got your class names and property names, and you got your CRUD for one object only, and when it comes to S the abstraction disappeared all of a sudden.

After you make the case for that being an abstraction, then it's worth looking if it is an useful abstraction.

Case for the CRUD component being an abstraction.

Abstraction to me is like zooming out on a street map. Looking at a map of Chicago on a single screen, removes the details of block level streets I would see at a detailed level.

Be careful here. Resident tree-haters (LimitsOfHierarchies) will say that abstractions are or can be situational. Abstraction delivers what you want to see but not what you don't need to. Thus, a roadmap of the US highways is still an abstraction even though it does not zoom in. A geological map of the US may omit streets to make the geology stand out even though it is at the same detail level (resolution) as the street map. It is feature-based abstraction of the type set-theory fans would probably promote over the "nested" composition view of abstraction.

If I have a CustomerAccessBean? the read method provides me with a Java data object containing all the information I need to process some business logic. This abstracts (and encapsulates) the RDBMs access.

If I were to zoom in, I may see details of multiple SQL call to multiple tables, details of translations of legacy data, details of caching strategies, details of connection pooling etc. I am not assuming a one to one relationship with the data object and a record (I too see problems with the current EntityBeans model). The CRUD component abstracts the access and supports distributed deployment.

DistribtionIsJustAnotherFacilityAtThisPointInTime? - that is to say, there are competing vendors, and the state of agreed abstraction of how to distribute work isn't even anywhere near (say) SQL. You certainly might (as of late 2006) want to abstract your distribution interface. But that's as far as it goes. Distribution splits into a number of areas: job submission, resource broking/sharing, monitoring, etc - people aren't even sure where the dividing lines are here. There's a lot of research going on. How can you abstract what you can't understand?

Distribution is also an artifact of the variety of software development environments and programming talent that is available. This tends to lead more toward systems that are LooselyCoupled?, where the results from a BusinessRule that has been applied by one process ends up stored in an XMLfile or flag in a database, so another completely different process (written by a programmer in a different country) can apply the next BusinessRule.

Is distributed computing a logical abstraction or a physical one? It seems to me it is mostly a physical one. It is DivideAndConquer along the dimension of location (usually). The problem is that usually there are multiple dimensions to divide something by and which is needed is situational. In an ideal world, one would not have to care where the server is. If I ask for info filtered by location, then it should deliver that; and if I wanted it filtered by Product Category, it should deliver that too. We usually don't split servers by product category because we know it would make other factors more difficult to work with. Why should location be an exception? Perhaps because as things get very large, we have to split by *something* in order to find some DivideAndConquer, and location is usually the least of the available evils to choose from. It is the best desparate choice.

CategoryAbstraction, CategoryDistributed