The distributed shared state concurrency problem is the problem of maintaining the consistency of some shared state in the face of multiple concurrent processes, running on a DistributedSystem.
This is a key open problem in computer science. Some claim there is no general solution for this problem. (Instead, it is often argued that we should abandon all hope of achieving GlobalConsensus on such systems, and instead use other concurrency models, such as MessagePassingConcurrency, and use algorithms that can tolerate a lack of consensus.) That caveat stated, there are many problems whose apparent solutions call for shared state concurrency in a distributed architecture -- or something that looks a lot like it.
Various ways to address the DistributedSharedStateConcurrency problem:
- RelationalDatabases (more specifically, RDBMSes) and architectures which use RDBMSes in a ClientServer or MultiTierArchitecture. The RDBMS becomes the sole repository of shared state (with replication servers implemented if necessary). Scalability issues are addressed by throwing BigIron and fat pipes (and millions of dollars for LarryEllison and a team of DBAs) at the problem.
- Other database architectures. Relational is the dominant technology for OnLineTransactionProcessing? and OnLineAnalyticalProcessing applications. NetworkDatabases? and the like are still used for other purposes. The various software tools providing DirectoryServices? (ActiveDirectory, NDS, YellowPages?) are all essentially network databases.
- NetworkFileSystem?s, including NFS, ServerMessageBlock?, AndrewFileSystem?, CodaFileSystem?, NovellNetware?, et cetera.
- All manner of DistributedObjects architectures, which allow finer-grained entities (objects?) to be replicated across different nodes, and which attempt to keep things in sync if one instance changes. Many even support ObjectMigration?.
- LindaTupleSpaces (and similar) -- a content-associative memory which can be distributed relatively easily. Now starting to show up in production applications and architectures. For example, see JavaSpaces technology.
- DistributedCache?s -- a sort of a distributed hashtable with entries that are transparently and synchronously (maybe, it depends on requirements) replicated across nodes in the cluster. It can be truly a performance boon for a distributed system, especialy in write-behind configuration. Please see TangosolCoherence? for example.
Is this classification all right? For me, it's rather confusing that relational databases are opposed other database architectures altogether, which by some reason also includes directory services. It's as well not entirely clear to me what is so special about NetworkFileSystem? in a way of addressing DistributedSharedStateConcurrency problem that separates it from generic directory services. -- IgorLobanov?
- The anti-network attitude displayed by some RelationalWeenies is a bit off-putting at times, and unfortunate. Non-relational databases certainly have their uses (when speed of a fixed set of operations is more important than efficiently supporting arbitrary queries). OTOH, for most applications involving generic and arbitrary business data, a RDBMS is certainly more important. However, the topology of the database (relational or otherwise) has little to do with the role of databases in this context. In the context of concurrency, the role of the database is of managing (and arbitrating access to) shared state--persistent or transient. In the database model, all such data exists (canonically) in only one place--the database. This simplifies much reasoning about state in distributed systems. Of course, databases can and do become bottlenecks in many applications, this is why RDBMS's are run on big beefy servers with a big fat pipe. This is in constrast to other models of shared state concurrency, wherein state is distributed among many nodes and changes in one node may be transparent to others.
- Completely agree. -- IL
- Isn't this an implementation issue only? Whether a RDBMS runs in one room or a million should ideally be a hardware or systems implimentor's issue and not a software design issue. But it does bring up the question of "why split"? Perhaps this relates to WebStoresDiscussion.
It seems to me that we have two completely different (but not mutually exclusive) approaches to state management: centralized and distributed. It is not a strict dichotomy but more likely a gradient scale. Also it classifies various techniques by their logical structure rather than by implementation details.
In centralized approach there is single facility which is managing all shared state in given distributed system. It acts as single point of entry for any state-affecting operations and therefore potentially introduces performance bottleneck in transaction-intensive distributed systems as every transaction must be backed by centralized facility.
- Again, is this an inherant limit of RDBMS, or just something (some) RDBMS vendor's haven't addressed? If the transactions don't need to referece other tables, then the tables of the transaction could in theory be stored on separate isolated machines. An SQL JOIN clause should not care where the joined tables are. It may speed up transactions at the expense of joins, though. However, a non-RDBMS distributed shop may face the same trade-off: if one data set is in Nebraska and another in Taipei, then cross-referencing the two would be slower than if they are in the same room. An RDBMS's goal should be to offer a central view of the data, not necessarily force a physical location of everything in the same room. Ideally, one should not have to rewrite their applications if physical partioning is changed to speed up certain activities. In short, the trade-off seems universal, not relational-specific. I think RDBMS lean toward centrality in practice more to ensure AcId than for anything else. If you abandon the requirement for AcId, then putting stuff all over the globe is much easier to do. But, you kiss data integrity goodbye.
OTOH such facility provides unmatched querying/analyzing capabilities because it contains all state of a given distributed system. The other advantage of theese systems is simplicity to develop. The most radical example of this camp is RDBMS.
On the other end of out scale there are truly distributed state management technologies. In such systems state usually maintained ``in the air`` via sophisticated caching and replication between system nodes. In such scenario there is no need for centralized coordinating facility because distributed system automaticaly manages it's state. Such systems usually scales up very well almost linearly, but imposes very high requirements on skills of its developers and especially architect. LindaTupleSpaces and DistributedCache? are examples of such technologies.
What about such classification shcheme? Now we can dig down in each approach to consider their subcategories and also discuss unmentioned pros and cons of either approaches. Any thought?
See also AnoteOnDistributedComputing
CategoryConcurrency