Ways in which AtomicConsistentIsolatedDurable may be violated for performance gains.
(Moved from ParallelNeedScenario)
Let's look at an extreme scenario for illustration. Take the fact of the quantity of X at store Y. If 100 CPU's wish to change this quantity at the same time, there is generally going to be a bottleneck at the point in RAM (disk cache) where this one value is stored. Updating this value is generally going to be a sequential process regardless of the number of CPU's involved. They have to "wait in line". This is because we want OnceAndOnlyOnce for the "official" value. (We can cache writes to multiple caches, but that involves many of the same problems described below.)
Now lets look at reading this inventory quantity. If 100 CPU's all want to read this one spot at the same time, they still must wait in line in the basic case. However, it may be possible to cache a copy of this value such that each CPU can have a copy.
Some kind of mechanism needs to be in place to make sure any updates to the original are propagated to the cached copies. Such a mechanism itself costs processing power and other resources, and thus adds overhead. The further and deeper the copies, the further the update tracking and propagation mechanism needs to be.
Further, each copy takes up space. Generally chunks of RAM are cached, not just individual values (in the hopes that related info is more likely to be close to each other). On a larger scale, this requires more total RAM than if fewer CPU's are used. If there is not enough RAM to go around for such caching, then it will become the bottleneck long before lack of CPU chomping power is.
There are work-arounds to this, but they involve trade-offs. In the database world, there is something called a "dirty read", which means you read from whatever cache is immediately available, knowing that it may not be the latest. (There's probably a more formal name for it.) Ideally, one wants an upper limit on the age of the data. For example, you'd like to specify that a dirty-read is fine as long as the data (cache) is no older than N seconds. This can improve performance by sacrificing some amount of timeliness. It is useful in say an e-commerce web system with lots of readers and writers. (If traffic is light, hopefully it will not bother to use dirty-read algorithms, and just wait for any pending transactions to finish.)
High-end database vendors have spent a lot of time thinking about these issues. They probably have a lot of good lessons and scenarios for those working on app-language-centric solutions. Or, just use a RDBMS instead of reinvent the wheel.
--top
"In the database world, there is something called a "dirty read", which means you read from whatever cache is immediately available, knowing that it may not be the latest."
Actually, a "dirty read" is where one transaction can select uncommitted updates from another transaction still in progress. In an ideal world, all transactions would be completely isolated (this isolation level is termed "SERIALIZABLE"), which means each transaction perceives the world as if all transactions occur serially rather than concurrently. Unfortunately, achieving this is relatively expensive -- mostly due to the use of locks upon which other transactions must wait, but also because fully-isolated transactions may be forced to abort in order to avoid deadlock. To achieve acceptable performance, the transaction isolation level may be reduced, but this introduces the possibility of encountering anomalies such as dirty reads. In many real-world applications (e.g., those where it can be guaranteed that no transactions will be rolled back, hence uncommitted data is valid), this is deemed acceptable. The transaction isolation level that permits dirty reads is termed "READ UNCOMMITTED".
Thanks for the clarification. I'm curious why using an uncommitted value would be more efficient than using the "official" data (as it is before commit's). Perhaps this should be moved to it's own topic.
I'm not sure I'd consider "out of date" to be "erroneous". It is highly dependent an the domain environment. But anyhow it seems the ideal system would offer a choice of either kinds of "dirty reads".
It's not merely "out of date", it is erroneous. Imagine that A adds 1000 to your current account balance, and B concurrently adds 500 to your account balance. Under READ UNCOMMITTED, the result is consistently correct if A commits. If B takes the initial account balance from before the transaction starts, the result is consistently incorrect.
In that scenario, yes. But envision a product catalog. If "A" is merely a read but "B" has an error and is aborted, then using the pre-B state is less of a "sin" than using the bad B state. Plus potentially slower.
As noted above, the presumption of normal transactional processing is that failure to commit (i.e., "is aborted") is considered rare and unlikely.
It's a business decision. Can you say that any of these 3 options are always the best or always the worse?:
[Your example problem is too weak to illustrate the differences between isolation levels, and certainly too weak to imply performance differences (AcidCompromisedForPerformance) in a read-only transaction. Complicate it up a bit...]
[To be fair, you should say that with all non-isolated levels (all levels lower than SERIALIZABLE) you get RaceConditions. READ COMMITTED and REPEATABLE READ offer progressively better reasoning about what sort of race conditions might occur, but don't prevent them. Not all race conditions are malign (resulting in error), but most of those dealing with updating shared state are malign. READ UNCOMMITTED is about at a level where you don't have any synchronization at all.]
[But top is saying that "an ideal system would offer a choice". I don't believe this to be the case. When attempting to achieve some fraction of AtomicConsistentIsolatedDurable, it is useful to note how these properties feed back into one another. In particular, sacrificing isolation makes consistency far more difficult to achieve, forcing programmers to achieve consistency at individual steps rather than for the transaction as a whole (which is far more expensive in algorithmic costs and constrains programmers in ways that I expect hurt productivity). Sacrificing isolation to the point of READ UNCOMMITTED pretty much kicks 'Atomic' in its rockers (since a transaction that commits can commit data that would otherwise have been uncommitted). While some ability to 'escape' transactions is valuable (e.g. to provide a progress bar or transaction debugger), reducing isolation levels should probably be heavily limited in scope... e.g. to 'read only' transactions. But, and here's the kicker: read-only transactions are relatively cheap to perform in a fully isolated manner, even with optimistic transactions (no locks), so most arguments in favor of escaping isolation tend to evaporate if the scope is limited to 'read only' transactions.]
True, and well-put. I was becoming increasingly terse (to the point of incompleteness) out of frustration. The behaviour of transactions under various isolation levels can be found in any sound reference on DBMSes, and should be well-known by anyone who professes to be a database expert.
[Heh, yeah, you'd expect someone whose name is derived from 'TableOrientedProgramming' to know these things.]
Explains why you have NO handle. Anyhow, the jury is still out.
No, it isn't. Someone has taken the time to explain, in their own words, material that can be found in any popular comprehensive text on databases. You should be appreciative of the effort rather than arrogant and dismissive. Asking you to make a BookStop was never more appropriate than on this page.
[Well, the jury has adjourned and gone home long ago enough for several members to die of old age on the terminology and the behavior of transactions under various isolation levels. But if Top was talking about the value of "choice", then he's right that there is no broad consensus. I thought the same thing on first reading: his statement was exceedingly ambiguous in its intended application.]
How about a common scenario where the "old cache" approach would be less efficient or riskier. Remember that the original motivation for the above discussion was scaling to multiple CPU's. The idea is that each CPU may have copies (cache) of some or all of the data. Rather than keep requesting a central copy, they could use their OWN cache as long as it's not older than a given span of time. I envision a kind of repeating "pulsed update" where queued catalog updates (catalog scenario above) are performed against the single master data set while the multi CPU's are using their cache. Thus, when the second "copy pulse" time comes, most or all the updates are already applied. This avoids the bottleneck of all processors trying to access a single copy of the pending updates and/or the single "official" copy of the data. --top
If you aren't planning to validate the data, and simply use cached data as though it were official (so long as it is no more than N seconds old), then you'll have RaceConditions galore. Whether these are malign race conditions depends on the domain. In your particular example, if you used cached data, you could end up with someone buying a product when you're sold out. In airlines, you might end up with two people booking the last seat on the same flight then having to shift one of them to a different plane. Whether these are costly things to you depends on the business.
But caches are not 'anti-ACID' or anything like that. One can easily achieve ACID transactions using old caches; for read-write transactions, one must verify (loosely) that either the data you read was up-to-date or the data you're writing hasn't been read in a read-write transaction. A lower cache update frequency would somewhat increase the risk that a read-write transaction based on reading data from the cache would fail. The update frequency would be tunable to achieve different performance and cost characteristics (bandwidth vs. probability of failed read-write transaction vs. non-ACID devaluation of aged data, etc.).
Also, note that read-only transactions, such as your catalog example, don't have ACID-related costs for being out-of-date if you already have caches of data for other reasons. It is trivial to serialize read-only transactions that read aged data only - you simply serialize them as having happened after the last update to that old data. Instead of ACID, you need to assign some sort of cost/value function that describes how aged data reduces in value for decision making and other purposes. For a product catalog for most businesses, you probably could get away with it being out-of-date a full month without difficulty. Altitude on an aircraft would be a bit more critical to keep up-to-date, ideally at least 10 Hz (sub-second out-of-date) according to the domain experts I've worked with.