Transparency And Uniformity

Many approaches to the complexities of persistency hierarchies, distribution and other complex structures try to abstract these complexities away. This is called transparency (see e.g. TransparentPersistence). This makes working with these structures easy, but has an important trade-off: Control over the inner workings of the structures is lost. Sometimes this control cannot be given up. But there is no need to go back to control all elements of the structures in isolation. We can still strive for uniformity, which is strictly less strong (less abstract) than transparency, but powerful nonetheless. Uniformity means we program all elements of a structure in the same way, but can still access and control the different levels and the inner workings of the structures.

ProgrammingWithoutRamDiskDichotomy would be great, because it could reduce and simplify quite a lot, but I would not like to give up the control over whether something is always persisted or may be kept only transient (and restored, recomputed, reinitialized after a PowerFailure). I think that ErosOs goes too far by giving up that control.

I would prefer to handle and address persistent and transient storage alike (with the same syntax and structured, i.e. API), but with optional or inherited etc. parameters/annotations that signify whether and how safely the data is stored.

Giving up that control may be comfortable, but quite a lot of large applications need to control how safely which data is stored. If you consider a deeper memory-hierarchy with backup systems and archiving for important data included, it is easily seen that not all data can be moved to the tresor-room. Only that which is marked accordingly should be.

-- GunnarZarncke

But should the syntax or API between "non-volatile" and "volatile" be significantly different? Is this just a historical habit, or for a real reason? What if the volatility was merely an attribute we put on structures? Pick your favorite structures: tables, associative arrays, objects, EssExpressions, etc. and suppose you could declare them one of:

Permanent: Stays until deleted.
Application scope: Stays until the application is finished.
User scope: Stays associated with a particular user. (May be overkill, but threw it in for thoroughness.)
Block scope: Stays until the declaring block is finished (or references gone if passed out). Example, declared at function scope.

Now that your favorite structure can be any one of these, would it change your programming approach? Is there still a need to have different syntax or access semantics for each scope (as we usually currently do now)? I am acknowledging your need for different volatility scopes, but am wondering if or why the access semantics need to be tied to them.

-- BlackHat

Well, the first category requires one different operator (delete). Other than that, how can it differ from current memory syntax?

"Can" or "should"? In practice, it differs. The earlier ones on the list tend to be via RDBMS.

The annotations for scope proposed above point into the direction I meant. But I also thought about annotation like

persistency
- data may be lost at any time
- data will not be GCed earlier than a given time
- data will not be GCed until memory shortage
- data will stay in memory as long as the system is running
- data will survive a program failure
- data will survive a power failure
- data will survive a given number of hardware failures (storage device losses)
- data will survive a computer loss
- data will survive a datacenter loss
- data will survive one atomic blast
security
- data is unprotected
- data is protected from neighboring programs
- data is protected from access to the storage media
- ...
- data is protected from any access (ahem)
access time
- data may or may not be delivered after any time
- data will be delivered after some time
- data is delivered with tape delay
- data is delivered with disc delay
- data is delivered with ram delay
- data is delivered with cache delay
- data is delivered with register delay

-- GunnarZarncke

I don't see the utility of some of these. Persistence of a level less than (current) local variable seems pointless, and the last 3 classes of persistence seem very specialized. The security categories seem more useful, but the access time seems to fall into the (possibly never),(long), and immediate, again with the proviso that the last few may have specialized use. -- PeteHardie

I do agree there may be more dimensions than my original list suggests. But the question is the same: should our syntax/API look significantly different for a given combination of such features?

I do have a nit about using hardware-dependent delays, though. I would suggest instead that an "average delay" (or priority) and "max delay" be requested, and then the system determines which device to use. This is more in the spirit of the title.

-- BlackHat

No, the API shouldn't look any different. I think opening and editing data on disk should look like this:

  {
    [accesstime <= typical_ram_delay] Document doc = storagesystem.get("my_document.doc");
    // edit doc
  }

or if you want to keep the open/save/lost-data metapher, you could do

  {
    [accesstime <= delay.typical_ram && persistency >= persistency.poweron] Document doc = storagesystem.get("my_document.doc").copy();
    // edit doc
    storagesystem.set("my_document.doc", doc);
  }

I agree, that the delays should be specified absolutely, but there should be handy constants for typical values (like typical ram access time).

-- .gz

I am not sure a file system analogy scales. For example, take a big array. One usually does not copy the entire array, change one slot, and then copy the entire array back. One generally wants to talk to one slot at a time. And, what about aggregate operations? For example, in SQL we might have "UPDATE foo SET amt = amt + 7".

[some time later]

There is no 'file system analogy' intended. Storagesystem stands for the universal storage system included objects held in memory or tape. The access by name is illustratory only, I could as well written "storagesystem.user.documents.personal.my_document" which would make the same logic look like object structure traversal.

Array access would be annotated differently of course - but still annotated. Imagine something like

  {
    [accesstime <= delay.typical_disc && persistency >= persistency.poweron ] int[] array = getArray();
    for(int i = low; i < high; i++)
    {
      [accesstime <= delay.typical_ram ] int e = array[i];
      e += p;
    }
  }

This would make the - possibly very large array - available with disc speed and bad persistency, i.e. a typical application of a memory mapped file. The actual access is annoted to be done in ram. This can be hoisted out of the loop to automatically load the relevant blocks from disc (and store them back when the reference is discarded at the closing '}'). And nothing of this fetching and blocking has to be coded.

-- .gz

I've given some thought to a similar system, though access time was not on the list. (I might add it, but it seems the one item that is weakest to guarantee. Instead, I have 'local' variables that cannot be shared or persisted and are atomic only if necessary, with which to provide rapid access guarantees.) One could use an 'attach' metaphor for storage (treats individual storage like a service with each cell being a first-class process; allows you to interject any process with the correct interface and treat it like a storage process).

Is the attach metaphor maybe the same as AccessOrientedProgramming?
Not the attach metaphor. 'Attach' is far more associated with prep for communications (setting up communications objects, possibly a message queue and semaphor, providing the initial set of capabilities with which you're making the connection, prepping the handshake, etc. - all before the first message is sent). Closer to AccessOrientedProgramming would be an ability to attach procedure-descriptors to a cell that are to be executed when the cell is updated. Such a service could be made available, of course... similar to a subscription service provided by each cell. If a subscription-service is provided, then AccessOrientedProgramming would be an easy delegate to write.
Personally, I'd far prefer a decent mechanism to add true rule-based programming than a simple reimplementation of AccessOrientedProgramming. But that's far more complex, as it requires a degree of system-wide coordination.

Anyhow, here's the list of orthogonal attributes I arrived at, and some of the conclusions:

persistence - surviving the lifetime of various components. If you have 'persistent' storage, you also need to have a namespace that survives components. There are special conditions wherein you are either unable to attach to existing storage (e.g. your copy is distributed or secured) or no existing storage exists. Suggestion is to allow specification of a default 'initial' value whenever accessing persistent storage (i.e. the value used and persisted if the object is not available).

An interesting problem is when to collect persistent storage.

security - protection against unauthorized mutation and/or peeking. Authority should be granted via tickets or certificates (on a Capability basis).

distribution - ability to share. Note that shared storage has a few other relevant orthogonal attributes. E.g. LDAP - data that is write-rarely read-often, and possesses graceful degradation even when out-of-date. If you can state that you don't care that your copy is completely up-to-date, a massive optimization can be made.

A really interesting problem is when to collect persistent, distributed storage.

transaction support - support for multi-cell SoftwareTransactionalMemory (possibly including hierarchical sub-transactions). Only cells readily support transactions, though other communication-services could provide a compatible interface.

subscription support - support for subscriptions to changes, possibly with various conditions e.g. max rate of update, or send changes in function over value, or even attach a procedure or continuation to be run with given capabilities.

Subscription and transactions are two services that can easily be optimized away on a per-cell basis if the compiler can determine that they aren't used.

I also have an orthogonal list of features for communication services. There are four basic types: one-to-one (connection), one-to-many (broadcast), many-to-one (port), many-to-many (channel). Orthogonal features are: (a) ordered message transport, (b) reliable message transport, (c) attach protocol, (d) type of data transmitted, (e) encodec for data transmitted, (f) persistence of connection/broadcast/channel, (g) transport medium.

I happened to choose these in a manner that could be resolved regardless of your choices, but, again, I didn't have 'accesstime'. Ability to specify 'accesstime' may be beneficial to those attempting realtime guarantees, but I also see it as something that is going to be nearly impossible to support orthogonally to these other features. With 'accesstime' it becomes really easy to ask for the impossible... and a good question is how one should resolve the impossible (with an exception? a compile-time error? - the power to shoot oneself in the foot shouldn't come without a safety).

EditHint: I'm still bothered deeply by the title of this topic. It seems way too open-ended.