Runtime Upgradeable Core

While considering the idea for a WikiIde, DaveVoorhis expressed an opinion that the WikiIde kernel should be as small as possible, with the goal of moving into a purely Wiki-based editing environment ASAP, yet be fully upgradeable from within the Wiki-based editing environment. Specifically, his words were: "if adequately designed, I believe it should be possible, over time, to evolve the [SpikeSolution] into the system described above entirely via the Wiki interface. [...] make the initial kernel, if it can be called that, as minimal as possible, with the goal of moving into a purely Wiki-based editing environment as quickly as possible."

This is a goal that I've latched onto; it appeals very heavily to my affection for elegant design and simpleness without simplicity. However, I've been banging my brain against it for a while now, and I'm a bit stuck on how to do it in a manner that is both elegant and practical.

First, a few definitions for this discussion:

A kernel isn't really any one entity; it is an interrelated collection of core services.
A critical service is any service that must not be interrupted. I'll assume familiarity with the term 'service', but for clarity: a service (a) is identifiable as fulfilling a particular role, (b) must be capable of being located to be of any use, (c) completes some sort of explicit or implicit 'service contract', and (d) (for computation services) does so in terms of receiving and delivering communications (not necessarily to the same parties).
- Critical services for the WikiIde include web service, the ability to create, read, and update pages, and the ability to alter the behavior of the WikiIde by doing (because if you can't do that much, you cannot fix the WikiIde after it breaks).
A core service is a service without which the system will not provide critical services (will not, not cannot). Obviously, this includes all critical services, but also any services upon which the critical services depend.
- Core services for the WikiIde include: event dispatcher and basic executive support, web server, storage management and versioning support (for both cell-data (like jpg images) and the WikiWord code-pages), security manager (which is critical only due to the potential of being locked out of your own system), process scheduler, database/filesystem indexing support (for finding pages), parser/compiler/linker/loader for pages (necessary for modifying behaviors). Keep in mind for WikiIde that all of these shall (ultimately) need to be designed in a scalable manner to support massively parallel operation: in a larger WikiIde, most of these core-services will need the ability to run on many CPUs, likely distributed worldwide, with storage and versioning management that is similarly distributed in order to provide automatic backups of the complete version-history. This distribution might raise the challenge-bar for RuntimeUpgradeableCore. (There are also many non-core services, such as page caching, and caching pre-parsed or compiled and optimized forms of a 'page' or project-instance, without which a large WikiIde cannot function reasonably - but those aren't 'core services' for a reason.)
Runtime Upgradeable ideally means without a 'reset' or interruption of actively running critical services, while upgrading services (or at least future use of the services). However, I consider that to be stomping over pragmatism, so I'm willing to accept any brief but non-catastrophic interruption of most services (e.g. loss of session, as though one had closed the browser, after upgrading the web service) - in practical terms, one can usually warn users ahead of an upgrade that will interrupt services.
RuntimeUpgradeableCore, when put all together, means that the system provides a set of core services that are Runtime Upgradeable even while being serviced from the current core services. I.e. one is not going 'outside' the core to upgrade the core. Pretend that the core is all you have - e.g. that it is running on bare metal, and that you have no practical way of inserting a rescue disk and re-bootstrapping the system if you fail. I say "pretend" because we all know it isn't true - there is a safety-net if you need it - but if you have to use any sort of rescue operation, the system has failed the goal of providing a (safely) RuntimeUpgradeableCore.

The real goal is to provide a safely RuntimeUpgradeableCore, such that one can literally upgrade core services, like the webserver or the page database/indexer, while in the midst of using the current web-server and page database/indexer, all without that horrible sense of dread or fear of things going horribly wrong and forcing you rescue your machine (or panoply of distributed machines) by hand.

Now, I've so far thought of two fundamental approaches to obtaining safety (and cannot help but think they ought to both be implemented, for sake of completeness): (a) the ability to run a to-be-core-service in parallel for an arbitrary period of time before switching to it, and providing some mechanism of transition. This would be akin to running the 'new' Linux Kernel atop the 'old' Linux Kernel, but having some magic button that says "switch!" and makes it so the old Linux Kernel is now running atop the new one - and implementing this "magic button" will undoubtedly take a great deal of thought itself (especially testing/ensuring it survives the transition, such that the property of being a RuntimeUpgradebleCore? is not lost in the upgrade). The ability to run the proposed service as a regular project on the WikiIde is ideal for a great number of reasons, since it means all projects are treated equally and allows one to run a variety of tests and perform arbitrary debugging from within the WikiIde (which is intended to provide these services) so you can feel comfortable that the changeover is unlikely to break things. (b) Some sort of rescue-service available continuously that runs with a set of core services completely independently of the primary one. E.g. if the webserver runs on port 80, one might have the rescue webserver running on port 2280 (memorable as: "port 80 after catch 22") - one could reserve the whole 22xx block just for rescue services. These would share very few of the core services, but there would need to be some overlap: the primary process scheduler and primary memory manager (which would leave upgrading those riskier than the others & more likely to require a rescue). This is intended to be redundant, as a safe-harbor before resorting to a bootup disk with rescue software. These catch-22 backup services also need to be upgradeable since you might change file-system formats or upgrade the language, but one can do so by choosing a backup service that is known-stable from among the recent builds.

Also, it wouldn't hurt for the WikiIde to be capable of automatically packaging up and providing the core service components as-they-exist as part of a machine bootup disk. (I do intend, ultimately, for WikiIde to run on bare metal; there isn't much reason to deal with interference from another kernel with other agendas if one does not need to do so. But the ability to provide the services even atop, say, a Linux boot disk would also do the job... albeit far less elegantly IMO.) That way, when the going gets really tough, a smart administrator can have a library of old bootup disks to fall back upon to rescue the system that will inherit any changes to the versioning support, filesystem formats, code language, etc. that might be critical for reading and rescuing the system. This packaging of the core will be aided by having the WikiIde be quite reflective and fully bootstrapped (the code for the WikiIde will be on WikiWord pages available through the WikiIde), so I imagine it is readily possible.

Anyhow, the above may provide safety, but still leaves that darned transition hanging around - the "magic button" mentioned above to initiate the transfer a core-service to its upgraded version that is running in the WikiIde, plus maintenance of the RuntimeUpgradeableCore property across upgrades. The latter, I'm afraid, will simply need to be obviously marked so it won't be removed by accident (ultimately, I imagine that part of being extensible and flexible is sometimes the ability to give up the ability to be extensible and flexible) - one still has the 22 service for rolling back if desired. As far as transition goes: forwarding between services doesn't count: that would bog down the system after just a few upgrades, and I'm not keen on one webserver knowing about another (or at least needing to know about another). But if services in the WikiIde are identified by their WikiWord page names (and that's how I plan it), then perhaps making the changeover can be as simple as modifying either one, central file (WikiHeart?, maybe? or WikiCore? if we're feeling a little less Disney-esque) that has a list of core services - the ones to which information incoming on particular ports are directed (e.g. it might have '2280 - RescueWikiWebServer') or perhaps having each of those destinations be a 'redirect' page (so RescueWikiWebServer is a ServiceRedirect:CoreWebService_xxxxx). Thus, the 'core' services are fully defined very simply as those that receive the top-level 'port' inputs, whereas non-core services get inputs directed through the code-language itself (often SharedMemory MessagePassing) and only redirect through the primary ports when doing reflection on the Wiki.

Just ThinkingOutLoud as though I had an audience that cared. (It helps me, if not you.)

On WikiIde, DaveVoorhis suggests use of dual heartbeat monitors (described below). These would also provide a great deal of stability absent human-intervention (for everything but the scheduler and the network stack stuff - which aren't helped by much else, either). They can be used in addition to the above approaches, and rely on both a source-control system and the ability to manipulate the execution to swap out an old service in favor of a newer one.

Primary heartbeat monitor.

This consists of, at minimum:
- A collection of test scripts that verify correct operation of:
  - All other core services
  - Explicitly including the secondary heartbeat monitor. (See below)
The primary heartbeat monitor shall continuously monitor the core and the secondary heartbeat monitor for new deployment of any of their components. When a change is deployed, the collection of test scripts will be executed. If any tests fail, the primary heartbeat monitor will roll back the changes (using the previous version of source control system, if it's what has just been modified) and reinstall the previous known-working version.
The primary heartbeat monitor must depend on either the system scheduler to initiate it at some regular interval (e.g. a subscription to a clock signal), or upon a (far more complex, and thus more likely to break) subscription-service that can identify when core-services are being upgraded. As a consequence, the scheduler is one service it cannot effectively test. Further, it may become ineffectual if the network stack goes down, or if the version-control system is compromised.
- A modified scheduler can be sandboxed and checked for proper short-term behaviour. Obviously, what can't be verified is whether or not it will continue running. -- DaveVoorhis
- Actually, if one separates multi-plexing layers from abstraction layers, one can even run the scheduler for quite a while as part of another service (such that processes within a process are sharing a scheduler). This can provide some vigorous testing. The real 'gotcha' is when attempting to change out, shifting from this 'sandbox' to the real deal and the full workload. If that goes wrong, it's time to pull out the rescue disk (or boot configurations/options). That said, a simple power-cycle could be made to work most of the time if one makes changing processes out into transactions (such that the final acceptance of the change isn't logged for some time after the change has been initiated... long enough for the primary heartbeat monitor to verify it, and possibly ramped up artificially for a scheduler change). The primary hearbeat monitor can verify a scheduler that is working at least briefly even though it cannot verify a scheduler change that doesn't changeover.
Some decisions need to be made with regards to how the Primary Heartbeat Monitor should behave if its test-scripts are updated or addended and, consequently, some of them fail. This can't really be considered a failure of the system being monitored. Some possibilities include backing up to run the last-known-working-versions of the scripts to ensure those pass, allowing failures to be isolated to the 'new' versions. With that setup, once the new versions are working, they become part of the last-known-working-script set.

Secondary heartbeat monitor.

This consists of, at minimum:
- A collection of test scripts that verify correct operation of:
  - The primary heartbeat monitor.
The secondary heartbeat monitor shall continuously monitor the primary heartbeat monitor for new deployment of any of its components. When a change is deployed, the collection of test scripts will be executed. If any tests fail, the secondary heartbeat monitor will roll back the changes and reinstall the previous known-working version.
Subject to the same limitations as the primary heartbeat monitor.

The following constraints must be observed:

When a change of the primary heartbeat monitor is deployed, no changes may be made to any other core service until the secondary heartbeat monitor has validated (and possibly rolled back) the primary heartbeat monitor.
When a change of the secondary heartbeat monitor is deployed, no changes may be made to any subsystem until the primary heartbeat monitor has validated (and possibly rolled back) the secondary heartbeat monitor.
When a change to another core service is deployed, no changes may be made to the heartbeat monitors until the primary heartbeat monitor has validated (and possibly rolled back) the core.
Compilers for all core services must be vigorously tested.
Changes to the scheduler, network-stack, version control system, and core-service language compilers ought to be deferred until sufficient automated testing has been performed in a safe or sandbox environment.

CategoryWiki