Push Doc Query In Sql

Subtitle:"How I made IBM cry once"

If you can store parameterised queries then you can push the results into a cache ahead of time. Alternatively you can go buy an additional caching Web application box for a lot of cash (Oracle 9i, Sybase Enterprise Portal) and it still won't be as fast as this, or you can DIY.

This version was on the db server and was faster than any existing commercial solution. Okay the server had 4GB of memory, but only 4 cpus. It drove 4 webservers (4 cpu each) giving out content for all pages plus arbitrary queries and was only just breaking a sweat. Scalability was made infinite using a write/read separable architecture.

This is one of the larger system/process architectural examples for my book. The cache maintenance logic is the primary focus, though I will discuss some of the non-caching advantages of having parameterised queries.

So far we've got a pretty efficient stemming query based on a small but useful logic. We can store any query since it is a list of <parentkey> terms. I index these queries with a <queryname> key in a separate table <stored query>. This table is handy for drop menus, usually including a <description> too. This table gets more stuff in a minute. The members of any one query can be the original stem terms, or the complete tree. I chose to only store the original terms as there could be a lot of derived terms, and the cache meant the small speed optimisation was unnecessary. This also allowed me to factor out the basic query logic into a single stored procedure used by both the stored queries and the ad-hoc queries. An ad-hoc query simply needed 'saving' to make it available for the further cache (and watchdog) functionality.

Initially this was a convenience for page authors who could embed an ad-hoc query as a JavaScript http GET using a cgi style parameter string (this was actually significantly complicated by the fixed number of parameters in stored procedures, but I digress). As time went on, more and more standard queries were generated and stored by name, thus the term 'stored queries'. Later this functionality was exposed to users so they could store their own libraries of queries and watchdogs.

The result of a query is a list of documents in some order. This list can also be stored in a cache. Thus stored queries can be run at any time and cached. This is the basis for 'push' functionality. Whenever a document was stored in the system, its attributes were stored as <parentkey> terms in the document cross-reference table. These terms were used to derive the affected stored queries which were then run on the next push cycle (every two minutes as it happens).

The cache entries could also age out, so the 'push' cycle involved two parts:

Comb aged/deleted documents out of cache.
Update cache where might include new documents.

The second part was designed to only run queries related to new (or updated) documents, but completely regenerated the cache contents for any particular query. This was much more reliable and simple. Updated document categories would otherwise cause duplicate cache entries, requiring further logic to fix. Specific caches could be manually invalidated where the query itself had changed, or the cache data wasn't there. Thus queries against cached data had to fallback to a full query when necessary. This prevented users from being forced to wait for push-cycle updates if they didn't want to. The flag which indicated cache status was put into the <stored query> table as <invalidflag>. I also added a <nocache> flag for stored queries that never cached results.

The <invalidflag> was set by any situation where a query needed to be run. Queries which were not cached were still flagged on a change. This allowed me to do 'watchdogs'. These were queries entered by users which defined some events they wanted to be notified about. The queries were not run by the cache logic, but an icon on their query page would change to represent the change in state. This icon linked to the actual stored query which could be run to find out what had triggered the watchdog.

We are starting to get into serious functionality now. Particularly this can be the basis for any dynamic personal portal. User preferences and standard pages are derived from the same query logic. This uniformity makes for a happy bunch of developers and happy users who often heavily customised the portal site by defining their own pages of stored queries. Content publishers used the same system for standard pages, and used a slightly modified version of the 'user preferences' dialogs to define those pages and how they should be displayed.

The 'push' logic allowed writes and reads to be fully separable. This scheme was well suited to scaling.

Oh yes, why did IBM cry? Well they quoted 3.5 megabucks to duplicate our feat using Websphere and about eight very large Sun boxes. Suddenly my extortionate rate didn't look so bad after all :). I don't know the status of the project, but it is still a very successful portal, even if they did strip out a lot of the good bits so IBM could do it.

Cheers

-- RichardHenderson.

CategorySqlProgramming