(This is something of a rant... but I'm hoping to open a discussion.)
The Problem:
We've all used them: Lists, Stacks, Queues, Vectors or Arrays, Sets, Maps, Hashes, Tables, Relations.... Each of these is a collection of values (possibly of object or cell references) with variant computational and conceptual access and manipulation characteristics (where 'access' refers to only those procedures that do not manipulate the collection as a whole). E.g. for lists, one can often access or manipulate (including insertion and deletion) to the first and last elements in O(1) time; with arrays, access by ordinal index occurs in O(1) time but deletion and insertion may be expensive (or even impossible). With sets and maps, it is often optimal to gain O(lg(N)) insertion, deletion, and manipulations; maps and hashes allow quick access to any particular item by some sort of key descriptor. Tables and Relations allow access to component values by some sort of partial descriptor (e.g. one or two arbitrary fields in a tuple), and access is optimized by indexing over certain 'key' fields in certain.
(Records and Tuples are not intended for use as arbitrary-length value-collections and, for that reason, aren't mentioned above.)
One can, in general, say that a collection is "indexed" by particular properties if it provides access to components with that property in a manner more rapid than a O(N) search, and a collection is indexed to a purpose (e.g. or deletion or insertion or manipulation of a middle element) if it allows procedural manipulation of the collection at a cost less than O(N) to reconstruct the 'new' collection resultant from said manipulation (be it an object or a value). (collection values, where the collection is, itself, an immutable value, are often more difficult to index for manipulation of a central element than are collection objects.) E.g. a list with rapid access/addition/deletion regarding the first and last elements would qualify as a collection "indexed" over the first and last elements, even if we don't typically use the word "index" in that manner. This is evidenced by the common implementation of such a list containing pointers directly to the first and last elements.
Any general-purpose programming language will provide at least one decent collection format (often the humble list, sometimes the array), and most general-purpose languages will, possibly through common or standard libraries, provide a great number of collection formats. As programmers, we grow comfortable with selecting the right collection for the job. The "right collection" is determined by the access and manipulation characteristics we require of the collection... both conceptually and computationally. E.g. one might select a stack over a list if, conceptually, 'stack' fits the usage better, even if the stack is just implemented via list. Similarly, one might select an array if one requires rapid random access to elements.
Selecting the right tool - the right collection - for the job is hardly a bad thing. It's the sign of a good programmer.
Unfortunately, the precise collection the programmer requires is very often unavailable in whichever language we choose. E.g. one often requires rapid access to items both by ordinal (implying array) or key (implying hash or map or table) AND by various other characteristics.
Tables and relations are, perhaps, the most flexible of the value collections (due to allowance over multiple indices), but they are rarely available in general-purpose languages (being also the most complex for which to maintain and, especially, implicitly utilize indices... and being even more complex if the values so indexed are object-references and thus subject to external manipulation (this would imply a requirement to subscribe to the objects in some manner in order to maintain indices). Even the flexible tables often lack the sort of indices required for rapid access as desired by the program (e.g. if the domain involves some sort of data-clustering (as in Bayesian sorting) or spatial-indexing (e.g. figuring out the set of objects within N meters of a particular point, or the set of objects intersecting a particular volume, or the set of possible values intersecting a conic volume associated with a directional field).
Also, for ordered collections (e.g. strings), this doesn't begin to touch upon such things as patterns crossing multiple values (e.g. regular expressions, grammars, and rapid access into subranges of a collection based on such things). These become remarkably important when dealing with indexing web-pages, parsing languages, interpreting text, etc.
Thus, as programmers, we are forced to build our own indices and accessors and manipulators... again and again and again. Often we'll forgo an index (and its associated maintenance) and perform straightforward searches; premature optimization is, after all, a fantastic and common source of error. However, as we add accessors and manipulators to these collections (and possible indices), the code becomes denser, more specialized, more difficult to update, maintain, refactor, reuse, and replace. CodebaseInertia? grows. Further, maintenances and difficulty of communicating the Indexes associated with the Value Collection adds and adds and adds to DatabaseInertia? -- it becomes expensive to move or replicate the data!
The Solution?
What I desire, what I'm calling for, is the design of a generic IndexedValueCollection, and its implementation within general purpose languages. (Hear me, language designers? This goes into the FutureOfProgrammingLanguages wishlist!)
Here are the features of the (as yet non-existent) generic IndexedValueCollection:
I believe it is feasible to design the IndexedValueCollection, but I wish to hear opinions as to what other programmers think about this. If anyone knows of research already performed in this direction, please post it!
The idea of a universal container -- whose abstract interface is constant but whose concrete implementation changes dynamically -- is not new, and is hinted at or mentioned in (at least) AreTablesGeneralPurposeStructures and TopsQueryResultSet. I've built a few experimental implementations, but I've never been particularly happy with the results. My experience (thus far -- I'm sure I'll tackle it again) has been as follows:
Ideally a compiler or static optimizer could do at least as well as a programmer at statically choosing 'the right tool for the job' even without hints... but engineers well understand the vast gap between 'ideal' and 'real'. Your experiences are all very well taken, and none too surprising (though I hadn't actively thought of them, except the hints). I thank you for offering them.
Taking them into account would imply the following: (a) Concrete structures of a collection shouldn't change during 'uptimes' unless it can be performed iteratively (e.g. in a half-lazy manner where processing is distributed over several manipulation events). (b) Statistics-based decisions should, perhaps, be performed in some audit-manner (e.g. only recording every ~hundredth access or manipulation event, with some Poisson randomness to avoid beat-frequency style patterns) depending on the relative cost of maintaining this record (which may be very high, esp. if it involves any form of logging). Complete statistics aren't needed to create a very good profile. Where possible, some sort of 'initial' (but non-binding) profile hints and statistics can be acquired both from the programmer (by some syntactic mechanism) and from previous runtime experience should enter the static optimizer or compilation decisions (much like statistics gathered on branching enter compilation decisions and profiler statistics often enter static hand-optimization decisions).
The greater challenge will remain the collection-value... where manipulations create 'new' values but leave the old one intact, and where the 'new' values may have different usage characteristics entirely than the one from which it arose. Have you any experience with these? (Beyond a common experience such as iteratively pushing every element of a list into a map so it is indexed by a particular value.)
I'd like the IndexedValueCollection to be good enough (at least in design) for use as tables in databases, filesystems on an OS (or distributed across the network), and even files within a filesystem. Further, they need also to adapt to the low-end and be good enough to handle lists or strings (or thousands of them) for an individual program and avoid excessive resource consumption of the system can't provide it. If a collection is to adapt to its usage, it can't be that all its indexing decisions are made statically, and it shouldn't be that all collections have indices forced unto them (esp. small collections, which at most ordering... which is easily implicit in representation).
Perhaps UniversalContainer? would be a better name for this topic? I see you (Voorhis) used that term (I once built a "universal container" whose internal representation is based on an initial hint...), but I do want to imply the possibility of an indexed collection-value while 'container' implies only a collection-object.
How about UniversalCollection?? Since indexes would be automatically created or dropped from a given collection instance while only altering its performance, and since non-indexed concrete structures would presumably be employed as appropriate, it seems to me that indexes are an implementation detail rather than a defining characteristic. Therefore, "index" need not be part of its name. -- DaveVoorhis
I thought about that one, but 'UniversalCollection?' brings to my mind the set described thusly: {x such that x exists in this universe}. Perhaps 'AccessStructuredCollection?' (indicating how the structure is decided). On the other hand, 'UniversalCollection?' is ostentatious enough to get other readers and more likely to receive accidental linkage, both of which are very positive traits (though more of the marketing variety). I might go with it and move this over there unless we have a better idea. We should probably gather the mentions from other pages we can find, too.
Regarding the 'index' though, I'd say that implicitly (and automatically) structuring and indexing values for optimal access so that the programmer needn't do so explicitly is, indeed, very much a defining characteristic of this overall idea. It's the abstraction and inference of the indexing layer that distinguishes this collection form from various others. While it may be decided that no indexes are needed for a particular collection, a decision to not index said collection is just as relevant as a decision to index it. I'm not opposed to the use of 'index' in the name, so UniversallyIndexedCollection? and AccessIndexedCollection? are also reasonable names.
I see what you mean about the "Universal" in UniversalCollection?. How about AdaptiveCollection? This is more reflective of its behaviour, and is sufficiently general to not imply indexing, non-indexing, or anything else about its internal structure except that it adapts according to usage. Putting "indexed" in the name suggests to me that it always maintains index structures. -- DV
I like it! I'll move this to AdaptiveCollection when I'm ready to go through and fix all the links. Thanks much!
It is always interesting to see that the time is ripe for an idea. I HaveThisPattern too (and a friend of me also). First of all let me say, that I'd stay with IndexedValueCollection. To me indexed doesn't imply an index only an indexing - and thats what you do when you query a collection by key instead of by order. And the value part is important too. A higher order collection of other collections or mutable objects or even itself (implying general graphs) would pose even more complicated problems.
What about when you access the collection purely by insertion and retrieval, as in a Stack or Queue? What about a collection where only insertions are performed, followed by iteration to generate some aggregate result? These do not involve indexes or indexing, yet could well involve adaptation. Therefore, I think AdaptiveCollection is a more general, encompassing, and applicable name. It is the adaptive nature of the collection that makes it significant, not the fact that it involves indexing. An IndexedValueCollection could refer to any indexed data structure whether it is adaptive or not. My understanding of the description above is that the key aspect of the collection is that it is adaptive, hence Adaptive should be in the name. -- DV
As a background: I once implemented a special case of an IndexedValueCollection for bits. The structure adapted to how sparse the bitset was. It moved from single element over enumerated list to bitvector ranges to full bitvectors and used special algorithms for access, intersection, union etc. But it didn't employ hints or profile information.
I also considered general structures, that's why I'd really like your proposal. I fear that the time is not ripe for automatical inference of usage patterns even though compiler research for array access has progressed quite far in the last years (esp. for fortran) it didn't stop with loop unrolling, there is loop fusion, loop fission, loop reordering, loop tiling including incorporating cache access patterns etc. now. But this inevitably focusses on arrays, i.d. structures accessed more-or-less serially by integer indizes. On the other hand there is progress on navigational structures that can detect and optimize recursions traversing trees and and alike structures as are found in actual tree and liste implementation. Integrating these lines of reasearch with profiling data might one day fulfill the sketched dream of a true IndexedValueCollection.
To get back on earth I think in the meantime a IndexedValueCollection baseclass plus a system of hints could free the programmer from the incompletely and arbitrary set of collection classes available, even if this container only delegates to the concrete classes immediately (by using a combination of decorator and factory). I think Daves conclusion that '[a] system of "hints" may reduce that overhead, but the "hints" tend to devolve into little more than a more abstract way of specifying Stack, List, Map, Hash, etc.' is too short-sighted. Of course if you have no smart means in-between and ultimately have to use the containers that are to your disposal now, the hints must neccessarily reduce to these few classes. But if the hints don't read 'list', 'heap', 'map', but instead e.g. read-pattern({(0, 70%), (last, 30%)}) && write-pattern(equally-distributed(0..MY_MAX-1)) && operations(filter, append)'. A simple list that checks special cases and matches these agains existing implementations would suffice.
Regarding the name: I like 'AdaptiveCollection' better, Gunnar, and while I agree with DV's defense of the name above, I possess additional reasons for liking it: accidental linkage, easier to casually include in a sentence, i.e. - marketing. I'm no marketer, but I know how important it is (as should you, with all your research into memes). 'AdaptiveCollection' will receive readers and ideas that 'IndexedValueCollection' would not.
Regarding the rest: I agree that it's still early to infer usage patterns. Indeed, it is impossible to do so statically in many cases! For example, if a collection is held by a service that processes queries (e.g. a filesystem or database), there is no information available to the compiler of that service to indicate the sorts of queries it might receive and had best prepare for.
OTOH, static inference of all possible mechanisms of access combined with a profiler to find usage patterns is well within our grasp. And we can provide hints as to what the profiler might say. I very much like your hints-descriptors, Gunnar, far better than any mention of 'list' or 'stack' or 'priority-queue using <function>' (though those sorts of hints might be legal as shorthand descriptors via syntax extension). The approach you describe above is to provide to the compiler a 'guesstimate' of exactly what a profiler might tell you. Using that sort of hint system (ideally with the exact same summary language a profiler would use) would significantly reduce the complexity of designing and developing adaptive collections. Indeed, it can use your hint to start, and (if determined profitable) revise when a sufficient profile of actual usage has been built. I thank you very much for putting that idea on the table.
I have no problem with multiple aliases for the same concept (after all, remember: I don't think pure concepts exist at all). AdaptiveCollection might be good for some purposes, wheras IndexedValueCollection might fit a precise definition of indexed, value and collection. -- .gz
I would suggest [why suggest? JustDoIt] that the majority of the content of this page be moved to AdaptiveCollection, and leave a description of IndexedValueCollection here. AdaptiveCollection is an appropriate all-encompassing term for the general concept discussed above, whereas IndexedValueCollection is a good term for one category of possible organisations within an AdaptiveCollection. I.e., given appropriate conditions, an AdaptiveCollection may decide to use an IndexedValueCollection, or it may decide not to use an IndexedValueCollection, as its internal structure. -- DaveVoorhis
That's OK with me. Just move forward as you see fit. -- .gz
See also: AreTablesGeneralPurposeStructures, ExtendedSetTheory