Collection Hierarchies

This is a design exercise for Java...

Introduction

It is generally accepted that the collection hierarchy problem has never been definitively solved for object-oriented languages, especially for those object-oriented languages like Java that supports both primitive types and classes.

The solution that probably comes closest to being definitive is the SmallTalk collection hierarchy. SmallTalk provides a wide variety of data structures (both abstract and foundational) along with a consistent idiom for iterating over collection elements. Each collection class has a set of HigherOrderFunctions used to apply blocks to their contents.

However, I am of the opinion that even the design SmallTalk Collection Hierarchy is flawed in several areas. For example, the Collection Hierarchy classifies a Linked Lists as a kind-of integer indexed sequence (i.e. by subclassing SequenceableCollection or the older IndexedCollection). This makes it difficult to create generic queues or stacks that can adapt the SequencableCollection protocol without having to consider the implementation details of the concrete class that is implementing that protocol. There is a rule being broken here somewhere. After all, if I have to consider a concrete-implementation when using an abstract interface, something is wrong in the classification scheme. A client cannot assume that the random-access provided by the Sequencable protocol will behave the same for a list as it does for an array.

Worse yet, even if the SmallTalk solution were perfect, it would still not be an appropriate solution for Java. After all, SmallTalk has a dynamic type system - everything is an object. In Java, we need to consider collections that are keyed from primitive types such as integers as well as first-class objects. This immediately complicates an optimal solution. The getAt, addAt, removeAt, or indexOf methods of a Java array type will most naturally deal in the primitive int type and not the Integer class.

The STL provides a great solution for GenericProgramming. Unfortunately, it is not an OO solution. A GenericProgramming solution has little use for establishing common interfaces using inheritance. Instead, it can use genericity with ParametricPolymorphism to enable common interfaces using different types. This is why an STL approach in Java (such as the JavaGenericLibrary or JavaAlgorithmLibrary) is usually unwieldy, overly complex, un-extensible, and inefficient. Unfortunately, even if Java did support parametric polymorphism, most of the proposals on the table cannot use both primitives and classes as template types.

This exercise need not result in a good Collection Hierarchy design. Maybe the exercise will discover that the object-oriented design possibilities for Java are compromised by its type-system. Hmmm.. I guess we already know this. However, my guess is that it will at least give us some hard evidence or make more concrete the reasons why this sort of straddle the fence between dynamic and strong type-system has weaknesses in an object-oriented language. If nothing else, it should show the complexity that Java's support for both primitive values and objects adds to a data structures foundation.


The Exercise or Your challenge if you choose to accept it...

Design an extensible class hierarchy for a set of foundational data structures and abstract data types that is suitable for languages like Java.

You don't need show all the interface members - just enough to demonstrate how the class tree and its interfaces work. You won't need to implement any of the interface members; just declare them. Furthermore, we don't care about names - this isn't an exercise about class naming. What I am interested most in is the inheritance and aggregation relationships and their impact on interface based programming. For example, what happens if a doubly-linked list is the subclass of an integer indexed sequence? In the Java collection classes, you can use a List interface. However, this spells bottleneck if you then write the following code based only on the interface:

  void some_func( List list, int at, String spec )
  {
      if ( list.get( at ).toString().endsWith( spec ) )
         list.remove( at );
  }

You see, while LinkedList and ArrayList share the common List interface, they actually end up requiring a different programming approach to solve various problems. To me, this is a blatant design flaw. The common interface deceives the programmer into thinking he is dealing with the same abstract when, in fact, a list and an array are completely different abstractions with disparate behavior.

Requirements:

The three requirements are to (1) create an extensible hierarchy with basic as well as abstract data types, (2) allow for intrusive as well as non-intrusive collections, and (3) assume HigherOrderFunctions for iteration rather than decoupled GangOfFour or STL-style iterators.

The first requirement suggests the need to represent both foundational data structures (i.e. lists or arrays) and abstract data types (i.e. sets, stacks, or maps). Some people consider that classes in the second category areadapters for the classes in the first category. This results in a collection hierarchy that uses aggregation as well as inheritance. Others decide this distinction is not important and put the class from both categories into the same hierarchy. Which approach you take is up to you, as long as there are examples of each and new data structures can be added using the existing interfaces.

The second requirement in its most simple form assumes we will have a traditional object-oriented linked-list where the node is an object-wrapper class that is hidden in the implementation of the list as well as an intrusive list that exposes the node interface and requires that elements subclass this type. Here is an example of each:

 class ItemList
   public void add( Object el )
   {
      Snode node = new Snode( el );

node.setNext( m_end.getNext() ); m_end.setNext( node ); }

class SnodeList public void add( Object el ) { Snode node = (Snode)el;

node.setNext( m_end.getNext() ); m_end.setNext( node ); }

These two classes are the minimum one can do to satisfy the second requirement. Of course, it would be nice if the collection hierarchy included (or at least allowed for) intrusive and non-intrusive lists that are singly-linked as well as doubly-linked. Also note that you may want to consider the Item List an abstract data type and the Node List the foundational data structure it is implemented with:

  class ItemList implements LinearCollection
  {
     private static class Link extends Snode
     {
        private Object m_item;

Link( Object item ) { m_item = item; } }

private SnodeList m_rep = new SnodeList();

public void add( Object el ) { m_rep.add( new Link( el ) ); } }

In this scenario, the foundational structure like resizeable array, doubly-linked list, and singly-linked list need not even be part of the primary collection hierarchy.

The third requirement is for InternalIterators. While you do not have to show the implementation of the internal iterators, you can definitely not rely on ExternalIterators to abstract away the differences between the data structures. The rationale is that we want a natural collection hierarchy. We don't want big coupled classes that simply hide the details of each collection. Not only is this non-optimal, it also just isn't as clean as the internal iterator approach taken by SmallTalk (see: InternalizeExternalIterators). Whether it is true or not, this exercise assumes that external iterators are a sign of weakness in object-orientation (See: ftp://ftp.netcom.com/pub/hb/hbaker/Iterator.html).


Why a List is not an Array or Observing the true Nature of a Collection

Just as there are many DesignPatterns, each with different behavior and applications, so are there many data-structures and algorithms. A set is not a bag and an array is not a tree. Many of us will take the time to understand the difference between a BridgePattern and an AdapterPattern but know very little when it comes to the best uses of a list over an array. Object-oriented programmers are particularly weak in this regard and many never go beyond the use of an Array (or OrderedCollection). Many just don't grok the situations where a list is more appropriate than a vector. Why is this? Why do we celebrate the differences between DesignPatterns - and discuss the appropriate usage of each - but yet we are happy to hide away those same differences in data-structures and algorithms? Just as with DesignPatterns, it is in their differences that their strength and power can be found.

Most of us know that arrays provide random-access to its elements in constant-time. We also know other stuff about arrays, such as inserting an element requires that all elements trailing the insertion point be copied. Furthermore, we know that random-access in an linked-list performs only in linear-time. For example, if we want element N of a linked-list, we must iterate from the start of the list to N elements. This is the case for positional insertion, removal, or access. Consider the following:

  public Object at( final int nIndex )
  {
     return detect( new Block() {
        int nItem = 0;
        public boolean is( Object each ) {
           return ( nIndex == nItem++ ); } } );
  }

This method attempts to provide a member that allows its clients to believe that this linked-list has the same random access features as an array. But this is not the case. The truth is that random-access is not in the nature of a linked-list, at least not as it is for an array. Random-access in a list has different behavior than random-access in a linked-list. However, most collection hierarchy designers are happy to have both Lists and Arrays descend from the same interface that provides random-access. This fools a client into believing that they can simply use this interface without regard to what concrete class it is instantiated with.

Because a linked-list is a linear collection, we can access its elements by position as shown in the list implementation of #at. However, this is not that same as indexed-access. We have to make a list look like it is integer-indexed by writing adapter code - we adapt the nature of a list (linear links) to support random-access features. Despite this fact, both SmallTalk and Java subclass their array and list classes from an indexed collection protocol. This means if you are a programmer with only a reference to this indexed protocol, you cannot use it generically and expect the same behavior.

If a list isn't an indexed-collection, what is it? Well, it is a linear sequence. If it is a singly-linked list, it is a Forward Moving Sequence, and if doubly-linked it is a Bidirectionally Accessed Sequence. This allows a list's elements to be accessed positionally, but not indexed. In a list, we are not accessing the item #at n, we are accessing the 'nth element. I hope this difference is clear. Contrast the previous list implementation of #at with this implementation for an array:

  public Object at( int nIndex )
  {
     return m_contents[ nIndex ];
  }

In this example, the integer argument is really an element index. This is a natural, constant-time operation for an array type. There is no way to get around it - lists are linear, they are not random access. It is my opinion that providing a common interface whose concrete classes have wildly different behavior is a design flaw. Generalization does not mean we should trick a client into believing that at(int) on a list is behaviorally the same as at(int) on an array. The common interface for a list and array is not an indexed-collection (such as java.util.List) with members like addAt(int,Object), getAt(int), or even removeAt(int). Instead, it is a Sequence with members like addFront(Object), addBack(Object), getFirst(), getLast(), removeFront(), removeBack(), and various InternalIterators. It is simply an error to index a linked-list and expect it to have the same behavior as a indexing an array. We allow our clients to produce better code when we enable them without violating abstraction and generalization.

We should always keep this tension between natural and artificial members in mind when designing classes. Think about the behavior contract for each data structure - are we doing something that can mislead our clients? Try to create a difference in your head between abstracting or generalizing and obfuscating. When we design, we want to abstract, not obfuscate.

It is interesting to note that in generic programming, libraries like the StandardTemplateLibrary do not even provide a positional helper method for linked-lists. To access an element by position, the programmer has to iterate until the element at N is found. Only vectors, and other truly random-access collections, support the getAt(int) interface member. This results in preventing the programmer from making incorrect choices. The interface members do not betray the programmer.

One Failed Attempted

Now, the simplest way for me to think of all of this was to make Array, ItemList, NodeList, and even Map all subclasses of IndexedCollection - i.e. a kind of collection whose elements are accessed by some type of index. For Array, the index would be an Integer. For the (D|S)NodeList, the index is doubly-linked or singly-linked Node. For ItemList, the index was a node-type cast to an immutable Index. And finally, for Map the index type was the association's key'' value.

There are big problems to this approach for the JavaLanguage. First, object creation is very expensive. To make the Array use Integer objects instead of primitive int values incurs unnecessary overhead. Randomly accessing 1,000 elements would require the client create 1,000 objects!! Instead, we would like to use int indexes for the Array. Unfortunately, this breaks the class tree since Map and List want Object indexes. Fooey! Personally, I was not willing to hoist the totally unnecessary overhead of creating an object onto my clients for each time s/he needed to index an array element.

--RobertDiFalco


Discussion:

No, no, Robert - Tell us what you REALLY think! ;->

Just a random thought: Maybe the "intrusive list support" of an object would involve the object implementing an interface. Perhaps a convenient way to do this would for the "intruded upon" class to have an inner class "node" that inherits from some common base class and supports an interface for fetching next node, previous node, and data item. Hmmm... Offhand, this seems like an interesting mix of intrusive and non-intrusive implementation techniques; a class could participate in several lists at once, by having several inner classes. (I'm not making claims that this is all "right" - I'm just thinking out loud. ;-) -- JeffGrigg

Jeff, I may be missing the point, but to me that is the big difference between an intrusive list (nodelist) and a non-intrusive list (itemlist). The nodelist requires that the client's elements implement the node interface - i.e. that elements are nodes. The itemlist deals with nodes internally and allows items to be of any object type.

However, I probably placed too much importance on this side issue of intrusive versus non-intrusive data structures. The real design issue is how to best organize the fundamental collection types in a language like Java where you have typed methods and overloading, both Object and non-Object types, and no parametric polymorphism. I didn't mention the Eiffel Collection Hierarchy only because it seems the worst of the bunch. In SmallTalk, for example, I could ignore the differences between using a Node for the Index in Linked Lists, an Integer for the Index in growable Arrays, and any hashable object for the Index in Maps/Dictionaries because everything is an Object. I wouldn't make the same mistake that smalltalk did by assuming #at and #indexOf in the Sequenceable Collection protocol be an integer. Then, because List and Array are both Indexed Collections, I could make Stack, Deque, Queue, or Priority Queue simply adapters that take an Indexed Collection as a construction argument. Unfortunately, doing this in Java would kill performance since you would have to create an Object index for Arrays, or use an integer index (an nth lookup) for linked lists.

Let's consider the following. Say I create an Indexed Collection interface that used both Object indexes and integer indexes. Something like the following:

 interface Indexable extends Collection
 {
Object at( int index );
Object at( Index index );

Object remove( int index ); Object remove( Index index );

Object append( int index, Object item ); Object append( Index index, Object item );

int nthValue( Object item ); int nthReference( Object item );

Index indexOfValue( Object item ); Index indexOfReference( Object item );

...and so on... }

Then, say I did something like the following for linked list types:

public Object at( final int index )
{
 int count[] = {0};
 return detectFirst( new Block() {
public boolean is( Object each ) {
 return ( index == count[0]++ ); } } );

}

public Object at( Index index ) { return ((Node)index).getItem(); }

And this for integer indexed types like dynamic arrays:

public Object at( int index )
{
 return m_contents[ index ];
}

public Object at( Index index ) { return at( ((ArrayIndex?)index).intValue() ); }

Besides being VERY complex (two versions of everything for each indexed feature), it does work for the adapters stack, deque, or queue. Why? Well, the adapters would have to make an arbitrary decision - do we implement ourselves based on integer indices or object indices? If the latter, we have to create a lot of unnecessary objects for dynamic arrays. If the former, we end up exponentially decreasing the performance of lists since we have to do a look up for each integer indexed access. See the problem? Now, I know Wayne would tell me that I shouldn't be concerned with performance. However, the reason this wouldn't perform well is because the design is overly complex and requires clients to make arbitrary decisions - i.e. Do I use integers or objects when I have an Indexed Collection but do not know its concrete type?

-- RobertDiFalco


Questions:

  1. What is the semantic difference between a key, a node, and an index? What are their similarities?

  2. What is the semantic difference between a dictionary that contains associations, a map that puts and retrieves a value at a key, and a map that puts and retrieves a value at a value.hashCode? What are their similarities? Are there different classes here or are they all one class?

  3. In an O-O system, is it possible that a List or an Array actually implement a Stack interface instead of the traditional approach where a Stack is created using a List or an Array instance?


Robert - you have done an excellent job of defining the problem.

I have a feeling (but I don't do excellent job's, so I haven't given it enough thought) that there is a danger in mixing functional requirements with implementation approaches. If we consider only the functional requirements of the collections(keyed, ordered, access [fast|other] by key/position/rank, etc.), and leave unspoken any potential implementation strategy) we might get a different cut on the interface requirements and class relationships.

You're right. Do you think I should take out all the implementation strategy suggestions and analysis? I have no trouble doing that. Goodness no!


OK, my attempt at a solution. I would start defining a couple of interfaces that simulate Smalltalk block objects:

 public interface Predicate
 {
boolean isTrue(Object o);
 }

public interface Operation { void performOn(Object o); }

Then we have an interface implemented by unordered collections:

 public interface UnorderedCollection?
 {
void add(Object o);
void remove(Predicate p);

void iterate(Operation op); void iterateSome(Operation op, Predicate p); }

...and for ordered collections:

  public interface OrderedCollection extends UnorderedCollection?
  {
void addAtBegin(Object o);
void addAtEnd(Object o);
void addAfter(Object o, Predicate p);

void iterateForward(Operation o); void iterateBackward(Operation o); }

The class hierarchy would be so structured that all you typically use are the interfaces. This makes the hierarchy flexible. Perhaps the classes can even be hidden behind a Factory object.

-- StephanHouben

So, would you create an array that used objects for indices instead of int values? i.e. addAfter( new Integer(10), "foo" ) instead of addAfter( 10, "foo" )? If you support both, it is confusing to the user. For example - When should I use Integer instead of int. Can I use int indexes in the Linked List class? Why are there so duplicate methods? And so on... The more I think about this problem, the more I wonder whether there even is an elegant O-O solution for a language like Java. I think it raises interesting doubts about its hybrid type-system (i.e. both objects and non-objects). -- RobertDiFalco


It seems to me that the 'collection classes' in Java2 are very well-designed (very few flaws). See <http://java.sun.com/j2se/1.3/docs/api/java/util/package-summary.html>. The addition of Java Generics <http://www-ics.ee.ic.ac.uk/surp99/report/caj97/> should get around the SimpleTypesAreNotObjectsProblem?. -- KeithRay

Well, we will have to agree to disagree. I really, really dislike the Java Collections. While I think the genericity promised for 1.4 is great, it will not fix design-flaws. You will have to read this page to understand why I think the Java collection classes are so poorly designed. However, you need only benchmark them to see the effect of this incorrect design. There are many other issues I can get into besides what I mention here, but this page goes into the major ones. -- RobertDiFalco

Have you looked at TheEiffelLibrary?? It uses a mixture of multiple inheritance and genericity.

Yes, I have. Unfortunately, this wouldn't work for a language like Java that has both objects and non-objects. And, as far as I know, GJ doesn't allow one to mix both.


Most tree taxonomies turn out to be flawed or arbitrary in the long run. There are many orthogonal characteristics to "data structures". Trees tend not to work very well with orthogonal traits. You can make each level represent a trait, but that is stretching things IMO. For one, how do you determine which trait gets to be higher in the hierarchy? The importance of each trait tends to be relative to needs, not absolute.

There's nothing that requires an interface hierarchy to be a tree. In fact, in Java, the ArrayList class implements RandomAccess? to mark random accesses as fast even though its primary interface is List.

Another way to study data structures is perhaps to identify features rather than IS-A type classifications. For example, a relational view would be to consider a DictionaryDataStructure a two-column table with one column being a unique index. IOW, how many columns, how many keys, and the nature of those keys (unique, non-unique). It is more of a smorgy-style way to look at things: "A little of this, a dash of that".

Perhaps some people grok the world better if they can find a way to view everything as a tree. But not me.

Are you stating that you don't like class hierarchies in general? I don't think this exercise assumes an is-a or has-a preference. The different data structures can be modeled with inheritance, aggregation, decorators ala STL, and so on. Of course, I think some set of common interfaces are important if you ever want to return some hing other than a concrete class. Otherwise converting between data structure implementations can have a lot of repercussions throughout a code base.

How about an example from practice.

See DedicatedStructuresVersusRdbms.


What do people think of the Java version of Foundation Framework? The ObjectiveCee version was part of NextStep/OpenStep, and now is part of CocoaFramework. But the Java version is part of WebObjects, and includes collection classes like NSArray, NSDictionary, and so on.


A rather design is available at: as part of the book Data Structures and Algorithms with Object-Oriented Design Patterns.


See also: DesigningBetterJavaCollections


CategoryDesignIssues CategoryJava CategoryHierarchy


EditText of this page (last edited January 2, 2008) or FindPage with title or text search