Designing Better Java Collections

If you were asked to design a better collections API for Java what would you do? Assume your API doesn't have to be backwards compatible but should be extensible. At the very least it should have lists (With OrderedCollection, lists are unnecessary), sets, maps. But the rest is up to you.


How about this as a starting point (I've supplied the classes, the methods are published and easily added):

 Object
Collection
Bag
KeyedCollection?
Dictionary
LookupTable?
 IdentityDictionary?
SequenceableCollection?
AdditiveSequenceableCollection?
OrderedCollection
SortedCollection?
ArrayedCollection?
Array
ByteArray?
String
 Symbol
Interval
Set
IdentitySet?
-- a SmugSmalltalkWeenie


Unfortunately, the class hierarchy above doesn't cut it. In fact, no class hierarchy will fix Java's collections, because the problem isn't at the Java library level (level 3 of the 4 "Java"'s), but the Java language level (level 2).

When you create a list of Strings, you want the function to access the list to give you an instance of type String; if you create a list of InputStreams, you expect an instance of InputStream. However, the Java language has no way to express "List of Foo", so, no matter what you do, you have to litter the code with casts -- you have to tell the compiler what your collection's objects' types are. And to think that strong typing is one of Java's selling points.

So, to answer the question -- I would add generics to the Java language. Then I could use type names like "List(String)" or "Map(String, InputStream)" and the compiler could do the Right Thing.

-- BillTrost

As the SmugSmalltalkWeenie who offered the above starting point, I assume that of course the language will be changed to avoid the problems BillTrost mentions. After all, each object contained in the list knows what it is, so list queries will just answer whatever object they find. I find the idea that a CollectionOfStreams? is a different class (with different behavior!) from a CollectionOfStrings? to be utterly repugnant. I care about the difference between an OrderedCollection and a SortedCollection?. If I care about what goes in the collections, I'll express that in the code that does the insertion.

Well, the rest of the world (the not SmugSmalltalkWeenie part) looks at Collection<Stream> and Collection<String> not as different classes, but as different types. Type is something that we make use of to better specify/enforce/communicate interfaces, contracts, design decisions, etc. It's also cool to use types for programming the difference when programming the commonality doesn't suffice.

Oh, and I forgot to mention that typically for the rest of the world it so happens that people who specify and implement an interface are typically not the same as people who make use of the interface. So if you care what goes into a collection, you don't search all over the place to see what people put in there. You just declare it's proper type. In some cool languages if you really care what goes into a collection, you may choose not to do anything , it's just done for you. Ain't TypefulProgramming beautiful ?


So we can't implement better Java collections till 1.5 comes out. Fine. What design decisions can we make? For instance which items in the list above would be interfaces, abstract classes and concrete classes? What design patterns would you use? I must say that it does seem like something of a copout to simply say "use the smalltalk collections api" as it implies that the software engineering community hasn't learned anything new about collections in the last 20 years. And the API design suggested above seems to be missing Stacks, Queues and Trees. Or would people just compose these from the above classes?--AdewaleOshineye

The API is specified in Redbook Smalltalk -- I figured I'd let someone else type it in, but I'll be happy to if people desire. Stacks and Queues are built with OrderedCollection. Is there a difference at the API level between a Dictionary and Tree? -- TomStambaugh (SmugSmalltalkWeenie)

[The Redbook smalltalk documents aren't available online. Besides they were only a proposed standard. Instead take a look at the actual ANSI standard. That's not available online either (stop me if you sense a pattern) but the final draft version is available from ftp://www.smalltalksystems.com/sts-pub/x3j20/ although only the .doc format works in OpenOffice.]

According to its documentation, IBM Smalltalk conforms to the ANSI/Redbook collection protocol. I'll type it in when I have a chance. -- TomStambaugh

For general dictionaries, you're not guaranteed that the implementations can conveniently respond to range queries. So typically Dictionaries don't provide range queries, and range queries are often times important.


I would strongly question the wisdom of designing the API for a statically typed language using a dynamically typed language as the basis. And saying change the language only gets you so far. Unless we're saying that the only good collections API is the smalltalk collections API. Which would be rather sad.

I'm saying that the collection protocol of Smalltalk, especially in combination with the Stream protocol of Smalltalk (which works on any SequenceableCollection), is a better starting point than the current Java monstrosity. I don't argue that it should be cast in stone, I'm not opposed to modifying it as needed to reflect the Java type system, I'm not opposed to extending the API as needed.

I am saying that it is straightforward to move to Java and that the result yields better Java code and more a more productive environment. A few other things are at least helpful:

-- TomStambaugh


Collections differ at several dimensions (rough cut):

The Smalltalk inheritance hierarchy artificially meshes all these different concerns into a single hierarchy, along with implementation issues. IMHO, Java's approach to separate collection interfaces and implementations is much nicer in principle, but it's not carried far enough.

Ideally, I would like to choose among a large and extensible set of collections, with each collection implementing an interface for each dimension. E.g. some given collection XyzCollection? could implement the interfaces Mutable, Sorted, Tree, Unicates, DynamicSize? and Keyed.

Of course, without generics, currently all is lost for Java collections. ;)

-- FalkBruegmann

The above suggests a feature selection kind of arrangment instead of a tree, something like:

  cf = new collectionFeatures();
  cf.mutable = true;
  cf.ordered = true;
  cf.flat = true;
  cf.duplicates = false;
  cf.fixedSize = false;
  cf.keyType = new collectionFeatures.key;
  cf.etc....
  myCollection = new collection(cf);  

However, personally I think this is trying to reinvent the (lite-duty?) database and I would rather go with a database-like API so that one can scale into a database if wimpy array-like things are outgrown (GreencoddsTenthRuleOfProgramming). In other words, follow ContinuityPrinciple. All non-trivial roads lead to a database IMO. I see no logical reason why lite collections protocols should not gracefully scale into heavy collections (databases) without an API overhaul. (But it then perhaps leads to database paradigm fights and the ObjectRelationalPsychologicalMismatch issue and the head-scratching issue of why OODBMS are failing in the market. Something needs to give one way or another, guys)

Not everything scales, and not everything that does scale scales into databases.

Well, I think it can for the most part. I see no natural barriers. Note that the implementation at the low end does not have to be the same implementation as the high end for this to work. It is the interface that is the issue, not the implementation.


I'd propose replacing flat/tree with flat/recursive to allow us to have graphs as well. OK, changed it.

Then we'd probably want indexed/keyed instead of simple/keyed. OK, added indexed, changed "simple" to "no lookup".

And by the way what's the difference between orderd and sorted? Ordered means that if you put elements into the collection in a certain order they will stay in that order. Unordered means, they may or may not stay in that order, order is not significant. Sorted means you provide a sorting criterion, and whatever order you put the elements into the collection, they will assume the order of that sorting. --fb

Dynamic size should allow for restrictions on the location of modifications. (i.e. you can quickly insert or delete: anywhere in a linked list; at the front and back for a circular deque; just at the back for a resizeable array)


The C#/.Net FCL seems closer to the design suggested above, albeit without generics.


Take a look at http://g.oswego.edu/dl/classes/collections/index.html for DougLea's design and implementaion of a collections API. It reminded me of the API designed by TimothyBudd? in his book ClassicDataStructuresInJava?.


Assuming we use this taxonomy of interfaces:

then we construct common data structures as concrete classes that implement particular combinations of interfaces: Add in a library of predicates and iteration protocols and we're in business. Of course we end up having to forgo performance and deal only with Objects but java 1.5 will provide BoxingConversions so at least it will be simple to use these collections in a generic fashion.


Of course, only some sorted collections preserve order when A==B. And who says a Set can't allow lookups. Of course, then you might call it a Hash. And what about partially sorted collections? --BenAveling

It's those kind of awkward questions (should there be a parallel hierarchy of mutable and immutable collections or are they just state changes?) that have led me to repudiate a lot of the thinking above. It's just not possible to come up with a sensible collection hierarchy through this kind of Platonic abstract thinking. Instead I'm kicking around the idea of building up a set of useful collections (like Stack, Queue) by composing other collections and ignoring the problem of slicing things up into the perfect hierarchy. Take a look at http://citeseer.nj.nec.com/black02applying.html for a similar refactoring of the smalltalk collections API using traits that eliminates large amounts of duplication.

Fundamentally the collections hierarchy problem is not single-rooted and can't be solved through inheritance alone. Some combination of (multiple?) inheritance, dynamic composition (whether through traits, aspects or something else involving the type of a collection changing at runtime) and abstraction from concrete implementations is necessary. Trying to solve it using a base Collections interface and the LSP as your guidelines keeps leading to contradictions (i.e what do a a MutableList? and an ImmutableList have in common? does it make sense to call an ImmutableStack? (where you can only peek()) a Stack?).--AdewaleOshineye

In addition to dynamic size, I think 'insertable' would be another category. For example, LinkedLists?, Maps and Set are naturally insertable, where an Array is not. An array may grow in size, but it is very inefficient to insert a value. -- Mike Austin


See: CollectionHierarchies


CategoryJava


EditText of this page (last edited December 15, 2007) or FindPage with title or text search