Bag Discussion

'Bag' is another word for MultiSet, i.e. a set which can contain more than one instance of a given thing.

In Smalltalk, a Bag is implemented as an unordered collection of keyed objects, likely using some kind of hash. It holds any number of objects, in arbitrary order, but optimised for looking them up according to some key value, and allows multiple objects that share a single key. The typical operations on a bag are "insert this object", "remove this object", and "give me a list of all the objects with this key".

A Java Map/HashMap is not a bag. A Map in java maps one key to one value.

RdfLanguage also uses "bag" to refer to an unordered collection.

It's actually difficult to implement a true bag. Apps usually need a way to reference dynamically-allocated things/objects, which requires a unique identifier. In practice this usually means it's really either a map or a list. Such an API will usually do one of 3 things:

Require the API user to supply the key, in which case it's a map
Give a random or random-looking key back to the API user, in which case it's a map. (For example, memory address pointer.)
Give a sequential number, in which case it's a list.

If it's a write-only data structure, then perhaps it can be a true map (from API user's perspective), but such is rarely useful. The closest I've seen is logging systems that simply record events. But even these usually have some order to them, such as line number in a file, or record ID in a database. Most databases either have an internal record ID (extracted with a special function or dummy column name), and/or preserve order (list). Plus, it's usually useful to create an auto-number index key to serve as a reference number. (Although in a few cases, I simply had to use existing tables and had no DBA privileges to add an auto-number. See BagNeedScenarios.)

--top

Your argument leads to a conclusion that "true bags are not useful for most applications" as opposed to your thesis that "true bags are difficult to implement". I agree. Bags are mostly useless, in practice.

The two issues are related because one usually needs a way to identify something to "retrieve" it. Perhaps I need to reword some of it.

Efficient retrieval of a specific element is not part of the `bag` abstraction. You can always remove elements one at a time until you find the one you're looking for.

"Bag" doesn't define efficiency since it's an interface mechanism and not an implementation. If you mean one may have to do a "sequential" extract of many elements and examine the values to find a given target, then I'd call that "inconvenient" for the interface user; which is another way of saying "usually not practical".
Bags are usually not practical. Even you advocate coupling each element with a unique identifier, such that you REALLY have a SET of (unique-id,element) pairs. Your mistake is calling the result of your tweaks a `bag`.
What I advocate is perhaps more comparable to "lists", actually. "Bags" may be a misnomer in some contexts. "Unique" can be situational, HalfAssKeys.

I think your concept of how "most databases" are implemented is not well tested against actual database implementations.

The one's that I'm familiar with either have an extractable internal "record ID" of some sort, and/or preserve order for non-keyed tables. They behave more like lists.

Which databases are you familiar with? To what extent do you use sharding and related features?

No, I have not tested their inherent ordering characteristics under sharding.

See also the Dictionary of Algorithms and Data Structures: http://www.nist.gov/dads/HTML/bag.html --ChristofferHammarstrom

See OrderedBag

CategoryDiscussion