Classification Is Tough

At lot of problems in my domain (custom biz apps), programming code classification, and security classifications seem to point to a common theme: classification is tough. Reasons for this include, but are not limited to:

Perspectives: The "proper" classification from one perspective may not work for another. Thus either a middle ground must be reached, or distinct classification systems/attributes need to exist that may have some conceptual overlap. For example, political geographical jurisdictions often don't match postal jurisdictions (such as zip-codes), making computing tax difficult.
Grokkability: The more flexible the classification system, the harder it is to navigate and comprehend. Trees are fairly easy to grok, but don't scale well. Sets are more powerful, but are difficult to visualize and navigate (or at least require a lot of training).
Cleaning - How does one keep the classifications clean without breaking things that work? Are there orphan or redundant classifications? How do we know? How do we motivate people and/or police them to keep them clean?
Tools - What kind of tools are available to manage classifications? Are they sufficient? Example tools include visual tree, graph, and set browsers.
Time - How do we coordinate historical classifications with current ones? For example, if a store moves to a different region, do we compare sales based on where the store was or where it is now? If the first, then it won't match historical reports, generating a lot of phone calls. And how do we track this time aspect? It can create a "perspective" problem, as described above where different time slices may require independent classification systems. Classification systems are convoluted enough without a time dimension tossed in.
How do we deal with "fuzzy" classifications? Should we rank associations?
Compiling versus interpreting - Can we use a "compiler" to use classifications for validation and integrity checking? Can we do this real-time, or does it require waiting? If the second, does it scale? What if much of the classification is handed off to power-users instead of programmers? Does this require a "language" change-over so that power-users have something more friendly than Java or SQL?
Overlap - Things will often fall into multiple categories, mucking up any attempt at a "clean tree".
Continuous - Often the degree of "fit" to some description or set of rules is not a Boolean answer.
Granularity changes - A granularity level or distinction that was a good fit for one situation may not be for another.

--top

I agree. ClassificationIsTough. Not only in your domain(s) but in any suitable complex domain. Kind of an IncompletenessTheorem?. There are also results from the social sciences that you should never let a group determine classifications for items. At best classification along simple measurable indicators (e.g. time, priority) may be used. Everything else inevitably takes up lots of time and no results.

Often the group has to determine classifications because they need them for their work. I generally try to let a power user of a particular group have a way to manage their own classifications for their group's needs. There still may be a master or default classification outside of specific departments, though. Almost any non-trivial business system is going to need a CMS - Classification Management System, of some sort. Trying to be a central classification cop that makes every department or user happy can be a daunting task. Thus there comes a time when it's easier to let them create their own as long as it doesn't clobber others'.

The beauty of set theory is that being a member of one set can be independent of membership to another set. (The sets themselves may need to be members of other sets, such as for indicating which department manages a given set.) The down-side is that sets are difficult for most users to grok. A power-user or "business analyst" is hopefully available to help with that so that programmers are not spending time shuffling instances between sets all day. -t

Methinks you misuse the phrase set theory, which focuses far more upon structure, description, and properties of sets than membership. A proper rephrasing would be "The beauty of sets is that being a member [...]".

As far as being 'difficult to grok', well, while I disfavor inventing statistics about what 'most' users have trouble with, I cannot name anyone who has had difficulty distinguishing 'red things' from 'pencils' or grokking that some things are 'red pencils' and fall into both sets.

To clarify what I meant, it's not so much the concepts of sets, it's using them in production for non-trivial classification, such as big store catalogs (products). Many users just seem to get confused because it's not something they are use to, at least not nearly as well as trees. Generally I will only implement them directly if a PowerUser will be managing with them. A generic clerk has a fairly high probability of getting confused and making "band-aide sets" instead of keeping the set model clean. I'm not saying "don't do it", but rather make sure the staff is qualified and ready first.

Rather than difficulty grokking sets, I suspect there would be more trouble with using them to drive code. If you're supposed to perform behavior X when you see a red thing, and behavior Y when you see a pencil, what are you going to do when you see a red pencil? In general, X and Y won't even be compatible behaviors... i.e. 'X' might be 'turn left', and 'Y' might be 'turn right'.

Specialization of behavior based on flexible classifications is also a problem:

Consider that predicates - functions that return true|false for their argument, usually without any SideEffects? - are a powerful vehicle to express arbitrary sets (including infinite sets, and even non-enumerable infinite sets - there are many infinities in set theory).

 (define (member? X Set) (Set X))
 (define (union A B) (lambda(X) (or (member? X A) (member? X B))))
 (define (setdiff A B) (lambda(X) (and (member? X A) (not (member? X B))))
 (define (list->set L) (lambda(X) (contains? L X))
 ;...
 (define Naturals (lambda(X)(and (integer? X) (greater-or-equal? X 0)))
 (define MyFavoriteNumbers (list->set '(6 7 12 42 108)))
 (define RussellParadox (lambda(X)(not (member? X X))))

By RicesTheorem, we know that we cannot, in general, know whether one set expressed by such a flexible mechanism is a subset of another. That is, it is impossible to write a 'subset?' function that will simultaneously be correct and terminate. While a sufficiently smart human may be able to tell that (subset? MyFavoriteNumbers Naturals) is 'true' by destructuring the above definitions, even that relatively trivial case is a challenge to automate.

If one cannot automate a 'subset?' function, then one cannot, in general, automatically compose two or more modules (with their new rules and classifications) in a manner that respects dispatch to the 'most-specialized' code. This adds to the problem mentioned earlier where incompatible rules might apply in the event of overlap. In general, a human will need to be involved every time two or more modules are brought together. This modularity issue has undermined the promise of such powerful classification-driven-behavior mechanisms as PredicateDispatching.

That said, the flexibility of rules and code driven by domain-data and domain-classifications can be very nice. Perusing some examples of rules-driven programming might be of interest; many may be found at the InformLanguage websites - consider http://inform7.com/learn/eg/bronze/source_2.html or http://inform7.com/learn/eg/bronze/source_28.html . Inform uses MultipleDispatch (restricted up to ternary verbs - the highest used in English) and a SingleInheritance taxonomy for the domain (simulation/game) elements - so it isn't as flexible as TopMind might favor, but remains impressive. But MultipleDispatch is probably the most powerful of practical (modular, high-performance) mechanisms for classification-driven behavior available today.

CategoryClassification, CategoryAbstraction