Collection Oriented Verbs

This is to eventually replace the topic DatabaseVerbs. I realized the behavioral patterns are not just about databases, but how collections are handled. A databases are one kind of system that uses CollectionOrientedProgramming. Collection-oriented programming tends to factor typical and common collection-oriented operations into a standard interface or operations. One perhaps can go outside of these operations and reinvent or redefine their own, but COP tends to encourage one to stick with a standard, both for optimization and consistency reasons. -- top

Sets and relations (relational operators slightly different than pure set to yield tuple closure, see note below)

union
difference
CartesianProduct
dot product
intersection (can be emulated with "A minus (A minus B)"
- See also: SetTheory

Relational/record operators

restrict (select rows; SQL "where X = Y"; Unix "grep" command)
project (select columns; SQL "select a, b, c"; Unix "cut", "sed" commands)
join
divide
- Original Codd
- Date's small divide/great divide
semijoin
semidifference
extend
rename
aggregate operators
- count
- sum
- avg
- min
- max
- all
- any

Ordered sequences

sort forward/reverse by (attribute list)
unsort
permute to arrangement #N in base factorial
analyze permutation yielding #N in base factorial
eliminate duplicate elements/allow only unique (Unix "uniq" command)
eliminate unique elements/allow only duplicates (Unix "uniq -d")
difference (Unix "(cat A; sort A B | uniq -d) | sort | uniq")

AplLanguage array operators and LispLanguage list/tree operators would be usefully listed here as well.

...just to help kickstart things. -- DougMerritt

A particular class of collections that often receives special attention are matrices, and often the more specialized vectors/arrays. Common collection processing on such structures is to look for patterns that exist across fields that are related in position... e.g. looking for 'edges' in a picture, connected regions with a given color, searching for substrings in an array of characters, etc.

The above CollectionOrientedVerbs are relational in nature, but they don't really support the sort of cluster-oriented searches necessary for quickly matching substrings or looking for regions of a common color. I.e. you could represent an 'array' as a relation of 'position' to 'value', but try writing an algorithm to search for arbitrary-length substrings in an array represented in this manner... it is doable, but won't be convenient.

Does anyone have ideas on which 'CollectionOrientedVerbs' would be ideal for such a purpose? (It would be nice to achieve them without sacrificing the relation abstraction.)

Do you mean efficiency or notational difficulties? Relational does not dictate implementation, I would note. Existing products may be slow at X but that does not necessarily mean that relational is inherently slow at X. It may be possible to tune a relational engine performance for a specific domain. This is after all what domain-specific collection engines, such as matrix languages, are doing. Related issues are mentioned in DoesRelationalRequireTypes. -- top

If it was just notation, I'd leave other people out of it... problems of syntax won't be solved here.

I'm wondering which verbs are suitable for clustering, edge searches in matrices, region descriptions, etc. I imagine some would include: rotations, dot products, translations, cross products, largest clique, scales, region updates and operations (e.g. raise the '.altitude' field in this region some function of a distance from the 'center' of the region), etc. A few more operations were mentioned above relevant to 'ordered sequences', but even matrices can be reordered (consider a 2D or 3D spreadsheet, ordering by columns or rows or in a depth field).

Your appeal to 'tune for specific domain' is, in essence, an appeal to simply have a bunch of 'specialized' collection classes with specialized verbs tuned for specialized domains. For context, I'm thinking about GeneralPurposeProgrammingLanguage, and would rather avoid having a ton of different specialized collections to individual domains... and asking programmers to do the tuning themselves effectively requires programmers reinvent the relation in application. So the important questions are: Should a GeneralPurposeProgrammingLanguage have a specialized collection type for matrices (of which sequences are a specific example) with a set of matrix-oriented verbs? or should it abstract matrices as sets containing relational data and find a way to generalize the matrix-oriented verbs for any relation, then be designed to automatically recognize certain optimizations for representing and implementing matrices? I'm not willing to depend on a SufficientlySmartCompiler, so the latter idea - recognizing when optimizations should be performed - should be doable with a minimum of programmer/expert system annotation and a relatively dumb compiler.

So the questions on this page are: what would be an appropriate set of verbs for matrix operations? how would they be defined for general relations? how would a relatively dumb optimizer recognize that a relation/set is being used as a matrix or collection and optimize appropriately (to achieve similar algorithmic performance, at least asymptotically)?

How would a 'rotate' be defined for relations in general? what about distances around a point? or matrix cross products? Are those even necessary? Perhaps some 'higher level' abstract function generalizes over these sorts of operations in a manner that can be extended to both relations and matrices in the same manner that 'fold' generalizes over map, case, and many other functions.

I don't use matrices in my daily work, so I am rusty on them and thus cannot comment on the best way to define matrix-oriented operations/verbs. But in general relational (as we currently know it) shines best when there's a wide variety of activities on collections. If most of the operations are matrix operations or spacial (as in 3D) operations or some special collection domain, then relational probably won't do as well as a dedicated set of domain operations.

I agree about the comments regarding relational. Thing is, one should suspect that something like matrices would also be used for relational operations (if only 5-10% of the time), and typically relational programming may find some use for matrix operations (e.g. to perform clustering and Bayesian filtering, or to aide in the automated learning algorithms of expert systems). So I'm resisting specialization. I'd rather get more verbs and figure out how to define them consistently for both what are 'typically' specialized collections AND for other relations, even if I can only figure out how to optimize them for plain-old-matrices and such. Anyhow, I'm certain you'd be all for this goal. What I really need are ideas on appropriate verbs and how to define (perhaps in terms of higher level verbs) what are traditionally matrix or vector verbs to work on relations... which is why I'm on a page titled 'CollectionOrientedVerbs'.

Maintaining guarantee of termination and support for data-parallel operation are both very useful, to be achieved if possible.

The topic DoesRelationalRequireTypes talks a bit about minimum requirements for qualifying as "relational" such that domain-specific or app-specific transformations can be defined without "busting" relational. Formalizing such may make a nice academic project. Even if some don't agree with the starting frameworks, they can serve as starting points for something tested and improved via cycles of peer review.

In reference to tuple closure above:

This is the minor thing (I don't mean that sarcastically -- it's important, but only a small change) that distinguishes relational operators from ordinary set operators. ChrisDate AnIntroductionToDataBaseSystems? 7th ed. sec. 3.2 pg 60: "the result of each of the three operations is another table (in other words, the operators are indeed "operators that derive tables from tables," as previously stated). This is the closure property of relational systems, and it is very important. Basically, because the output of any operation is the same kind of object as the input -- they are all tables -- so the output from one operation can become input to another. Thus it is possible, e.g., to take a projection of a join, or a join of two restrictions, or a restriction of a projection, etc., etc. In other words, it is possible to write nested expressions -- i.e., expressions in which the operands themselves are represented by general expressions, instead of just by simple table names. This fact in turn has numerous important consequences, as we will see later (both in this chapter and in many subsequent ones)." [italics in original.] -- Doug

I would note that nested expressions and "table names" can be the same thing. The "table" can be a virtual table that is equivalent to an expression, but merely a syntactic convenience to avoid dealing with a Dagwood Sandwich (ThickBreadSmell) on paper. SMEQL (TqlRoadmap) takes this approach. - top

He further clarifies in chapter 6 in discussing the RelationalAlgebra operators. P157, under Union: "although that result [of a union of two different kinds of tuples] is a set, it is not a relation; relations cannot contain a mixture of different kinds of tuples, they must be "tuple-homogenous". And of course, we do want the result to be a relation, because we want to preserve the closure property. Therefore, the union in the relational algebra is not the usual mathematical union; rather, it is a special kind of union, in which we require the two input relations to be of the same type -- meaning, for example, that they both contain supplier tuples, or both part tuples, but not a mixture of the two. If the two relations are of the same (relation type), then we can take their union, and the result will also be a relation of the same (relation) type; in other words, the closure property will be preserved. [Footnote]: Historically, most of the database literature (earlier editions of this book included) used the term union-compatibility to refer to the notion that the two relations must be of the same type. This term is not very apt, however, for a variety of reasons, the most obvious of which is that the notion does not apply just to union." [emphasis in original] -- Doug

I think that sets and relations can be unified. A 'set' is a relation with one (implicit) column, or a 'relation' is a 'set' with just one type that happens to be a specific record type. Even in a language without static typing, one could probably get away with implicit restriction based on failed pattern matches. I guess that achieving reasonable performance is a critical feature. A 'join' in general terms would take two collections and a function of two arguments to either produce or reject a third argument... but ensuring that the function can readily be analyzed to take advantage of indexing might require weakening the language or working with indirect query objects.

Anyhow, tuple closure could be type-enforced where necessary, and ignored otherwise. Implicit restriction in PatternMatching would, ideally, diminish the need for it.

  { 
    point(x:1 y:2)
    point(x:2 y:3)
    red
    blue
    green
  } where \n => n.x > 1
  returns 
  { point(x:2 y:3) }

See CollectionOrientedProgramming CollectionsArentOo

CategoryCollections