# Data Space

Common use: the (physical) space made accessible for or occupied by data. E.g. the local stack, local parameters in an object, the data section of an executable file, the heap.

The word doesn't appear to have a formal meaning, but certainly has gained a practical one.

Back in my day, we used to call it "memory." -- DaveVoorhis

In a 2005 paper by AlonHalevy?, http://www.cs.washington.edu/homes/alon/files/dataspacesDec05.pdf, 'dataspace' is defined as a new data management metaphor to describe the result of linking disparate databases. Similar in sense to the PrivateJargon? term below, though without the concept of 'distance' and with an 'enterprise search' engineering flavour.

AlonHalevy? now works at Google, on 'crawling the DeepWeb'.

I've been using this word as a bit of PrivateJargon? created for a useful (but still immature) concept for which I could not find another word. This concept is explained as follows:

Points in any cartesian product of N dimensions can be viewed in an N-dimensional space. A 'DataSpace' extends this notion from numerical points (floats, integers, etc.) into any sort of data that can be mapped into a space through DataSpace operators (such as relational operators). This can be done for even one dimension. For example, strings can be mapped into lexicographical order (or any other arbitrary order, as desired) and they'll be on a one-dimensional DataSpace.

The 'space' in DataSpace refers to the fact that there is some concept of 'nearness' and 'regions'. It is always possible to define some sort of distance function between points, or at least a relative distance function between points in a particular finite set. DataSpace can be arbitrarily 'warped' in the sense that you aren't stuck with the default distance measurement between points. (E.g. in N-dimensional Cartesian space, one is stuck with the (x1-y1)^2+(x2-y2)^2+...+(xn-yn)^2=D^2 distance. However, there is nothing restricting a user from arbitrarily mapping points in a Cartesian space over a different space... e.g. an N+1-dimensional toroid or sphere. This simply requires providing a different distance functions.)

By the presence of a distance function between points, it is possible to identify clusters of points that sit near each other, or to identify nearest points to another when performing either calculations over aggregates or data extractions. Both of these are useful capabilities in certain domains (learning systems, data mining, etc).

In a sense, the 'set' concept is a binary form of a DataSpace; a set describes a region, for which any given object is either all the way in or all the way out. Intersections and Unions are operators over the space that add and remove regions from consideration. DataSpaces?, however, may be a bit fuzzier... allowing distance functions other than '0' vs. 'infinity (oo)'; every point in a DataSpace diagram may have a conceptual location within the region. This is what allows concepts of 'clustering' and 'nearness' with regards to the entire set of data points returned. Set-based models, such as Relational, aren't capable of expressing this concept, meaning the task of performing clustering and such must be passed back to the application (which would need to receive and process all the points in the current set, rather than just the cluster(s) of interest).

Indexes upon data are fundamentally accessors to regions of a DataSpace based on a statically chosen distance measurement. E.g. an index on the first four letters of a string in a tuple would map to a region in a DataSpace containing all tuples from the dataset that contain strings starting with those same first four letters. They are useful for accessing commonly used regions of DataSpace under a common 'warping' by a commonly used distance measurement, especially when there is more data than may otherwise be readily accessed. (Without indexing, grabbing values of a particular region will take O(N) time for N data points.) Points that 'move around' (due to mutations upon data) must, of course, be re-indexed. In the broadest DataSpace concept, the sorts of indexing performed by a Database would depend on the sorts of requests made of it... the access patterns, so to speak. Fixed or required indexes should still be allowed for engineering purposes, of course (to optimize for cases that are rare but need a lot of speed), but there'd be no restrictions on the other sorts of indexing or access patterns. As an example of removed restrictions, consider that an RDBMS will index primarily by Tablename->Identifier->Tuple(s). In the broader sense, perfectly useful indexes would include Identifier->Tablenames, Identifier->Tuple(s) (across all tables), etc. and these indexes could be specified or developed automatically based on the prior requests made of the Database.

Of course, the Query language and its underlying DataModel are capable of limiting the sorts of requests one may express, both in the sense of making certain requests a pain to express (impedence) and making certain requests impossible to express (infinite impedence). For example, the RelationalModel doesn't allow any requests of Dataspace that don't start with 'Tablename'. That's a pretty fundamental limitation; it essentially breaks what is conceptually a single Dataspace into several disparate, disconnected spaces (one for each table) between which the distance is always infinite. This restricts the sort of 'arbitrary distance functions' one may express because the distance between table regions is fixed. (An explicit 'join' between two spaces is allowed to create a new Dataspace with some properties of each, which is essentially how closed-space DataManipulation occurs in Relational; that's an inference pattern.) Since the MetaData doesn't get a Tablename, it is usually entirely inaccessible... though some RDBMSs are flexible on that point. The RelationalModel requires that you must use a fixed Tablename, restricting you to FirstOrderLogic... preventing, for example, general use of a tablename pulled from a table. This limits distance-functions for any singular query to those that are entirely defined outside the universe of data (rejecting the useful possibility of using distance functions that flex based on the internal data), though that may be remedied at least partially by composition of queries and certain uses of aggregate-calculation queries. Additionally, relational query languages don't allow recursion or fixed-points directly, which prevents certain navigational distance functions (i.e. rather than finding all reachable points in a subgraph, any single query, even the compositional ones, can only find a fixed number of reachable points in a subgraph).

• Aside: One should note that most of these can be remedied by pulling the necessary behavior back into the calling program... which, generally being turing complete, possess far fewer of these restrictions.

By observing these sorts of limitations and restrictions on the queries and manipulations one can perform upon the conceptual 'universal' DataSpace associated with a Database, one can identify the flexibility of its DataModel, query language, and manipulation language. It is possible to gather at least one measure the 'power' inherent to a DML/DDL.
• Aside: At some fundamental level, there must be a so-called 'model' of data that corresponds directly to what is physically and philosophically possible regarding the use of data. This would be a natural definition of data, because it neither imposes any restrictions nor implies more to the nature of data than is actually there. Any other DataModel or definition can only ever restrict what is possible or imply as possible the impossible. There are probably some fundamental limitations on exactly how the DataSpace may be worked, no matter how 'arbitrary' you allow the functions to be, based simply on the fact that there exist restrictions on both the nature of functions and the nature of computation. Even so, one will quickly find that most DataModels severely restrict any direct manipulation of the DataSpace, and even those DMLs and DataModels that are TuringComplete in their operations over DataSpace can make certain things difficult to express or perform efficiently.

If I were to posit a type for a DataSpace, it would be:

```   type dataspace         = tuple of (set of <datum>,<distance_function>)
type distance_function = datum -> datum -> distance  constrained by commutative, triangle_inequality

```
Datum is described under WhatIsData.

An interesting idea, but I'm curious about some things:

• How does the RelationalModel's concept of "distance," which you define as only being 0 and infinity, represent a limitation? The RelationalModel allows us to provably and optimisably deduce new facts from existing facts. See, for example, DatabaseIsRepresenterOfFacts. I'm not clear how distance applies to that. What would be a practical application of overcoming this limitation? Your concepts of "nearness" and "regions" remind me of locality-based optimisations used in some DBMSes, in order to improve the performance of retrievals, but these are internal and have no bearing on the phrasing of queries, only their execution time. (See, for example, http://www.google.co.uk/search?hl=en&q=locality+database+optimisation) It also reminds me a bit of spatial databases or spatial extensions to SQL DBMSes, used for phrasing queries based on distances between points, typically for geographical or topological purposes. (See, for example, http://postgis.refractions.net/)
• Is your notion of 'best N' results distinct from, say, TutorialDee's notion of a quota query or various SqlLanguage implementations of the equivalent, which return a ranked relation of at most a specified cardinality?
• You've said a DataSpace is useful for identifying the flexibility of DMLs. How would you quantify and rank, or otherwise distinguish, one DML from another using the notion of a DataSpace?
-- DaveVoorhis

I readily agree that RelationalModel allows us to deduce new facts from existing facts, as per DatabaseIsRepresenterOfFacts. Long story short, however, is that the RelationalModel makes it difficult to derive certain classes of facts from other facts - in particular those related to clustering mechanisms of any sort (Bayesian clustering, DataMining for expert systems, grabbing the 'nearest' points to some 'concept', etc.) RelationalModel is especially adept at handling boolean queries. If one is working in terms of 'distance', booleans allow for two positions: 0 and 1. Any particular query is located in exactly one of two positions, as is all clustering. Similarly, pieces of information are divided among tables in a rather strict manner - they can be combined into views, but there is no way to easily work with a 'union' of all facts as though they were in a single table - no way to perform a single 'query' that can return results from all tables. In a metaphorical sense, it is as though each relational table is at infinite distance from every other one; even joining them just creates a new such table.

(Description of WhatIsData moved to said page in pursuit of conceptual clarity. Description of DataManipulation extraneous and removed/reorganized on appropriate page.)

Use of a 'DataSpace' would allow arbitrary choice of 'distance' concepts as part of the query. If talking in terms of KillerUserInterface supporting queries, zooming, and progressive disclosure, this essentially allows the definition of a layout function and 'origin' position as part of the query. This allows one to browse facts of all sorts, a bit like a mind-map. It could even be used for geographic layout, were the appropriate coordinate information associated with facts. However, outside of the browser, one also gains clustering for any number of other purposes.

Locality-based optimizations would be one intrinsic application of such a system, as would be temporal optimizations and queries for regions across other dimensions. But the main thing a DataSpace does differently than Relational is to purge the 'relvars' distinction. Individual facts are still represented as tuple-values with some sort of semantic header component, but there is no particular distinction between them. I.e. 'space' definitions could cluster all 'facts' that just happen to contain the word "Dave" regardless of where said word appears. Specifying a join is certainly still possible, but can equally be performed on table-name or any other feature. The queries possible within a DataSpace are a greater subset of those available in Relational. The cost is paid in the greater complexities of optimization, constraint description, and constraint enforcement.

The purpose of the DataSpace concept is more to aide certain forms of less structured DataMining and fact browsing than it is to support high-optimization massive fact processing. If you were looking at a target market, DataSpace would be aimed at ArtificialIntelligence and automated learning systems that need to identify arbitrary patterns in data and later be able to subscribe to changes that manipulate these patterns. It is very, very important to automatic development of KnowLedge that the DataSpace not be artificially partitioned. E.g. the sounds made by the birds and cats in the morning might not seem to us like they have much relation to the weather, and thus be in an entirely separate table if we were in control of dividing it, but that doesn't mean there isn't an empirically observable pattern there to be found by a free-wheeling learning system. Similarly, a 'scholar' agent might be performing arbitrary semantic compression across the DataSpace, continuously increasing the abstraction of the data and 'forgetting' stuff that can be reproduced with some degree of reliability (lossy compression) in order to maintain operations within finite computational space. To do this in any highly automatic fashion, it is important that the system be free to create either what are essentially new tables at whim, or that the system simply have one big 'space' to dump any and all data - the DataSpace opts for the latter.

I have no doubt that the relational model will serve better (more optimally) in cases where data is to remain highly structured and consistent. Implementations of the relational model can make more assumptions and thus make more optimizations. However, they are bound by these assumptions even as they make them. DataSpace aims at freedom from assumptions in order to remove artificial assumptions from learning and clustering systems. That said, there is nothing keeping one from implementing relational model inside an implementation of DataSpace or treating a relational database as a partial data-repository for a DataSpace.

In summary: the lack of a distance concept is not a limitation that causes problems across all domains. However, it does cause difficulty in many of which I am familiar, such as genetic programming, bayesian clustering, and learning systems. In these cases, one ends up using a RelationalDatabase as little more than an inflexible PersistenceLayer that fights most direct queries and searches for pattern-identifications purposes. In such cases, the database really isn't a 'Representer Of Facts', but rather a 'Mere Container Of Facts'.

Regarding the 'best N' results: A DataSpace query would require a distance-function and could return the 'best N results' from any and all 'tables' or even functions over multiple tables (variant table-names). Joins are, reasonably, a bit more difficult to specify - essentially one is dealing with tuples for which one component is 'predID' (a tablename or autogen ID for AI-produced predicates), but most joins would be specified much as they are in DataLog explicit predIDs. A relational query would be limited to a single table or join, a binary selector function, and an ordering function (which could be arbitrarily complex outside the context of SQL).

Regarding the classification of DMLs, it's worth noting that a DataSpace would strictly encompass just about any other DataModel. Any DML that could handle a DataSpace would likely be able to handle every other DataModel as well, but vice versa is not true. For example, SQL or even DataLog cannot readily handle higher-order logic whereby one joins with tables named in another table and where the table isn't even specified - the table is one of the return columns based on joining of other features. These things simply wouldn't be uncommon requirements in any DML designed for a DataSpace. As far as how they would be 'ranked' - they wouldn't be; however, they could be categorized by which features they offer and which they do not, as well as how well they optimize and whether they are declarative vs. imperative, consequent-based vs. mechanism-based.

A good DataSpace would also need to keep all meta-data in the DataSpace, such as semantic meanings of predicates produced automatically by AI, and the attributes and value-types associated with the tuples of any particular predID (DataSpace does not mean type free-for-all; it simply drops all tuples into one space.) This is pretty much necessary for generic DataMining.

A good DataSpace would also need to keep all meta-data in the DataSpace

This I agree with, in principle. I'm not convinced that using variables to hold global relations (or tables) either without allowing, or without easily supporting meta-operations on collections of those variables and their types (these collections are also relations, of course) is the best approach. However, this is the typical approach taken by implementations of the relational model, implementations of systems inspired by the relational model (e.g. SQL), and suggestions for implementations of the relational model. Operations on catalogues and similar meta-data are often, at best, awkward. That said, I'm also not convinced that alternatives can be implemented without unpleasant complexity.

(From WikiWithMoreThanPages)

My main motivation for exploring massive data-parallelism is not raw speed but a sense that a) data representation is just naturally parallel and we should not introduce artificial sequentialism or it will break the simplicity, b) most things we're trying to do in programming are just not that hard and life really should be a lot more pleasant than it seems it actually is right now. GUI programming in particular seems like it hurts far, far more than the simplicity of the underlying ideas needs to, and c) what we want in an age of the Web is massive democratization and massive data remix across multiple data sources and services; that means eliminating the difference between 'personal data storage' and 'personal applications/desktops', eliminating the difference between 'writing a program' and 'installing a program', and building something more like 'personal data dashboards' that receive, mix, blend and apply data and functions drawn equally from all over a globally distributed, locally cached utility Web, and that a reasonably interested 'user/programmer' could safely create without messing up the rest of the world, much like HTML enabled the creation of the first Web.

Of course, the hard part is trying to grasp the right abstractions for the first building blocks, and that's where I'm unstuck at the moment. I know I need some concept of a 'hierarchical namespace' (or dictionary) and a fairly simple 'publish, subscribe, compute' framework and a basic language/syntax, and it should be rather counterintuitively both *hardcore* pure-functional and *not* strictly typed at all (which rules out Haskell), but getting all the bits together to make the simplest possible working model still eludes me.

(It seems like RDF *ought* to be able to do this, but RDF is missing a lot of fundamental pieces: primarily an actual programming *language* with data-parallel semantics that can itself be expressed in RDF.)

If anyone else has any idea of just what this paradigm is, because there must surely be a name for it, I'd love to know - it would save me a lot of effort if someone else has already built it.

-- Nate Cull

EditText of this page (last edited October 22, 2012) or FindPage with title or text search