What Is Data

It seems appropriate to begin with a definition. This is from the American Heritage Dictionary (through answers.com):

da·ta (dā'tə, dăt'ə, dä'tə) pronunciation pl.n. (used with a sing. or pl. verb)

1. Factual information, especially information organized for analysis or used to reason or make decisions.
2. Computer Science. Numerical or other information represented in a form suitable for processing by computer.
3. Values derived from scientific experiments.
4. Plural of datum (sense 1).

[Latin, pl. of datum. See datum.] 3 And a datum (from Latin) means "something given".

Operant Definitions and Representations for Datum

A typical computer science text will identify "label:value" or "label:(tupleVal1, tupleVal2, ..., tupleValN)" (or some other isomorphic structure) as a conceptual representation for a datum. The latter is heavily utilized within the RelationalModel and is, thus, associated with the most common and popular of databases -- the RelationalDatabase. The former representation is commonly associated with ObjectOriented programming, as the labels correspond to attribute names of the object.

I suspect it pre-dates the RelationalModel, but, alas, I have not been able to find a definitive reference. AnIntroductionToDatabaseSystems turns out not to be the source of the definition, so I'll have to re-visit some texts and examine their references. It might be found in Elmasri & Navathe's "Fundamentals of Database Systems" or FD Rolland's introductory "The Essence of Databases". If it's the latter, I should be able to easily find the original source by simply asking the author -- FD Rolland's office is next door to mine, and unlike many academics, he can usually be found in his office during office hours. BTW, the tuple is more accurately "label_t: (label1: tupleVal1, label2: tupleVal2, ..., labeln: tupleValN)". -- DV
Any isomorphic representation is the same. If you wish to use a mathematical record instead of a mathematical tuple, as you suggest, then that's fine, too.

Underlying this conceptual representation is a physical representation for the same data. In the physical representation, the label is often implicit (based, for example, on location in memory), and the value(s) are also represented in efficient formats for computation and retrieval.

While label:value and its variants are representations for data, they are not, technically, definitions for data. A particular label:value pair may simply be a value, possessing no extrinsic meaning and reflecting neither fact nor information about a world. As values, these representations are commonly used with records and tagged unions. To qualify as a datum, the presence of a particular label:value must also be (at least implicitly) interpreted as representing a fact about a world. E.g. when seen within an object, the label:value is representing a fact about that object -- that the attribute referenced by 'label' is described by 'value'. When seen within a relational database, label:(value1,value2) indicates that a predication concept referenced by 'label' is true when applied over 'value1' and 'value2' (or the entities these values reference) within some implicit world.

If I understand what you're illustrating, it may be more accurate to demonstrate your relationship between value1 and value2 as label_r:(label1:value1, label2:value2). As for a datum having to represent a fact about a world, I question this. Can one conceive of a "label: value" pair that does not represent a fact about a world? If so, by accepted uses of the term "datum," it would still be a datum even though it's a lie or random. If not, then the extension to the definition is arguably redundant. -- DV
Can one conceive of a "label: value" pair that does not represent a fact about a world? Sure. It's very easy. Simply make up a label, then make up a value, then stick em' together. Examples: "snoobarble:(7,blab)". "axeltrough:(egz,xlagg)". "label:(value)". I can keep going. There are two conceptual problems with calling all "label:value" pairs datum. The first is that, by calling all "label:value" pairs datums, then you have no qualifier for what isn't data. The very concept of a database would be pointless, since anything you can imagine (that fits 'label:value') is a valid datum... a valid piece of information, whether you are holding it in a database or not. The other problem is is that, if the label means nothing, or means nothing when attached to the value, then you don't really have information/fact/etc. What does "dgzxves:(txshgfe,19193945)" mean to you? What does it mean about the world? No meaning -> Not Data.
Are you sure you're not conflating the generally accepted (albeit subtle) distinctions (at least within information systems) between the meanings of "data" and "information"? Obviously, "dgzxves:(txshgfe,19193945)" doesn't mean anything. It's not even data, because 'txshgfe' and '19193945' are unlabelled. "dgzxves:(sploog:txshgfe,ghhk:19193945)" at least tells us that 'txshgfe' is a sploog and '19193945' is a ghhk. In examining it closely, I suspect the datum is either encrypted, written in an alien language, or is random gibberish. Of course, because it conveys no information to me -- though maybe it does to someone -- I have no way of telling which of the three is true. Fortunately, it doesn't matter. It's at least of the form 'label:value' so it's data by definition, which means I can work with it. I can construct database schemas to generate databases that will house it, write programs to manipulate it, and so forth. I can generate random data with which to test my programs and schemas, too. Whether the data makes sense -- i.e. delivers information -- to me or anyone else is immaterial. -- DV
- In information systems, the basest form of data is the collection of measurements from very real sources (raw facts -- received:(source, measured_value, time)). People typically distinguish as information that which they possess after processing this raw data in some meaningful manner to produce useful facts. However, one agent's information is another agent's data. It isn't "raw" anymore... you might call it "cooked" data. But it's data. Data and Information are the same sort of thing; they're just at different stages of processing. Properly, if you were to categorize the output of a random label:value generator, the correct approach is to create data of the form: "received:{ value: (<label>,<value>), source: myRandomGenerator.output, time: <time>}". In this manner, your data consists of only facts.
- The generated label:value pairs, themselves, are not data. They are merely values from the domain (Label x Universal). This distinction is a rather important one. They have no more meaning than do the numbers from a random number generator. In particular, as values, each possesses an intrinsic meaning (just as the numbers represented by '7' and 'twenty-one' have intrinsic meanings), but they don't have any extrinsic meaning. Just like '7' and 'twenty-one' say nothing about the world, these label:value pairs say nothing about the world. They are of the same nature. '7' isn't data, and neither is 'whuvvlthump:gibberish'. They represent nothing about the world. They are not facts. They are not information. They are, by definition, not data.
- "of the form 'label:value' so it's data by definition, which means I can work with it" -- All label:value pairs are values. They are coinductive values, from the product of domains (Label x Universal). Being values, you can work with, perform calculations over, store them into collections and later retrieve them, index them for rapid access, sort them, organize them, and otherwise manipulate them. Computation and calculation are all about the manipulation of values. As these values also happen to bear the same structure with which you are used to representing data, you can structure and destructure them in the same manner as you would data. However, all that does not make them data. The "it's data, which means I can work with it" notion is something of a conceptual pitfall; a value doesn't need to be data for you to work with it.
- "Obviously, "dgzxves:(txshgfe,19193945)" doesn't mean anything." -- ;) And adding 'sploog' and 'ghhk' doesn't help. The value on the RHS of 'dgzxves' was once a simple tuple and, after your alteration, is now either a record or a tuple containing two label:value pairs. No meaning was added. Really, even the 'is a' predication relationship you're inflicting on those labels isn't intrinsically there; it only exists in your mind. You could call it a 'labelled with' relationship, though.
The problem here is really whether "random data" is a meaningful description or an oxymoron. I would argue that it's meaningful -- at least to the extent that it's a recognised descriptor for something that would otherwise require us to use a term like "gibberish." Unfortunately, saying that I've tested the new corporate DBMS with ten terabytes of gibberish doesn't inspire the same executive confidence as saying it's been trialed with ten terabytes of randomly-generated data. -- DV
- There is a word for "random data" in information systems. It's called "white noise". It generally informs you only that your measurement instrument is broken or disconnected.
- The use of randomly generated values in the place of real data is perfectly fine for performing tests. However, such things really aren't data. Or, if you do wish to consider it data, then you can call it "real data about a randomly generated world". That'd be perfectly fine, conceptually correct, and your executive might get a kick out of it. Of course, that does imply a certain level of consistency that isn't provided by truly random label:value pairs, but that consistency is likely desired anyway; it will give you far more realistic behavior out of table joins and such in the RelationalModel, and ensure that references to entities actually point somewhere meaningful in an ObjectOriented model.

While these operant definitions of data have proven useful, they unfortunately don't cover the full concept of 'Data' or 'Datum' indicated by the English definition above. They are quite limited, for example, in the representation of complex facts or information (such as 'or' facts like "p:a or q:(b,c)").

There is a word for "random data" in information systems. It's called "white noise".

"White noise" is certainly a familiar term in audio and signal processing, but in a quarter of a century working in IS both in industry and academia in three countries, I've only heard it used to refer to electrically-generated (or via other technical means) white noise used a source of true random (as opposed to pseudorandom) numbers. The occasions where this is used are rare -- at least compared to the bulk of day-to-day issues in crunching data in ERP, CRM, patient records, inventory, etc., systems -- and then it's almost invariably described as being used to generate random data, mainly for encryption purposes or statistical sample selection. "Random data" in information systems is called "random data" by everyone I know, and by the vast majority of reference materials. Where have you worked in IS that it was called "white noise"?

-- DaveVoorhis

That statement is made with a bit of facetiousness on my part. In answer to your curiosity: white noise arises from truly random raw data sources... those that receive:(port, measured_signal, time). If it isn't completely random, you get something else... like pink noise. If you have white noise, then you have something that is both truly random and truly data. Very little else actually qualifies.

Of course, when a person such as yourself discusses "random data", you are generally talking about something that is neither truly random nor truly data, but rather something that bears mere a semblance of each of those properties. Phrases are, of course, free to hold different definitions than their component words. This is such a case. You say "random data", but you mean "a large set of pseudorandom values that bears a structural resemblance to data".

Actually, that didn't answer my curiosity. I am quite familiar with the distinctions between white noise, pink noise, et al. My curiosity was this:

Where have you worked in IS that it was called "white noise"?

I believe you've indirectly answered my question: I suspect you've never worked in IS, or in computing, except perhaps as a teaching assistant whilst doing your degree.

By the way, when I say "random data," I mean "random data," and everyone in the room knows exactly what I mean, too. Nobody cares whether the values are truly random or pseudo-random, or whether it's really data or something that merely structurally resembles it. These are irrelevancies, unrelated to the essence of the definition as it is generally used. You'll realise that when you get a job in that field.

-- DaveVoorhis

And when you say "foo", you mean "foo", and everyone else who uses "foo" to mean the same thing knows exactly what you mean, too. However, that doesn't mean you understand foo, or that you can explain foo... just that you can point at it. At this time, I'm quite thoroughly convinced that you really don't care about what data is, or what random is, or why random data is neither random nor data. As a person who has worked in IS for a quarter century, you've realized that such things are quite irrelevant to your particular career path.

    "What's in a hot dog?"  
       "I don't know.  I just pop'em in my mouth, chew thoroughly, and process the shits."   
    "WhatIsData?" 
       "I don't know.  I just organize by label, dump it in a table, and process the bits."

In all your quarter century of working, you've never sat down and asked yourself: so what, exactly, is this 'data' I'm working with? what is 'random'? what would qualify as 'random data'? After all, YouAintGonnaNeedIt applies to education, too. And I'm not being sarcastic; it really does apply. WhatIsData is entirely irrelevant to you. All a technician needs is a sufficient operant representation... doesn't even need to know why the representation works, or for what things it doesn't work, just that ItWorks.

WhatIsData is relevant only to those who are studying new data models, those doing original research, etc. I am among those people. I do not work in the technical (and business-oriented) field that has obtained the name "Information Systems", but I do a great deal of work with 'information systems' whereby I mean precisely that -- 'systems' that capture, carry, communicate, and process 'information'. And, within my area of study, there is a great deal of need for fine distinctions between "data" and "value", "random" and "pseudorandom", etc. Certain forms of reasoning and proofs of correctness require it.

Though such distinctions may be irrelevant to you, it's rather presumptuous to state that they are, therefore, irrelevant.

A Formal, multi-disciplinary Definition and Represention for Datum

In deference to those who seek answers first and understanding only if it is forced upon them, I'll present the conclusion first.

Conclusion: A datum is a proposition held to have a truth value in a world. Any datum can be represented by a tuple (proposition, truth value, world). As with label:value, above, the representation is not the definition; to constitute a datum, the tuple must be held, and must be subject to an implicit interpretation: the proposition is held with the given truth value in the given world.

When a datum is held with a collection of other datums, you have data.

Reasoning:

Some of the key phrases from the English definition of data are "fact / information", "given / derived", "represented / organized", "for processing / analysis / reasoning / decision-making". Those can help finite the nature of data by noting what does and doesn't constitute data.

What is NOT data?

Values are not data. Values include numbers, names, abstractions, booleans, types, etc. Examples include '1', 'David', and '\x -> x+1'. Complex values can be described inductively or coinductively from less complex values. Values are concepts; they possess an intrinsic meaning, and the property of being immutable by any physical means. However, values cannot be data because values aren't facts or information; they possess no inherent extrinsic meaning. Values project nothing onto the world outside themselves.
- However, values are a fundamental component of data. They simply require a context that somehow projects their intrinsic meaning into an external system. For example, while neither 'David' nor '7' have extrinsic meaning, they gain such a meaning within the phrase 'StolenCookies(David,7)'. But even the value 'StolenCookies(David,7)' has no extrinsic meaning, except as part of yet another system.
- It is in the treatment of a value such as 'StolenCookies(David,7)' that it becomes 'data'. For example, when that phrase is projected into our world, it probably means "it's a fact that David stole 7 cookies" (which probably needs yet another implicit context, such as "from batch X").
- One should note that values that are not propositions cannot be projected onto an external world. No matter how you contort the concept, you can't project '7' onto a world without embeddeding it into a proposition. This has to do with the nature of propositions (below).
Propositions are not data. Propositions are just values that, themselves, possess one or more non-contradictory truth values when placed in the context of a world. There are an infinite number of legal propositions, and an infinite number of possible worlds. For a proposition to be data, it needs to be a proposition that is held to have a certain truth value within a particular world.
- Note that even true propositions (aka truths about a world) are not data. To be data, a proposition must be held, represented in some manner for for processing / analysis / reasoning / decision-making.
- Note: these truth values may come from the broadest possible conception of logic. Some of those in common use across domains are:

      type truth_primitive = true         --(boolean logic)
                           | false        --(boolean logic)
                           | believed with <confidence> to be <truth_primitive> --(epistemic logic)
                           | necessary    --(modal logic)
                           | possible     --(modal logic)
                           | <probability> likely  --(bayesian logic)
                           | <percent> true        --(fuzzy logic)
                           | unknown      --(open-world inherent logic value)
                           | unknowable   --(theoretic logic (provably unknowable or undecidable))

Objects are not data. Objects are not facts.
- Objects are the only things that can physically hold other things, though. Thus objects are, like values, fundamental components to the concept of data. All data must be represented within objects.
Worlds are not data. Worlds are either physical or conceptual systems over which propositions apply.
Tuples are not data. While this follows from tuples being values, and values not being data, it's useful to realize that a TupleSpace is not, inherently, a DataSpace. A TupleSpace inflicts no meaning on the tuples.

However, in combining objects, propositions, and truth values, the different components of a single datum can come together. If a proposition is held to have a truth value within a particular world, then you have data about that world.

   type datum        = tuple of (<proposition>,<truth_primitive>,<world>)
                     -- implicitly projecting: <proposition> is held to have <truth_primitive> in <world>
   type proposition  = world->(true|false|unknown)
   type data         = finite_set of <datum>
   type database     = cell of <data> 
   type dbms         = service with  <accessors (possibly involving a DDL/DML)> 
                          performing <contract> 
                          over       <database>

A slightly broader approach would allow a set multiple non-contradictory truth values, and a (potentially infinite) set of worlds. Those allow one to represent axioms, assumptions, etc. as data. However, this has been removed from the primary representation for the sake of clarity.

'World' is just a tad more complicated. However, I didn't start this page to explain the world. I started this page to explain 'data'. And I did. Hopefully you're feeling enlightened.

I'm not feeling enlightened but I do feel bored. Knowing some of the people on this site, I believe they do too. --CostinCozianu
;) No eureka moments for you, eh? Ah, well; if you don't actually need this definition for anything, then of course you won't feel epiphany. The question is whether you are bored because you read this page or you read this page because you are bored.
Reading this page reminds me of a particular Zappa song "... boring me to pieces". Rather than feeling an epiphany, I feel you're going nowhere really fast, and I'm bored contemplating the idea that somebody may have to bring this page back to some kind of reason while dealing with your opposition. BTW, did you concoct this grandiose scheme yourself or did you pick it up from some place ? A reference would spare the advised reader from having to deal with the deluge of words. --Costin
There is no scheme. The formal definition arises directly from the concept that is referenced from the English words. I don't find it boring at all; to me, the formalization of an informal definition is much like watching a flower slowly blossom. The flower goes nowhere, and it gets there really fast; the only change is that you can now look inside, and learn a great deal that you didn't know before. And, since you watched it open, you can prove it's the same flower you started with. However, I take no offense that you do find this boring. It's like watching a flower slowly blossom; not everyone finds that. . . exciting, and not everyone wants to look inside. Also, a reference never spares readers anything; it just means they need to go elsewhere for their baptism of knowledge.

(see also DataBase, DataManipulation, WhatIsWorld?).

Below is a rather long argument involving different understandings of Formally Correct definitions vs. Operant definitions. The most pertinent points have been refactored above.

(Moved from older discussion under DataSpace)

A value such as '1' has a conceptual meaning, but isn't a datum without a context... e.g. a source or other proposition, such as: StolenCookies(David,1) or ReceivedFrom(<portname>,1,<time>). Every datum, however, must be a value because all propositions are values. In addition to being a proposition, the concept of 'datum' has additional requirements; in particular, a datum isn't just any proposition, but rather one that you happen to have and hold to be true (at some level of truth... MultiValuedLogic fully allowed). Note also that both 'have' and 'hold to be true' are temporal concepts and may change over time. Ultimately, if a closed system doesn't have a proposition, then that proposition isn't a datum. Discussing the data you don't have is self-contradictory... it isn't data until you have it.

Aside1: I explicitly mention 'source' as a proposition here because, under KnowLedge theory, it is the most primitive of all possible propositions. "I received value V from port P at time T" reflects the nature of senses. There is no meaning to this sort of datum that is not inferred through a process of learning and conceptualization.
Aside2: Additionally, 'have' is a space-bound concept, and intrinsically related to computation. You must be able to somehow represent the proposition to have it or compute over it. However, I'd rather not get too involved in the discussion of fundamental computational limitations under this page. The important fallout from this is that, given that all computations must occur in finite space and time, all Databases must inherently be finite in nature because Data must be represented in finite space.

Given that data are propositions you hold to be true, and that a Database (by its most natural definition) is simply a collection of data, "DatabaseIsRepresenterOfFacts" is fundamentally true... though if one were to discuss the power of a particular database system, one could describe it in part by asking what sort of facts it is capable of collecting.

The data management definition I've been using for some years, which I believe is from AnIntroductionToDatabaseSystems (but I'll have to check), is simply this: A datum is a value with a label.

For example, the following are not data:

  07765227291
  Dave
  525i
  WhatIsData
  128 Gloucester Road

The following are data:

  MyPhoneNumber: 07765227291
  FirstName: Dave
  BMWModel: 525i
  WikiPage: WhatIsData
  Address: 128 Gloucester Road

The label may be implicit in some cases, but is always present and gives the value a human-understandable meaning. In this view, propositions or truth/non-truth interpretations of the datum are strictly at a higher level, and are not necessarily even present in (for example) pure record-keeping systems. Thus, I would argue that your definition of a datum -- assuming I'm using your syntax correctly -- should merely be:

   type datum        = tuple of (<label>, <value>)

-- DaveVoorhis

No. To be a datum, it must be a proposition that is held (by something or someone) to be true (or have some other truth value) in some world. Each of those nouns and verbs is very important, and shouldn't be trivialized as you are doing. However, most of these things can easily be implicit in a communication context. In the examples you listed above, the implication is that the propositions are held (by DaveVoorhis) to be (true) in (the real world at this time).

  VoorhisDatabase :: database = DaveVoorhis.brain
  DaveVoorhis.brain = cell containing {
         'PhoneNumber?(self,07765227291)' is 'true' in 'the real world at this time'
         'FirstName?(self,Dave)' is 'true' in 'the real world at this time'
         'FirstName?(thatPickyWithWordsGuy,David)' is 'true' in 'the real world at this time'
         'IsA(BMWModel, 525i)' is 'true' in 'the real world at this time'
         'IsA(WikiPage, WhatIsData)' is 'true' in 'the real world at this time'
         'IsA(WikiPage, WhatIsData)' is 'strongly believed to be false' in 'the real world of 2006 October 18'
         'IsA(Address, 128 Gloucester Road)' is 'true' in 'the real world at this time'        
     }

Technically, neither 'label:value' nor the isomorphic tuple of the same (tuple of (<label>,<value>)) is an actual proposition. However, again this comes down to contextual implications. Labels reference a conceptual predicate (an abstraction or function), while the values are implied to fulfill the predicate. It is incorrect to call these constructs propositions, but it is okay to allow them to stand in for propositions. In the general sense, predicates and propositions are just abstractions; they do not need names any more than lambda functions require them. (OTOH, labels do clean up communication a great deal when all parties in a communication understand the label. Shared labels form the basis of language. It is probably worth moving further discussion on that, however, to WhatIsLanguage?.)

The definition you provided (isomorphic to datum = label:value) doesn't correspond to the fundamental concept of 'data', but it is sufficient to help a new student in the field understand values vs. propositions. New students tend to already understand that "data is something you have", because they're not used to thinking and reasoning terms of ValueSpace?s. They also tend to already understand the implicit held as true in this world because that's how they're used to thinking about facts. Unfortunately, they also unconsciously take these things for granted, and thus don't understand the importance.

Anyhow, a book like AnIntroductionToDatabaseSystems, or a course on the same, is not usually focused on the underlying philosophy. That's somewhat unfortunate, really, but rather practical. You should take all things you read (including what I write) with both a grain of salt and some level of respect for someone who has studied a subject. If you're looking to argue any of the points I described, a reference to a book as an authority on the subject doesn't do the job -- you need to present an actual argument, such as "I believe the <worlds> and <truth_values> requirement can be removed from your definition of 'datum' because....". However, I agree that it is perfectly fair to ask: "Can you please explain why 'tuple of (<label>,<value>)' is insufficient?"

I hope I have sufficiently answered that question.

To be a datum, it must be a proposition that is held (by something or someone) to be true (or have some other truth value) in some world.

That may be true in some strict, formal, or academic sense, but as the proposition is always true in the context of data, it can be taken as a given (and hence ignored) in any pragmatic sense -- especially established, familiar definitions of the term "datum." Hence my reference to AnIntroductionToDatabaseSystems, which I mention not as an argument from authority, but simply to point out the usual, accepted, understood meaning of the term. It also happens to be a meaning of "data" presented by the Oxford Dictionary of Computing (4th ed.), though the term "field" is used in place of label. These are familiar terms, with well-known meanings. Unless you're working in a field where your extended definition has some relevance (philosophy of computing?) I'm not convinced it adds anything of value.

In particular, I don't see it being helpful in the context of computer science, software engineering, data management, database systems, language design or related subjects, and I'm not convinced any "underlying philosphy" belongs in these. Furthermore, what you claim I've "trivialized" is precisely what should be simplified, in order to aid understanding and limit scope to relevant detail. Similarly, one could, for example, define an internal combustion engine in terms of ultimately deriving power from the sun, as do all non-nuclear earthly sources of motive energy, but it would have no engineering relevance.

If you can show me a computing context where there is relevance or importance in the notion of two data, defined as tuple of (<proposition>, <truth_values>, <worlds>) which are otherwise equal but differ in <proposition> and/or <worlds>, and which is not captured by tuple of (<label>, <value>)[1] then there may be substance to your definition -- assuming it doesn't wind up in some circularity that requires predicates to be defined in terms of data and data in terms of predicates. Otherwise, the definition strikes me as at best being pedantic, at worst potentially circular, and in either case without clear purpose.

That said, I'm always happy to be convinced of the need for more rigorous or complete definitions, but for now, colour me unconvinced.

-- DaveVoorhis

[1] Temporality, perhaps? But then it's (for example) tuple of (<label>, <value>, <time interval>), and the <time interval> may alternatively be represented in the <label>.

"(While) that may be true in some strict, formal, or academic sense, but as the proposition is always true in the context of data (...) I'm not convinced (your definition) adds anything of value." -dv

For a person who likes Relational, which gains much of its value largely from its strict, formal, and academic origins, you're showing an awful lot of resistance to accepting the possibility that a strict, formal, and academic definition of data might provide value. In truth, my goal in writing this page is not to convince you that this definition has any value to someone who is, for example, writing or using a database. I had have one goal -- to answer 'WhatIsData' in a manner that is formal, strict, and (very importantly) correct.

An argument against a correct definition for a merely convenient (but conceptually incorrect) one really belongs on another page (possibly UsefulDefinitionsOfData?). That said, the definition I provided will not cause any computational difficulties (e.g. slowdown, indexing problems, etc.) where data is used for exactly the same purpose it is currently used. Implementation and underlying representation don't change the nature of things. The <label>:<value> approach you described, for example, isn't data but is a means of representing data. Its type is still data( <proposition>, <truth_values>, <worlds>), and a correct translation back to such is possible by simply filling in all the necessary values (proposition is from (label,value,context in which label was used), use truth_value = held to be 'true', use world = 'real world at this time'). <label>:<value> is not fully correct because it can't represent every sort of datum, but it works as a representation in some circumstances.

Anyhow, I'll be glad to answer your questions. I imagine this is a situation where you need to run into walls inflicted by a weak definition of data prior to appreciating a more powerful definition. You've probably never run into such a wall in the applications you've worked upon.

Alright...

In response to: "... but as the proposition is always true in the context of data, it can be taken as a given (and hence ignored) in any pragmatic sense". -- That's both a rather large assumption and a premature optimization. It is true that you can always re-construct the proposition to read with the same truth_value, and use an implicit 'held to be true'. However, the semantics change a bit.

     ActionOutcome?(coinflip,heads) is believed with 90% confidence to be 50% likely in all worlds
     Believed(90%,Probability(50%,ActionOutcome?(coinflip,heads))) is held to be true in all worlds

In the first instance, the proposition asked of the world is contingent on the performance of a coin flip (though an 'unknown' may be returned in the interim). In the latter, the proposition on the world is about one's own belief. These are different things. (In general, transforming these things will become... conceptually ugly... when dealing with Modal and Epistemic logics. The nature of a proposition, and the manner in which one believes it... or explicitly disbelieves it... should be kept separate.) Also, always true causes difficulty in an open world, as the black-and-white boolean set logic doesn't work in such a situation. (see OpenWorldAssumption, ThreeValuedLogic, MultiValuedLogic).

Also, you really can't both toss this and switch to using label:value for data... or you end up with constructs like:

     believed:(90%,probability:(50%,actionOutcome:(coinflip,heads))) is true in this world at this time.

(How do you plan to represent this in Relational? I suppose: actionOutcome(coinflip,heads,50%) could be connected, and the 90% confidence in it could be separated... that'd be reasonable.)

In response to: "... the usual, accepted, understood meaning of the term. It also happens to be a meaning of 'data' presented by the Oxford Dictionary of Computing (4th ed.), though the term 'field' is used in place of label. These are familiar terms, with well-known meanings." -- regardless, you won't find a better definition of the term than the one I outlined (and gave reasons for) above. 'field:value' can represent data, but isn't data. Attempting to use it for data gives you weaker reasoning powers than you'd otherwise possess.

In response to: "Unless you're working in a field where your extended definition has some relevance..." -- That's easy to answer. This extended definition is of great relevance to the study of KnowLedge. A proper ArtificialIntelligence must be able to gain KnowLedge (including data and information) in a manner that is not heavily prescribed or limited by the model in which that data is stored. A weakened definition of data limits the sorts of things such a machine can learn. To remedy this, it's best to simply find the broadest possible definition of data, and figure out how to query and manipulate that in addition to how to index and utilize it efficiently. My own study of KnowLedge is driven by both a desire to develop ExpertSystems for such things as code optimization and theorem-proof speedups and a desire to create a 'Storyteller' system -- something of a reverse ExpertSystem? (a SynthesisEngine? rather than an AnalysisEngine?) that can aid humans in story-production by filling in the small details where the human doesn't wish to. A SynthesisEngine? must be able to reason across multiple worlds, projecting forward in the process of making decisions. If you wish to know more of my motivations, my name is a link to my home page.

In response to: "Furthermore, what you claim I've 'trivialized' is precisely what should be simplified, in order to aid understanding and limit scope to relevant detail." -- There may be ways of simplifying the definition I provided without reducing its power or correctness, but you haven't provided it. Determining which details of the definition aren't relevant is a domain-specific thing. Sometimes all you care to know about an internal combustion engine is that you feed it gasoline and it runs your car. However, if that is your definition of 'internal combustion engine', then you've trapped your mind. You can't even begin to conceive of an internal combustion engine that runs on, say, ethanol or gunpowder. In your simplification of details, you've severely limited your own conception of what data can be. You can't write a database that reasons about multiple worlds and the commonalities between them. You can't effectively operate with the more complex logics outside of the black-and-white Boolean logics you happen to be familiar with. You can't reason with complex facts, such as 'A or B'. You'd find it difficult to operate with raw propositions (0-ary predicates). Etc. And, worse, you won't even think about trying these. You won't even know you are missing them... not until you start banging your head against self-imposed walls and start thinking sideways.

It's better to start with a more powerful definition and make an engineering decision to trade power for whatever (simplicity, computational efficiency, etc.) If you can, you should always start with the the most powerful definition possible and limit from there... that way you know you're not losing possibilities before you even get started. E.g. if you don't want to deal with complex propositions (e.g. 'P(a) or Q(b)' is true in this world at this time) then you can limit yourself to single propositions (just 'P(a)', for example) and deal with the fact that you've lost some ability to perform complex reasoning. At that point, if you're also willing to eliminate nameless predicates (including runtime-constructed predicates), then you could reasonably call propositions as being <label>:<value>. You lose a great deal of ability to represent and produce complex facts in the database, but you've gained some computational convenience and conceptual simplicity.

And if you did this for a domain that didn't need this extra power, it would be cool by me. However, doing it before you even look at the domain is a mistake. Even worse is the effect of teaching a bad definition to scores of young engineers some of whom who will, undoubtedly, run into walls imposed by their education because they need more than their educator did. The argument that the "underlying philosophy" does not belong in computer science and software engineering is a statement born of both ignorance and arrogance (and should be a subject of yet another page).

In response to: "assuming it doesn't wind up in some circularity that requires predicates to be defined in terms of data and data in terms of predicates" -- no risk of that. Propositions don't even need Predicates, but they may certainly utilize them.

With regard to reasoning about worlds -- I'd like to direct you to study ModalLogic?. I don't wish to attempt to explain it to a person new to the concept because I still remember the headaches it gave me. Temporal logics do constitute one sort of world description you may wish to track data about -- it provides historical databases. However, it is hardly the only possibility. Another, very practical possibility would be in reasoning about simulated worlds -- allowing a single database to track facts about multiple such worlds.

Rather than quibble about minutiae or which definition is correct, I'd like to step back and look at it from a different view:

There are two fundamental kinds of definitions presented here.

The first is a working definition. For "data" (or more accurately, "datum") it is "label: value." Effectively, it's an operant definition. Whenever you see a "label: value" -- regardless of what they mean, where or how they're used, or even if the label is implicit -- you can accurately say you've got a datum. For the majority of computer science and database work, that is sufficient. "Datum" is simply a term applied to a common type of construct, and "data" refers to a collection of them.

The second is a formal definition, which is what you have provided above. It is entirely appropriate and necessary to use such a definition in those areas where we need to perform formal reasoning about "data" -- such as in the fields of formal logic and philosophy, and even some branches of database theory -- as opposed to manipulating data itself (or describing potential manipulations) in concrete or abstract terms, which is what we do in the majority of database theory and computer science.

In work where the formal definition is not required, it is not necessarily enhanced by reference to the rigorous, formal definition. You will note, for example, that Codd's "A Relational Model for Large Shared Data Banks" is a singular work that is undiminished by failing to explicitly define "data", though the familiar working definition obviously applies. In that, and arguably the majority of work in databases and computer science, a formal definition of "data" isn't required. Reference to a comprehensive, multi-disciplinary, formal definition would at best be redundant or extraneous, and at worst confusing. In the context of Codd's work, as with many others, the familiar working definition of "data" is both sufficient and complete.

Likewise, in work where the formal definition is relevant, reference only to the working definition would be, at best, insufficient and incomplete.

In other words, both the working (or operant) definition and the formal definition are correct. They simply serve different purposes, but ideally should both be recognised as components in a complete understanding of the term "data."

-- DaveVoorhis

Those are words with which I very largely agree.

Please pardon just a bit of quibbling on my part. I've heard from different sources that both GodIsInTheDetails? and that there exists a DevilInTheDetails?. I'm not sure whether they're duking it out or making love.

Anyhow, I'd be somewhat careful in the use of label:value as a definition to add: (a) you're physically holding them somewhere, (b) that they implicitly represent truths about the world, and (c) that the nature of this truth is determined by a predication relationship over the value that is mutually understood in communications through the use of the label. Without such constraining phrases, a guy like me could (and probably would, knowing me) come along and either create a random generator of label:value pairs and claim it's producing 'data', or start discussing the whole ValueSpace? of label:value pairs, just to show that the definition is leaving out some very important components.

I entirely agree that label:value (bare, naked, without any of the extra phrases described above) is quite suitable for representing data in many circumstances. It doesn't really define data, but it is certainly something with which you can work. It's an operant representation, really. And calling it an operant definition is correct after adding the constraining phrases I described above.

That is a minor discrepancy in the context of your overall point. I fully agree with the notion that both formal and operant definitions have a place.

An operant definition merely supplies a metric or mechanism by which we can quantitatively identify an instance of a concept. For example, in the field of psychology we might design an experiment to correlate mental illness with intelligence -- perhaps to test the hypothesis that there is a fine line between genius and madness. You could start by falling down a rabbit hole of endless definition-quibbling over the meanings of genius, madness, mental health/illness and intelligence and never accomplish anything. Or, you could use an operant definition for madness -- say, the number of days spent as a psychiatric in-patient in the past ten years -- and an operant definition for genius, such as an IQ score. Neither operant definition has anything to do with formal definitions of mental health or intelligence, but both are familiar, recognisable, generally accepted, and (most importantly) unambiguously quantifiable indicators of the presence of these concepts.

Likewise, the operant definition for "data" merely provides a means for identifying examples of data; it provides a a clear and unambiguous indicator of the possible presence of instances of the concept. As such, whether the data is physically held somewhere or not, whether or not it implicitly represents truths about the world, or whether or not its truth is determined by some predicate is irrelevant. These are elements of a more formal definition, not the operant definition, which is purely concerned with whether a construct matches the "label: value" pattern or not.

By the way, a random generator of 'label: value' pairs is a category of database utility, used to generate random data for testing or research purposes. (See, for example, http://www.freedownloadscenter.com/Programming/Databases_and_Networks/EMS_Data_Generator_2005_for_SQL_Server.html) Within the operant definition, it is not any less "data" for being random. It may not be information, but it's still data.

-- DaveVoorhis

An operant definition merely supplies a metric or mechanism by which we can quantitatively identify an instance of a concept. Yep. As an example "the number of days spent as a psychiatric in-patient in the past ten years" is fully qualified. If, instead, you used "a number", then every number is considered a fully correct answer to "how mad is he?". This is directly analagous to your proposed definition -- "a number" and "label:value" are merely representations by which an instance of the operant definition is communicated. To provide an operant definition of data, you need to fully qualify the use of that representation. These qualifications I provided with a few constraining phrases.

By the way, a random generator of 'label: value' pairs is a category of database utility, used to generate random data for testing or research purposes. I can certainly see how such a thing would be useful (even if it doesn't provide anything more than a semblance of data). However, I'm not quite as nice as those guys are. Does it provide an entirely random and generally meaningless label? (foobarble, snarflgoo, yumchumchum, etc.) Does it then connect this label to entirely random value? ... probably not. This is made to interact with SQL. It is a far more useful tool than the one I was discussing.

As for starting to discuss the whole ValueSpace of label:value pairs, just to show that the definition is leaving out some very important components: Yes, you can do that, but the majority of database researchers will quickly shoo you out of the room for being a pedantic pest. Those working on query optimisation, for example, couldn't care less and it couldn't be less relevant. -- I agree. However, that doesn't make them correct in calling a representation a definition. It just means I'm no longer in the room to tell them about it. Really, it's a rather simple concept for them to get across their minds... it only takes two minutes effort to recognize and memorize the difference between "label:value" and "label:value held as representing a truth that is communicated based on a predication relationship between the label and the value." Those who shoo me out of the room, but don't care to understand or remember even that much, sure as heck shouldn't (and, fortunately, generally wouldn't) get involved in a discussion on WhatIsData.

This all seems to be heading a little deep and academic for mere mortals.

To me, Data is only a concept, and its absence is just as important as its presence. Data is a set of Datum, and a Datum is a piece of information set in a context.

I may be vastly wrong in my interpretation, but it works for me, and I think I see some correlation between what I think and what has been said.

-- BarnySwain

You're not wrong. "piece of information" = proposition held to have a truth value in some world, "set in a context" = database. It is useful to formalize the concept when comes time to reason about it. The core of your statement is correct. I'd be careful with the "only a concept" comment -- such a position only holds until the moment you start studying concepts.

"A datum is a piece of information set in a context" - I believe a datum Becomes a piece of information when set in a context. I like to think of data as the set of accessible facts about instances of entities. At least I find that useful. Any piece of data can be accessed if you know the entity to which it belongs, and the particular instance of that entity. -- PeterLynch

Data Is

Facts
- Required Facts
  - Required Facts In A Structure
    - Required Facts In A Structure which is Maintainable And Recoverable

-- DonaldNoyes

Depends on context. EverythingIsRelative. Solved. Goodnite. -top

CategoryDataStructure