Double Dipping

Double dipping is when someone dips her nachos in the salsa twice, after licking the chips... then another person goes in and dips his chip without knowing the saliva is all over the salsa! (or pita bread and hummus, respectively!)


In programming, dipping should be done once to keep sterility of code. Code should not be polluted with more potential harmful germs. Double dipping causes bugs and obsolete/orphaned code. Old code (old chips) may be dipped several places and may/will not get fixed if it doubled up.

This may even be when one violates OnceAndOnlyOnce.


Example

Having a customer's schema stored in two or three places instead of one means that code can (will) become orphaned with old obsolete faulty rules, faulty logic, etc. If the "CustomerID" column is changed to just "id" for example, the "search and replace" error prone way of fixing two things instead of one will not hold up in the long run. Humans make mistakes and not all orphaned/obsolete duplicate code will get repaired.

Double dipping can be seen in ObjectRelationalMapping, mapping to a database and declaring the schema twice in a sense. Another problem with double dipping and having two dips into the database (with OOP and with SQL, instead of just one relational language) is that you double your brain mapping. One has to think in double the languages - the OOP language, and the SQL language. One can never abolish or abandon SQL when it comes to needing to query the database anyway - so having all the code dipped twice in OOP and in SQL isn't so effective as just dipping once using something potentially similar to RelProject.

Similarily, we double dip when we define a record or a struct and map our schema to it.

type

  something = record
    field1: itIsAlreadyInTheSchema;
    field2: itIsAlreadyInTheSchema;
    firstname: string;
    lastname: string; // etc. etc.
  end;

This problem is hard to solve, because the database lives on another server. Having the schema right in the language as a type (schema type stored in some module)... just like Arrays, Lists, and integers are built in types, would be nice (DatabaseType).

It seems compile-time resolution + FirstClassTypes would be really necessary to solve the DoubleDipping problem you describe here. Essentially, you need the ability to perform a query at compile-time to fetch the necessary pieces of the schema, and you need to turn those pieces into types that can further be utilized to form the type-descriptor and then by the type-checker.


Even when defining stuff like this in Cee code (which has a global namespace), we can see a DoubleDipping pattern:

   mysql_func(db, 'foobar');
   mysql_other(db, 'fu');
   mysql_redundant_mysql(db, 'fu');
   mysql_again(db, 'fu');
   mysql_yet_again(db, 'fu');
Continually dipping the mysql string in our code over and over again is NeedlessRepetition. The ugly mysql_namespace is double, triple, and infinitely dipping our code. If the functions ever change to mysql5_something or mysql4_something.. then our double dipped mysql_prefix may introduce bugs if when we refactor we forget to replace all mysql_prefixes with mysql4_prefixes. Not only does it cause bugs, but just plain annoyance, verbosity, and NeedlessRepetition.

Below is better:

  use 
    mysql

func(db, 'foobar'); other(db, 'fu');

We dip the mysql string into our code once, as a module in the use or import clause.

If there is a namespace conflict, then optionally access its module by name:

   mysql.func(db, 'foobar');
   mysql.other(db, 'fu');

In other words we only double dip the "mysql." in our code when absolutely necessary. With the Cee language it forces people to make hacks like hardcoded mysql_prefixes, causing ugly double dippings of the same mysql, mysql, mysql, mysql, chips in the salsa.


This is where I think we need truly *global* namespaces, that range from functions/objects, to systems, to LANs, to the entire Net, all using one interlinkable abstraction. You should be able, in a 'program' that references a database, to say something like 'grab the metadata that describes this database, and construct object classes based on the tables in it'. Define the data schema OnceAndOnlyOnce where it belongs, in the database (or in code which constructs that database; preferably in a single block of declarative code which does both).

Doing something like this kind of blows the mind of a C compiler, or even a Haskell compiler, or any statically-typed compile-link-run workflow, since it means the type-checker no longer has absolute control over all entities it creates. But that's life in the modern networked desktop world, where programs must coexist in a rapidly-changing data environment, where entire new data *types*, not just instances, can be created at runtime by other systems or by user activity. A dynamic operating system with user-creatable databases and file and object systems, seen as a whole, doesn't *have* a single 'CompileTime' anymore; at best it has a 'shipped from factory' time, but what with service packs and patches, we can't even depend on that. But our programming languages, by and large, still assume that we can break up our system into static 'applications' that assert all the possible types they can work with, at compile time.

I thought this was what the promise of ObjectOrientation was in the 1980s (and then ComponentProgramming? in the 1990s) - global object repositories, lots of little data types with standard interfaces, loosely linked together, able to be broken apart, reused, and plugged together in new ways - but somehow the opposite occurred. ComponentObjectModel brought us OLE which brought us a monolithic MicrosoftOffice. And I still can't mix and match my own view of the data I own. Why?

Global namespaces are a fine idea, so long as issues of security and name distribution are handled well. I certainly make a case for it in WikiIde - heavily 'hierarchical' namespaces (as seen with folders) need to be severely discouraged if sharing is to be made easy; folders hide names into tiny little niches. If you say a global name is just an absolute URI (not relative to any unspecified location), then (in many ways) we already have a global namespace, but I believe that what matters as much or more than the namespace is ensuring interoperability... i.e. formalizing protocols and message transport in common languages that can be implemented directly from specification. I.e. having a function in a global namespace does you not one whit of good if you can't figure out how to request its representation (e.g. in text or XML) then further interpret it. Nor does it do you much good if you need to write special code to interpret each different function in the namespace. Essentially, you also need a global TypeSystem and considerable standardization to make a global namespace useful.

However, a global namespace doesn't necessarily preclude a static CompileTime. I.e. a system could easily have a CompileTime at which point it takes a snapshot of the relevant function and procedure descriptions from around the world, then promptly type-checks and compiles and doesn't automatically accept updates to function-descriptions as being updates to the compiled object. And, given that security, privacy, and support for offline work in case of disruption are all critical issues in a global namespace, this may even be desirable, though capability to perform runtime interpretation remains implicit.

That's pretty much what I'm thinking of, yes. By 'global namespace' I don't necessarily mean 'flat', just connected (modulo security), and incremental compilation with manual intervention sounds great.

What I am somewhat concerned about is inflexible TypeSystems which can't cope with generating new types as runtime outputs - because as far as I can see, *all* computer time is now runtime, if you have to switch a module off to recompile it, you've experienced a DenialOfService.

Actually, there is very little keeping you from simply creating a new service, running it in parallel, then swapping one out for the other, so DenialOfService isn't necessarily the case. See RuntimeUpgradeableCore for some ideas. Actively connected services, of course, might be interrupted briefly, but that is likely for any intrusive update - i.e. you can't even modify a database schema effectively if you have to deal with continuous updates during the handoff. And entirely static type-systems are a problem since they restrict forms of communications (e.g. if H.264 and Vorbis et. al. were 'types', the inability to add new types at runtime would also be the inability to handle new video formats.) That said, there are many potential paths to add types at runtime. One is to use modular plugin systems. One may be aided by use FirstClassTypes, which allow the creation and commmunication of type-descriptor values. As far as 'flexibility' goes, a global TypeSystem can be as flexible as we need it to be so long as this flexibility is standardized. There is no reason we can't have PredicateTypes, DependentTypes, NominativeAndStructuralTyping (named types by URI is similar to using XML schema; structural typing is type-descriptor-by-value), etc. But it's worth keeping in mind that even having FirstClassTypes doesn't preclude the necessity to initially design the system for runtime upgrades if that is a desired capability.

This suggests to me that the whole notion of 'types' needs to be reformulated as functions or abandoned (sorry, BertrandRussell); ie, an arbitrary function needs to be able to take a type as input and generate a type as output - after all, isn't that what a compiler is? What *is* the type signature of the Haskell compiler? [Type of a compiler? See ThirdFutamuraProjection.] And that's where I find myself getting deeply lost in all the discussions on LambdaTheUltimate about which non- TuringComplete type system is 'best'; as far as I can tell, they're all fundamentally unworkable unless they're available to the runtime system for checking newly created objects and impose no restrictions on the nature of those objects.

Why seek a non-TuringComplete type system? I, personally, like having a TypeSystem that is just as powerful as everything else in the language, minus communications. It's just as easy to complain to the programmer that some types are taking too long to resolve, point those out, and ask for advice; given TuringComplete type systems, programmers will understand this well enough, and 99% of the time it's due to something easily fixed by the programmer. My experience with non-TuringComplete TypeSystems is that, inevitably, along comes some situation where a programmer will want the power of TuringComplete systems and will need to reinvent it, badly, via construction of some sort of third-party tool. Also, even though the HaltingProblem is impossible to decide in the general case, it is very decidable in the vast majority of cases in practice, so you could even have the compiler do a little bit of quick time-and-memory-limited analysis to see if it can determine whether it will halt in the case it is working on.

So I'm not sure that a global TypeSystem, in the way we currently perceive strict typing (non-overlapping hierarchies), is possible in any system that can modify itself at runtime; but if we were to treat types more as some kind of logical declaration of features and constraints, and then have a global repository of such rules, that might work.

[I basically agree that DoubleDipping (which seems to simply mean an extended opposite of OnceAndOnlyOnce) causes lots of problems. And our current 'state-of-the-art' tools are not really suitable to deal with the redundancies in the realities of the internet and such. But I do not fully agree with your schema/sql example. Or maybe I think it is misleading. Surely duplicating the same structure is DoubleDipping and that's where your example is right. But in many cases the structure of e.g. the object model and the database model is intentionally different (if it is intendend to be the same you example holds, but this should be stressed then). Sometimes - or for large applications rather often - the structure of the objects in ram is hugely different from the ones in the db. And this is not because of lack of rigour. The requirements on the structures are quite different and the accesses serve different needs. And this is often good (if you use the db only for persistency thats another case). -- GunnarZarncke]

[I agree that it can be a good thing not to map things directly. Alright but consider if you have a whole bunch of SQL strings in your application that say SELECT vegname FROM vegetables (returning carrots as vegname). If the database ever changes and now the new table called veggies replaces vegetables... our compiler/interpreter should inform us of this error immediately the next time we recompile/run the application. The compiler checks the schema and says but wait! there is no vegetables table. Search and replace or refactoring doesn't always work and some humans forget to replace all instances of vegetables with veggies, etc. etc. If there is a DatabaseType available to the application, it automates and saves us from running around to 5 places trying to synchronize what DB schema is really available. Changing SQL strings in our application is error prone since they are just strings. Parameterized queries are just as much of a problem.]

  connect(server15);
  resultset = query(vegname FROM vegetables WHERE color = clOrange);
  print(resultset.all); 

compile error line 2. There is no vegetables table on server 15. Check your DatabaseType for the schema specification.

So then we repair it to:

  resultset = query(vegname FROM veggies WHERE color = clOrange);
  print(resultset.all);  // prints "carrot" and "orange peppers" on the screen

[Actually I have done exactly this already. The class defining the constants representing database tables and column names is simply generated from the database schema. I implemented different versions for different projects. One qerrying an actual database, one using the create table script and one using hibernate annotations in the classes representing tables. So my compiler does complain when some table is changed. -- .gz]

{The idea of having languages with semi-static type definitions, i.e., that alter only in response to changes in some external resource -- such as a database, or a WebService WSDL document, or an API specification -- is an interesting idea. I can see it promoting a fix-by-error-message approach to application changes, in which you alter the external resource first, then attempt to recompile everything (or watch the dynamic scripts fall over), and finally repair everything that coughs up an error message. It must have been tried.} -- DaveVoorhis

[With an embedded database like SqLite or RelProject couldn't this checking be done much easier since there are not TCP/IP connections.. The issue is having the compiler do a call home to some server. Whereas an embedded localhost DB would make at least prototyping the idea much easier.]

{I'm not sure how the mere absence of a TCP/IP connection would make a difference. If you change a type definition by whatever means, the code dependent on it will have to change.} -- DV

[If the network is not available, the compiler will barf and complain. On localhost, one doesn't need an internet connection. I am considering usability now of the compiler calling home to some server. On localhost, TCP/IP would not be needed (although localhost TCP is still possible, if the person has a network installed on the computer). Consider if the DB lived in some file on the hard drive (embedded) and did not require any TCP/IP. This makes it much easier for the compiler to access and find info about (since it is right there on the hard drive). All that would be required would be that the compiler query the correct file on /home/me/mydb/ or C:\documents and settings\mydb\. Now I'm unfortunately discussing physical specifications of the DB - but generally embedded localhost DB's are much easier to access physically, than a server based DB (to obtain the schema info). The compiler could even query the database using the query language and access its information tables to get the type information.]

[At this point, I'm considering playing around with this idea and want to get started on something that is easier to prototype immediately. A localhost embedded DB, rather than a full blown package like Oracle on some other server, will simply be easier to play with in practice during the beginning stages. I'd also like to hear about what sort of checking is done in RelProject, and how errors work, etc.]

{Integrating database and application functionality into a single language does not preclude distributing components over a network, nor should it. In high capacity environments, we can assume it to be necessary. There's any number of reasons why a database might be inaccessible, of which TCP/IP failures are relatively rare. If the network is not available, then the shared database (or whatever shared component) isn't either. The type system and/or compiler can throw a "<service> not available" exception. Developing database applications against an absent database is going to be nigh unto impossible regardless of the reason. In the RelProject, there's a significant amount of execution-time checking (though more could be done) that goes on to verify that external resources are accessible and valid, and throw exceptions if not. However, this is not appreciably different (except in the implementation details) from similar checking and exception throwing that goes on if the native storage engine goes haywire. From a user's point of view, the only difference between (say) a native relvar becoming corrupt and a TCP/IP connection failing is what error message is reported.} -- DV

The idea of a stand-alone system should be carefully handled once the words 'global namespace' enter the picture in any case. If you have a global namespace, one cannot avoid the issue of network connections potentially being down. Disruptions and delays should be assumed. The best one can do is attempt to handle fault gracefully (GracefulDegradation). In this case, I'd suggest having the OperatingSystem cache such things as remote schema and simply have the compiler print some warning message and move on gracefully when it uses the cached version of the schema rather than the remote schema. Keeping this in mind, CompileTimeResolution ought to use RepresentationalStateTransfer and largely be side-effect free (as much as possible, anyway). E.g. fetching a schema by URI (or some sort of URI stand-in) rather than explicit procedure would allow a more ready identification of which 'state' is to be represented and therefore can be cached.

Some thought ought to go into such things... but, again, this is much more of a 'global namespace' issue than it is a database issue. Similarly, global namespaces introduce issues on how updates to objects shall be handled when disconnected. GracefulDegradation would be to hold onto said updates until such a time as a merge can be performed.

Each database that you access could be accessed as if it was a separate module (namespace). Databases, by nature, are global stores of data. So if you were connecting to the somedb, and if you had anotherdb... you'd import that into your program:

  import
    maindb, anotherdb;

When the compiler sees we are importing maindb and anotherdb, it tries to connect to it, based on the IP/login/password settings specified in a configuration file (or, maybe in the source file itself - but I'd recommend a config file that could be reused and swapped easily).

Usually, an application is only accessing one single database and it isn't a problem to treat the database as a global store of data.. this is by nature how a database is used today (i.e. like a global module, or namespace).

Actually, I dispute that 'an application is usually only accessing one single database'. This (false) assumption lies at the root of my current intense frustration, as a system administrator and as a user, with just about every piece of software I use. They all make the mistake of assuming that I don't want to join information from separate databases, or separate application domains. But I do! In fact, both as an administrator and a user, joining information from separate databases to make new queries or mini-applications is *precisely* what I use a computer for - and it's the thing that the applications and operating system do their best to make impossible! Possibly 'application programmers' can avoid having to deal with this unpleasant reality by restricting the domain of data they deal with to a single, purpose-built database - but the rest of us, who want to get actual work done, which involves cross-referencing and linking data, aren't nearly so fortunate.

For example: let's say in my personal DataSpace, I have a set of Excel spreadsheets, a set of Access databases, a whole filesystem full of Word documents, a bunch of .CSV log files, several corporate SQL databases accessible via ODBC, my groupware/calendar/email application, my Web browser's history cache (either Favorites or Firefox 3's new SqLite thing), a bunch of technical support websites, a Web bookmarking service like del.icio.us, one of those 'virtual sticky yellow notes' tools like TomBoy, a source code versioning system like SubVersion, a bunch of Notepad text files, a proprietary ObjectOriented database, several Web-based frontends to databases that don't provide a SQL API, various management tools that *only* provide a CommonObjectModel API, a management tool written in Java that (presumably) provides some kind of Java API (though it will probably be easier to write an AutoIt script to screen-scrape and capture text the hard way)... and my job requires me to *manually* punch keys and retype information between all these various sources in order to answer queries. Seems like that's what my computer should be doing, no? But the OS and languages I use force me to do this myself.

So let's start by tossing that assumption that an application should only ever connect to one database. Connecting to just one database should be the *exception*, for very narrowly focussed tools of little importance that don't need to interchange information with any other system at all.

Ideally what I want is some kind of personal / enterprise DataFabric system, that lets me import, compute and publish small structured chunks or collections of RaggedData? with as much filtering and type-checking as I need, and keep it automatically up-to-date as data sources change. And structure each element as a function or filter implemented as a process. For example: take one CSV log, non-destructively monitor it for changes at runtime, generate a new record for each new line, parsed into fields. Take those fields, massage things like DNS addresses so they match a corporate standard. Cross-correlate that with a set of different SQL databases maintained by other management tools. Compute new fields and collections to indicate things like 'history of all changes to a workstation, ordered by time. Export this as a new globally-accessible object or stream in the system so the web server can import it. Make it so I can create my own personal kinds of these persisten mini-apps or filters, but can run them on a corporate server unchanged if I need to. Etc. If we can get there, we'll have something very nice.' -- NateCull?

I wish I could offer you a fresh cup of DontModeMeIn, but Microsoft bullied it off the market. I fully sympathize with your plight. I want global transactions and arbitrary join queries to go along with my global namespace and narrow-tabled databases.


AprilZeroEight


EditText of this page (last edited November 12, 2014) or FindPage with title or text search