File Trees To Manage Code Discussion

Because of LimitsOfHierarchies, I believe that the existing hierarchical file systems are insufficient to manage the code of non-trivial applications. I would like to hear other opinions on this. However, most existing tools are hard-wired to work with treed files only. -- A RelationalWeenie

Uh, wrong. In many languages, little assumption is made about what "contains" functions; many languages explicitly decouple the structure of a program from the structure of the storage mechanism that the SourceCode is stored in. Many development environments assume hierarchical filesystems, because that's what most operating systems (which host the development environments) provide. But I can't think of any language that requires that its code be in a file in a hierarchical filesystem (Java?). Your average C compiler will happily take its input from anywhere the OS allows - a file, a terminal, a pipe (which allows the source to come from the output of another program - or a database query), or a monkey typing on the keyboard. This is almost always an OperatingSystem issue, an environment issue - not a language issue.
- I don't think C is a good example because it builds it's own internal structure before making an EXE file(s).
  - The same is true for most other things. The one difficulty is languages that do textual inclusion - #include "foo.h" assumes a filesystem in most C/C++ implementations, though the standard states that the semantics of #include <> and #include "" are ImplementationDefined.
Some languages, like SmalltalkLanguage, avoid the filesystem altogether by putting everything in an image (see ImageBasedLanguage). Many such languages do structure things hierarchically somewhat, but an in-memory image can easily be organized however you like - including along relational lines.
VisualAgeForJava stores all source and bytecode in a (shudder) database.
- Is it a RelationalDatabase? {I assume it is a NavigationalDatabase}
- I used it for a year and it never gave me a clue. Does it matter?
- To me, no. To others, maybe. OTOH, there might be uses for different types of queries on the code.
- Such as?
  - TableOrientedCodeManagementDiscussion discusses some example queries/scenarios. I think most agree that ad-hoc relational query systems are more powerful than their navigational cousins. And SQL (if used) is already known, thus little or no training.
- [It would be useful to know for an extensible IntegratedDevelopmentEnvironment. Not sure whether VisualAgeForJava qualifies. But the Smalltalk videos I've seen on that subject were intriguing.]

Should I assume that we're talking about TwoDimensionalTrees? here? Safe to assume that a BanyanTreeForest? model is not the idea?

  function A(...) {
    ...
    B(...)
    ...
  }

  function B(...) {...}

How does A know where to find B in an interpreted environment?

"Interpreted" language usually only refers to the back-end. Most interpreters parse as much text as they can, and build up the symbol table as they go. A has no problem finding B, as by the time A gets to run, B is already in the symbol table and is well-known. Most symbol tables are somewhat hierarchical (a symbol table can be an element of an enclosing symbol table, this is how nested lexical scopes are implemented), but this has nothing to do with the filesystem. If the language supports eval(), and the argument to eval cannot be pre-parsed; it is parsed when it is run.

Some interpreted languages will find B by searching first in current memory cache, then current application directory, then along the OS executable search $PATH, while others will require you supply a location or list of locations to test for the existence of B at run time.

{Maybe you should have a look at IntentionalProgramming. There all "symbols" are in a 'data base' and can be viewed and edited in different ways. I heard Outlook was built this way. -- GunnarZarncke}

Maybe that is the problem. If the language has its own "symbol table", then we have a OnceAndOnlyOnce sin with regard to the database. At least some of the symbols, such as function names, are already in a table if you use a RDBMS to manage code. If it was truly "device independent", when it could get function info and do name-space searches within the database (along with some caching).

Please go learn what a symbol table is before making suggestions on how they should work, because it's quite clear you don't know.

I suppose I am going to have to build a table-oriented interpreter to shut skeptics like you up. BTW, I never claimed a "symbol table" was a relational table. -- top

A symbol table could be implemented as a relational table, easily enough... and given that the parent attribute of any symbol would be a symbol, you could even do useful joins on such a creature. Most compilers don't do that for performance reasons.

All modern programming languages already have a "dictionary" DataStructure as part of definition or standard library; so no violation of OnceAndOnlyOnce occurs here. Unless, of course, somebody decides to write their own dictionary class, in which case they're just being stupid. :) And of course, the discussion of the symbol table is rapidly diverging from the topic of this page, which is use of files, more specifically HierarchicalFileSystems, to manage codebases.
It is not a matter of having a dictionary library, it is a matter of where the information about the program is. And, the topics are possibly tied, so it is too early to conclude this is off topic. Remember, our sample goal is to store and manage functions in a RDBMS instead of traditional files.
- You can do this in most languages today. Rather than having your MakeFile? do

 gcc -c foo.c

have it do

 <command to issue database query to retrieve source code> | gcc -c /dev/stdin

Similar tricks can be pulled in other languages.

The same tricks may not work for interpreted languages.

Hmmm. Since there are as many interpreted languages as there are file systems to house them (more, actually), what is it we're seeking here?

I want to manage source code in a RDBMS instead of a file system. Ideally, each function would be in a database record, but a consolation prize is having each module in a record. I generally use interpreted languages.

Which language(s) do you use? If it's BourneShell, that's easy - just use a HereDocument. PhpLanguage and JavaScript are similarly easy - the languages are designed to source and interpret code that's generated on the fly (and "generation" can certainly include retrieval from a database). Don't know about perl/python/ruby, unfortunately.
How about a PHP demonstration of the above function A B example?
Are you suggesting that the code should be pulled from the RDBMS at run time and then executed?
Well, the interpreter pulls files from the file system at run-time. Why are file systems more sacred than databases?
Because a file system is (generally) a service of the OS. A RDBMS is not. Yeah is not that great a reason, but, to run something like this actually, it would mean giving root permission to Postgres, have lots of overhead and, come on, you don't need it. Also, most RDBMS run over regular files, so why not just CutTheMiddleman?.
My goal here to improve human management of code. If performance is the only reason not to, then at least lets explore the future when such concerns will seem silly. O/R mappers, for examples, are performance hogs, but people still use them with the assumption that it performs a software engineering benefit. There are many situations where managing complexity is more important than benchmark performance. Note that a file server is usually not part of the desktop that runs an app from the file server.

It seems to me that you are just trying to emulate PolyMorphism. This may be good if you don't want to mess around with an OO model for your data. But, the downside is having some messy SQL code deciding which function is really called, instead of a single and elegant call, and letting the compiler or runtime to decide which one is really called.

Why is it "PolyMorphism"?

Well, I guess that if this happens at compile time, it is just some way of conditional compilation. However, if you do this at runtime, you basically are applying different (but probably related, so the query has some sense) functions depending on the data, without going into a if or switch statement (in C and the like) to determine the function. That I see as PolyMorphism (in a broad sense).

>>My goal here to improve human management of code. swing-developer replies: this is a jolly good idea, and an RDBMS is a natural fit. There's a lot of information which is lost when code is edited, and if the editing of code had as a side effect a transaction within a database, then that act of editing itself, complete with some aspects of the intent of the human who did the edit (let's start there at least) could become a first class citizen. That's very interesting information. Of course, I myself could never stand to give a reason why I had committed something to version control and no programmer is going to accompany every edit with a little story about what they were thinking, but that's not required. Performing refactorings like "rename variable" or "extract method" have built into them, if they're executed as committed macros (which they all are nowadays) what that aspect of intention was.

It may not sound like an interesting thing to know that such and such edit was performed as part of, say, a global rename of some variable, but that information identifies the persistence of a thing, that variable, which otherwise might be missed. Identifying the persistence of things which may mutate forms over time is a first class Hard and Interesting problem that can be leveraged in very many ways.

As an example, in the case of "extract method", it would be hard to know that the appearance of some lines of code that used to appear in Method A but are now missing from that method yet appear in Method B is definitely an example of "extract method" without recording the execution of extract method explicitly.

The utility of the RDBMS in this scheme is it provides a gatekeeper through which all edits must pass thus providing a way to know when any line of code has been edited, which in turn permits arbitrary code to be run in response. As it is now, there really is no centralized mechanism that distributes the fact of the event of an edit. Furthermore, edits can come from any number of processes or actors. To obtain global knowledge of all edits, you could try to write document listeners for each file, and integrate that into the IDE (if you have access to the source), but without the cooperation of not just the IDE but all plugin writers who cause edits to occur also, you can't know what the intentional semantics (a made up term whose meaning should be clear within the context of this post) of an edit were. Writing listeners that listen to every file in a project is No Fun. Trying to interpret the meaning of those edits just from the changes they inflict on a file is Even Less Fun.

Moved from HowCanSomethingBeSuperGreatWithoutProducingExternalEvidence:

Ideally languages could work directly with relational databases instead of just hierarchical file systems. Languages are tuned to hierarchical file systems out of historical habit, but non-tree file systems are a legitamate idea. Until that happens, eval() is one approach. It would make an interesting demonstration if you could show how Lisp could better integrate into table-oriented approaches without having to overhaul the language to get away from tree files. -- top
- Huh? What do filesystems (hierarchical or otherwise) have to do with this? Or are you equating LexicalScoping with hierarchical filesystems (and their bastard cousin, the dreaded NetworkDatabase) just because it's tree-structured? I'm not sure what you are getting at here, top. At any rate, most languages do work with RelationalDatabases, and other than perhaps JavaLanguage (which uses filesystem hierarchies for its package structure - a design decision that is controversial within the Java community), most languages don't care one whit about how persistent storage is structured.
  - If you read the JLS spec, there is nothing that requires filesystem hierarchies, it is just that many IDE designs thought it was advantageous to take advantage of that mechanism to "find" source files from FQNs so that they didn't have to maintain an index of classname to file location. But they now have so many symbol location tables and all kinds of other meta-data that it is rediculous. Eclipse will not accept a source file not located where it package says it should be in the filesystem hierarchy. Netbeans will, but it doesn't use any mappings so if a file is not in the correct directory, it will get compiled over and over because netbeans will look in the same relative directory path for the output class file and since it never gets compiled there, netbeans always thinks it needs to recompile. Filesystem access and directory traversals are expensive because they are "small" amounts of data transferred between kernel and user space using tons of calls. Databases can be much more performant because they transfer large amounts of data over open ended pathways (the network) and you can use multi-threading/multi-tasking to weave the data transfer latency into other operations. - ggw
  - Java doesn't have to use filesystem hierarchies for package structure. See VisualAgeForJava. It stored all source and bytecode in a (shudder) database.
- Top, this has nothing to do with filesystems or databases, you don't understand these features and you're spouting nonsense. Please try and learn some programming before spitting out opinions about stuff you know nothing about.
- Why do you automatically assume the communications fault is on MY end? That is not very diplomatic of you. What are you trying to achieve by using harsh accusations? I will try to clarify what I am getting at. Ideally I believe function code should be managed in database cells. If function A calls function B, the interpreter needs to know where to find function B. The determination of this in most languages assumes files, not databases, as the function containers. -- top
- No it doesn't. You assume files because you know next to nothing about programming. Everything you say is so full of misconceptions it's hard to even know where to start. You need to do your homework and learn about stuff before you involve yourself into conversations that are way over your head.
- Then please tell me, Dear Nobel Winner, how is the interpreter going to know where to find function B in the above example if B is in a database? And some advice: if you are not going supply technical content, but just insults, then please say NOTHING. You are not wanted here and you help nothing nowhere.
- [I'm no Nobel winner, but I have used a compiler that fetched source from a database. It finds it by name. How else would it find it?]
- There appears to be a misunderstanding here. I don't mean application code that does an explicit database lookup; I mean code that calls a function like a normal function call. The interpreter needs to know where to look for that function. There is a normal "hunting algorithm" that an interpreter uses to find routines (search path). Somehow this would have to be overriden or altered if we want it to add the database to its hunt list. I don't know of too many languages that provide for this out-of-the-box.
- [There's no misunderstanding. The compiler (or interpreter) finds the function by name. The application doesn't do an explicit database lookup. The language doesn't usually determine this, the compiler or interpreter does. The Java language spec says that files are just one way to store source code and bytecode, not the only way. Your claim that "[l]anguages are tuned to hierarchical file systems out of historical habit" is incorrect for Java and Smalltalk (and probably C, C++, C#, etc.). Most of them aren't "tuned" to any specific storage system but allow compilers/interpreters to define that. IBM's Visual Age series worked this way in an attempt to facilitate team coordination. The idea was interesting but the implementation was problematic (code and comment format between methods wasn't preserved properly, for instance). Today most teams keep their source in a version control database and check out entire files as needed. IBM has largely abandoned the Visual Age series and its replacement once again uses files for source code.]
- I realize there may be exceptions. I did not say "all". But, I would still like to see a pseudo-code example of how one tells the language engine to transparently search for function code in a RDBMS. What do the SmallTalk or VA "hooks" look like?
- [The code looks the same regardless of where it is stored. Again, this isn't a language issue. It's a compiler/interpreter issue. It sounds like you aren't grasping the difference between the two.]
- So one has to tweak the underlying compiler/interpreter code to make it happen instead of through existing API's? If so, just say so. I asked a how-to question, not a definition question.
  - In Java, you'd just provide a ClassLoader implementation that knows how to use the database system to see if a requested class can be loaded from its "source" of code. The URLClassLoader loads stuff from URLs. The plain Java ClassLoader uses the "classpath" mechanism to enumerate directory structures to search. -ggw
- [You asked "how is the interpreter going to know where to find function B in the above example if B is in a database?" and I answered "by name". It's the same way it finds it in a file system.]
- I give up. This conversation is going in circles. Anybody else want to try an answer?
- [I've given you the answer. The interpreter knows where to find function B in the above example because it knows that "B" is the function's name. It looks up function "B" in the database and loads it. What else is there to say?]
- Yes, but few offer this ability without altering the interpreter's source code. Would you at least agree to that? Note again that I did not say "all".
- [Certainly. Most compilers, interpreters and virtual machines read source and/or byte code from files. I didn't know that was being questioned. Notice that I don't say "languages", but "compilers, interpreters and virtual machines". Most languages don't specify where or how source or executable is stored. You might want to look into the history of compilers and interpreters that load source from databases. They haven't proven to be more useful than those that load from file systems. I think history weighs against your claim at the top of this page.]
- At this point I suspect it is because of performance issues. Databases are generally not tuned for such purposes in practice. But, MooresLaw and the need for a "structure" more powerful than a tree may eventually change the decision equation. (AS/400's file system can allegedly double as both a hierarchical system and relational.) If anyone encounters reasons other than performance, please let us know. I admit I was sloppy in my usage of "language" versus "interpreter". I apologize for that lack of precision and any confusion it may have caused.
- [First, languages don't require a structure more "powerful" than a tree. They don't even require a structure as powerful as a tree. All they need is a map. Many modern languages provide hierarchical namespaces to prevent name collisions, but those namespaces are flattened in implementation to become part of the names. For instance, when a Java VM loads "com.foo.Bar", it doesn't load "com", then load "foo", then load "Bar". It loads a class named "com.foo.Bar". It could just as easly be named "comfooBar" for all the VM cares. The tree is merely a convenient way for humans to organize classes. Second, performance hasn't been the problem. The problem has been lack of benefit. Keeping code units in a database hasn't been demonstrably more convenient than keeping them on a filesystem, and in some ways it's less convenient.]
- I meant for human convenience, not necessarily machine or algorithm convenience. Large code trees seem to suffer many of the problems described in LimitsOfHierarchies. I sometimes have to create a parellel "meta-base" to keep track of file or code snippet attributes. Granted, if I wrote the code in such trees, I would probably have taken steps to reduce problems encountered, but directory trees in practice just grow into messes.
- [So you're saying that a bag of classes will be less messy than a tree of classes? Why would that be?]
- It is easier to keep to OnceAndOnlyOnce outside of trees. Trees don't handle orthogonality very well, as described in LimitsOfHierarchies. One needs to go outside of trees to fix some of that. As far as "messy", I find normalized tables the most convenient, but If we are just comparing trees to other navigational (non-relational) structures, then it is hard to say which is more "messy": the duplication needed in trees, or the spider web of pointers in the other solutions. Sometimes the tree is better even with duplication, but sometimes not. It is a nasty battle of duplication versus tangled pointers. It is like deciding between being forced to go to work naked or in drag.

The RelProject implementation of TutorialDee stores all user-defined code definitions within a relational database. All code resides in the database as source. There is no code repository outside of the database. Upon the first invocation of a given function, its source code is looked up in the database, compiled to an executable form, the executable form is cached, and finally it is executed. Subsequent invocations execute the cached version.

Related: TableOrientedCodeManagement

Perhaps move all this to TableMantraTakenToExtreme?

CategorySourceManagement, CategoryInfoPackaging