Little Language

One spin-off of the UnixDesignPhilosophy was the realization that it is easier to implement a task-specific language optimized for that task than it is to implement a general-purpose language optimized for all possible uses. The Little Language title was given by JonBentley in the Communications of the ACM [Jon Bentley, "Little languages", Communications of the ACM, 29(8):711-21, August 1986.].

What Bell Labs did was to make separate languages for the tasks they found, and optimized them for those tasks. These tasks include:

Pattern matching (and thus RegularExpressions were born)
text line editing (ed/sed)
language grammar (lex/yacc)
Regex record data manipulation (AWK [AwkLanguage])
shell services (sh)
text formatting (troff/nroff); spawned multiple interacting little languages:
- picture drawing (pic, IDEAL)
- graph drawing (grap, dot, GnuPlot)
- typesetting mathematics (eqn)
- typesetting tables (tbl)
- CHEM (chemical structures)
- bib, refer (bibliographic notations)
- dformat -- data structure/record format drawings
- ptx (create permuted KWIC (keyword in context) index)
- macro packages like mm, me, ms
incremental rebuilds (make)
preprocessing (cpp, m4)
arithmetic (dc [DeeCee], bc)
find (non-power users may not have noticed that it supports full boolean expressions over file attributes)

After UNIX left Bell Labs, more other little languages got added to it, including:

shell services (csh, ash, ksh, zsh, bash, tcsh, etc.)
file editing (vi) (vim and several other clones are now big)
sendmail CF configuration language (TuringComplete, I think, but obfuscated)

I know there are more. Could someone fill these in? Thanks.

Little Languages have these characteristics:

may or may not be TuringComplete (see table below)
very efficient to execute (for example, sed scripts don't need to be tokenized; they already are.) both in terms of speed and memory (most cases) in their intended scope.
reasonably simple to express (RegularExpressions are simple enough to be concisely described on a single sheet 8 1/2"x11" paper, single spaced, 10 point font.)
very powerful, within their scope, frequently to the logical extreme of their scope.
pathologically cryptic, requiring users to become experts before using them
Some little languages are not designed with efficiency in mind at all. Instead, it's intended to save the developer time and effort by allowing her to express her intentions at a much higher level of abstraction. This allows the developer's programs to be much shorter than equivalent programs in other languages.

The "little" part primarily refers to the scope of what the language tackles. (For example, TeX is purely focused on typesetting, so it qualifies as a little language no matter how big the implementation). Little Languages are related to:

DomainSpecificLanguages: a little language is a kind of domain specific language in a very narrow domain.
MinimalistLanguage, a language whose interpreter/compiler can be run on systems with very little RAM. Unfortunately, this typically requires user programs to be much longer than equivalent programs in other languages, as the user manually implements stuff the interpreter left out.
LowKeystrokeFormalLanguages ?

Below is a long discussion about the TuringCompleteness of the languages above. Let's summarize here what we know about the TuringCompleteness of the above:

RegularExpressions: not TC, equivalent to FiniteAutomata, but is TC if match and replace and also repeats are allowed
ed/sed: TC (but data may be tricky to represent)
lex/yacc: not TC per se, equivalent to PushDownAutomata? (ContextFreeGrammars?)
AWK: TC
vi/ex: TC
sh: almost TC, only needs some way to manipulate data structures (dd is enough; dd/sh is even used to distribute CLC-INTERCAL)
troff/nroff: ?
TeX: TC
make: not TC, lacks means for representing data. Needs a shell to do anything.
cpp: some implementations are TC (but loops have to be written by #include'ing the current file)
m4: TC
dc/bc: TC (but data may be tricky to represent)
IDEAL: not TC

Other little languages:

"dc is the oldest language on Unix; it was written on the PDP-7 and ported to the PDP-11 before Unix [itself] was ported." (KenThompson)
Tcl "ToolCommandLanguage" ( see "little language" -- http://wiki.tcl.tk/3075 )
"SQL used to be a little language"
the Berkeley Packet Filter (BPF)
control languages for the BPF, such as OpenBSD's pf (http://www.openbsd.org/cgi-bin/man.cgi?query=pf.conf&apropos=0&sektion=5&format=html)
Would Jakarta Struts application configuration (struts-config.xml) be considered a little language? Only if you consider every XML DTD to be a little language.
BitScope? http://massmind.org/techref/language/meta-l/bitscope.htm
RebolLanguage ??? REBOL isn't a little language itself, but it's designed for creating them. --Gregg Irwin
LOGO ??? Logo is a dialect of Lisp, but it's not a little language. The little language inside Logo is the one related to Turtle Graphics. --Gregg Irwin
In general, LowKeystrokeFormalLanguages tend to be LittleLanguages?

Some Little Languages are TuringComplete. For example, I see no reason why somebody couldn't write a Universal TuringMachine interpreter in AWK to prove it. (If that sounds like too much work, you could simulate a URM machine - I would have thought the AWK associative arrays would be great for that.) how about a sed Universal TuringMachine? <http://queen.rett.polimi.it/~paolob/seders/scripts/turing.sed> -- (That's down, try http://sed.sourceforge.net/local/scripts/turing.sed.html)

How about Lex/Yacc, TeX, and AWK?

A related sort of little language is the little language that isn't designed with efficiency in mind at all. Instead, it's intended to save the developer time and effort by allowing her to express her intentions at a much higher level of abstraction.

SQL used to be one of these, I think. Prolog and the various expert system shells also. The idea seems sadly quiescent these days; instead we're all standing around marvelling at bytecode, container libraries, and garbage collection as if these things somehow represented advances in the state of the art.

-- WilliamGrosso

Lex/Yacc, sure. TeX? Maybe at one time. It no longer seems very little to me anymore. I guess it still could fit, due to ancestry, but I personally would never list it. AWK? probably, though it's also growing. Part of my argument against AWK would be that most things people do in perl could also be done in AWK, and I would certainly not call perl little. AWK is admittedly smaller, and it was originally little (the version on the AIX box at work is easily a Little Language. The version on my Linux box at home probably isn't. For the record: on my Linux box, sed is 35k, mawk is 88k, and gawk is 255k.)

The size of the implementation code is not the point, especially given the Moore's Law side effect that most programs keep growing larger over time, whether they need to or not. Someone might implement any of these things (bc, ed, lex, tic tac toe) in a hundred billion line C++ program that requires a petabyte of RAM sometime in the future, but that wouldn't necessarily mean that it's no longer a "little language" (although it would certainly make the term ironic, and one would have to wonder if bc really needed a real-time ray tracer etc to draw the numerals in a pretty way)
The "little" part primarily refers to the scope of what the language tackles. TeX is purely focused on typesetting, so it qualifies as a little language no matter how big the implementation.

First of all, the TuringTest refers to an AI problem rather than computability.

TuringComplete is framed in terms of mathematical operations on the integers {0,1,2,...} rather than in terms of machine words or whatever other data type you use in practice. (So long as your data types are finite, you could code them up as numbers and compute over them with functions.)

The fact that you can't write a device driver in AWK, or interface with a graphics system is irrelevant from the computability theory perspective. The argument would run like this:

Allowing for finite-space limitations, the computational problems solved by the device driver or graphics system are effectively computable. 2. Unix and its device drivers and graphics systems are effectively computable - so it is possible to write a universal TuringMachine program which emulates the Unix system and its device drivers.(I said it is possible, not that it's fun!)
Write a universal TuringMachine simulator in AWK.
Run the TuringMachine simulator in AWK, and load in the Unix emulator as data.

You end up having solved the graphics, operating systems and device driver computations solely using AWK. The fact that it's not practically feasible doesn't matter.

(An alternative strategy would be to produce a universal TuringMachine to AWK compiler rather than write the emulator.) Then write a C compiler to target the UTM.

TuringComplete is about being able to solve any effectively computable mathematical problem in a language - it's not about practical software architecture issues such as interfacing with device drivers. -- some guy

GregMcFarlane? wrote a set of macros to demonstrate that vi is TuringComplete; they're distributed as part of the standard vim distribution at http://www.vim.org . AWK should be no problem; egrep probably isn't TuringComplete; Sed neither. Csh?

First, having been corrected on my misunderstanding of TuringComplete, I believe I've corrected the introductory description.

I'd say shells in general have problems - basically, shells are big EscapeHatches; most of the shell's abilities are to run other programs, and they have very limited abilities on their own.

That being said, I would propose that ksh, bash, and zsh are probably TuringComplete, and Bourne Shell is definitely not TuringComplete.

Why do you think Bourne Shell is not TuringComplete?

Sed? Around 1992 or so, KevinBraunsdorf told me about a 100-line sed script written to do arbitrary-precision integer mathematics. It wasn't pretty, it apparently got sed banned from all future obfuscated code contests, but it apparently worked. Beyond putting fear into me (I'm not sure if it was because someone could do that in sed, or that someone did do that in sed, that scared me more) it made me decide to not doubt sed's computational abilities. -- EdGrimm

Here's a C++ program you can't write in sed:

    #include <iostream>
    main() {
for(;;) { cout << "foo" << endl; }
return 0;
    }

Er, yes it can:

    :loop
    s/.*/foo/p
    b loop

-- JimPerry

I found a proof of sed's turing completeness: http://robertkotcher.com/sed.html

Look it's even shorter in sed! >:)

A Little Language is designed for a specific task, and not for throwing at any computational task that comes along. As such, it usually has no need to be TuringComplete - at least, that is what is expected. But along comes a task that lies just on the wrong side of the border of its abilities, and the temptation to extend it a little bit to handle the edge case can be irresistible. If such a LittleLanguage ends up being TuringComplete then it's pretty much by accident, and the result is virtually guaranteed to be a particularly uncomfortable TuringTarpit.

If a LittleLanguage is designed from the outset to be TuringComplete then those borders don't exist and the pressure from that direction to extend the language is eliminated; it has a better chance of retaining what conceptual purity (coherent semantic model, robustness of implementation, etc.) it started out with, even if it is subsequently extended for other reasons. It may still be a TuringTarpit in the general case, but if you're working anywhere in the vicinity of its bailiwick, at least you won't suddenly run into an impossibility or gotcha.

For an example of an "accidentally TC" LittleLanguage I nominate sendmail; and for a deliberately TC LittleLanguage, PostScript.

It bothered me that a page on Little Languages should not contain a reference to JonBentley or ProgrammingPearls, so here they are. (Actually, Bentley's column on Little Languages appeared in the second collection, MoreProgrammingPearls.)

Also, I have the feeling that the idea of LanguageAsInterface is relevant (or that, if it isn't, it ought to be).

-- CameronSmith

: "sed scripts don't need to be tokenized; they already are"

I've often read this but never known what it means. Would someone explain? -- AnonymousDonor

All sed commands are one byte long, not including arguments. -- EdGrimm

More languages than just sed have this property or a similar one. The qmail security guarantee talks about using this (and null-terminated lines) under the subject of parsing. dc is similar; there are only one and two character commands, some of which have a mandatory one character argument, and no two-character command overshadows a one character command except !.

If a little language is TuringComplete, many properties of program written in that language become undecidable: Such languages suffer from the HaltingProblem.

SQL is another well known little language that has no way to express loops nor recursion, and which is not TuringComplete.

But queries are a form of implicit looping over all records...

In fact, I have written a SQL statement that did run forever reading a table with one row. - Joshua

Yes, there is queries looping over records (you can even make it access multiple tables and do a bunch of other stuff), and you can make it to do things with it by creating a trigger and then inserting into a read-only view that has that trigger attached. They can even be made recursive. You can even do conditionals (with a WHERE clause), and a lot of other stuff. Numeric loops are missing, although a virtual table provides that, so you can use that.

Keeping a little language declarative does have some advantages such as making it possible to use the scripts for multiple purposes (code generation, sanity checks, documentation, ...)

For those interested - see a paper I wrote in 1997 titled Little Languages: Little Maintenance? (it's on my homepage), which also discusses whether a little language should be TuringComplete. -- ArieVanDeursen

Also, the Berkeley Packet Filter (BPF), can be considered as a "little language". It is also decidedly turing-incomplete.

The basic RebolLanguage book describes it as a MinimalistLanguage. Implementations have a very small footprint. Is it appropriate to think of Little Language as a logical categorization of a language while MinimalistLanguage emphasizes the physical size of the implementation? -- DavidNess

OlinShivers? has a nice take on little languages at: http://www.cc.gatech.edu/fac/Olin.Shivers/citations.html#ll. He argues, that instead of little languages, we should use a single extensible language plus a lot of libraries. Why redesign things like variables and functions over and over? Wasn't that the idea behind ToolCommandLanguage?

RegularExpressions are also TuringComplete, if they get applied in turn until none of them matches. Actually, plain text substitutions suffice - they are an instance of SemiThueGrammars?. -- PanuKalliokoski

No...as phrased this isn't right. Plain old regular expressions, as you know, are not TC. Applying them in turn doesn't help, that's just (RE)*, another RE, and strictly speaking, it doesn't even matter if they're NDFA or DFA. What you must have meant to say is that regular expression match and replace is equivalent to a SemiThueGrammar?, since then you can have arbitrary RHS and LHS on each grammar rule. -- DougMerritt

By which point regular expressions have become overkill, since search-and-replace on plain substrings is sufficient to implement a Semi-Thue grammar.

So why are LittleLanguages seen to be a good idea but FourthGenerationLanguages seen to be a bad idea? The 4GLs I have used seemed to be collections of related LittleLanguages.

4GL began as a technical term but was emptied of all meaning by marketroids back in the 1980s, so it currently tends to be considered to be marketing fluff unless proven otherwise. Nor is there a "fifth generation language" by which to contrast, despite the passage of time. Terminology trends change. "fourth generation" and "generation" in general used to be commonly applied to many things, not just languages, and now they are (mostly) not. The last usage I personally saw was Bentley calling awk a 4GL, and meaning that in a positive sense (which he described in detail).

Wasn't Prolog given the fifth generation tag by Japanese marketroids? - ScottElliott

I forgot about that one. But I meant used meaningfully; the Japanese 5th generation label was empty in general, and there was no reason to apply that label to Prolog (except that it was the official language of their 5th Gen. project). "Fourth Generation" was once in a while used in a non-empty way.

One interesting thing about little languages is that they must interface to the world. When they do, it will usually display some awkwardness. Otherwise little languages are really nice.

Take a rather libre view of the classification of little languages, C is good at managing the memory, but some part of the world is better interfaced using OO, there comes the awkwardness of writing GUI applications in C.

Awk and sed are good at parsing flat files. When you want to parse EBNF using awk and sed, you would hate yourself. Yacc is for that.

A few more examples would be good.

When we talked about little languages, there must be mentioned that a good method for these little languages to interface with to the world is just equally important. In UNIX, that is the pipe. With pipe, the little languages can stay little.

The pipe is not perfect though, for it is unable to prevent people from inventing Perl. Or we could say that awk and sed is of no use to yacc and lex. When we want to parsing EBNF with another little language, the pipe can't let us utilize the other previously existing little languages. When we want to use a functional style to parse the EBNF instead of the C used in yacc, again we can not effectively utilize the yacc and lex but have to be forcef to implement our own yet another yacc in our own favorite functional language, ie. haskell etc.. The pipe cannot help us on that.

-- ZhaoWay

For a description of a possible class of LittleLanguages in the FlowBasedProgramming context, see http://www.jpaulmorrison.com/fbp/bdl.htm. FlowBasedProgramming supports a network of "pipes", and so should be a great environment for hooking together different (types of) LittleLanguages. -- PaulMorrison

Does FurryScript count? It is domain specific, and I think for mostly a narrow domain; it can be used with other stuff but not very well. What about, DadaEngine and rmutt, which have some similar purposes (even though none of them was based on the other)?

Is MusicMacroLanguage (MML) a LittleLanguage?

See Minilanguages: Finding a Notation That Sings http://www.faqs.org/docs/artu/minilanguageschapter.html

CategoryLanguage