Sawzall is an infamous language that pretty much no one has heard of - but it is more important than you think. It is Google's current main language that they use to help mine/analyze the internet (and interact with Google File System and Map Reduce).
Sawzall borrows from C, Algol, and Pascal style syntax. It is procedural, but designed it seems for specifically commanding Google's computers to act in a parallel way (at a very high level with short precise programs).
See
for a detailed article about the language.
The "every thing is a file" view needs refreshing, and their misunderstanding of the relational model needs to be pointed out. Once again they seem to be confusing physical with logical.
SawzallLanguage works with GFS (google file system) and MapReduce
? (based on some functional ideas). But note that Sawzall itself isn't purely functional - it is rather imperative and procedural in nature.
Rob Pike (and friends) view from Unix days is that everything is just a file. That is a useful abstraction - but his file purism leads to a FraudulentMindset. By stepping into this first person view of everything being a file, he misses out on the powerful relational model view (GFS versus my rightfully suggested Google Relational Database (GRD)).
- I don't believe that the referenced paper, or the work they've done, is based on or limited by the idea that everything is a file.
- They are not necessarily talking about data over which they have control.
- Even if they do have control over the data, it may be that other uses of it are not well-suited to having it stored in a relational database.
- The techniques they are discussing work equally well on data that someone else has, such as the web as a whole. Saying that they are insufficiently brilliant for not simply using the full relational model seems to miss the point.
- Of course various concerns such as logical relational model and physical realisation (with the associated limitations) should be separated, but I suspect that Pike et al know that.
- I've read the paper again, and they are discussing ways of analysing data. They say explicitly that some data is difficult to house in a relational database, showing that they have considered it. Who can disagree? Do people think everything can appropriately be stored in a relational database? My experience suggests not.
- Housing data in a relational database is not difficult. That's what google relational database would be designed to solve - the physical limitations of today's GreenPlum/Bizgres and similar clustered solutions, just as Google File System solves the physical limitations of say Windows 3.1 or Windows 95 maximum hard drive space available. For example, if Google were to use Windows 3.1 as their operating system of choice, and SqLite as their database of choice - obviously they would have trouble fitting their data into the database and obviously the physical limitations of Windows 3.1 and SqLite would hurt them. These physical limitations are not about the relational database model being difficult, but about limitations of products that are not Google Relational Databases engineered by folks with PhD's who should know by now that GFS is a poor solution compared to GRD (Google Relational Database).
- Sometimes you can't run your database over the best part of 1000 machines. Sometimes the processing needs to have that throughput. In those cases the database solution won't work. They are discussing techniques that do work in that scenario. I've run standard relational databases over several tens of machines and the response times for simple queries have been abysmal. This despite the system being set up and configured by industry accredited experts.
- Sometimes you can't run your file system over the best part of 1000 machines. Oh wait, that's what Google File System was invented for. Sometimes you can't run your database over the best part of 1000 machines. Oh wait, that's what Google Relational Database and GreenPlum/Bizgres would hopefully be invented for, if one actually stepped out of his Unix File Shoes and looked closer.
- A proposed relational "solution" seems to be suggesting almost the same thing, but with less detail.
- Uh, except, for the fact that it is relational, and not based on "less detailed" flat "unix files". In addition, a temporary migration path is suggested, to store tabular data in files in the mean time instead of just flat file data.
- I'd love to see a relational solution written up properly. If it really is better than what Pike et al have done then there won't be a problem getting it published.
- There isn't really a point to proposing a relational solution. Files work fine - so I don't see why one would wish for a relational solution to be written up properly if there's no reason for a massively scalable relational database similar to GreenPlum/Bizgres but tweaked by employees who have PhD's. --DevilsAdvocate
I noted in my article that google employees are full of
brilliance (constructive) - but
also that Rob's view that
everything is a file needs to be criticized. Pike's articles are brilliant, but they also contain misleading statements about relational techniques and models (such as confusing
relational with
physical storage). To remain quiet about such misleading statements would be scientifically dishonest. Those not open to criticism (especially when constructive blurbs and humor are added) show signs of denial,
ClosedMind, etc.
- I don't understand this criticism. They have the data, it's in a file system, it needs to be analysed. Your proposal seems to be to run slave processes on each machine to mine the data, then send the results back to a big database. In some ways that's what they're doing, and they have a language that does it well. I'm no longer sure what your criticism is.
- In fact even flat XML files would do the job splendidly.. if they work, for data retrieval, then it'd be wise to use them. It's GoodEnough and its working, so why fix what's broken, if the broken solution works fine? --DevilsAdvocate
It is important we criticize the industry, and I find Google has some misunderstandings in their Sawzall paper. However I also constructively address their problem by offering a serious solution and complimenting Rob Pike's brilliance, along with adding some humor.
- I do not find any misunderstandings in the paper. I find innovative solutions to difficult problems. Your criticisms seem misdirected, and I would like to see how you think a relational database will help. It's not just the storage that needs to be distributed, it's the processing. The storage is trivial by comparison.
- Indeed, the Google File System storage system is a trivial implementation. Even junior high students could implement one in smalltalk or php, using just objects or associative arrays alone. Likewise, a massive relational database engineered to span across a few ten thousands of commodity PC's is trivial enough that a kindergarten student could code it in a couple of hours. In fact, the storage of thousands of tera bytes of data is trivial enough that even a tar zip file could fit it onto a single 200GB hard drive available at Best Buy. --DevilsAdvocate
- I've yet to see even one example of a tar-zip offering a 1:10k compression ratio (enough to fit just two 'thousands of terabytes' into 200 GB) - not even on contrived data.
- [Considering above italics were 99.8 percent tongue in cheek (ItalicsAreHilarious), the case is rested.]
I leave this quote:
"Sometimes we discover unpleasant truths. Whenever we do so, we are in difficulties: suppressing them is scientifically dishonest, so we must tell them, but telling them, however, will fire back on us. If the truths are sufficiently impalatable, our audience is psychically incapable of accepting them and we will be written off as totally unrealistic, hopelessly idealistic, dangerously revolutionary, foolishly gullible or what have you. (Besides that, telling such truths is a sure way of making oneself unpopular in many circles, and, as such, it is an act that, in general, is not without personal risks. Vide Galileo Galilei.....)" --E.W. Dijkstra
- I would be interested in seeing real evidence of these unpleasant truths you claim to have found. So far I see none.
- Our audience is psychically incapable of accepting the idea that "relational techniques" will work on massive data. The "physical relational products" available from a current vendor are not relational techniques (useful model and methods, which should disregard any physical storage.. a separation of concerns). Pike implies relational techniques (not products or databases available) are not suited for Google's task at hand, which is like saying that Windows 3.1 hard drive limitations don't permit google to use an operating system since operating system techniques obviously do not scale on computers, as proven by Windows 3.1. That doesn't stop google from developing, say Windows For Google Clusters and Google Relational Database. A massively scalable Google Relational Database (GRD) would scale well, just as GreenPlum scales to the needs of its customers, and just as Windows 3.1 did scale for the needs of some customers. The traditional relational techniques and relational model is not in fact a problem, as Pike suggests. In the article, Pike confuses the relational techniques (logical) with the implementation (physical). For more info read some of FabianPascal and do the homework to find out what is meant by this confusion.
See also: NotNiceEnough, MapReduce?, RobPike