Mars Spirit Software Problem

I've noticed, over the last few weeks, that the anomaly suffered by Mars Spirit seems to be focusing on its software, specifically the apparent accumulation of "data files" in its flash memory. I've seen some references to problems related to a size mismatch between its flash memory and its RAM. Earlier, one of the services reported that the team said it had worked fine during the nine days of testing on earth, but that perhaps the problem was related to the 18 (I think that was the number) days of operation on Mars.

In other words, it sounds like a core leak (well, maybe a core/file system leak).

I wonder if anyone shares my interest in questions like the following:

What language (languages) and systems does Mars Spirit (and presumably also Mars Opportunity) run?
What is the software architecture?
What methodologies did the software team follow?
What alternatives are now available to the team to correct the anomaly?
Are there any learnings for us in this experience?

This isn't much more info, but it runs VxWorks and I believe Nasa is rated a CMM 5. More in this article: http://www.space.com/businesstechnology/technology/mer_computer_040128.html.

See this thread: http://groups.google.com/groups?dq=&hl=en&lr=&ie=UTF-8&th=f1f58d2c78255ac6

The following excerpts that discussion:

The flight software is all written in C running on vxWorks. The ground software is a mishmash of C, C++, Fortran, Perl and Java (and maybe a few other things.) The other day in the hallway I heard a MER team member who had worked on some of the C++ ground software say, "I wish we were using Lisp." (He was a known Lisper. It definitely was not a case of a C++ programmer having an epiphany.)
Spirit rebooted itself 60 times over a period of a couple of days
Embedded systems like that usually have watchdog hardware/software. When the machine crashes/freezes the watchdog reboots the system. This probably indicates that there is a repeatable fault which has the system continually resetting, the system boots, runs for a while, crashes or detects a problem (maybe no communication with Earth), etc, etc, etc Most of the Stereo Vision and Rover Navigation Software (70k or so lines) is written in C++. There is some C and assembler code for certain types of optimization.
This document's a little old, but relevant:
- http://robotics.jpl.nasa.gov/people/mwm/visnavsw/aero.pdf
Actually, to be strictly correct, the flight code is heavily modified from the research code, and it's written in a restricted subset of C++. I don't know the details of the restrictions, but for example they don't use exceptions or namespaces (which is causing me considerable grief at the moment).
In the January issue of IEEE Computer there is an article "NASA's Mission Reliable" by Patrick Regan and Scott Hamilton on some of the efforts which went into testing the robots' software. Apparently, NASA tried hard to develop automatic testing tools for software written in Java and C++, and apparently they did a good job in catching a few concurrency-related bugs. The article is targeted toward a general audience, not too technical, more magazine/survey-style. Lisp is mentioned there once. The abstract can be found here:
- http://csdl.computer.org/comp/mags/co/2004/01/r1059abs.htm
More recent news:
- http://spaceflightnow.com/mars/mera/040126spirit.html
- http://www.nytimes.com/2004/01/27/science/space/27MARS.html suggests that it was a matter of the flash (EEROM) filesystem metadata (or maybe filesystem cache?) overflowing RAM, due to the length of time that the system had been running and the amount of data captured:
- It is now believed that the rover's flash memory had become so full of files that the craft couldn't manage all of the information stored aboard. Spirit bogged down because it didn't have enough random access memory, or RAM, to handle the current amount of files in the flash - including data recorded during its cruise from Earth to Mars and the 18 days of operations on the red planet's surface.
- And in that we realized that we had this reset problem. Based on just kind of the hunch of our lead software architect, he believed that the problem was probably associated with the mounting of flash and initialization. There is a hardware command that we can send that bypasses the software where we can actually tell the hardware to not allow us to mount flash on initialization. When we the next day actually sent the command to do that, software initialized normally and was behaving like the software that we had always known. It was a fantastic moment.
- Right now, our most likely candidate for the issue has been narrowed down a little bit. It is really an issue with the file system in flash. Essentially, the amount of space required in RAM to manage all of the files we have in flash is apparently more than we initially anticipated.
- Tomorrow, we are might try to access flash and do a little bit of a health check on it. The next day we might try to delete some files to see if our hunch is correct that it's really due to the number of files that we are trying to manage on the flash file system.
You haven't asked about hardware, but I think it's also interesting. From the "Mars Exploration Rover Landings" press kit:
The computer in each Mars Exploration Rover runs with a 32-bit Rad 6000 microprocessor, a radiation-hardened version of the PowerPC chip used in some models of Macintosh computers, operating at a speed of 20 million instructions per second. Onboard memory includes 128 megabytes of random access memory, augmented by 256 megabytes of flash memory and smaller amounts of other non-volatile memory, which allows the system to retain data even without power
Why Lisp? http://alu.cliki.net/RtL%20Highlight%20Film

...And since that will probably dissuade people from reading the thread, here's the best of the humor that appeared in those 40 articles:

>> > No. If I wanted to dissuade someone from using Lisp there are much more effective tactics, like suggesting they do a google search on "lisp jobs" and compare the results to "c++ jobs" and "perl jobs", or even "python jobs".

>> Googling on ``food service jobs gives you even more hits.''

> Ive never heard of the "food service" programming language. Is it new? Could you point me to a reference?

It is the latest thing. It has some of the features from all these languages: Java, Apple Script, Fromage, HotTea?, Jam, and Korn Shell.

Its best for writing spaghetti code .... if you can stomach it.

I can't believe they didn't think of this problem. There's no methodology that replaces thinking.

Don't underestimate NASA, the difficulty of what they do is enormous. To accuse them of not thinking is rather arrogant.

They ought to send the project lead out into the field to fix it... :)

Careful, there! Some NASA guys would want that, so you'd be actively encouraging them to insert bugs so that they could go on a field trip.

Wouldn't be too bad; after all NASA is reporting that there is a StarBucks? on Mars.

http://www.antimatternews.com/2004/01/us_rover_finds_.html

These guys are amazing: they were able to debug a problem at 120 bits per second using communication windows of half an hour at a time. But I bet even they couldn't have debugged the problem using standard tech-support protocols. ;-) Oh, and I hear they finally got the broadband guy out there to install a high speed connection...

Figures. Of course they'd support Mars long before my neighborhood!

From above: The flight software is all written in C running on vxWorks.

So now I want to ask some really provocative questions:

Could the flight software have been written in Lisp or Smalltalk?
- [Possibly; though given the nature of the above bug (an OS configuration issue), I don't know if that would have fixed the problem. Given what we know of the hardware platform (low-end PowerPc; they didn't say exactly which PowerPc it was in the summary above), a good robust Smalltalk/Lisp implementation might not have been available]
Would a language like Lisp or Smalltalk, with built-in storage management, have avoided this problem? [Does the Lisp/Smalltalk come with flash filesystem drivers - drivers for flash memory devices are far different beasts than storage management systems that run out of RAM or disk.]
Would this problem be easier or harder to correct, remotely, if the flight software had been written in Lisp or Smalltalk?
- [VxWorks has good remote debugging capabilities; though when your debug interface is low-bandwidth (120 bps) and high-latency (30 minutes), debugging kinda sucks. Doubt Smalltalk or Lisp would have helped.]
How heavy would the extra RAM be if they wrote the system in Lisp or Smalltalk?

I hope that this very specific, pragmatic, and very real-world example might help to focus some of the various allegations and claims that have been made over the years on both sides of questions like this. Embedded systems running Smalltalk in real time were part of Tektronix oscilloscopes for a very long time (I don't know if they still are or not).

[They are not; the last TektronixInc scope to have SmalltalkLanguage in it was the 11k series; discontinued a while back. One reason Tek abandoned Smalltalk was that the low-end scope architecture (TDS1000, 2000, 3000 series) transitioned from the 68k architecture to the PowerPc, and a good Smalltalk implementation could not be found. High-end Tek scopes are, of course, PCs running Windows with special hardware inside. (This was a few years ago, before Squeak and others were widely available.) Lots of Smalltalkers still here at Tek though, reluctantly coding in OtherLanguages?...]

In retrospect, was CeeLanguage the best choice for the flight software of this mission?

It doesn't appear that CeeLanguage was the culprit; rather it's that the OS was misconfigured. If you use Smalltalk or Lisp on top of an RTOS (VxWorks or otherwise) you still have to configure the OS to your hardware; if you run Smalltalk or Lisp naked (can be done, though rarely done in EmbeddedSystems) likewise you still have to tailor your design to your hardware.

And the hardware design was severely constrained in many ways, for example:

Need to use rad-hardened ICs. For space missions, this should be obvious.
Weight is a premium.
Power consumption is a premium.
Ability to operate at a wide temperature range (as Mars doesn't have a nice thick atmosphere like Earth, temperatures there vary much more widely than your average air-conditioned office).

These limit the choices you can make for your CPU and other components.

Given that they went with an embedded PowerPC environment, was a good Lisp or Smalltalk implementation available - one that has FLASH filesystem drivers and the like? Probably they could have gotten on; but it would be rather unlikely that such an environment would have significant use in production environments. While CeeLanguage and VxWorks have many shortcomings; there are zillions of deployments of such out there running along quite reliably. If you use vxWorks, you're stuck with C at the bottom layer for the most part - the language is implemented in C, its command shell is a sort-of-C interpreter, and its API is written entirely in C. You can run your app in other languages besides C, of course - most languages in widespread use have a vxWorks port of some pedigree - but it is doubtful that one would (or could, without rewriting vxWorks itself) rewrite the BoardSupportPackage? (the vxWorks version of a HardwareAbstractionLayer?) in anything but C. And the BSP is where the configuration code that was in error lies.

The short answer is - if the thing were done in Smalltalk, it probably wouldn't have made it to launch without tons of extra engineering effort. And that wouldn't have necessarily fixed the problem.

I know that lisp or smalltalk or your favorite language is the best thing ever, but how do you think flash is at all related to the language?

If they don't manage their storage, which I still find very hard to believe, they don't manage it. The language doesn't matter. I wonder if the file system check is taking too long with the extra data and it is watchdogging.

See Rover Navigation 101: Autonomous Rover Navigation: http://marsrovers.jpl.nasa.gov/gallery/video/animation.html