Ytwok Embedded Systems

See, I told you so!

Actually it is all GeorgeBush's fault again: http://www.buzzflash.com/analysis/03/08/15_blackout.html

Those who follow the energy industry even superficially already know of Dubya's culpability, but what does this have to do with YtwokEmbeddedSystems?

Looks like someone's trying to suggest that the cause of the 2 day US blackout is a belated rollover issue. How many seconds has it been?

Well, it was due to some fluke in the year two thousand apparently...

Historical Material:

It seems, on YtwokUpdate, that very few folk take the embedded systems threat very seriously. Many suggest that the market problems are more of a worry. Here's some hard facts (ie. not just consultant blather) that suggest embedded systems problems are indeed a big deal. Refutations only too welcome. -- PeterMerel

Surveys:

http://www.jimlord.to : JohnKoskinen? just confirmed that this is on the level (http://abcnews.go.com/wire/Business/AP19990819_522.html). If the US navy has this right, we're looking at something just dreadful.

From several months ago:

http://www.dpsv.state.mn.us/docs/infocntr/year2000/survey.htm is the only hard data published on individual readiness in the utilities. This is from Minnesota state, where 80% of the electric generators are coal-fired (meaning, if they trip, they stay down for a week minimum). This backs up Rick Cowles' anecdotes at http://207.158.205.162/PP/RC/rc9808.htm . There's a federal study that NERC will be releasing this month that should put the national situation in perspective, but, going by the Minnesota survey at least, most utilities seem to have done just about nothing about their embedded systems problems to date.
http://www.itaa.org/softpr7.htm
ftp://ftp.nerc.com/pub/sys/all_updl/docs/y2k/y2kreport-doe.pdf This is it - the bomb. The NERC surveyed just about every electrical utility in the US in Q3 98; over 80% of them responded. Dig those crazy curves in sections 4.2-4.8. See how they all magically reach 100% readiness in Q4 99? Now remember what we know for a fact: 80% of all software projects are delivered late. Not only are these curves astonishing on face value - which they are with many projects jumping from 25% completion to 50% completion in 3 months - but if you were presented with them as confident estimates from one of your development teams you'd laugh right in their face. I haven't seen anything published anywhere as disheartening as this.

[A Quick Update: the NERC material continues to be profoundly disturbing. Happy faces appear in their summaries and graphs, but even a little analysis reveals a very different complexion. Try http://greenspun.com/bboard/q-and-a-fetch-msg.tcl?msg_id=000TGG for a sample.]

[Another quick update: NERC's latest says everything is just fine now, don't worry, everything has already been fixed.]

http://www.api.org/ecit/y2k/pressrelease2.html summarizes the gas and oil industry's most recent survey. Skipping to the bottom line, of 800 companies surveyed, less than 600 replied. Of these, over 70% haven't yet started embedded systems remediation with 15 months to go. 100% expect to be compliant by 1/1/00.

Case Studies:

http://www.prepare4y2k.com/embed1.htm describes four current case studies in real live manufacturing operations: a major food processing company, a major automaker, a major soft drink company and a major pharmaceutical manufacturer. As you can see, embedded systems failures are both widespread and very difficult to deal with. The rates quoted match similar studies of oil production/distribution facilities.
http://www.senate.gov/~y2k/statements/091098miller.html is testimony from the transport industry to Sen. Bennett's committee on Sep. 10. In brief, 80% of airports refused to testify to the committee; of the ones that did, less than a third have completed any assessment of their problems. The most advanced of the few that would talk, Seattle-Tacoma, began their project in 1993 and are only now moving into replacement/testing: '"What we found in our inventory was that practically everything at the airport was potentially affected," she said, listing security controls, runway lighting, baggage conveyors, fire alarms, back-up generators, 911 response systems, storm water treatment, and heating and parking garage systems.'
http://ourworld.compuserve.com/homepages/roleigh_martin/news9712.htm#BM971213A refers to a number of real experiences of y2k bugs during testing.
http://www.theaustralian.com.au/extras/007/4051126.htm describes such an experience in some detail.
http://garynorth.entrewave.com/mirrors/gn/GND_RecordView.cfm?RecordID=867 is a brief summary of Texaco's case studies on plant failure.
http://www.msnbc.com/local/KYTV/43343.asp describes some of the problems the British telecommunications industry are encountering in verifying compliance in their equipment suppliers.
http://www.theaustralian.com.au/extras/009/index.htm describes the massive ripple effect caused when just a single key utility plant fails.
http://www.y2ktoday.com/modules/home/default.asp?feature=true&id=484 - RoleighMartin? describes some of the very bad things going on inside the electrical generation industry.
http://www.nrc.gov/NRC/NEWS/WIR/week3.html#_1_10 includes an alarming failure at the Peach Bottom #2 nuke reactor. During Y2k testing many of the plant's emergency response systems locked up for several hours. Operators had thought they were just monkeying with the backup systems but, well, D'Oh!
http://www2.idg.com.au/CWT1997.nsf/09e1552169f2a5dcca2564610027fd24/c8bcf163090b09dd4a25674a007debee?openDocument - Sydney water claims it'll be compliant at the end of June. What's significant is the scale of their work - of 90,000 control points, 36,000 had potential date dependencies. Of these, 10,000 required close inspection, and between 1,000 and 1,500 needed to be replaced. That's commensurate with the failure rates quoted below. Note that this project started in 1996; despite this, Sydney Water has just declared that it doesn't know if its pumping stations will make 1/1/00. Do you know what the state of play is where your water comes from?
http://www2.army.mil/army-y2k/y2kelectric90224/sld009.htm - this and the following slides give the US army's interpretation of several of the above and related studies to do with electric systems. Now it's fair to say that the army is looking for trouble ... but it also sure seems like they found a lot of it.
http://www.gao.gov/new.items/ai99114.pdf speaks for itself: "All phases of operations in the electric power industry--from generation to distribution--use control systems and equipment that are subject to Year 2000 failures [...] almost half of the reporting organizations said that they did not expect to be Year 2000 ready within the June 1999 industry target date, and about one sixth of the respondents indicated they would not be ready until the last 3 months of 1999—leaving little time margin for resolving unexpected problems.[...] probable scenario may include 10% to 15% loss of generation, the loss of wire-based voice and data communications, and the partial loss of EMS/SCADA systems. [...] A credible worst case scenario might result in area blackout caused by the long-term loss of generating and control facilities, and the loss of fuel supplies." They give examples of actual clucks who actually expected to do their remediation in December 1999. Click through and hold tight.

The "hard facts" in the mostly-anecdotal accounts above describe the use of embedded systems, but I sure didn't see any indications of just what it is that's going to fail in those embedded systems in 2000. What specific embedded systems (broadly speaking) are date-sensitive and going to fail when the odometer rolls over? -- JimPerry

Failure Rates:

http://www.mot-sps.com/y2k/black_prod.html is Motorola's list of "black" - date-dependent - chips. Intel's list is at http://support.intel.com/support/year2000/status/index.htm (was http://support.intel.com/support/year2000/status.htm). About 90% of the 70 billion chips ( http://www.datamation.com/PlugIn/newissue/09y2k.html ) out in the field go into embedded systems, and estimates for the gross percentage of y2k-failing chips overall run between 1 and 10% depending on who you believe.
http://www.ragts.com/webstuff/y2k.nsf/Pages/Brands-Allen-Bradley?openDocument is Allen Bradley's list of its date dependent manufacturing controls. http://www.bently.com/support/2000list.htm is Bently Nevada's list of its date dependent electrical generation controls. According to http://www.shell.co.uk/news/speech/spe_beatbug.htm a typical oil rig has about 100 mission-critical control systems comprising 10,000 chips. The UK government Action-2000 campaign describes typical manufacturing uses of embedded systems at http://www.bug2000.co.uk/business/emsys.shtml . -- PeterMerel

It seems to me that these still do not address the likely kinds of failures. It seems to me that we're saying "Something could go wrong, therefore something will go badly wrong." -- RonJeffries

BIOS/RTC Problems:

http://www.mitre.org/research/cots/COMPLIANT_BIOS.html describes and provides patches for BIOS/RTC rollover failures in PCs. Unfortunately very few embedded controls have much to do with PCs - when their RTC problems bite they can't be patched but have to be replaced.
http://www.nethawk.com/~jcrouch/second.htm describes the Crouch-Echlin "time dilation" effect - RTC chips running at the wrong speed. This is a recent discovery and it's not known how far-reaching it will be or whether it's been adequately investigated by the controls manufacturers.

Control Failures:

http://ourworld.compuserve.com/homepages/roleigh_martin/y2k_com.htm provides an excellent list of control manufacturer's statements about specific problems in specific controls. http://www.iol.ie/sysmod/irc98a20.htm is a list of common failures in software systems; the same failures will bite SCADA when the controls it monitors fail. But, as you can see, the variety and demography of these failing controls is too complex for anyone to have a clear picture about the exact effects on a particular plant. All that you can really say is that there are effects in controls across the board, and then look at GIGO problems that have occurred in manufacturing/distribution operations in the past.

Plant Failure Scenarios:

http://www.iol.ie/~pobeirne/ieismelt.htm describes the Tiwai Point failure that is most often quoted in the y2k embedded systems literature. Most manufacturing failures will be less dramatic than this - alarms get raised and safeties cause shutdowns without accompanying damage. The problem then is twofold: getting the plant back online, and preventing ripples from its downtime from affecting further plants. This second one is the real business-killer: the JIT and cyclically dependent network of western manufacturing needs to stockpile supplies and make real contingency plans to be able to do anything about it.
Dick Mills describes electrical generation failure issues and scenarios at http://home.olemiss.edu/~lpkleuse/y2k.doc, http://web.archive.org/web/19990423035708/http://207.158.205.162/PP/RC/dm9835.htm, and especially http://web.archive.org/web/19990504162401/http://www.y2ktimebomb.com/PP/RC/dm9841.htm.

DickMills' articles describe electrical generation problems due to business issues, not because of YtwokEmbeddedSystems failures. This is a RedHerring because it has nothing to do with YtwokEmbeddedSystems. I've heard similar prognostications based on the "fact" of a coming currency collapse. -- GeorgeDinwiddie

Um, see the subtitle "Plant Failure Scenarios" just above? Mills talks about what happens to the grid if generation plants fail, and what happens to the grid if large grid users fail. This section was in response to Ron asking above about what sort of things could go wrong. Now if you want specifics in specific plants that might give rise to the sort of things Mills considers, check the links under the Case Studies section above. In general I find Mills to be quite knowledgeable about electrical generation and distribution, but see nothing in what he writes that has anything to do with any coming currency collapse - where do you see that? -- PeterMerel

http://207.158.205.162/Special/Opinion/Readers/asumu9824.htm describes middle-east vulnerability to embedded systems issues. I haven't been able to find any specific oil-plant failure scenarios beyond the shell hand-waving above.

Conclusions

Sorry, Peter, start providing real facts and not sky-might-fall anecdotes and maybe there'll be some interest in refuting or perhaps joining the march to tell the king. (No, I haven't read the latest batch yet, but the first several had nothing specific enough to even try to refute). -- JimPerry

I'm hitting more and more communications glitches trying to post here... My main point is that I can't refute (as you originally requested us guys to do) generalities. I'm not real concerned because nothing I've seen so far has made me concerned, including the sorts of raw statistics here. -- JimPerry

It seems to me at the moment that, having some bald conclusions below and a description of how I get to them above, the more holes you can spot in the "generalities and raw statistics" I've posted the easier time you ought to have refuting the conclusions. Tell me survey X is biased or statistic Y is unrepresentative. Tell me there's some specific hole in the logic that needs filling. As it is now I'm not certain what sort of hard data beyond the above you'd like to see. A hypothetical example of what you might regard as real concerning would really help me understand what's missing here. -- PeterMerel

Some specific examples of what to expect to fail. All "general-purpose" computers have a date function, and are subject to date-related failures, but only a relatively small subset of software does date calculations and of those only certain cases fail at Y2K. A lot fewer embedded systems even have date capabilities, and I don't have the sense that many of those have software that actually does date calculations of the sort that would fail on century rollover. OK, the Tiwai Point example shows one date-related program bug, but bugs will always be with us, and it doesn't point to a Y2K bug.

Already we are getting very close to the end of the century, there are credit cards, contracts, perishables, and the like with post-00 expirations, but relatively few problems have so far occurred (some, but few, and not terribly serious). Lots of people are spending millions on tracking these down; are there no examples of the form "we found/fixed this bug that, left alone, would have caused the controllers on all Boeing aircraft built since 1972 to divert fuel from the engines to the air-conditioning systems"?' -- JimPerry

Peter, I don't understand how you derive the fifth conclusion below from the first four (with which I agree). The first four, I believe, are pretty much factual. The fifth ... well, where does it come from? -- RonJeffries

Check out http://www.foxboro.com/y2000/index.htm for the goods. Doing a little cut-and-grep, it seems that 30/580 or about 5% of their controls have problems. Of the systems based on these controls, however, 111 of 261 or 42% have problems. This is a little fuzzy because they're still studying some of the systems, but it'll do for now.

This illustrates the Beach/Oleson Pain Index (http://www.cio.com/marketing/releases/y2kchart.html) at work. Simply put, the idea is that the more interconnected a process is, the more prone it is to catastrophe. Now the systems Foxboro produce are generic; they don't represent installed control systems, which will have plenty of site-specific interconnections in addition to these generic ones. Shell says that a typical oil rig has 10,000 control points; if that's the case, then if the Beach/Oleson curves are at all reliable we have to expect the rigs, as they stand, are going down. Think of it by analogy with software; if 10% of your objects were corrupted by some bug scrawling on memory, how likely would it be that your program would happily continue?

Beyond this, apart from quoting manufacturer's black-lists, all I can do to give a sense of how widespread these things are is add case studies and surveys to the first section of this page above. In fact I've just added a couple along the lines you suggest. But I don't know how to assure you that they're representative, and I know you don't regard such anecdotes as hard data.

Still, maybe they'll help. The ITAA study describes bug problems that have already occurred. The first Australian link describes a plant that would have crashed. The Roleigh Martin link has several. The garynorth link is the only trace I still have of a chap from Texaco who reports their experience of plant failure rates in case studies. The second Australian link describes the current situation in Victoria, where a single plant failure is costing hundreds of thousands of jobs throughout several industries in two states. There are many other examples out there, but they're mostly very short on specifics and probably won't satisfy you.

I took a quick glance at the Motorola chip list. I don't have any databooks handy (I'm not doing hardware design anymore), but I recognize a number of those chips as real-time clock/calendar chips. A few are 8-bit microcontrollers that probably have RTC chips built onboard. Yes, some of these chips store dates as two digits (BCD) and may calculate leap years incorrectly in the next century. Are these problems likely to cause widespread problems? I don't think so. How many embedded systems that you know of care what the date really is? I would bet that more than 99% of these chips are being used for time calculations, not date calculations. -- GeorgeDinwiddie

Check the Foxboro, Allen-Bradley and Roleigh Martin links above before you wager. The black-lists only tell you that there are plenty of bum chips out there; the Foxboro, etc. links hint at how many are in crucial spots and demonstrate how the rates of failure are amplified in larger systems. -- PeterMerel

Well, it doesn't tell you how many of these chips are out there, just what their part numbers are.

Um, where did I say that it told you how many there are out there? For that, check the Datamation link quoted above. 70,000,000,000 chips, 90% in embedded systems, and 1-10% will fail.

The Roleigh Martin link just seem to list "Year 2000 Embedded Systems Vendors, Associations and Manufacturers."

Yes, so that you can verify that the Foxboro/AB figures are typical. If you don't want to verify that, then ignore Roleigh Martin.

The Foxboro link seems to be an advertisement for their services to do a Y2K audit of your system.

Sorry, you're right, they've changed the layout on their site. Try http://www.foxboro.com/y2000/y2000sysprod.htm for their systems and http://www.foxboro.com/y2000/y2000controlprod.htm for their controls to verify the totals I gave.

The Allen-Bradley site just lists products and what Y2K exposure they have (mostly none).

Yes, as described, typical control failure rates are 1-10%. So now read the Beach/Oleson link, the Shell link, and see what that means. Check the Foxboro links to see it in action.

In any event, how are Y2K failures going to affect industrial processes? If the failure is exhibited by the wrong date showing on the operator's console, then so what?

Um, what operator's console? Do you think we're talking about StarTrekSystems? You're looking at the guts of 20th century SCADA; these control points get scanned automatically with millisecond tolerances on a 24/7 basis; SCADA switches switches, opens and closes valves, starts and shuts processors down automatically, but without any real smarts - things have to be engineered right in the first place. They don't self-correct at the push of a button like the starship enterprise. The smartest ones can failover to redundant systems if they somehow figure out that their data is screwed - but if those systems are built with the same date-dependencies, that's not going to help at all.

And by processors I don't mean CPUs - I mean tons of metal on the move, great energies, highly reactive chemicals, and all the other massive bits and pieces that get used in modern automated factories - these things are engineered to require automation because humans are neither fast enough nor reliable enough to keep them working otherwise. For some manufacturing there's prototype lines that are worked entirely by human operation - but they're scrapped as soon as the real production line is stable.

Still, let's say you're really lucky, and there's a human operator. The best he'll get is an alarm telling him that the line is down and he ought to call an engineer to fix it. When lines go down because of unexpected failure modes, the engineer has to spend a lot of time to figure out what's wrong. How much time? I've seen manufacturing systems where glitches cost more than a million dollars a day, hundreds of domain-expert engineers were called in to fix things, and it still took months to fix.

If these engineers are unlucky, there'll be ancillary damage from all the great big masses and energies and reactive chemicals rushing around. Check the Point Tiwai or the Victoria description (second "the australian" link in case studies) to see cases where the plant failures have required great delays and very high equipment replacement costs.

Sure it's a failure, but the consequences are not very serious. Do you know of ANY industrial processes that are date dependent? What failure modes are you fearing? -- GeorgeDinwiddie

Read the Case Study links above to see typical fearful failure modes. I'd suspect that, at least as often as not, the typical failure modes wouldn't be very serious in themselves - just time-consuming. But such delays are amplified by JIT, cyclic dependency, and hubbing, which are staple strategies of all modern industries. 100,000 of my countrymen were laid off in the last 5 days because of just such amplification of a single plant failure, so don't tell me that the same thing happening all over the civilized world all on the same morning is "not very serious". Well, unless you can explain why the Victorian experience is atypical.

Still, the most disturbing thing for me is that these are all the facts I've been able to find. Most of what's out there is either feel-good legal boilerplate, chicken-little paranoia, or "y2k consultants" trying to turn a buck. Facts are rare as hens' teeth. Still, tell me what kind of stuff you'd regard as "hard" and I'll be happy to see what I can find. Some pointed questions on your part would be really useful too. So far what I take away from the above strongly suggests:

1-10% of all chips are date-dependent in some fashion
About 10%, or maybe a little more, of control points based on these chips have rollover problems. An unknown number have dilation problems.
The installed base for these things is vast, and it's highly labor-intensive to find and replace them.
The utilities are waaay behind on this work.
If the work isn't completed the results will be severely disruptive to the systems that rely on the failing controls.

Quite apart from the pointed questions I'd like, if you can find any hard facts - hell, any technical material at all - suggesting that any one of these conclusions is mistaken, I'd be very much obliged you'd present it here. In quite a lot of wrestling with search engines in the last month I've found nothing to balance what's above. -- PeterMerel

CouldIsntWill.

On a related note: I was just speaking with a neighbor who works in IS with a North Carolina Utility about the Y2k work they're doing. NCs electricity regulation board did an interesting thing a couple of years ago - they let the utilities know that they would not be allowed to pass any of the cost of Y2k related shutdowns to their customers. As a result, the utilities suddenly became highly motivated to fix them themselves, and it now appears that all of the NC utilities will be ready. This has been one of the few instances I have seen where government intervention in Y2k has had a positive effect.

KyleBrown

Indeed, it seems like NC will make it if anyone can - check http://www.state.nc.us/Pubstaff/psy2k/y2kintre.htm. Accountability is an excellent motivation. But conversely I'm beginning to think that until insurance companies take vulnerabilities like cyclic dependency, JIT, and hubbing into account when they price their premiums, there will be very little motivation for industry to balance efficiency against reliability. The Congressional, NERC and Minnesota PUC studies indicate great sloth and ignorance throughout most PUCs up until very recently - and terribly inadequate investment now that they are becoming aware. -- PeterMerel

CategoryRant (?) EmbeddedSystem