Ytwok Embedded Systems

See, I told you so!

Actually it is all GeorgeBush's fault again: http://www.buzzflash.com/analysis/03/08/15_blackout.html

Those who follow the energy industry even superficially already know of Dubya's culpability, but what does this have to do with YtwokEmbeddedSystems?

Looks like someone's trying to suggest that the cause of the 2 day US blackout is a belated rollover issue. How many seconds has it been?

Well, it was due to some fluke in the year two thousand apparently...


Historical Material:

It seems, on YtwokUpdate, that very few folk take the embedded systems threat very seriously. Many suggest that the market problems are more of a worry. Here's some hard facts (ie. not just consultant blather) that suggest embedded systems problems are indeed a big deal. Refutations only too welcome. -- PeterMerel

Surveys:

From several months ago:

[A Quick Update: the NERC material continues to be profoundly disturbing. Happy faces appear in their summaries and graphs, but even a little analysis reveals a very different complexion. Try http://greenspun.com/bboard/q-and-a-fetch-msg.tcl?msg_id=000TGG for a sample.]

[Another quick update: NERC's latest says everything is just fine now, don't worry, everything has already been fixed.]

Case Studies:


The "hard facts" in the mostly-anecdotal accounts above describe the use of embedded systems, but I sure didn't see any indications of just what it is that's going to fail in those embedded systems in 2000. What specific embedded systems (broadly speaking) are date-sensitive and going to fail when the odometer rolls over? -- JimPerry


Failure Rates:


It seems to me that these still do not address the likely kinds of failures. It seems to me that we're saying "Something could go wrong, therefore something will go badly wrong." -- RonJeffries


BIOS/RTC Problems:

Control Failures:

Plant Failure Scenarios:

DickMills' articles describe electrical generation problems due to business issues, not because of YtwokEmbeddedSystems failures. This is a RedHerring because it has nothing to do with YtwokEmbeddedSystems. I've heard similar prognostications based on the "fact" of a coming currency collapse. -- GeorgeDinwiddie

Um, see the subtitle "Plant Failure Scenarios" just above? Mills talks about what happens to the grid if generation plants fail, and what happens to the grid if large grid users fail. This section was in response to Ron asking above about what sort of things could go wrong. Now if you want specifics in specific plants that might give rise to the sort of things Mills considers, check the links under the Case Studies section above. In general I find Mills to be quite knowledgeable about electrical generation and distribution, but see nothing in what he writes that has anything to do with any coming currency collapse - where do you see that? -- PeterMerel


Conclusions

Sorry, Peter, start providing real facts and not sky-might-fall anecdotes and maybe there'll be some interest in refuting or perhaps joining the march to tell the king. (No, I haven't read the latest batch yet, but the first several had nothing specific enough to even try to refute). -- JimPerry

I'm hitting more and more communications glitches trying to post here... My main point is that I can't refute (as you originally requested us guys to do) generalities. I'm not real concerned because nothing I've seen so far has made me concerned, including the sorts of raw statistics here. -- JimPerry

It seems to me at the moment that, having some bald conclusions below and a description of how I get to them above, the more holes you can spot in the "generalities and raw statistics" I've posted the easier time you ought to have refuting the conclusions. Tell me survey X is biased or statistic Y is unrepresentative. Tell me there's some specific hole in the logic that needs filling. As it is now I'm not certain what sort of hard data beyond the above you'd like to see. A hypothetical example of what you might regard as real concerning would really help me understand what's missing here. -- PeterMerel

Some specific examples of what to expect to fail. All "general-purpose" computers have a date function, and are subject to date-related failures, but only a relatively small subset of software does date calculations and of those only certain cases fail at Y2K. A lot fewer embedded systems even have date capabilities, and I don't have the sense that many of those have software that actually does date calculations of the sort that would fail on century rollover. OK, the Tiwai Point example shows one date-related program bug, but bugs will always be with us, and it doesn't point to a Y2K bug.

Already we are getting very close to the end of the century, there are credit cards, contracts, perishables, and the like with post-00 expirations, but relatively few problems have so far occurred (some, but few, and not terribly serious). Lots of people are spending millions on tracking these down; are there no examples of the form "we found/fixed this bug that, left alone, would have caused the controllers on all Boeing aircraft built since 1972 to divert fuel from the engines to the air-conditioning systems"?' -- JimPerry

Peter, I don't understand how you derive the fifth conclusion below from the first four (with which I agree). The first four, I believe, are pretty much factual. The fifth ... well, where does it come from? -- RonJeffries

Check out http://www.foxboro.com/y2000/index.htm for the goods. Doing a little cut-and-grep, it seems that 30/580 or about 5% of their controls have problems. Of the systems based on these controls, however, 111 of 261 or 42% have problems. This is a little fuzzy because they're still studying some of the systems, but it'll do for now.

This illustrates the Beach/Oleson Pain Index (http://www.cio.com/marketing/releases/y2kchart.html) at work. Simply put, the idea is that the more interconnected a process is, the more prone it is to catastrophe. Now the systems Foxboro produce are generic; they don't represent installed control systems, which will have plenty of site-specific interconnections in addition to these generic ones. Shell says that a typical oil rig has 10,000 control points; if that's the case, then if the Beach/Oleson curves are at all reliable we have to expect the rigs, as they stand, are going down. Think of it by analogy with software; if 10% of your objects were corrupted by some bug scrawling on memory, how likely would it be that your program would happily continue?

Beyond this, apart from quoting manufacturer's black-lists, all I can do to give a sense of how widespread these things are is add case studies and surveys to the first section of this page above. In fact I've just added a couple along the lines you suggest. But I don't know how to assure you that they're representative, and I know you don't regard such anecdotes as hard data.

Still, maybe they'll help. The ITAA study describes bug problems that have already occurred. The first Australian link describes a plant that would have crashed. The Roleigh Martin link has several. The garynorth link is the only trace I still have of a chap from Texaco who reports their experience of plant failure rates in case studies. The second Australian link describes the current situation in Victoria, where a single plant failure is costing hundreds of thousands of jobs throughout several industries in two states. There are many other examples out there, but they're mostly very short on specifics and probably won't satisfy you.

I took a quick glance at the Motorola chip list. I don't have any databooks handy (I'm not doing hardware design anymore), but I recognize a number of those chips as real-time clock/calendar chips. A few are 8-bit microcontrollers that probably have RTC chips built onboard. Yes, some of these chips store dates as two digits (BCD) and may calculate leap years incorrectly in the next century. Are these problems likely to cause widespread problems? I don't think so. How many embedded systems that you know of care what the date really is? I would bet that more than 99% of these chips are being used for time calculations, not date calculations. -- GeorgeDinwiddie

Check the Foxboro, Allen-Bradley and Roleigh Martin links above before you wager. The black-lists only tell you that there are plenty of bum chips out there; the Foxboro, etc. links hint at how many are in crucial spots and demonstrate how the rates of failure are amplified in larger systems. -- PeterMerel

Well, it doesn't tell you how many of these chips are out there, just what their part numbers are.

Um, where did I say that it told you how many there are out there? For that, check the Datamation link quoted above. 70,000,000,000 chips, 90% in embedded systems, and 1-10% will fail.

The Roleigh Martin link just seem to list "Year 2000 Embedded Systems Vendors, Associations and Manufacturers."

Yes, so that you can verify that the Foxboro/AB figures are typical. If you don't want to verify that, then ignore Roleigh Martin.

The Foxboro link seems to be an advertisement for their services to do a Y2K audit of your system.

Sorry, you're right, they've changed the layout on their site. Try http://www.foxboro.com/y2000/y2000sysprod.htm for their systems and http://www.foxboro.com/y2000/y2000controlprod.htm for their controls to verify the totals I gave.

The Allen-Bradley site just lists products and what Y2K exposure they have (mostly none).

Yes, as described, typical control failure rates are 1-10%. So now read the Beach/Oleson link, the Shell link, and see what that means. Check the Foxboro links to see it in action.

In any event, how are Y2K failures going to affect industrial processes? If the failure is exhibited by the wrong date showing on the operator's console, then so what?

Um, what operator's console? Do you think we're talking about StarTrekSystems? You're looking at the guts of 20th century SCADA; these control points get scanned automatically with millisecond tolerances on a 24/7 basis; SCADA switches switches, opens and closes valves, starts and shuts processors down automatically, but without any real smarts - things have to be engineered right in the first place. They don't self-correct at the push of a button like the starship enterprise. The smartest ones can failover to redundant systems if they somehow figure out that their data is screwed - but if those systems are built with the same date-dependencies, that's not going to help at all.

And by processors I don't mean CPUs - I mean tons of metal on the move, great energies, highly reactive chemicals, and all the other massive bits and pieces that get used in modern automated factories - these things are engineered to require automation because humans are neither fast enough nor reliable enough to keep them working otherwise. For some manufacturing there's prototype lines that are worked entirely by human operation - but they're scrapped as soon as the real production line is stable.

Still, let's say you're really lucky, and there's a human operator. The best he'll get is an alarm telling him that the line is down and he ought to call an engineer to fix it. When lines go down because of unexpected failure modes, the engineer has to spend a lot of time to figure out what's wrong. How much time? I've seen manufacturing systems where glitches cost more than a million dollars a day, hundreds of domain-expert engineers were called in to fix things, and it still took months to fix.

If these engineers are unlucky, there'll be ancillary damage from all the great big masses and energies and reactive chemicals rushing around. Check the Point Tiwai or the Victoria description (second "the australian" link in case studies) to see cases where the plant failures have required great delays and very high equipment replacement costs.

Sure it's a failure, but the consequences are not very serious. Do you know of ANY industrial processes that are date dependent? What failure modes are you fearing? -- GeorgeDinwiddie

Read the Case Study links above to see typical fearful failure modes. I'd suspect that, at least as often as not, the typical failure modes wouldn't be very serious in themselves - just time-consuming. But such delays are amplified by JIT, cyclic dependency, and hubbing, which are staple strategies of all modern industries. 100,000 of my countrymen were laid off in the last 5 days because of just such amplification of a single plant failure, so don't tell me that the same thing happening all over the civilized world all on the same morning is "not very serious". Well, unless you can explain why the Victorian experience is atypical.

Still, the most disturbing thing for me is that these are all the facts I've been able to find. Most of what's out there is either feel-good legal boilerplate, chicken-little paranoia, or "y2k consultants" trying to turn a buck. Facts are rare as hens' teeth. Still, tell me what kind of stuff you'd regard as "hard" and I'll be happy to see what I can find. Some pointed questions on your part would be really useful too. So far what I take away from the above strongly suggests:

  1. 1-10% of all chips are date-dependent in some fashion
  2. About 10%, or maybe a little more, of control points based on these chips have rollover problems. An unknown number have dilation problems.
  3. The installed base for these things is vast, and it's highly labor-intensive to find and replace them.
  4. The utilities are waaay behind on this work.
  5. If the work isn't completed the results will be severely disruptive to the systems that rely on the failing controls.

Quite apart from the pointed questions I'd like, if you can find any hard facts - hell, any technical material at all - suggesting that any one of these conclusions is mistaken, I'd be very much obliged you'd present it here. In quite a lot of wrestling with search engines in the last month I've found nothing to balance what's above. -- PeterMerel

CouldIsntWill.


On a related note: I was just speaking with a neighbor who works in IS with a North Carolina Utility about the Y2k work they're doing. NCs electricity regulation board did an interesting thing a couple of years ago - they let the utilities know that they would not be allowed to pass any of the cost of Y2k related shutdowns to their customers. As a result, the utilities suddenly became highly motivated to fix them themselves, and it now appears that all of the NC utilities will be ready. This has been one of the few instances I have seen where government intervention in Y2k has had a positive effect.

KyleBrown

Indeed, it seems like NC will make it if anyone can - check http://www.state.nc.us/Pubstaff/psy2k/y2kintre.htm. Accountability is an excellent motivation. But conversely I'm beginning to think that until insurance companies take vulnerabilities like cyclic dependency, JIT, and hubbing into account when they price their premiums, there will be very little motivation for industry to balance efficiency against reliability. The Congressional, NERC and Minnesota PUC studies indicate great sloth and ignorance throughout most PUCs up until very recently - and terribly inadequate investment now that they are becoming aware. -- PeterMerel


CategoryRant (?) EmbeddedSystem


EditText of this page (last edited April 6, 2004) or FindPage with title or text search