Thursday, January 11, 2007

Software Glitch Loses Another Spacecraft

You remember that story a few years back, how the Mars Climate Orbiter decided to burn up in the Martian atmosphere? Turns out the human error was that some engineer neglected to convert from English units to Metric. Whoops!

So, NASA (or rather, Lockheed Martin), has done it again.

NASA - Mars Global Surveyor

NASA Decides That A Software Error Doomed The Mars Global Surveyor Spacecraft
Panel Will Study Mars Global Surveyor Events

The short version is that an update to the satellite's software in June 2006 caused it to go into 'safe mode' in November 2006 and the battery burned up when the radiator neglected to cool it properly (solar rays do that).

An article I read on NASA's mounting problems in the late 90's said something interesting about NASA contracts. Government contracts can be made quickly, correctly, or cheaply. You can only have two of the three options on any one project. When quick and cheap trumps correctness, you get stupid things like failure to convert inches to meters and solar panels getting stuck.

This should never have happened. And now we've lost a key mission in our space exploration endevours. Yes, the satellite lasted longer than the mission originally specified, but it might have lasted a lot longer. QA should have caught this.

I realize that everyone makes mistakes, and that QA can't catch every bug, but one would like to think American engineering and ingenuity could manage not to half-ass something as important as this.

4 comments:

Aerin said...

As a QA person - I have to say - sometimes we do catch the errors! But they go into production/space anyway.

Management (NASA in this case) decides it's worth the risk to not fix the software bug or doesn't understand its impact.

Also, there's always a discussion about when an error is introduced in the assembly line process. Some errors/problems should be discovered long before they even reach QA.

Diane Lowe said...

Agreed. I think this is what happened to Space Shuttle Columbia (management running the risk of something bad happening).

And errors should be discovered long before reaching QA, but reality is QA is the last bastion of hope for some errors.

don said...

I think this illustrates the importance of modeling quality systems on the "prevention system" and not the "detection system" As it has been shown that once a mistake has been made, it is very hard to detect it in inspection, even when it's obvious.

don said...

I have to add that my Uncle was an electrical engineer for a contractor working on launching. He is now retired. He told me after the shuttle blew up, that as good as the components are they aren't quite reliable enough to ensure nothing will go wrong considering how many systems and parts there are.

I knew another guy who was in the launch business. They had the six sigma system. And still rockets blew up.

And I myself have to address corrective actions requests. What will you do to ensure that this problem will never happen again? Inspection is not an acceptable answer.

You can change the process, but all change is not for the better.