Cascade Failures

Cascade Failures

by Scott Alderucci


Abstract

Many systems are so reliant upon each other, it is often unknown the full extent of how one system's breakdown will affect another. The interdependency of many of our systems allows failures in one system to affect others down the line. Such breakdowns have threatened to affect some of our largest interconnected systems over the years. Failures in one system or line have shut down entire power grids. Single corrupt bits of data have caused the collapse of programs and systems around the world. The threat of 'Y2K' before the year 2000 was an example of how our interdependency was thought of as being susceptible to complete shutdown from a single error. Cascade failures are often prevented through routine testing of certain systems to assure that other systems in parallel will be able to handle the loss.


What is a Cascade Failure

A sequence of failures where one failure causes the next is known as a cascade failure. A cascade failure is based on the idea that a breakdown starts with a single initiating event such as a leaky pipe or a computer software failure. This event, in turn, creates a series of other events(such as a broken valve or the water shorting of a piece of electrical equipment). Each event, in turn, has the potential to create a series of other events. A cascading sequence of events, branching from a single event, continues throughout the interrelated systems. This represents the sum of all possible breakdown models stemming from a single initiating event.


Power System and Telephone Outages

There have been several large-scale breakdowns in our power systems in recent times. These breakdowns have typically been triggered by small events that lead to a cascade of failures that eventually brought down large portions of the network. The high complexity and interconnectedness of these networks designed to improve their performance have lead to an unexpected sensitivity to small disturbances. The high degree of connectivity also makes it possible for small failures to propagate and lead to massive outages. In 1965, cities from Toronto to New York were without power because of a single relay failure that cascaded through all the others in the system[10]. "With one line down the rest of the lines had to pick up the load, another line started to become overloaded and its relay tripped, then another line and another."[Cole].

This same effect is seen in the telephone system where "telephone switching systems must be repaired within 1.5 seconds or the circuit failure errors passing through the network will cause a propagating positive feedback which may deadlock more of the network"[Cohen] [3]. The phone networking system works off a complex series of relays and circuits that, when problems do arise, can cascade throughout the system. MCI WorldCom experienced a "meltdown of the network," according to CBOT president Thomas Donovan that cascaded through over 250 workstations across the globe that in turn shut down a global electronic trading system. "The network problem was also having a cascade effect on other providers that interface to the MCI WorldCom network as part of peering agreements." [Gerwig and Semilof] [8].


Coding Errors and Y2K

Some cascade failures result not from a physical break of a system or part failure but from coding anomalies. Programs rely on many lines of code controlling how they handle data. An error caused by anything from an input mistake, system oversights, or a directed attack by a virus can cause an entire application to crash and in some cases all other applications that rely on it to function [1]. As an example for Internet applications, a news story could generate a level of traffic that overloads the server causing cascade failures and leaving the system vulnerable to attack [4]. This potentially disastrous problem can be prevented using security software that "maps out the ripple effect that one failure has on other applications."[Vriesinga] [2].

A single error potentially able to topple an entire culture was a scare in the late 90's in eve of the "Y2K bug". In our interdependent world, the relationships between businesses, industry and government are like a spider's web. For the whole web to remain intact, every individual node must be attached at all times. The scare of a potential breakdown of the spider's web by the failure of one of the nodes by 'Y2K' brought the potential of cascade failures to public light[6]. People feared a collapse of systems that were unable to compensate for the internal two digit year clocks setting to zero. Some foresaw a cascade effect occurring as nodes of society could cease to function such as power plants and financial institutions and ripple through other systems disrupting air traffic control, water supply stations, traffic lights, etc. The fear greatly overshadowed the true potential of the threat but nonetheless drastic measure was taken to ensure a systematic collapse would not occur from a single oversight [7].


Conclusion

Cascade failures have caused major injuries in some critical systems of our society. Such failures have led to a need for redundancy and backups. Too often we depend on one critical system working, whether it be written code or a single bridge to an island. Cascade failures are often prevented through routine testing of certain systems to assure that parallel systems will be able to handle the loss. Another solution to this sensitivity is to add additional complexity in the form of more sophisticated control strategies. This can decrease the potential for one problem to bring about others however this can also add to the complexity of the system and the further inclusion of potential human error[9]. "Some technologies are so unsafe that "human error" is actually an inevitable consequence of the system, especially in complex and "tightly coupled" systems where cascade failures may occur and may be impossible to foresee or to even track."[Perrow]


References:

1. Garfinkel, Simson L. 50 Ways to Crash the Net 18 August 1997

2. Vriesinga, Roland. Spinnaker Software www.scguild.com/Resume/1839R.html

3. Cohen, Fred. Protection and Security on the Information Superhighway, On-Line. 1995-7 Chapter 4

4. Crittenton, Bryan C. Restoring Cyber Security Vistronix Inc. www.stsc.hill.af.mil/CrossTalk/2000/jan/crittenton.asp

5. Borland, John. Net blackout marks Web's Achilles heel c/net June 6, 2001 news.com.com/2100-1033-267943.html?legacy=cnet

6. Ripples: The Cascade Effect for PR Purposes NEWSWEEK April 21, 1999 www.garynorth.com/y2k/detail_.cfm/4474

7. Schwartau, Winn. Graceful Degradation: A Fresh Look at Fixin' Y2K www.infowar.com/chezwinn/articles092899/LookingAtY2KDifferently.shtml

8. Sweeny, Terry and Moozakis, Chuck. MCI Frame Net Melts Down InternetWeek August 12, 1999 www.techweb.com/wire/story/TWB19990812S0013

9. Perrow, Charles. Normal accidents: living with high-risk technologies Princeton University Press, 1984, pp. 62-80

10. Cole, Michael J. Food For thought Fault Tolerance, Math News Volume 85 Issue 4 March 2, 2001 www.mathnews.uwaterloo.ca/Issues/mn8504/food4thought.php