Longford – Why efficiency is the enemy of safety
(Source: Meredith Doig)On 25 September 1998, at the Longford Gas Plant in Victoria, Australia, a gas absorber burst, releasing a large cloud of hydrocarbon vapours. Seconds later, it exploded. Two staff were killed and 4 million people were deprived of gas for about two weeks.
Longford is owned by Exxon’s subsidiary, Esso Australia. Esso placed the entire blame upon the control panel operator, Jim Ward, whom they said had ignored alarms and failed to follow procedure. The official inquiry, blamed Esso for failing to ensure that Ward and his supervisors were aware of the hazards and knew the procedures for dealing with them.
The inquiry recommended the adoption of what is known as a “safety case” system in which all potential hazards are identified, their likelihood of occurring calculated, steps for mitigating the hazards specified along with procedures for dealing with them. A government authority to monitor compliance with the system is to be established.
The safety case system was developed in Britain after the explosion of the Piper Alpha oil platform in the North Sea in 1988. However, the latest thinking about what makes complex work places reliable suggests that the approach is incorrect. It suggests also that the official inquiry failed to understand what caused the explosion.
The concept of mindfulness
Professor Karl Weick of Michigan University is one of the world’s leading
organisational scholars. He has made a special study of what are termed high
reliability organisations, such as nuclear power stations, aircraft
carriers, air-traffic control rooms and hydrocarbon plants.
”The problem is most people think of reliability as repeat-ability and it
does have that component. But there is an additional thing, which is that
you need some flexibility and responsiveness to pick up things that haven’t
happened before - the anomalies emerging”.
He says it is of the essence of complex systems such as hydrocarbon plants
or nuclear power stations that they always retain a capacity to produce
novel or surprising events. The idea that you can develop a procedure for
every conceivable hazardous event is not realistic and even if it were, it
would be dangerous.
”You see this in simulations at nuclear power plants. People often stop and
say, ‘where have I seen this problem before?’ When they take the time to
think and consult the manuals, the events leading towards disaster are
continuing to happen. You lose a critical essence of time.”
Weick says that no matter how much training you do, when a problem occurs,
people will be left wondering which training applied to which problem. He
notes that it is truism in air traffic control rooms that the first thing
you must do when you start work is to forget all your training.
Weick argues that rather than training, the key is to reliability is
developing a culture of “mindfulness”, in which staff are constantly wary of
the dangers, sensitive to new and surprising occurrences and have sufficient
resources to deal with emergencies when they arise.
Efficiency drive
There were many aspects of the work environment at Longford which ensured
that this did not happen. The most important of these was the drive for
efficiency. This had lead to the reduction of the number of supervisors at
the gas plant from four to one. It had lead to the withdrawal of all the
engineers from the plant back to the head office in Melbourne. It had led to
a practice of the single manning of the control panel. Maintenance had been
slashed, with the result that breakdowns were occurring more often and that
equipment was out of order for longer. The list of maintenance jobs to be
completed had blown out from 500 to 3500 jobs.
It is extraordinary that such a panel should be operated by just one person,
responsible both for paying attention to the detail and taking initial
responsibility for any troubleshooting. Esso noted that unions had been
happy to accept that additional trouble shooting responsibility in the
course of an enterprise agreement.
Weick argues that the principles of efficiency and reliability are
contradictory. Efficiency looks for standardisation and predictable
solutions. It seeks to minimise costs. Reliability requires duplicating
resources and sensitising people to deal with unpredictability.
”There are real limits to any human comprehension. The worry with only one
plant operator is that he might forget to check what he has not done. With
only one operator, it would be hard to check your judgement. The usual
division of labour that makes it work is that some members of the team are
working with detail, others are there to stand back and get the over all
view.”
The first was the extraordinary work load upon the panel operator. Jim Ward
testified that on an average day, he had to attend to an absolute minimum of
300 or 400 alarms being signalled on his control panels. These alarms might
be because pressures, temperatures or flow rates were too high or too low in
different parts of the system, or they might signal malfunctions in
equipment. The highest number of alarms ever recorded on a day was 8800, or
one every six seconds. When an alarm signalled a light came on and a buzzer
sounded. The operator had to acknowledge the alarm by pressing a button,
which turned the buzzer off. The light remained on until the problem was
fixed. Some problems could be fixed at the control panel, while others had
to be fixed by operators the field. Some alarms signalled important
abnormalities, while others were petty. However the panel did not
differentiate. In practice, many alarms were allowed to remain untended for
hours.
In addition to dealing with alarms, the panel operator also had to authorise
maintenance crews, which involved dealing with up to 90 permits a day. They
also had to liaise both with supervisors and operators in the field. Ward
said that on the morning of the disaster, he would have fielded between two
and three dozen telephone calls. He noted that whenever the supervisor was
away from his desk, the telephone defaulted to the operator.
In such an environment, there is a limit both to the quantity and,
crucially, the quality of attention that can be given to any single problem.
The buzzers, the lights, the telephone calls, the radio contact with
operators in the field, the maintenance staff coming and going and the noise
of the plant presented an overwhelming assault on the senses.
Weick says that disasters commonly have multiple causes. This was true with
Longford where there was an oil leak, pumps had broken down and there was a
build up of condensate in the apparatus that ultimately failed. There was
also a completely unrelated shut-down in part of the plant which required
operator attention throughout the morning. Weick points to a principle of
requisite variety which says you have to match the variety in the things
going on in a system with the human resources to sense what is happening. He
says that in high reliability organisations, there needs to be a
preoccupation with failure and a sensitivity to the new and the unusual. The
practicality in the Longford control room was that there were so many things
going wrong all the time, that failure became an accepted part of the
routine.
This was so much the case that significant abnormalities in the process,
such as a piece of equipment that normally operated at high temperatures
falling to more than 20 degrees below zero was not even mentioned at the
handover meeting between shifts.
Poor communications
Poor communications was a characteristic of the organisation. The supervisor
of the plant, Bill Visser, reported that since assuming sole responsibility
for running the plant, he had become completely removed from day-to-day
operations and was mostly concerned with administration. The production
coordinator at the plant, Michael Shepard, who was became involved in
managing the crisis as it evolved during the morning did not explain what he
was doing or why to the panel operator Ward. “Mike doesn’t generally speak
too much,” Ward commented. The final explosion occurred when, over a
crackling radio, Shepard instructed Ward to open valve TC3. Ward heard him
say PC3.
The isolation of the control room operator was intensified by a culture in
which people were assumed to be capable of handling problems by themselves.
So Ward did not feel compelled to seek the help of his supervisor, and
Shepard reported to management in Melbourne that “everything was under
control”, when it clearly was not.
Weick says that when people are presented with something they do not
understand, their instinct is to create a platform of meaning in order to
make sense of the situation. When people are trained for reliability, the
first tendencies they learn are crucial, because they are likely to reappear
when the pressure increases.
As the crisis gathered at Longford, Ward’s response was one of denial that
it was anything out of the ordinary. Although he was aware that his direct
supervisor, Bill Visser, was showing signs of stress and agitation and he
could see Shepard, trying to reassure Melbourne that everything was under
control, he said his belief up to the point of the explosion was that “this
was a normal day”. This was evident when, in the moments after the explosion
and fire, which took place just 20 meters from Ward, he found himself
wondering whether he should hit the emergency shut down button. “I was, for
some strange reason, worried about the impact on production”.
The sense of unreality was exacerbated by a tendency to confuse the control
panel for the plant itself. Several times during the inquiry, Ward said he
could not answer questions without reference to a drawing of the plant. Esso,
for example, said that it was unrealistic to suggest that Ward would have
reacted differently had he appreciated the danger of explosion. “He knew he
had to react to the alarms that confronted him and to fix the problems such
alarms represented. Had he monitored his process as he was trained to
do.....he would not have to concern himself about any such danger.”
Weick says that just as nurses commit medical errors when they forget that
the chart is not the patient, control panel operators commit mistakes when
they forget that the dial is not the technology. In a paper to be published
later this year in the journal, Research in Organisational Behaviour, Weick,
together with Kathleen Sutcliff and David Obstfeld, analyse the processes
that contribute to a state of mindfulness within organisations.
Characteristics of high reliability
Although their discussion relates to high reliability organisations, they
note that the cognitive processes that are important have a relevance to
other organisations as well. Certainly, there are many complex computer
systems which, though they do not have the same life and death quality as a
nuclear power system, share similar characteristics of complexity and the
need for reliability.
They point out that the first characteristic is a preoccupation with
failure. They encourage and reward the self-reporting of errors, citing the
case of a seaman on a nuclear aircraft carrier who reported loosing a tool.
All aircraft were forced to find terrestrial bases, until the tool was
found. The next day, the seaman was commended for his actions at a formal
deck ceremony.
A second characteristic is a reluctance to simplify interpretations of
events. Simplification limits the precautions people take and the number of
undesired consequences they envision and thus increase the likelihood of
eventual surprise. It is important to recognise the complexity of systems
and to provide sufficient back-up and resources to come to terms with it.
There needs to be sensitivity to the operations, so that new and unexpected
developments are recognised. High reliability organisations need to have
duplication of resources, with some people looking at detail and others
trying to keep track of the big picture. Effective organisations are very
sensitive to pressures of overload and ensure that backup is available.
Although it is good to be trained to anticipate as many things that might go
wrong as possible, it is also extremely important that staff have the
resilience to interpret things they have not seen or dealt with before.
Finally, the structure of work should not be over-specified. The moment that
safety becomes a matter of routine, attention is dulled. This is Weick’s
problem with the standards based response to safety, such as that
recommended by the Longford Royal Commission.
Weick says the problem with a safety case system is that people feel that
once they have completed the documentation and filled in all the forms, that
safety is thereby guaranteed. ”Doing this monitoring really occurs
independently of whether you’re safe or not. That kind of documentation
doesn’t tend to have any effect on making people attentive. There is always
the danger that believing a safety system is ‘in place’ can contribute to
complacency. Safety systems have to be constantly renegotiated and
re-enacted for the system to remain reliable. Executives can’t mistake the
paper documentation for actual practices and you can never take things for
granted. The trick is to remain continuously mindful of what is going on.”
Rather than mouthing slogans about how important safety is, Weick says it is
more important to build an intolerance of mistakes and to provide the
organisation with the resources to make sure that they do not happen.
References
Originally published in The Manager Online Magazine, June 1999 [now ceased]
Two books recently published books provide useful analyses and background to
the Longford tragedy:
Lessons from Longford: The Esso Gas Plant Explosion and
Lessons from Longford: The Trial
