Podcast: Lessons Learned From 1994 Texaco Refinery Explosion
Transcript
Welcome to Process Safety with Trish and Traci, the podcast that aims to share insights from past incidents to help avoid future events. Please subscribe to this free podcast on your favorite platform so you can continue learning with Trish and me in this series. I'm Traci Purdum, Editor-in-Chief of Chemical Processing, and joining me, as always, is Trish Kerin, the director of the IChemE Safety Centre. Hey, Trish, what have you been working on?
Trish: Hey, Traci. Well, I've just come back from a couple of weeks traveling around the UK talking to a whole lot of the Safety Centre member companies and just spreading the word on process safety.
Traci: Always spreading the word. That is your mission and life, isn't it?
Trish: Absolutely, it is.
What Happened July 24, 1994, in Milford Haven?
Traci: Well, it seems that this year there are a lot of momentous anniversaries in terms of process safety incidents. As you always point out, it's beneficial to revisit these events and learn from them. So today, we are going to be talking about the explosion and fires at the Pembroke Cracking Company plant at the Texaco Refinery in Milford Haven, Wales, which took place on July 24th of 1994.
So, we're coming up on the 30th anniversary here. This incident could have been much worse, but the site had been built and operated with high safety standards. Can you give us a little bit of background and overview of what happened that day?
Trish: Certainly, you're right. This incident could have been a lot worse. One of the reasons why there were no fatalities in this incident, though, was that one of those lucky moments where it happened on a weekend, and there weren't that many people around. There was only the skeleton operating crew at the refinery as opposed to all the normal staff that would've been there. Again, one of these many incidents where if only it didn't happen, then if it happened at a different day of the week, we could have had a very different situation for them as well. But what actually happened in this instance was it's another example of one of those instances that we refer to as Natech, a Natural Hazard Triggering Technological Incident.
What occurred here was basically a severe electrical storm started to occur around the facility, and that led to a series of different plant upsets occurring. It also did lead to a lightning strike, and the lightning strike caused a small fire on one of the crude distillation unit towers. That led to the crude unit being shut down, as well as various other units were all shut down as a result of this electrical storm and the lightning strike and small fire that occurred at that point in time, except the fluidized catalytic cracker, the FCC, was left to continue running during this time.
So, they shut down all the other units of the refinery but kept this one unit running at the time. About five hours later, then there was a rupture of a pipe, an explosion associated with the FCC, the cracking unit, and that then took several hours to burn out because the rupture actually occurred in what's called the flare knockout drum, which is where you knock liquids out before vapor goes to the flare system. They had overfilled that drum, and then a ruptured elbow occurred, releasing about 20 tons of hydrocarbon liquid and vapor, which then exploded.
But because it was the flare knock out drum that meant it actually took the flare system out of service as well, couldn't then safely flare to atmosphere. They literally had to leave the fires in a controlled manner to burn out all the inventory safely at that point in time. And as I said, they were lucky that it happened on a Sunday when there wasn't many people around. It actually took until the evening of the Tuesday for all the fires to be fully extinguished.
Control System Failure and Alarm Flood
Traci: The health and safety executive found five key issues that created the perfect storm, no pun intended here. But let's examine each of these issues and try and extract some lessons learned and how facilities can avoid similar incidents. So the first key element they found was control systems. Let's learn a little bit about that.
Trish: When the FCC unit actually was going through this upset time, the debutanizer level control failed closed at that point in time except the Distributed Control System, the DCS, which is the computer-controlled system that the operators were seeing, said that it was open. And so, the operators thought the valve was open. They thought that product could continue to flow through. They didn't realize they were pumping into a closed vessel, which then overpressured.
There was that failure of the control system to tell them what was actually accurately going on. That was coupled with what we call alarm flood, which is when there are just so many things occurring in their plant at the time that the alarm screens just keep scrolling through with all these multitude of alarms. When that occurs, you can't possibly pick out the really important alarms that you need to focus on and do something with, the ones that are actually going to make a difference because there's so many pieces of data just coming at you.
The alarm flood was being caused because of all the other units that were upset and shut down at the time. So, during this entire operation, there was a lot of conflicting and confusing information through the control systems that led to the operator struggling to actively respond in the right ways.
Traci: Did they rectify the control systems or is that a concern that we should still have today with those both happening at the same time?
Trish: One of the challenges we have with control systems is they're there to tell us when something goes wrong. So when something goes wrong, they've got a lot of information to tell us, and therein lies the problem. We've got so much information that needs to be communicated through to the operators, and that's why it's really important to make sure that we have engineered safety shutdown systems that are independent of the control system. If the plant gets to a certain state, then the plant safely shuts itself down and the operators then get a chance to figure out what's going on and deal with the alarms.
One of the other challenges we've had and this incident happened at a time when DCSs were still being rolled out in industry. They were still quite new, and we all of a sudden had this realization that we could alarm all these different things that could go on in the plant, and so we did. We put alarms on almost everything, except not everything is important. When everything's important, nothing's important because you can't see the difference. We've had to since the very early days of DCSs, now go back and put in usually programs that are called alarm rationalization programs where we look at all the alarms that we've got programmed in and remove them, remove all the ones we don't need.
An alarm should have a significant action that has to be undertaken, and it should be an action that needs to be undertaken in a very short period of time and it needs to be preventing something very significant happening. And so if you're not ticking those three boxes, it shouldn't be an alarm. It should just be a journal history so you can go back and look at what happened after the event, but it shouldn't require acknowledgment and silencing and management of an alarm if there's no operator action to be taken.
Traci: So no cry-wolf alarm should be happening.
Trish: Absolutely, and unfortunately, when we put DCSs in, it was so easy to just put an alarm on everything.
Traci: Right. Just like if you have a hammer, everything looks like a nail.
Trish: Absolutely.
Lack of Maintenance and Lack of Management of Change
Traci: The next key finding was maintenance procedures.
Trish: Yeah. So, this one is actually around dealing with how they were maintaining the plant more generally. The fact that they had a valve fail closed, but the DCS read differently suggested that there was potentially some maintenance issues around how the signals were coming in, but also the elbow that actually ruptured on the flare knock-out drum ruptured because of corrosion, so that was the weak point. There was issues there also around corrosion management programs and preventative maintenance programs to make sure that the facility is fit for service and operations at all times.
Traci: What about... The next key finding was plant modification and change procedures. Big thing that we always talk about.
Trish: In this instance, they'd done a modification on the flare knock-out drum, and it was related to the automatic pump-out system, and they had not assessed the potential consequences of what could go wrong if that failed.
So, when they had made that physical change, they hadn't gone through an adequate management of change process to understand what hazards had been introduced and what potential consequences could come from those hazards. Effectively, it was a failure of management of change completely because they just didn't assess it in any adequate way, resulting in them not recognizing the risk that was occurring there.
Control Room Design Issues
Traci: And the next key topic, something we often talk about as well is control room design.
Trish: This one had in terms of the layout of how the control room was laid out in terms of the human machine interface that we talk about in control room design as well, very much a human-factors related aspect. So again, when you've got a DCS system that is designed to just give you a flood of alarms when something is... when part of the plant's upset, you've got no hope of actually managing what's going on in that control room. That was coupled also with some poor graphic design in the DCS program, too. So, it wasn't clear what was actually going on at the time.
It was a lot harder to figure out what was happening in the plant because of how the graphics were laid out. So when we design our control rooms, we need to make sure that we've got the right human factors and HMI, Human-Machine Interface, activities understood and addressed, and then make sure that we do have the right layout within the facility so that different parts of the required systems that people need to access are easily accessible, easily operable, so that we can get optimum human performance in that control room even under a stress situation.
Traci: Are there standards for HMI for interfaces and fonts? I know I've talked about font sizes before with Dave Strobhar over on the Distilled podcast, but are there standards for HMI graphics?
Trish: Yes. Yeah, there are. So they have been developed, and they've changed over many, many years. There's a whole lot of different standards that are out there on HMI design for the graphics, but also on how we do alarms on how we... how many alarms an operator should see in a normal operating period and how many alarms an operator should see in a plant upset period.
So, there are actually some quite detailed metrics that you can monitor your system with to see how it's performing from a human factors perspective as well as getting the design right. And we now see a lot better design in DCSs that are being put in place today. We've learned a lot. As I said, this incident was in the early days of DCSs-
Traci: Right.
Trish: ... and I don't think we'd really understood the graphic interface as much as we do now.
Traci: We've come so far on so many things graphically speaking.
Trish: Yep.
Traci: The last key point that they found was emergency response and spill control.
Trish: Yeah. So, this one really was around looking at their emergency operating procedures and how people were trained to respond to that, and the fact that the refinery was under a significant emergency response situation yet kept one of its units operating. So everything else was shut down, but they kept one of the units operating, which meant they were actually in a situation where they were trying to manage an emergency and also then trying to manage a steady state operation and keeping it in parameters and keeping it safe, and that effectively were two very, very different activities going on at the same time that they were not equipped to be able to deal with.
And in most instances, most people wouldn't be equipped to be able to deal with those things happening, quite frankly. When you're in an emergency situation, you actually need to focus on the emergency, and that, I think, was how they ended up potentially getting to the point where they didn't have focus on what was happening in the FCC, and that then resulted in a more significant issue occurring because they were too busy responding to the significant emergency they already had in their plant facility.
Making sure you've got really clear emergency response procedures that deal with what you're going to do, what's going to be kept running, and what's not going to be kept running, there may well be some units that you do need to keep running for various reasons. In some instances, it's more hazardous to shut something down than it is to keep it running, but you need some really good emergency and troubleshooting procedures to deal with some of those scenarios as well. We don't just need operating procedures that tell us what to do when everything's working right.
We actually also need operating procedures that tell us what to do when we are in an upset situation, and we need to do some troubleshooting because that's the unusual time, that's the time that we're not always experienced with. And that's why if we parallel to the airline industry when we actually look at how pilots are trained and the things that they do, the moment something unusual happens, there's a checklist for it, and they go through that troubleshooting checklist to determine what action they need to take, making sure that we clearly understand those emergency response protocols is very important.
From a spill control perspective, obviously, there was a lot of firewater to manage, and that's always an important aspect. There was not only hydrocarbon, there was firewater, there was foam, and all of those need to be managed from an environmental perspective as well.
What They Got Right
Traci: Let's talk a little bit about what they got right.
Trish: Effectively, they responded very well to the initial incident itself, I think, in terms of being able to shut down multiple units all at once that were under electrical disruption due to the storm and cope with the initial fire that actually happened. So that initial fire on the crude unit that happened because of a lightning strike, I think they managed all of that part exceptionally well.
From that perspective, all credit to the operator. Then, as we said earlier, this could have been a much worse situation. The fact that there was no one killed in this incident was fantastic. They also were able to then rectify any damage in the refinery, and I think they were up and running again in about 18 weeks after the incident.
Traci: That's fast.
Trish: Yeah, very fast.
Traci: Is there anything that you want to add that we didn't touch on yet?
Trish: I think really, again, it always comes back to one of the perennial things I talk about all the time, and that is you need to understand your hazards, and you need to know what those consequences could be, and you need to put in place barriers or controls to stop the hazard becoming a consequence. Standard risk management.
But if you don't know what your hazard is, you're not going to be able to figure out what your consequence is, and you certainly won't be able to control it in any way. It always comes back to that hazard identification and looking at what you're going to do with it and how you can manage that adequately. That, I think, is one of the key aspects of what we do every day in process safety. We've got to get that right so we have a chance of managing it properly.
Traci: Well, Trish, thank you for always helping us understand the hazards and consequences. Unfortunate events happen all over the world, and we will be here to discuss and learn from them. Subscribe to this free podcast so you can stay on top of best practices. You can also visit us at chemicalprocessing.com for more tools and resources aimed at helping you run efficient and safe facilities. On behalf of the globe-trotting Trish, I'm Traci, and this is Process Safety with Trish and Traci.
Trish: Stay safe