How To Avoid Alarm Rationalization Mistakes

In this episode we focus on alarm rationalization mistakes, emphasizing the importance of addressing underlying problems rather than just managing the steady-state alarm rate. The discussion highlights the need for a clear operator action for each alarm, uniqueness in alarms, and the impact of alarm floods on human factors. The conversation explores challenges, such as changes in processing equipment and the significance of data analytics in improving alarm systems.

Transcript

Traci: Welcome to the operator training edition of Chemical Processing's Distilled Podcast. This podcast in its transcript can be found at chemicalprocessing.com. I'm Traci Purdum, Editor-In-Chief of Chemical Processing, and joining me to talk about all things operator training is Dave Strobhar, founder and principal human factors engineer for Bevel Engineering.
Dave is also the founder of the Center for Operator Performance.
Hey, Dave! Welcome to December.

Dave: Thanks, Traci. Yeah, getting ready for winter to start hitting us all.

Traci: Yes, it will rear its ugly head soon enough. Although I do like the first snowfall especially if I don't have to drive in it, it's real pretty, but then I get tired of it.

Dave: December snow's great. It's that February and March snow that gets a little tiring.

Traci: Exactly. Yeah, get tired of it after a while.

Well, today we are going to talk about alarm rationalization mistakes. And I guess, out the gate, what are some of the causes of these mistakes?

Dave: Well, probably the biggest cause is a focus that people have on the steady-state alarm rate. So, how many alarms per hour does your console operator get? And there's in both of the current primary alarm standards, the ISA 18.2 Alarm Management for the process industries and the European document on alarm management, they both have numbers that you should be targeting for about six alarms per hour. However, that's really just an indication of overall alarm system health. But it has become the focus where people are manipulating their alarm system to get their numbers down below that six alarms per hour.
And so, this would be akin to your having a fever and going into your doctor or urgent care or wherever. And they'll say, "Oh, well, we'll give you something to bring that fever down." And you say, "Well, okay, then what are you going to do about the underlying cause?" And they say, "Oh, well, that's not the problem. The problem is you have a fever, and we'll just get that fever down and send you on your way."
So, the big mistake is that focus on that particular number. And as long as that number is below the target level, they think everything is fine. And yet, they've done nothing to address some of the underlying problems that resulted in that being particularly high prior to their intervention. So, the big mistake is in alarm management is people are treating the symptom and not the cause. And then they're surprised later on when, of course, the cause rears its ugly head again, and they find that their situation hasn't significantly improved.

Traci: So, how can you ensure that they are consistently treating the underlying problem rather than just the symptoms? And how can they better identify critical alarms?

Dave: Well, the key issue is fundamental in alarm management, in alarm racialization, an alarm is to prompt a unique operator action. And so, part of the rationalization process is to essentially apply that very simple. I mean, that's not hard. Is it, Traci? A prompt a unique operator action to evaluate each alarm against that very simple criteria, but they aren't actually enforcing it. That simple statement, there are two major parts to it.
First, there's an operator action that the operator needs to be able to do something in response to this alarm or needs to do something. And as too often, you find people that will have alarms. And you'll ask an operator, "What do you do when this alarm goes off?" They say, "I don't necessarily do anything." So, the question is, well, why is it an alarm? And it's because somebody thought it should be an alarm.
And the other is that it'd be unique. And so, when you look at it, and you see 10 alarms that are all going to generate the same action, then you should say, well, why do we have 10 alarms doing what one alarm could do? If they apply that very simple criteria that cures just a multitude of problems that you would have. And if you do that and do it well, then the consistency comes across because I can judge whether an action has occurred. I can judge whether it's unique. And therefore, make sure that, yes, that is a case where an alarm is warranted.
And when we're talking about critical alarms, oftentimes critical alarms will be added independent of that criteria. So, somebody will add a critical alarm and they'll say, "Well, that's really because that's a really bad situation." It's like, yeah, but the operator can't do anything about it. You've got this critical alarm, and all it's telling them is, wow, you are really in trouble now, but there's nothing you can do about it, so hang on. And that's where the critical alarms have become disassociated from the basic criteria for why you have alarms.

Traci: Well, you bring up a good question about alarm floods and nuisance alarms and all these things going off at once, and it all just becomes static. It becomes background noise that they can't even do anything about. How should this be dealt with?

Dave: Well, if you look at the criteria, nuisance alarms are there because there's no action. And so, if they've gone through and really applied it, they should go away. Alarm floods are often the cause because under those circumstances, that circumstance of an upset or some sort of a problem, then you're really in a new state. You're no longer in steady-state. You're in some sort of upset. And alarms that may have been valid under steady-state no longer apply.
And the classic example is, if I have some sort of trip from a safety instrumented system, well, it's going to shut down a lot of things because that's what it's supposed to do. It's going to take a lot of energy out of the process to bring the process to a safer state. Well, all those alarms that were associated with low energy, like low temperatures and low pressures and low flows, well, they're all going to come in because they should be coming in.
And that's where one of the areas that has the potential for great improvement in performance, research from the Center for Operator Performance in cases you get about a 30% improvement in performance is by suppressing those alarms when you would expect them. So, if the unit trips, don't give me the alarms that you would say, well, of course, that alarm's going to come in. I'm going to get low fuel gas pressure because the safety system closed the valves of fuel gas going to the heater. So, take those out. Don't give me things that under this circumstance, there's nothing I'm going to do. So, again, it gets back to that operator. There's no action because now it's expected in that particular condition. So, going through and addressing those does wonders for the alarm flood.
The other thing in the alarm flood is it will typically show you cases where really the action is the same. And we've reviewed a wide range of operator actions during upsets. And as you're looking down and you're seeing that they're getting these alarms associated with ... I may have a heater with multiple passes through it, I'm going to have an eight pass heater and I'm getting eight alarms indicating that there's low flow or low temperature through that heater. Well, what's the response to any one of those alarms? Well, it's essentially the same action at this point. And so, I don't need eight alarms to tell me that I have low flow or low temperature. I really only need one. And then that one should probably be suppressed in this particular case.
So, applying that same criteria really helps in both that nuisance alarms and the alarm flooding problem. And it is really the alarm flooding problem that is the potential for an operator making a mistake.

Traci: Bringing up the point of operators and human factors, what sort of human factors should be taken into account with all of this?

Dave: So, the problem is it has to do with one of us being able to detect that something's wrong. One of the first steps when you look at the human information processing is we have to know that something is wrong and what that something is. In order to detect that something is wrong, it needs to stand out from that background noise. So, the human factors comes into that. Can I detect that? Yes. What's going wrong? And something's going wrong, and where is it? So, when I have that alarm flood, if you get above 25 alarms in five minutes, So, five alarms per minute, you're above really what you should be, any human being is going to be able to respond to. And yet, I've seen 60 alarms per minute. So, people are way above what any human being could expect to detect and process that information. If I actually want them to take action on it, it's an even a lower number than that. If you get above two to three alarms per minute, you're now into a range where even experts have a hard time dealing with that rate.
So, that's the human factor problem that I'm going to overwhelm my ability to process the information because the human information processing system is a very capacity-limited system. We can only deal with about seven things at a time in our active memory. And if you go above that, either you're going to have to ignore the additional information or you have to get rid of some of the things you were dealing with before. So, this capacity limitation in our information processing system is the real human factor that is coming into play here as far as alarm management. And we have to be low enough to be able to deal with that.
The standards have addressed it by they allow a certain accommodation for alarm floods that you should not be in an alarm flood, which I think they define as more than 10 alarms per minute for more than 10 minutes. Well, that really was an accommodation to what they thought was a practical matter. They didn't see that they could get the alarm rate down in upset situations, So, they just provided an accommodation. But it's saying for that 10 minutes, your alarm system is useless. Your operators can't process the alarms at that rate. And so, ask yourself, well, what can go wrong in a 10 minutes that could be very, very bad for the plant? And are you okay with that? Because your alarm system isn't going to tell you. And that's where we see problems.
One of the plants we worked on ... Well, in a lot of them, you'll have an independent malfunction. So, a plant was responding to a loss of raw water. So, they didn't have any water coming into the plant. And everybody's shutting down. And the alarms are flooding in. And so, the alarm system is useless. And in the midst of that, an instrument air dryer switches incorrectly. And so, they miss the alarms that the instrument air pressure is dropping down. And they don't notice it until the valves quit responding to the control commands because the air pressure's too low. And so, that happened within a few minutes.
So, giving plants a break for this 10 minute .... okay, well, you're okay for 10 minutes. Well, you're taking a risk with that because within 10 minutes something can go wrong and the operator's not going to detect it. And why even in 10 minutes that you're assuming they can recover? It's a very risky approach that people are taking.

Traci: And dealing with seven things. I don't know if I can deal with seven things. Are there certain people that are not cut out for this type of work?
Dave: Oh, absolutely. That's why we're seeing more and more plants that are increasing the specification, or specialization, I guess would be the better word, for their console operators because not everybody can handle the stress associated with these alarm floods that are coming in. There is a real need for ... because you have to act quickly that the person sitting at the console be able to make independent decisions and judgements. Some people are not good at deciding to shut down the plant on their own without asking anybody and saying, "No, this is unsafe. I'm going to shut it down." That's sort of an attribute that different people have as far as their ability or willingness to take those kinds of actions.
Surprisingly, one of the other criteria that you want in a console operator is there's actually a test that can measure your tolerance to boredom. So, if you're bored, what do you do? Well, some people just tune out. Other people find ways to stay engaged. So, because you have this job that entails long periods that are essentially boring, you don't have a lot going on, but you do have these periods of intense activity that are safety-critical, you want to make sure that during those long periods, they don't tune out. And so, you can actually measure an individual's ability to tolerate boredom so, that when the upset occurs, they're ready to jump in and management as opposed to they've tuned out for so long that they're starting from a dead stop to get up to speed on, well, what's happening? What do I need to do? Where do I go?

Traci: Yeah, throw down the comic book and get back into action.

Dave: If only it were a comic book. Now, it's turn off your YouTube on your cell phone. Yeah.

Traci: Right, exactly. What about changes in processor equipment?

Dave: Well, So, a lot of times, people will add equipment in. And it's interesting that oftentimes a lot of the alarm problems come from that. So, you've gone through and you've created a reasonably good alarm system. Now, you're going to add some additional towers or processing equipment to your area. Well, the people that are responsible for that, they justifiably are very proud of what they're doing and want to make sure it's done right. But what they often do, they will then over-alarm that equipment in relation to the rest of the process as a whole. So, whatever is the most critical thing on that equipment becomes their highest priority alarms. Well, the most critical thing on that equipment might be kind of a medium priority in relation to everything else. So, the new equipment comes in and distorts the alarm system.
There was an incident that brought down a major Gulf Coast refinery. A 300,000-barrel-a-day refinery came crashing down in large part to a situation where at this particular refinery, they started to burn a new fuel in their boilers. And they had a decent alarm system. They're bringing in this new fuel that they're going to burn. And with this came all sorts of alarms. And nobody bothered to question what it was because the project brought them in and they just added it on. So, when there was an upset, all these alarms start ringing. And during the ensuing alarm flood that occurred, the operator failed to notice that they had lost the boiler feed water to all the boilers. And so, eventually, this plant came crashing down with no steam, no instrument air, and the root cause was this addition to the unit. And with that addition came all these alarms that then undid all the work they had done because they were the source of the flood.
And by and large, we'll get back to our original comment, a lot of it, there wasn't much that the operator could do. So, it violated that basic operator action criteria. So, these were all environmental alarms. And so, needless to say, people say, "Oh, it's an environmental alarm. It's important." But a lot of it had nothing to do that the operator could take an action to prevent it. It had to do with the fuel that was being burned and that sort of thing. So, all these alarms get added in. They violate why you should have an alarm. And then when there is an upset unrelated to that, they obscure the underlying cause and prevent the operator from detecting that not only are they losing boiler feed water, but they're losing instrument air at the same time.

Traci: What's the best way to test alarms so that you don't get yourself into that scenario?

Dave: Well, So, again, as you go through alarm rationalization, the idea is to document sort of those requirements. And the test really becomes adherence to some key characteristics when you're documenting it. So, the standards say that when I specify an action, I want to specify some observable action. In other words, I want to be able to say, I want the operator to start or stop something. I want them to open or close something, raise or lower something. Not just, well, the operator needs to be aware, the operator needs to think about it.
And we've seen that a lot where they will try to circumvent the alarm rationalization by, yeah, they'll fill out the form and there'll be something in the, well, what's the operator supposed to do? And it says, well, increased awareness of this parameter. Well, that's not an objective action. I can't see it. I can't see awareness. And so, they let that go through. And that's how the alarm then gets in. Because when you ask, well, what is it they're actually going to do? Well, the answer is they're not going to do anything. And so, it shouldn't be an alarm. So, the key is to adhere to that criteria of, yes, it is going to be an action.
And then another part of the rationalization is you ask, well, what's the consequence if that action is not taken or is not successful? And again, the standards say, well, tell us the next action. What's the next consequence? What's going to occur next? And too often, they go to some sort of ultimate consequence where, well, if they don't do this and then this happens and this happens, and this happens, well then I'm going to have a loss of containment. But the standards are quite clear. No, what's going to happen next? And oftentimes, it's another alarm. But if you go to that ultimate consequence, then everything's going to skew high. You're going to get all these "critical alarms" that really aren't critical because I've got other alarms that are going to come in behind it to tell me that I've got a problem.
So, I may have a low flow alarm. Oh, that's critical because it's going to lead to these bad things. It's like, well, before those bad things happen. If this flow goes low, then this temperature's going to go high and I've got an alarm on that temperature. So, the next thing, the consequence of missing the low flow is not this loss of containment or catastrophe that's going to happen. What's going to happen is I'm going to get an alarm on high temperature. That's the one you need to worry about.
And so, if you adhere to what the standards say and actually apply them ... and that's the problem, people circumvent it because their emotions get in the way. That's probably one of the real human factors we should be talking about. Pride and emotions tend to drive some of these alarms in their selection that if you actually apply the criteria, it'll take care of So, many of the problems that we've been talking about.

Traci: Now, would data analytics help with that pride and emotion type of situation?

Dave: It does. And so, a lot of, as I said, the focus became steady-state alarm rate. How many alarms per hour? And so, people were gathering those metrics and that's great. But the real problem in where the errors are going to occur is during the upsets. And that's where the data analytics can come in and you can look at both the rates of which the alarms are coming in. And you can say, wow, we're allowing you to have, for this 10-minute period, to have over 10 alarms a minute. Well, you were having 60 alarms per minute. This is way above the allowable. Not just you've touched the threshold.
And using an analysis of those and then going in and looking for patterns that you can use to reduce the number of alarms So, that, well, what happens is whenever I get this alarm, I alSo, get these five at the exact same time. So, why are we giving them six alarms when one would do? And so, analyzing your upsets really becomes the source of improvement because that's where the air's going to happen. I've got an alarm flood. I miss critical alarms and bad things happen. Well, okay, I've had some upset and I didn't miss anything, but let's go in and let's look at this and analyze that alarm system and find out is it really giving them the information we want. Can we get it down to within that seven chunks of information that they can deal with and not overwhelm their information processing system?

Traci: Dave, is there anything you want to add on this topic? We've covered so much. You make operator training exciting. Is there anything you want to add?

Dave: Well, I think we have been over a decade since the alarm standards were issued. I am on the ISA 18.2 committee. Hopefully, we can address some of these things, and the guidance was provided with the best of intentions. Now, we have some feedback and experience. Hopefully, we can revisit some of these mistakes that are going on and correct them in future revisions So, that people aren't spending tremendous amounts of time and resources to then not accomplish the goal of providing the operator's clear indication that you have a problem. Here's what it is. Here's what you need to do.

Traci: As always, I appreciate your thoughts on this and the feast or famine, the pride and emotions, dealing with so many alarms at once and being able to test for boredom. All these things come into play and it's just amazing how well you talk about it. And I always feel like a fly on the wall with all this information you're throwing at us.

Folks, if you want to stay on top of operator training and performance, subscribe to this free podcast via your favorite podcast platform to learn best practices and keen insight. You can also, visit us at chemicalprocessing.com for more tools and resources aimed at helping you achieve success.
On behalf of Dave, I'm Traci. And this is Chemical Processing's Distilled Podcast, operator training edition. Thanks for listening. And thanks again Dave.

Dave: Thanks, Traci. Always a pleasure.