January 1, 2009
Reliability vs. Availability
Clearing Up Misconceptions
By Philimar Menard, Q&R Consulting
There seems to be a huge misunderstanding of reliability and availability. The cable industry spends tens of billions of dollars yearly to maintain an availability of 99.98 or better. But why? The information that follows comes from a year-long study of the subject.
Figure 1 is a comparative plot of the HFC network. The outside plant is the most vital sector for a cable operator. It is the pipe that aggregates the upstream and downstream services and transmits them back and forth between customers and the headend. This is truly the workhorse of the cable industry, the most costly plant segment by far.
FIGURE 1: Comparative plot of the HFC network
Figure 1 illustrates the underlying cost of operation of the HFC plant. In this N+6 configuration analysis servicing between 750 and 1,000 customers, one can see the huge difference between the system's availability and reliability. This simulation was based purely on field data collected on component reliability to validate the performance of the HFC network (amplifiers, line extenders, taps, fiber, nodes, etc.). The blue line depicts the availability performance, and the red line shows the overall system reliability curve. To say the two graphs are diverging is an understatement. Figure 1 shows system reliability around 60 percent the first year and 10 percent in the fifth year, while the availability remains constant around 98.96 percent.
What does this mean? For an HFC plant with 100 nodes, that means 40 truck rolls in the first year and 90 in the fifth year of operation - a 90 percent failure rate within the first five years.
Though this is obviously bad, it is not unusual. The focus was on the availability of the network, not its reliability. High availability was maintained via all the truck rolls and equipment swaps. These truck rolls don't just cost a lot of money; they also tie up key resources in "fire fighting mode." Innovation and continuous improvement suffer.
The bottom line here is that network operators are expending a great deal of money and effort on the wrong things. It pays to know the technical difference between system reliability and availability.
Reliability, the backbone of any good business strategy and essential for growth, is the probability that an item will perform its intended functions without failure for a specified interval under stated conditions.
Several parts of this definition require clarification.
The first is probability. A probability is a ratio; in reliability, it is the number of successes divided by the number of attempts. This means it is a quantifiable number. Consider the difference between 0.60 and "a lot" or "most."
The second important item is "performs its intended functions." This suggests that the functions an item needs to perform have all been identified. Many times, an item is deemed unreliable even though it performs the functions that have been identified. The problem is that not all the necessary functions have been identified. For example, a part has a certain mass that dampens vibration of an assembly. It is decided to reduce the mass of the item to reduce cost. If the function of dampening vibration is not identified, then the change may go through because the item by itself performs, but the assembly may fail because of increased vibration.
The third important item in the definition of reliability is "without failure." This implies a failure has been defined. In some circumstances, this may be apparent (smoked board, failure to power up, etc.). In others, a certain amount of degraded performance over time may be acceptable. In the case of an amplifier or line extender, the amount of gain may degrade over time, but as long as the consumer does not experience picture degradation, color issues, etc., it may not be a failure. Defining the threshold or the amount of degradation or drift that is acceptable is sometimes difficult, but is very important.
The fourth item is "for a specified interval." This is simply not "a long time." It is a specific number and should be in units of measure relevant to the part. A specified interval of five years does not mean much to a part. The specified interval of five years needs to be translated into hours of operation (43,800 hours), number of cycles, etc., for it to be meaningful.
The last item is "under stated conditions." This means that the environment in which the item operates must be completely defined. Temperatures, temperature cycles, pressures, pressure cycles, corrosives, contaminants, and maintenance items (for example, household cleansers) must all be defined for an item to be robust to all operating conditions. This particular requirement is misunderstood by 99.99 percent of reliability engineers.
The engineer needs to account for the worst case as the guage for temperature stress factors, elevation, vibration, dust, and so forth. Through my study, I found 99 percent of the vendors omit the temperature factor during their reliability calculation by setting the pi-T factor to 1, a bad practice. For example, an amplifier designed for Georgia temperature factors will not operate properly in places like Phoenix, Las Vegas, and other extreme places.
When developing a technical requirement for an item, all five points in the definition of reliability must be addressed. The reliability paragraph in a specification should:
1. Call out a probability, for example 0.95.
2. Define all functions of the item. One could refer to a different paragraph in the specification where functional requirements may already exist.
3. Define what a failure is and is not, such as failure to operate when commanded to or greater than 20 percent change in resistance.
4. Define the specified interval or mission duration, for example 1,000 hours energized or 900,000 cycles. (Note: One must then adequately define a cycle.)
5. Define the stated conditions, such as 50 degrees C energized and 25 degrees C when not energized.
All of these points are necessary for a thorough reliability requirement. To illustrate the process, consider the performance of two well-known Internet protocol (IP) switches.
Figure 2 illustrates the performance of a particular well-designed and reliable IP switch. In layman's terms, this device is so reliable that it requires nearly no repairs over five years. Although this particular device is highly reliable, its sales team is having a hard time selling it to many cable operators. Why? Simple: The switch is more expensive than some others on the market, and the buyers are making their decisions on the basis of initial cost, without taking into account the cost of ownership and cost of operation.
FIGURE 2: Performance of a well-designed and reliable IP switch
Now, let's analyze a similar IP switching product line that offers a better initial price with nearly 1/4 of the cost of ownership and cost of operation of the first product line. (See Figure 3.)
FIGURE 3: Performance of a similar but less expensive IP switch
It is clear that this product line is not comparable to the first product analyzed. Looking at the year 5 performance for both products, it is evident that the survival rate of the second product is about 38 percent, compared to nearly 98 percent of the first product analyzed. However, the less reliable product is the leading IP switching choice. Decision makers are looking to make an immediate impact on their short-term strategy without regard to the business's long-term needs. This type of mentally needs to be shifted quickly in order to hold original equipment manufacturers (OEMs) accountable for poor performance.
Now consider availability. What is it exactly? Where does the responsibility for it lie?
Availability is the probability that an item/system is good and ready to go when needed.
For example, I expect my car to start whatever the weather and to take me back and forth to my routine destinations, day in and day out. When I take it to the mechanic for routine maintenance (oil and tire changes, fluid checks, etc.), my car is not available, shaving a little bit of the 99.999 percent availability set forth by the manufacturers. Although my car is not available during routine maintenance, this is not a reliability hit because routine maintenance is scheduled downtime. On the other hand, if I have a transmission problem, both availability and reliability will take a hit because the mean time between failures (MTBF) will decrease and my mean time between repairs (MTBR) will increase.
For most complex systems with multiple schemes of redundancy (stand-by, manual, etc.), there's a process called the Markov model that can help simulate overall system availability. In its simplest form, availability is a function of MTBF and MTBR, as illustrated in this simple equation.
A (t) = MTBF/(MTBF + MTBR)
Availability is a shared matrix that needs careful attention from all parties involved: vendors, engineering, network operations, and field installation. In a great organization that values continuous improvement, here's how the process goes: The engineering team works with the vendors to translate system requirements into technical specifications. Upon clear agreement, the vendor develops and delivers robust and reliable products suitable to meet the targeted MTBF. (Use my guideline stated previously to define reliability requirements.) The operations team works with engineering all through the design process and testing and validation phases to identify failure characteristics, troubleshooting guides, and corrective action to minimize the overall system downtime, thus improving the MTBR to support the customers' and contractual targets. The field team coordinates the builds with operations and engineering teams per the vendor's recommendations to reduce "infant mortality" and strenuous operation.
Availability is a subset of reliability that requires careful inputs from all involved parties to clearly define contractual needs, reduce strenuous operation, and build long-term value for both the customer and the business. Without this clear focus, system operation goes through a constant "fire fighting" mode that is costly, restricts key resources, and hurts company's growth.
At this writing, the big three of the U.S. automobile industry are tottering, and the financial markets are in turmoil because of lack of due diligence and risk assessment. Risk assessment is not easy, but necessary. The skills required cannot be obtained over the counter or from a book or some clever tools. It is a very disciplined skill that requires years of crafting by connecting the dots and staying abreast of a fast-changing world. A reliability engineer needs to be an engineer on steroids; the job requires the ability to reverse-engineer designs while finding ways to improve them based on other core principles–such as material property, design for Six Sigma (DFSS), design for reliability (DFR), LEAN, etc.
Stale ideology, rigid and outdated guidelines, bureaucracy and the glory days of the past will not propel your company forward. What is required is a long-term strategy and focus on core fundamentals. While availability is important, reliability is in many ways more so in the long run. Focusing on quarter-to-quarter strategy with only vague long-term vision is not the path to longevity.
Philimar Menard is founder and CTO of Q&R Consulting. Reach him at firstname.lastname@example.org.