systemref

System Reliability Math — Calculating MTBF for Factory Automation Cells

MTBF is derived from component failure rates and system topology. Series systems sum failure rates. Redundant pairs multiply them. Availability is a function of both MTBF and repair time.

Mean Time Between Failures (MTBF) is the expected operating time between failures for a repairable system. It is not a guarantee — it is a statistical expectation derived from failure rate data. For automation cells, MTBF drives maintenance intervals, spare parts inventory, and whether a given uptime target is physically achievable.


Core definitions

Term Symbol Relationship
Failure rate λ Failures per operating hour
Mean time between failures MTBF 1 / λ
Mean time to repair MTTR Average hours from failure to restored operation
Availability A MTBF / (MTBF + MTTR)

A component with MTBF = 80,000h has λ = 0.0000125 failures/hour. Over one year (8,760h), the expected number of failures is 0.0875 — approximately one failure every 9.1 years under stated conditions.


Series system: no redundancy

In a series configuration, any single component failure stops the cell. The system failure rate is the sum of all component failure rates:

λ_system = λ₁ + λ₂ + λ₃ + ... + λₙ

MTBF_system = 1 / λ_system

Example cell:

Component MTBF (h) λ (failures/h)
PLC (S7-1500) 200,000 0.00000500
HMI panel 50,000 0.00002000
Servo drive 80,000 0.00001250
Servo motor 60,000 0.00001670
Safety relay 150,000 0.00000670
System 0.00006090
MTBF_system = 1 / 0.0000609 ≈ 16,420 hours (~1.87 years)

The HMI and servo motor dominate. They are the weakest links — the most cost-effective targets for replacement or redundancy.


Redundant pair: active standby

For a component with an active standby, both units must fail for the system to fail. The effective failure rate of the pair:

λ_pair = (λ₁ × λ₂) / (λ₁ + λ₂)

For two identical servo drives (λ = 0.0000125 each):

λ_pair = (0.0000125 × 0.0000125) / (0.0000125 + 0.0000125)
       = 0.000000000015625 / 0.000025
       = 0.00000000625 failures/h

That is a ~2,000× reduction in the drive's failure rate contribution. Redundancy is only worth its cost when the component is a dominant contributor to λ_system and the standby can take over without manual intervention.


Availability calculation

For the series cell above, with MTTR = 4 hours:

A = MTBF / (MTBF + MTTR)
  = 16,420 / (16,420 + 4)
  ≈ 99.976%

To determine the MTBF required to hit a specific availability target, rearrange:

MTBF = (A × MTTR) / (1 - A)

Target: 99.99% availability at MTTR = 4h:

MTBF = (0.9999 × 4) / (1 - 0.9999)
     = 3.9996 / 0.0001
     = 39,996 hours (~4.6 years)

The current cell MTBF of 16,420h falls short. To reach 99.99%, either reduce MTTR (faster diagnosis, local spares, trained technicians) or raise MTBF by replacing the two weakest components.


Effect of MTTR on availability

MTBF improvements are expensive — they require better components or redundancy. MTTR improvements are operational — spares on-site, documented procedures, trained staff.

MTBF (h) MTTR = 8h MTTR = 4h MTTR = 1h
16,420 99.951% 99.976% 99.994%
39,996 99.980% 99.990% 99.997%
100,000 99.992% 99.996% 99.999%

Halving MTTR delivers a larger availability gain than doubling MTBF at these ranges. For most automation cells, the fastest path to higher availability is reducing repair time, not purchasing better hardware.


MTBF data sources

Source Basis Use case
Manufacturer datasheets Vendor testing, controlled conditions First estimate
MIL-HDBK-217F US military component-level models Conservative baseline
IEC 61709 International component failure rates Reference standard
CMMS field data Your actual operating conditions Most accurate

Manufacturer MTBF figures assume 25°C and low vibration. Apply a derating factor of 0.5–0.7 for high-temperature enclosures or high-vibration environments. A drive rated for 80,000h in a lab may deliver 40,000–56,000h on a stamping press.

Build a CMMS-backed failure log from day one. Field data from your specific environment will diverge from handbook values within two to three years — and that divergence is the most useful reliability signal you have.