Documentation · 04

Annex C - Troubleshooting runbook

Annex C: troubleshooting runbook for the cross-boundary interface, who calls who, escalation paths.

Annex C — Joint Troubleshooting Runbook for Long-Distance Interlocks

Referenced by MSA § 5.5 — flagged by Gary as an explicit open item on the 04-21 call: “we never adequately addressed how the operator was going to troubleshoot the long distance interlocks.” This runbook is the interim until a tooling solution exists.

When this runbook applies

You are an operator (or maintenance tech, or the on-call Manufacturing Operations engineer) and one of the following has happened:

A consumer controller has dropped into fail-safe because four expected update intervals elapsed without a message.
A producer reports it has been sending but the consumer has been missing — sequence gaps on the receiver side.
An operator HMI shows an interlock as silent but the local network looks fine.

You can see your end of the wire. You can’t see the IT backbone in the middle. This runbook is how you triage that.

Step 0 — Before you call anybody, look local

Do these three things first. Most “interlock failures” are not actually long-distance failures.

0.1 Is the local controller alive? Check the producer (if you’re on the producer side) or the consumer (if you’re on the consumer side) is powered, in run mode, and not faulted. If it’s faulted, this is a local PLC problem — go to local PLC procedures.

0.2 Is the local switch alive? Check the One Big Switch for the affected Manufacturing Operations network is powered and the relevant port LEDs are active. If the switch is down, it’s a local-network problem — go to local network procedures.

0.3 Did you bring something down for maintenance? Check the production-schedule feed and any in-progress maintenance work orders. If this is a scheduled powerdown, no action — IT monitoring should already know (per MSA § 2.6). If it doesn’t know, that’s a § 2.6 compliance issue, not an interlock failure.

If 0.1–0.3 are all clean, the failure is somewhere in the long-distance path. Continue.

Step 1 — Measure your end of the wire

1.1 Ping the boundary firewall from the services server (.2).

Expected: <1 ms, no loss.
If fail: the One Big Switch ↔ boundary firewall link is broken. Local problem, but inside the Manufacturing Operations Firewall. Replace cable, check port, escalate to network steward.

1.2 Ping the IT layer-3 switch downlink port (the IT-side address of the boundary link).

Expected: <5 ms, no loss.
If fail: the boundary firewall ↔ IT layer-3 switch link is broken. This is the demarcation; call IT.
If pass: you can reach IT. Continue.

1.3 Ping the peer Manufacturing Operations services server (the .2 address of the other end of the interlock).

Expected: <10 ms in a well-running plant, may be higher.
If fail: the IT backbone is not delivering your traffic to the other network. Call IT.
If pass: end-to-end reachability is intact. The problem is probably in the interlock priority class, not the underlying network. Continue.

1.4 Traceroute to the peer Manufacturing Operations services server.

Expected: small number of hops, all responsive, no asymmetric path.
If hops are missing or asymmetric: the backbone may be using the backup path (per MSA § 3.3 / Annex B). That’s not necessarily wrong, but flag it to IT for explanation.

These tests do not require IT permission per MSA § 3.5. If IT has blocked them, that itself is the incident — escalate under § 7.

Step 2 — Measure the interlock specifically

2.1 Run the producer-initiated heartbeat (per Annex B test method for the interlock).

The producer sends a known test sequence; the consumer logs round-trip latency and sequence gaps.
Expected: round-trip well under the maximum allowed loss window. Zero sequence gaps over a 30-second sample.

2.2 Check sequence gap history on the consumer.

The consumer keeps a rolling log of sequence numbers received.
A handful of gaps over hours = network noise, normal.
Bursts of gaps = priority class violation. The IT backbone is shedding your traffic. This is a Major Incident under MSA § 7.
Steady gap rate above ~1 per minute = bandwidth starvation or path flapping. Major Incident.

2.3 Check the receive timestamp drift.

If timestamps from the producer are clustered (multiple arriving at once after silence), the IT backbone is buffering instead of forwarding in real time. This is a priority-class violation. Major Incident.

Step 3 — Listen for the FSK, metaphorically

Gary’s note from the call: in his SCADA days he learned to diagnose RTU vs IO faults by ear, from the FSK tones on the ops channel. We don’t have audio anymore, but the principle is: most interlock failures have a signature, and the signature points to where the problem is.

Signature	Probable cause	Action
Local controller faulted, no signal sent	Producer-side PLC problem	Local PLC procedures
Producer sending, consumer silent, ping to IT downlink fails	Boundary link down (MO side)	Network steward, MO side
Producer sending, consumer silent, ping to IT downlink works but peer unreachable	IT backbone problem	Call IT — Major Incident if § 5.3 reservation violated
Producer sending, consumer receives but sequence gaps in bursts	Priority class not honored	Call IT — Major Incident
Producer sending, consumer receives but always 200+ ms late	Buffer/queuing on backbone	Call IT — review § 3.2 priority enforcement
Producer sending, consumer fine, then both go silent at the same time	Maintenance window collision or shared upstream failure	Cross-check IT scheduled-maintenance calendar (per § 6.2)
Producer fine, consumer fail-safed, no log of any anomaly	Consumer-side PLC clock drift or NTP failure	Check services server NTP on consumer side

Step 4 — Engage IT

When you call IT, tell them:

Interlock ID (from Annex B).
Which step in this runbook you got to before calling.
Specific measurements: ping result, traceroute result, sequence-gap pattern, timestamp drift.
Whether this is the primary or backup path (from § 1.4).
Whether you are claiming a Major Incident under MSA § 7.

IT’s obligation under MSA § 3.4 is to acknowledge the measurement and produce their own backbone-side measurements (per § 6.2). If they cannot produce backbone-side measurements within the agreed window, that is itself a § 7 Major Incident.

Step 5 — Restore

5.1 Once the cause is identified, the side that controls the cause executes the fix:

MO problem: MO fixes, IT is informed.
IT problem: IT fixes, MO is informed.
Joint problem: both sides coordinate via the joint review process (§ 8.1).

5.2 Do not re-arm the interlock automatically. A fail-safed consumer is fail-safe for a reason. After the cause is fixed:

Verify the producer is sending cleanly (run § 2.1 heartbeat).
Verify the consumer is receiving cleanly (run § 2.2 sequence-gap check over a settling period — typically 60 seconds).
Operator manually re-arms the interlock at the HMI.

5.3 Log the incident with measurements in the incident register for § 7 / § 8.1 review.

Interlock families (one section per family in Annex B)

Part-Ready Interlock Family

Includes: mach1→conv-main:part-ready, similar producer→conveyor patterns.

Specific notes:

Fail-safe action is “conveyor stops at boundary.” That is a hard line stop, not a soft slowdown. Production loss is immediate.
The consumer fail-safe is intentional — Gary’s framing: “unless you get a positive indication you’re able to go on these tracks, you don’t go on these tracks.”
Re-arm: confirm with machining floor that the part is in fact ready before re-arming. Do not re-arm based on the interlock signal alone.

Part-Handoff Interlock Family

Includes: conv-main→asm1:part-handoff, similar conveyor→consumer patterns.

Specific notes:

Fail-safe is at the assembly entry station — the part may be queued at the entry but the station refuses pickup.
Re-arm: visual confirm at entry station, manual reset.

Utility Availability Interlock Family

Includes: powerhouse-1→all:steam-available, compressed-air availability, similar one-to-many utility broadcasts.

Specific notes:

One-to-many. A failure of the producer affects all consumers; a failure of one consumer’s path does not affect others.
The 500 ms expected update interval (vs 100 ms for production interlocks) reflects that utility state changes slowly. Don’t tune this faster; it just wastes bandwidth.
Re-arm: utility department confirms the resource is in fact available before consumers come back online.

Limits of this runbook

This is a paper interim until tooling exists. Specifically, what’s missing:

Real-time visibility into the IT backbone middle. Operators cannot see what is happening between the boundary firewalls. They can only measure what arrives at their end.
Automated interlock-health dashboard. Currently the operator has to know to run the steps above. A passive monitor on the consumer side that surfaces sequence gaps, timestamp drift, and fail-safe state automatically would replace most of Step 2.
Shared backbone telemetry feed from IT. MSA § 6.2 commits IT to publishing backbone link-quality metrics. Once that feed is live, Step 4 collapses into “compare your numbers to theirs.”

Until those exist, this runbook is the contract.