A deterministic diagnosis-and-response framework for Post Fiat-style validator operators. The manifest converts observable runtime signals into precise failure classifications, recovery actions, retry logic, and escalation paths. Every rule is structured for direct ingestion by monitoring tools, automation agents, and operator runbooks. All decisions in this framework are fully deterministic and require no operator intuition.
The framework is built from real-world failures encountered while operating a non-Post-Fiat validator node, anonymized and abstracted into reusable rules. Failure scenarios labeled OBSERVED were directly experienced; scenarios labeled DOCUMENTED were derived from public operator discussions and inferred from system behavior.
Operational Capabilities Enabled
This framework enables:
- Deterministic classification of validator failures without operator guesswork
- Direct mapping of observable runtime signals to recovery actions
- Integration into monitoring tools and automation agents via a structured schema
- Reduced time-to-diagnosis and time-to-recovery during validator incidents
1. Signal Vocabulary
Operators inspect the following signals. Every rule in this manifest is expressed as a combination of values drawn from these enums. No rule references signals outside this vocabulary.
1.1 node_state
Value Meaning Initializing Node started; pre-handshake. SessionStarted Cluster session handshake initiated; not yet joined. ReadyToJoin Cluster session validated; awaiting consensus participation. Ready Joined cluster; participating in consensus. Stopped Process not running. Unknown State unreadable (process alive, status endpoint unresponsive).
1.2 error_pattern
Value Trigger none No error in last poll window. unauthorized_401 HTTP 401 from cluster endpoint. cluster_session_missing ClusterSessionDoesNotExist or equivalent session-lookup failure. bootstrap_unreachable Bootstrap peer connection refused, timeout, or DNS failure. version_mismatch Cluster rejects node version; protocol or jar incompatibility. config_invalid Config parse error or required field missing. disk_full Write failure traceable to filesystem capacity. port_bound Configured port already in use by another process.
1.3 peer_status
Value Definition connected_quorum Connected peer count ≥ quorum threshold. connected_partial Connected peer count > 0 but < quorum threshold. isolated Connected peer count = 0. unknown Peer endpoint unresponsive.
1.4 sync_status
Value Definition advancing Local ordinal/height increases across two consecutive poll windows. stalled Local ordinal/height unchanged across three or more consecutive poll windows. regressing Local ordinal/height decreases relative to previous window. not_applicable Node not yet Ready; sync not expected.
1.5 restart_behavior
Value Definition stable Process uptime ≥ 10 minutes without exit. crash_loop Process has exited and restarted ≥ 3 times in 10 minutes. manual_only Auto-restart disabled; process up only when invoked manually.
1.6 dwell_time
Time spent in the current node_state without transition. Expressed as one of:
Value Range short < 2 minutes expected 2–10 minutes prolonged 10–30 minutes excessive > 30 minutes
2. Failure Taxonomy
Each failure class has a stable identifier, severity rating, and the set of signal combinations that classify into it. Severity drives escalation timing only; classification drives action.
Class ID Name Severity F-AUTH-01 Cluster Authorization Failure high F-SESS-01 Cluster Session Lost high F-BOOT-01 Bootstrap Dependency Failure high F-VERS-01 Version Incompatibility critical F-CONF-01 Configuration Error critical F-JOIN-01 Stalled Pre-Join medium F-SYNC-01 Post-Join Sync Stall medium F-PEER-01 Peer Isolation medium F-LOOP-01 Crash Loop critical F-RES-01 Resource Exhaustion high
Severity definitions: critical triggers escalation on first detection; high triggers escalation after the configured retry budget is exhausted; medium triggers escalation only on repeat occurrence within a 24-hour window.
3. Decision Matrix
Signal combinations resolve to exactly one failure class. Rows are evaluated in order; the first match wins. Cells marked * accept any value of that signal.
# node_state error_pattern peer_status sync_status restart dwell → Class 1 * version_mismatch * * * * F-VERS-01 2 * config_invalid * * * * F-CONF-01 3 * * * * crash_loop * F-LOOP-01 4 * disk_full | port_bound * * * * F-RES-01 5 SessionStarted unauthorized_401 * * * * F-AUTH-01 6 * cluster_session_missing * * * * F-SESS-01 7 SessionStarted | Initializing bootstrap_unreachable * * * * F-BOOT-01 8 SessionStarted | ReadyToJoin none * * stable prolonged | excessive F-JOIN-01 9 Ready * isolated * stable * F-PEER-01 10 Ready * connected_quorum | connected_partial stalled | regressing stable * F-SYNC-01
If no row matches, classification is F-UNKNOWN. F-UNKNOWN is itself an escalation trigger: an unmatched signal pattern indicates either incomplete signal vocabulary or a novel failure mode. The framework treats this as critical.
4. Retry, Cooldown, and Escalation Rules
For each failure class, the system performs a deterministic recovery action. Each action has a fixed retry budget, cooldown window between attempts, success condition that confirms recovery, and escalation trigger that raises the failure to operator attention.
Class Action Retry Budget Cooldown Success Condition Escalation Trigger F-AUTH-01 Refresh credentials from secrets store; restart node. 2 5 min node_state == Ready within 10 min of restart Budget exhausted, OR error_pattern == unauthorized_401 persists post-refresh. F-SESS-01 Stop node, clear local session cache, restart. 2 5 min node_state advances past SessionStarted within 10 min Budget exhausted. F-BOOT-01 Switch to next bootstrap peer in fallback list, restart. N (= number of fallback peers) 2 min node_state advances past SessionStarted within 10 min All fallback peers exhausted. F-VERS-01 Halt automated recovery. 0 n/a n/a Immediate. Operator must perform version migration. F-CONF-01 Halt automated recovery. 0 n/a n/a Immediate. Operator must repair config. F-JOIN-01 Restart node. 2 10 min node_state == Ready within 15 min of restart Budget exhausted. F-SYNC-01 Restart node. 2 15 min sync_status == advancing within 15 min of restart Budget exhausted, OR sync_status == regressing post-restart. F-PEER-01 No restart. Wait one cooldown window, re-evaluate. 3 10 min peer_status advances to connected_partial or connected_quorum Budget exhausted (peer isolation persists 30+ min). F-LOOP-01 Halt automated recovery. 0 n/a n/a Immediate. Auto-restart is causing the loop; operator must inspect. F-RES-01 Halt automated recovery. 0 n/a n/a Immediate. Operator must free disk or release port. F-UNKNOWN Halt automated recovery. 0 n/a n/a Immediate. Novel signal pattern requires human classification.
Retry counts reset only on confirmed success (success condition met) or after a 24-hour quiet period with no recurrence.
5. Automation-Ready Schema
The following YAML schema is the canonical machine-readable representation of every rule above. A monitoring tool ingesting this schema can fully reproduce the manifest's diagnostic and response behavior without referencing the prose sections.
manifest_version: 1.0
poll_interval_seconds: 60
signal_vocabulary:
node_state: [Initializing, SessionStarted, ReadyToJoin, Ready, Stopped, Unknown]
error_pattern:
- none
- unauthorized_401
- cluster_session_missing
- bootstrap_unreachable
- version_mismatch
- config_invalid
- disk_full
- port_bound
peer_status: [connected_quorum, connected_partial, isolated, unknown]
sync_status: [advancing, stalled, regressing, not_applicable]
restart_behavior: [stable, crash_loop, manual_only]
dwell_time: [short, expected, prolonged, excessive]
failure_classes:
- id: F-AUTH-01
name: Cluster Authorization Failure
severity: high
match:
node_state: [SessionStarted]
error_pattern: [unauthorized_401]
action:
type: refresh_credentials_and_restart
retry_budget: 2
cooldown_seconds: 300
success_condition:
node_state: Ready
within_seconds: 600
escalation:
on_budget_exhausted: true
on_persistent_signal: unauthorized_401
evidence_required: [auth_response_log, credential_refresh_log]
- id: F-SESS-01
name: Cluster Session Lost
severity: high
match:
error_pattern: [cluster_session_missing]
action:
type: clear_session_cache_and_restart
retry_budget: 2
cooldown_seconds: 300
success_condition:
node_state_advances_past: SessionStarted
within_seconds: 600
escalation:
on_budget_exhausted: true
evidence_required: [session_error_log, cache_clear_log]
- id: F-BOOT-01
name: Bootstrap Dependency Failure
severity: high
match:
node_state: [Initializing, SessionStarted]
error_pattern: [bootstrap_unreachable]
action:
type: rotate_bootstrap_peer
retry_budget: dynamic_fallback_list_length
cooldown_seconds: 120
success_condition:
node_state_advances_past: SessionStarted
within_seconds: 600
escalation:
on_budget_exhausted: true
evidence_required: [bootstrap_attempt_log, peer_list]
- id: F-VERS-01
name: Version Incompatibility
severity: critical
match:
error_pattern: [version_mismatch]
action:
type: halt
retry_budget: 0
escalation:
immediate: true
evidence_required: [version_handshake_log, local_version, expected_version]
- id: F-CONF-01
name: Configuration Error
severity: critical
match:
error_pattern: [config_invalid]
action:
type: halt
retry_budget: 0
escalation:
immediate: true
evidence_required: [config_parse_error, config_path]
- id: F-JOIN-01
name: Stalled Pre-Join
severity: medium
match:
node_state: [SessionStarted, ReadyToJoin]
error_pattern: [none]
restart_behavior: [stable]
dwell_time: [prolonged, excessive]
action:
type: restart_node
retry_budget: 2
cooldown_seconds: 600
success_condition:
node_state: Ready
within_seconds: 900
escalation:
on_budget_exhausted: true
evidence_required: [state_transition_log, dwell_duration]
- id: F-SYNC-01
name: Post-Join Sync Stall
severity: medium
match:
node_state: [Ready]
peer_status: [connected_quorum, connected_partial]
sync_status: [stalled, regressing]
restart_behavior: [stable]
action:
type: restart_node
retry_budget: 2
cooldown_seconds: 900
success_condition:
sync_status: advancing
within_seconds: 900
escalation:
on_budget_exhausted: true
on_persistent_signal: regressing
evidence_required: [ordinal_history, peer_status_history]
- id: F-PEER-01
name: Peer Isolation
severity: medium
match:
node_state: [Ready]
peer_status: [isolated]
restart_behavior: [stable]
action:
type: wait_and_observe
retry_budget: 3
cooldown_seconds: 600
success_condition:
peer_status_in: [connected_partial, connected_quorum]
escalation:
on_budget_exhausted: true
evidence_required: [peer_status_history, network_egress_check]
- id: F-LOOP-01
name: Crash Loop
severity: critical
match:
restart_behavior: [crash_loop]
action:
type: halt
retry_budget: 0
escalation:
immediate: true
evidence_required: [process_exit_log, restart_count_window]
- id: F-RES-01
name: Resource Exhaustion
severity: high
match:
error_pattern: [disk_full, port_bound]
action:
type: halt
retry_budget: 0
escalation:
immediate: true
evidence_required: [filesystem_usage, port_binding_check]
- id: F-UNKNOWN
name: Unmatched Signal Pattern
severity: critical
match:
fallback: true
action:
type: halt
retry_budget: 0
escalation:
immediate: true
evidence_required: [full_signal_snapshot]
evaluation_order:
- F-VERS-01
- F-CONF-01
- F-LOOP-01
- F-RES-01
- F-AUTH-01
- F-SESS-01
- F-BOOT-01
- F-JOIN-01
- F-PEER-01
- F-SYNC-01
- F-UNKNOWN
retry_reset_policy:
on_success: true
on_quiet_period_seconds: 86400
6. Privacy-Safe Example Cases
Each case maps directly into the schema above. Field names match the schema exactly. No real IPs, hostnames, secrets, or unreleased internal details appear.
Case 1 — OBSERVED
observed_signals:
node_state: SessionStarted
error_pattern: none
peer_status: unknown
sync_status: not_applicable
restart_behavior: stable
dwell_time: excessive
classified_as: F-JOIN-01
action_taken: restart_node
attempts_used: 1
outcome: success
success_condition_met:
node_state: Ready
elapsed_seconds: 480
Case 2 — OBSERVED
observed_signals:
node_state: SessionStarted
error_pattern: unauthorized_401
peer_status: unknown
sync_status: not_applicable
restart_behavior: stable
dwell_time: prolonged
classified_as: F-AUTH-01
action_taken: refresh_credentials_and_restart
attempts_used: 1
outcome: success
success_condition_met:
node_state: Ready
elapsed_seconds: 540
Case 3 — OBSERVED
observed_signals:
node_state: Initializing
error_pattern: bootstrap_unreachable
peer_status: isolated
sync_status: not_applicable
restart_behavior: stable
dwell_time: prolonged
classified_as: F-BOOT-01
action_taken: rotate_bootstrap_peer
attempts_used: 2
outcome: success
success_condition_met:
node_state: ReadyToJoin
elapsed_seconds: 360
note: Two of three bootstrap peers in fallback list were unreachable; third succeeded.
Case 4 — OBSERVED
observed_signals:
node_state: Stopped
error_pattern: version_mismatch
peer_status: unknown
sync_status: not_applicable
restart_behavior: manual_only
dwell_time: excessive
classified_as: F-VERS-01
action_taken: halt
attempts_used: 0
outcome: escalated
operator_resolution: Performed version migration per upgrade procedure (stop, replace binaries, clear data and logs, restart with updated systemd unit).
Case 5 — DOCUMENTED
observed_signals:
node_state: Ready
error_pattern: none
peer_status: connected_partial
sync_status: stalled
restart_behavior: stable
dwell_time: not_applicable
classified_as: F-SYNC-01
action_taken: restart_node
attempts_used: 1
outcome: success
success_condition_met:
sync_status: advancing
elapsed_seconds: 420
Case 6 — DOCUMENTED
observed_signals:
node_state: Unknown
error_pattern: none
peer_status: unknown
sync_status: not_applicable
restart_behavior: crash_loop
dwell_time: short
classified_as: F-LOOP-01
action_taken: halt
attempts_used: 0
outcome: escalated
operator_resolution: Disabled systemd auto-restart, inspected exit logs, identified upstream cause before re-enabling restart policy.
7. Operator Checklist
Use this checklist on every alert. Each step has a binary outcome; no step requires interpretation.
-
Capture the signal snapshot. Record current values for all six signals (
node_state,error_pattern,peer_status,sync_status,restart_behavior,dwell_time). Do not act on partial snapshots. -
Evaluate against the decision matrix in
evaluation_order. First match wins. If no match, classification isF-UNKNOWNand escalation is immediate. -
Confirm the action is permitted. Check
retry_budgetfor the class. If budget is exhausted, do not retry — escalate. -
Execute the action exactly as specified. No substitutions. If the schema says
restart_node, restart the node; do not also clear caches or rotate peers unless the schema specifies them. -
Wait for the cooldown window. Cooldowns exist to let state settle. Acting before cooldown ends invalidates the success condition.
-
Verify the success condition. Use the exact comparison in the schema (e.g.,
node_state: Ready within_seconds: 600). Partial recovery is not success. -
Record the case. Log signals, classification, action, attempt count, and outcome in the format used in Section 6. This builds the dataset that future revisions of the manifest depend on.
-
Reset the retry counter only on confirmed success or after the 24-hour quiet period. Never reset on partial recovery.
Closing Note
This manifest is reusable network infrastructure, not a personal incident journal. Operators of any Post Fiat-style validator can adopt it as-is, fork the YAML schema into their monitoring stack, and extend the failure taxonomy by adding new entries that conform to the existing field structure. The signal vocabulary is deliberately finite: any failure mode that cannot be expressed in these signals is F-UNKNOWN by definition, which forces either a vocabulary extension (with operator review) or a novel-failure escalation. There are no special cases outside the system.