Step 10 is about humility.
Assume everything above can fail.
After Step 9, the system is:
- reliable
- observable
- power-aware
- outage-tolerant
- updatable
Step 10 adds the final safety net:
Automatic recovery when the software itself gets stuck.
This step assumes: even good code can fail in bad ways.
1. Purpose of Step 10
The watchdog answers one brutal question:
“What if the system stops making progress?”
Examples:
- deadlock
- infinite loop
- stuck driver
- memory corruption
- unforeseen corner case
No amount of retry logic can fix these.
2. Watchdog Philosophy
Senior rule:
“The watchdog is not a feature. It is a judge.”
Key principles:
- Watchdog does not understand logic
- Watchdog only observes liveness
- Reset is not failure; it is recovery
3. What the Watchdog Should Mean
A watchdog reset means:
“The application failed to uphold its contract of progress.”
It does NOT mean:
- the device is broken
- the design is wrong
It means the system protected itself.
4. Where to Feed the Watchdog (Critical)
Never feed the watchdog in callbacks or drivers.
Correct rule:
Feed the watchdog only when the system makes forward progress.
In this architecture, that means:
- after successful state transitions
- after SEND success
- after successful WAIT completion
5. Application Context Update
struct app_ctx {
enum app_state state;
enum app_state recovery_state;
uint32_t retry_count;
k_timeout_t retry_delay;
int last_error;
struct sample_buffer samples;
bool progress_made;
struct mcp9808_ctx sensor;
struct net_ctx net;
struct http_ctx http;
};
progress_made is a logical signal, not hardware-specific.
6. Defining “Progress”
Progress means:
- State changes
- Successful SEND
- Completion of WAIT
- Successful SENSOR_INIT
- Successful NET_INIT
Progress does NOT mean:
- looping
- retrying without delay
- logging
7. Feeding the Watchdog (Conceptual)
if (ctx->progress_made) {
watchdog_feed();
ctx->progress_made = false;
}
This check happens once per main loop iteration.
8. Timeout Selection
Watchdog timeout must be:
- Longer than the longest valid blocking operation
- Shorter than “forever”
Example:
- HTTP timeout: 30s
- DNS timeout: 10s
- Watchdog: 60–90s
This ensures:
- real hangs trigger reset
- slow networks do not
9. Interaction with WAIT and Power Management
Important rule:
Do not feed the watchdog before long sleeps.
Correct behavior:
- Feed watchdog
- Enter WAIT sleep
- Wake up
- Feed again after progress
This ensures sleep itself is not mistaken for a hang.
10. Handling Reset After Watchdog
On boot:
- MCUboot runs first
- Application starts fresh
- State = BOOT
Optional:
- Persist reset reason
- Log “watchdog reset”
The system treats reset as recovery, not error.
11. Logging Watchdog Events
Log sparingly:
- Watchdog started
- Watchdog fed (DEBUG only)
- Reset detected
Example:
Reset reason: watchdog
12. What Step 10 Deliberately Avoids
- Feeding watchdog everywhere
- Multiple watchdogs
- Complex health metrics
The watchdog must remain simple and strict.
13. Success Criteria for Step 10
Step 10 is complete when:
- System resets on true hangs
- Normal slow operations do not trigger resets
- Watchdog logic is explainable
- Reset is safe and recoverable
14. Architectural Status After Step 10
After Step 10, the device has:
- graceful failure handling
- controlled retries
- safe self-reset
This is production-hardened firmware.
15. Final Note
The watchdog is the last safety net.
If it ever triggers:
- the system did its job
- not everything went wrong
That mindset is critical in production systems.