Step 10 is about humility.
Assume everything above can fail.

After Step 9, the system is:

  • reliable
  • observable
  • power-aware
  • outage-tolerant
  • updatable

Step 10 adds the final safety net:

Automatic recovery when the software itself gets stuck.

This step assumes: even good code can fail in bad ways.


1. Purpose of Step 10

The watchdog answers one brutal question:

“What if the system stops making progress?”

Examples:

  • deadlock
  • infinite loop
  • stuck driver
  • memory corruption
  • unforeseen corner case

No amount of retry logic can fix these.


2. Watchdog Philosophy

Senior rule:

“The watchdog is not a feature. It is a judge.”

Key principles:

  • Watchdog does not understand logic
  • Watchdog only observes liveness
  • Reset is not failure; it is recovery

3. What the Watchdog Should Mean

A watchdog reset means:

“The application failed to uphold its contract of progress.”

It does NOT mean:

  • the device is broken
  • the design is wrong

It means the system protected itself.


4. Where to Feed the Watchdog (Critical)

Never feed the watchdog in callbacks or drivers.

Correct rule:

Feed the watchdog only when the system makes forward progress.

In this architecture, that means:

  • after successful state transitions
  • after SEND success
  • after successful WAIT completion

5. Application Context Update

struct app_ctx {
    enum app_state state;
    enum app_state recovery_state;

    uint32_t retry_count;
    k_timeout_t retry_delay;
    int last_error;

    struct sample_buffer samples;

    bool progress_made;

    struct mcp9808_ctx sensor;
    struct net_ctx net;
    struct http_ctx http;
};

progress_made is a logical signal, not hardware-specific.


6. Defining “Progress”

Progress means:

  • State changes
  • Successful SEND
  • Completion of WAIT
  • Successful SENSOR_INIT
  • Successful NET_INIT

Progress does NOT mean:

  • looping
  • retrying without delay
  • logging

7. Feeding the Watchdog (Conceptual)

if (ctx->progress_made) {
    watchdog_feed();
    ctx->progress_made = false;
}

This check happens once per main loop iteration.


8. Timeout Selection

Watchdog timeout must be:

  • Longer than the longest valid blocking operation
  • Shorter than “forever”

Example:

  • HTTP timeout: 30s
  • DNS timeout: 10s
  • Watchdog: 60–90s

This ensures:

  • real hangs trigger reset
  • slow networks do not

9. Interaction with WAIT and Power Management

Important rule:

Do not feed the watchdog before long sleeps.

Correct behavior:

  • Feed watchdog
  • Enter WAIT sleep
  • Wake up
  • Feed again after progress

This ensures sleep itself is not mistaken for a hang.


10. Handling Reset After Watchdog

On boot:

  • MCUboot runs first
  • Application starts fresh
  • State = BOOT

Optional:

  • Persist reset reason
  • Log “watchdog reset”

The system treats reset as recovery, not error.


11. Logging Watchdog Events

Log sparingly:

  • Watchdog started
  • Watchdog fed (DEBUG only)
  • Reset detected

Example:

Reset reason: watchdog

12. What Step 10 Deliberately Avoids

  • Feeding watchdog everywhere
  • Multiple watchdogs
  • Complex health metrics

The watchdog must remain simple and strict.


13. Success Criteria for Step 10

Step 10 is complete when:

  • System resets on true hangs
  • Normal slow operations do not trigger resets
  • Watchdog logic is explainable
  • Reset is safe and recoverable

14. Architectural Status After Step 10

After Step 10, the device has:

  • graceful failure handling
  • controlled retries
  • safe self-reset

This is production-hardened firmware.


15. Final Note

The watchdog is the last safety net.

If it ever triggers:

  • the system did its job
  • not everything went wrong

That mindset is critical in production systems.