Step 6 answers a deceptively simple question: When things fail repeatedly, how should a device behave over time?

Part A – English Version

Why Step 6 Exists

After Step 5, our firmware:

  • can fail safely
  • can explain failures
  • can recover structurally

But it still behaves like an impatient human.

Without an explicit retry policy, systems tend to:

  • retry immediately
  • retry too often
  • retry forever

This hurts:

  • power consumption
  • network infrastructure
  • backend services
  • device credibility

Step 6 turns retries into policy, not accidents.


The Core Insight

Retries are not a technical detail. They are a behavioral decision.

How often you retry says something about:

  • how valuable the data is
  • how patient the device is
  • how respectful it is to the network

This must be explicit.


The Anti-Pattern: Inline Retries

A common mistake:

for (int i = 0; i < 5; i++) {
    if (send() == 0) {
        return 0;
    }
    k_sleep(K_SECONDS(1));
}

Problems:

  • retry count is hidden
  • delay is arbitrary
  • power impact is unclear
  • logs are noisy

Most importantly: policy is buried in code.


Centralizing Retry Policy

Retries belong to the application, not subsystems.

We extend the context:

struct app_ctx {
    enum app_state state;
    enum app_state recovery_state;

    uint32_t retry_count;
    int last_error;

    k_timeout_t retry_delay;
};

Retry behavior is now visible state.


Designing a Backoff Strategy

We start simple and deterministic:

static k_timeout_t calc_backoff(uint32_t retry)
{
    if (retry < 3) {
        return K_SECONDS(10);
    } else if (retry < 6) {
        return K_MINUTES(1);
    } else {
        return K_MINUTES(5);
    }
}

Why this works:

  • predictable
  • debuggable
  • power-friendly

Randomization can be added later.


Applying Backoff in ERROR

case APP_STATE_ERROR:
    ctx->retry_delay = calc_backoff(ctx->retry_count);

    LOG_ERR("Error %d, retry %u, wait %lld ms, recover to %s",
            ctx->last_error,
            ctx->retry_count,
            ctx->retry_delay.ticks,
            state_str(ctx->recovery_state));

    ctx->retry_count++;
    ctx->state = APP_STATE_WAIT;
    break;

The ERROR state now:

  • decides delay
  • logs policy
  • remains centralized

WAIT Becomes Meaningful

WAIT is no longer a dumb sleep.

case APP_STATE_WAIT:
    k_sleep(ctx->retry_delay);
    ctx->state = ctx->recovery_state;
    break;

All waiting behavior flows through one state.


Why This Is Predictable

Given logs like:

Error -110, retry 4, wait 60000 ms, recover to NET_INIT

An engineer can immediately answer:

  • how long the device will be quiet
  • what it will try next
  • why it behaved that way

This is professionalism.


Avoiding Retry Storms

Without backoff:

  • thousands of devices retry together
  • servers get overloaded
  • outages cascade

With explicit policy:

  • retries spread over time
  • systems degrade gracefully

This matters at scale.


What Step 6 Does NOT Do

Step 6 does not:

  • add randomness
  • distinguish error types
  • persist retry state

Those are product-level decisions.

We start with clarity.


A Reviewer’s Perspective

A reviewer can now see:

  • retry behavior clearly
  • delays explicitly
  • no hidden loops

And can reason about fleet behavior.


Final Thought (English)

A polite device retries thoughtfully. An impatient one becomes noise.

Step 6 teaches patience.


Part B – Phiên bản tiếng Việt

Vì sao cần Step 6

Sau Step 5, hệ thống đã:

  • xử lý lỗi
  • log rõ ràng

Nhưng nếu không có retry policy:

  • hệ thống trở nên nóng vội
  • tốn pin
  • phá backend

Step 6 biến retry thành hành vi có chủ ý.


Insight cốt lõi

Retry là quyết định hành vi, không phải chi tiết kỹ thuật.


Anti-pattern: retry inline

Retry trong code con:

  • khó thấy
  • khó review
  • khó thay đổi

Tập trung retry policy

Retry thuộc về application.

Context mở rộng rõ ràng.


Backoff đơn giản nhưng hiệu quả

Deterministic backoff:

  • dễ debug
  • dễ giải thích

ERROR quyết định policy

ERROR:

  • tính delay
  • log
  • chuyển WAIT

Không nơi nào khác làm việc này.


WAIT có ý nghĩa

WAIT là nơi duy nhất được ngủ.


Tránh retry storm

Retry có kiểm soát giúp:

  • hệ thống ổn định
  • backend sống sót

Step 6 KHÔNG làm gì

  • không random
  • không phân loại lỗi

Giữ rõ ràng trước.


Lời kết (Tiếng Việt)

Thiết bị lịch sự retry có suy nghĩ. Thiết bị nóng vội trở thành gánh nặng.

Step 6 dạy thiết bị kiên nhẫn.