Problem
What's actually being polled
SP-Legends is a Roblox community hub built around +1 Skill Point Legends.
The visible feature on the home page is the live private-server tracker: a list
of community-shared private servers with current player counts, refreshing in
near-real-time. "Near-real-time" sets a budget. The Roblox API is the upstream;
rate limits set the ceiling.
Cadence
Per-endpoint cadence: 20 s fast / 40 s slow
Picking the polling interval is the most consequential design decision. Users
cannot tell the difference between "5 seconds stale" and "20 seconds stale" on
a list of player counts. They can tell the difference between "live"
and "broken because we got rate-limited", which is what too-fast polling
eventually buys you. The interval is also bounded below by the upstream's own
update cadence: there is no point pulling fresher than the API is willing to
serve.
Two endpoints feed the live page, and they don't change on the same timescale,
so they don't share a tick:
- Fast — 20 s. The paginated
/private-servers list. This is what users actually watch refresh: per-server populations, who's in which server. Freshness here is the user-visible feature.
- Slow — 40 s. The aggregate
/games?universeIds=… total-player count. This number shifts on minute-scale and isn't worth a request every 20 seconds; it's polled every other fast tick and the previous value is reused in between.
Twenty seconds is the smallest fast cadence that:
- Looks live in human perception.
- Stays well below any plausible per-IP rate limit on the endpoints I touch.
- Survives a 2x or 3x burst (e.g. a manual refresh on top of the loop) without tripping anything.
Splitting the cadence cut roughly half of the per-hour request count
against the slow endpoint without affecting freshness on the page.
Shape
The refresh task
The refresh task is one tokio task spawned at startup. It loops:
- Sleep until the next tick.
- Make the small set of HTTP requests needed to refresh the model.
- Parse responses into typed values (no string-shaped state).
- Write the new snapshot into the in-memory
AppState with a single swap, not a series of partial mutations.
- Persist to disk:
private_servers.json for the current snapshot, append to ps_history.bin for the rolling history.
- If anything failed, fire an admin alert. Continue the loop.
Failure
The failure mode I optimized for
The pessimistic case is not "the refresh task fails once". The pessimistic case
is "the refresh task fails for an hour and nobody notices". Three design
choices fall out of that:
- Backoff is paranoid but bounded. A single failure waits the same twenty seconds and tries again. A run of failures backs off — exponentially up to a small ceiling — but never disables polling. Self-healing is more valuable than throughput on a hobby site.
- Alerting is fire-and-forget. The HTTP path that produces the alert never blocks on the alert delivery.
tokio::spawn, post the webhook, drop on the floor if Discord is down. The alternative is a polling task that gets stuck on a slow webhook and starves the actual refresh.
- Retry context survives restarts. Each fetch failure (total-players, private-servers, share-code resolve) is recorded against a named "kind" with a consecutive-count, a first-failure timestamp, the last error string, and the last HTTP status. The map is mirrored to a small JSON file on every change, so a process restart in the middle of an outage doesn't reset the timeline. Alerts only fire at consecutive-count milestones (1, 5, 25, 100), and a "recovered" alert closes the incident when the streak ends — between milestones the context is updated silently.
Persistence
Persistence as a separate concern
The refresh task does not own the persistence format. It hands a snapshot to a
persistence module that knows how to write JSON for the live state and a
compact binary append for history. Two reasons:
- If the persistence layer needs to change format — and it has — the refresh task does not need to know.
- The refresh task can be tested with an in-memory fake persistence; the persistence layer can be tested without a network.
The history file is a binary append-only log because text-shaped histories
grow obnoxiously fast and binary parsing is cheaper than JSON for this
workload. The cost is a tiny amount of extra code; the benefit is six months of
history fitting in a sensible size.
Coupling
State swap, not state mutation
The HTTP handlers that serve the live page read from the same
AppState the refresh task writes to. The discipline is to never
mutate that state in place: the task computes a new snapshot, then swaps it in
atomically. That way a request that lands mid-refresh sees either the old
snapshot or the new snapshot, never a half-formed view. The cost is one extra
allocation per refresh; the benefit is that no handler ever sees inconsistent
state.
Hindsight
What was already on the list
Two follow-ups had been sitting in the "fix when it bites" pile for a while.
Both shipped together in the rework above:
- Per-endpoint cadence. Originally everything refreshed on the same 15-second tick. The endpoints have different change rates, so the total-player aggregate is now polled every 60 seconds while the private-servers list runs at 20 seconds. Same freshness on the page, fewer requests against the slower endpoint.
- Structured retry context. Alerts used to tell me that a refresh failed. They now tell me why: a stable kind ("roblox.total_players", "roblox.private_servers", "roblox.share_code_resolve"), the consecutive failure count, when the streak started, the last error message, and the last HTTP status — all persisted to disk so a restart doesn't lose the incident timeline.
Neither was urgent at the time. The refresh task had held up across multiple
Roblox API changes without a rate-limit incident, and the polling story was,
for a long time, the least interesting thing about SP-Legends. Both items took
a quiet afternoon to ship and immediately paid for themselves the next time
upstream wobbled.
Related
Related reading