3 de marzo de 2026 · 4 min · Carol

uptime monitoreo SRE

Un uptime alto no es lo mismo que disponibilidad

Tu monitoreo puede reportar 99.99% y tu cliente estar furioso. La diferencia está en qué monitoreas, con qué frecuencia, y qué consideras "arriba".

"The site is up 99.99% of the time" — who hasn't seen that line in a sales pitch? The number is pretty, it's reassuring, it becomes a footer badge. But it hides more than it shows.

Uptime is a measurement, not a guarantee. And like any measurement, it depends on how you define the boundaries.

What "up" means to your monitor

Most monitoring tools check a single thing: a GET on the home page. If it responded 200 OK within the timeout, it's "up". If it didn't respond, it's "down".

This catches:

✅ Server completely down
✅ Broken DNS
✅ Expired SSL certificate
✅ Full timeout

But it doesn't catch:

❌ Home page OK, checkout broken
❌ 200 OK returning a generic error HTML ("Service Unavailable")
❌ Degraded response time (slow but up)
❌ API working, admin panel down
❌ SSL cert expires in 5 days (still valid = "up")

Result: you close the month with 99.99% on the dashboard and a queue of tickets complaining.

The three levels of monitoring

1. Basic availability (HTTP)

GET / and look at the status code. It's the minimum. It detects most catastrophic problems but misses the silent ones.

Use for: a simple corporate site, a landing page.

2. Keyword check

GET / and look for a keyword in the body. If the word "Welcome" disappeared, something is wrong — even if the status is 200.

Use for: detecting a generic error page serving 200, defacement, an unscheduled banner.

3. Multi-endpoint

Monitor separately: home, login, payment API, admin dashboard, webhook. Each has its own alert. The status page shows it granularly.

Use for: SaaS, e-commerce, anything where different parts can fail independently.

Frequency: 1 minute vs 5 minutes

UptimeRobot free checks every 5 minutes. Sentinela and most paid plans check every 1 minute. The difference in practice:

Fails at 2:03:30 PM, 5-min interval → detected at 2:05 → alert around 2:06
Fails at 2:03:30 PM, 1-min interval → detected at 2:04 → alert around 2:05

Seems small. But if you run e-commerce at peak time, 1 minute of undetected downtime is a few lost orders. Multiply by month, it becomes a real difference.

The trade-off is cost: 1/min = 5x more checks = more load on your server (small) and more load on the monitor (you pay for it in the plan).

"We're up but we're slow"

Degraded latency is the silent killer. The site responds 200 OK, but it took 8 seconds. The user gave up before the response arrived. To the monitor, everything was OK.

How to catch it: monitor not only the status, but the response time. The metrics that matter:

p50 (median): half the requests are faster than this
p95: 95% of requests are faster than this — the worst 5%
p99: the worst 1%

The average deceives. If p50 = 100ms but p99 = 8s, your user experience has a serious problem masked by the average.

Maintenance windows: honest reporting

99.99% is honest if you count it right. If you take the API down every Wednesday at 3 AM for maintenance, that counts as downtime — unless you declared the window in advance.

Good monitoring lets you:

Create a scheduled maintenance window
Suppress alerts during the window
Not count the window time against uptime%
Show a "scheduled maintenance" banner on the status page

Without that, either you lie in the numbers or you pay in false alarms.

The right question isn't "what's my uptime"

It's: "when something broke in the last 90 days, did I know before the customer?"

An honest metric:

How many incidents opened
Mean time to detection (from breakage → first alert)
Mean time to resolution (from detection → resolved)
In how many cases the customer warned before the monitor

That last one is the real test. If you had 3 incidents this month and in 2 of them the customer complained first, your monitoring isn't doing the job — regardless of the aggregate number.

Practical conclusion

Before celebrating 99.9%:

Monitor more than the home — add checkpoints for the critical flows (checkout, login, main API)
Use a keyword check when the app serves 200 in degraded mode
Look at p95 and p99 — not just status — to catch degradation
Declare maintenance windows for honest uptime
Track MTTD (time to detection) — not just uptime

Uptime is an easy number to publish. Real availability is harder to measure — and more important to deliver.

Sigue leyendo

22/05/2026

Cómo calculamos el riesgo en $/año (ALE) — fórmula, tabla y ejemplo

19/05/2026

ASM (Attack Surface Management): por qué mirar tu sitio como un atacante