| domain | varaneckas.com |
| summary | Here’s a summary of the website content:
This document outlines best practices for incident response and system management, particularly focusing on centralized databases and distributed systems. Key recommendations include:
* Incident Response: Prioritize resolving service disruption, utilize techniques like component shutdowns, error boundaries, and cascading failures. Implement robust runtime controls, status endpoints, and thorough incident follow-ups (detection, escalation, recovery, prevention). Embrace “slow thinking” during outages and leverage ChatOps. A minimum of 6 people should be on-call. * Resource Management: Employ predictive algorithms for resource forecasting (short, mid, long-term), and automate capacity migration. Monitor critical metrics using Prometheus, paying close attention to HDFS cluster balance and disk space. * Monitoring & Alerting: Utilize Real User Monitoring and Synthetic Monitoring. Name alerts clearly and proactively manage alert fatigue. Leverage visualization techniques to interpret SLOs effectively (e.g., request percentiles). * Root Cause Analysis: Focus on restoring service first, delaying in-depth root cause investigation until recovery.
The document emphasizes proactive measures like data science-driven forecasts and automated responses to ensure system resilience. |
| title | Blog of Tomas Varaneckas |
| description | Blog of Tomas Varaneckas |
| keywords | have, will, service, incident, people, monitoring, team, time, more, call, services, error, bots, there, outage, fact, page |
| upstreams |
|
| downstreams |
|
| nslookup | A 104.21.13.234, A 172.67.133.108 |
| created | 2025-12-20 |
| updated | 2025-12-20 |
| summarized | 2026-02-02 |
|
|