System Monitoring Indicator Observation Standards
Li
Li Wei
Title: System Monitoring Metric Observation Standards
Metric Categories
- Core Infrastructure Monitoring (CIM): average CPU utilization, duration of CPU peak usage, average memory usage, bandwidth input/output, etc.
- Application-Level Monitoring (ALM): JVM process memory, number of internal threads, disk I/O, index read/write operations, user logs, request logs, request error counts, etc.
- Service Quality Monitoring (SQM): maximum request latency, average request latency, average request rate per minute, peak daily request rate, order count, query count, etc.
Alarm Metric References
| Core Metric | Metric | Description | Standard |
|---|---|---|---|
| Application | cpu | CPU utilization – proportion of time spent executing non‑idle processes (non‑idle CPU time ÷ total CPU time) | 60% |
| memory | Memory usage – used vs. available space; pay attention to total, used, free, etc. free + buffers + cached represents available memory. Too low can trigger full GC (FGC) and affect system response |
60% | |
| disk | Disk I/O – how busy the disk is; I/O load reflects system load and can become an application bottleneck | 60% | |
| load | load.1minPerCPU |
load.5minPerCPU – CPU load per core |
|
| oldGC | full_gc_count – number of full GCs |
2 times per day | |
| swap | mem.swapused.percent – swap usage percentage |
10% | |
| Service Quality | failure_rate | Interface failure count ÷ total interface calls | 0.01% |
| error_count | Number of interface errors | — | |
| average_response_time | Total time from when a user sends a request to when the response is fully received | — | |
| TP999 | Minimum latency guaranteeing that 99.9% of requests are responded to within this time | — | |
| qps | Requests per second (queries per second) | — | |
| business_data | Drop‑zero alarm – trigger an alarm when the call volume is zero over a period | 0 | |
| range_fluctuation | Set threshold ranges for metrics to ensure normal fluctuations within a defined interval | Context‑dependent | |
| data_accuracy | Compare data across strongly related systems (e.g., compare B‑side and C‑side data of a group‑buying platform) | Context‑dependent |
Originally written by Li Wei (李唯_) and published in Chinese on 后端技术栈全书 (Full-Stack Backend Engineering). Translated and adapted for DriftSeas with permission.