Linux Performance Troubleshooting: The 60-Second Diagnostic
Performance issues strike at the worst times. Here’s your battle-tested methodology to find the root cause quickly.
The First 10 Seconds: Quick System Overview
# One command to rule them all
$ uptime && free -h && df -h / && ss -tlpn | head -20
# Output tells you:
# 1. Load averages (1, 5, 15 minutes)
# 2. Memory usage (used/available)
# 3. Disk space on root filesystem
# 4. Listening ports and their processes
Load average interpretation:
- < CPU cores: System is comfortable
- ≈ CPU cores: System is busy
- > CPU cores: System is overloaded
- > 2× CPU cores: System is struggling
CPU Investigation: What’s Actually Running?
# See CPU usage per process (refreshes every 3 seconds)
$ top -b -n 1 | head -20
# Better: htop if available
$ htop --sort-key=PERCENT_CPU
# Find which CPU cores are busy
$ mpstat -P ALL 1 3
# Check for CPU wait (I/O bound processes)
$ vmstat 1 5
Key metrics:
- %us: User CPU time (your application)
- %sy: System CPU time (kernel)
- %wa: I/O wait (disk/network bottleneck)
- %id: Idle (good! means CPU has capacity)
Memory: The Silent Killer
# Understand memory pressure
$ cat /proc/meminfo | grep -E "(MemTotal|MemFree|Buffers|Cached|SwapTotal|SwapFree)"
# Check for OOM killer activity
$ dmesg -T | grep -i "oom\|kill"
# Find memory hogs
$ ps aux --sort=-%mem | head -10
# Check slab memory (kernel caches)
$ slabtop -o
When to worry about memory:
- Available memory < 10% of total
- Swap used > 0 (on servers with SSD swap, monitor carefully)
- OOM killer has been active (check dmesg)
- Slab memory growing without bound
Disk I/O: The Common Bottleneck
# Check disk latency
$ iostat -dx 1 3
# Find which processes are doing I/O
$ iotop -o
# Check for files with many open handles
$ lsof | awk '{print $2}' | sort | uniq -c | sort -nr | head -10
# Monitor specific directory I/O
$ inotifywait -m /var/log -e create,modify,delete
I/O red flags:
- await > 10ms (disk is slow to respond)
- %util > 80% (disk is constantly busy)
- High iowait in CPU metrics
- Many processes in D state (uninterruptible sleep)
Network: The Invisible Traffic Jam
# Overall network stats
$ sar -n DEV 1 3
# Connections by state
$ ss -s
# Find network-heavy processes
$ nethogs
# Check for packet loss
$ ping -c 10 -i 0.2 google.com
# DNS resolution times
$ dig google.com | grep "Query time"
Application-Specific Debugging
For Java Applications:
# Quick JVM health check
$ jcmd <PID> VM.version
$ jcmd <PID> VM.flags
$ jcmd <PID> GC.heap_info
# If you suspect GC issues
$ jstat -gcutil <PID> 1000 5
For Python Applications:
# Check for memory leaks
$ python -m tracemalloc <your_script.py>
# Profile CPU usage
$ python -m cProfile -o profile.stats <your_script.py>
For Database Troubles:
# PostgreSQL
$ pg_top
$ SELECT * FROM pg_stat_activity WHERE state = 'active';
# MySQL
$ mysqladmin processlist
$ SHOW ENGINE INNODB STATUS\G
The 60-Second Checklist
Run this script when alerted:
#!/bin/bash
# save as /usr/local/bin/quick-diag
echo "=== $(date) ==="
echo "=== Uptime & Load ==="
uptime
echo
echo "=== Memory ==="
free -h
echo
echo "=== Disk ==="
df -h / | tail -1
echo
echo "=== Top 5 CPU Processes ==="
ps aux --sort=-%cpu | head -6
echo
echo "=== Top 5 Memory Processes ==="
ps aux --sort=-%mem | head -6
echo
echo "=== Network Connections ==="
ss -s | head -5
echo
echo "=== Checking common issues ==="
# Check for zombie processes
if [ $(ps aux | grep -c 'defunct') -gt 0 ]; then
echo "⚠️ Zombie processes found!"
ps aux | grep defunct
fi
# Check disk space
ROOT_USAGE=$(df / | tail -1 | awk '{print $5}' | sed 's/%//')
if [ $ROOT_USAGE -gt 90 ]; then
echo "⚠️ Root filesystem ${ROOT_USAGE}% full!"
fi
# Check load vs CPUs
CPUS=$(nproc)
LOAD1=$(uptime | awk -F'load average:' '{print $2}' | cut -d, -f1 | tr -d ' ')
LOAD_RATIO=$(echo "$LOAD1 / $CPUS" | bc -l)
if (( $(echo "$LOAD_RATIO > 2" | bc -l) )); then
echo "⚠️ High load average: $LOAD1 on $CPUS CPUs"
fi
Advanced Tools for Deep Dives
When quick checks aren’t enough:
-
perf - CPU profiling at the kernel level
perf top -p <PID> perf record -p <PID> -g -- sleep 30 -
strace - System call tracing
strace -p <PID> -c strace -p <PID> -e open,read,write -
bpftrace/eBPF - Modern observability
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'
Conclusion
Remember the troubleshooting hierarchy:
- Check the obvious (disk space, memory, load)
- Identify the resource bottleneck (CPU, memory, disk, network)
- Find the culprit processes
- Understand why they’re misbehaving
- Fix or mitigate the issue
Practice these commands in non-production environments. When production is burning, muscle memory will save you.