When production is on fire, you need answers fast. Here's the systematic approach senior SREs use to diagnose Linux performance issues in under a minute.

Linux Performance Troubleshooting: The 60-Second Diagnostic

Performance issues strike at the worst times. Here’s your battle-tested methodology to find the root cause quickly.

The First 10 Seconds: Quick System Overview

# One command to rule them all
$ uptime && free -h && df -h / && ss -tlpn | head -20

# Output tells you:
# 1. Load averages (1, 5, 15 minutes)
# 2. Memory usage (used/available)
# 3. Disk space on root filesystem
# 4. Listening ports and their processes

Load average interpretation:

< CPU cores: System is comfortable
≈ CPU cores: System is busy
> CPU cores: System is overloaded
> 2× CPU cores: System is struggling

CPU Investigation: What’s Actually Running?

# See CPU usage per process (refreshes every 3 seconds)
$ top -b -n 1 | head -20

# Better: htop if available
$ htop --sort-key=PERCENT_CPU

# Find which CPU cores are busy
$ mpstat -P ALL 1 3

# Check for CPU wait (I/O bound processes)
$ vmstat 1 5

Key metrics:

%us: User CPU time (your application)
%sy: System CPU time (kernel)
%wa: I/O wait (disk/network bottleneck)
%id: Idle (good! means CPU has capacity)

Memory: The Silent Killer

# Understand memory pressure
$ cat /proc/meminfo | grep -E "(MemTotal|MemFree|Buffers|Cached|SwapTotal|SwapFree)"

# Check for OOM killer activity
$ dmesg -T | grep -i "oom\|kill"

# Find memory hogs
$ ps aux --sort=-%mem | head -10

# Check slab memory (kernel caches)
$ slabtop -o

When to worry about memory:

Available memory < 10% of total
Swap used > 0 (on servers with SSD swap, monitor carefully)
OOM killer has been active (check dmesg)
Slab memory growing without bound

Disk I/O: The Common Bottleneck

# Check disk latency
$ iostat -dx 1 3

# Find which processes are doing I/O
$ iotop -o

# Check for files with many open handles
$ lsof | awk '{print $2}' | sort | uniq -c | sort -nr | head -10

# Monitor specific directory I/O
$ inotifywait -m /var/log -e create,modify,delete

I/O red flags:

await > 10ms (disk is slow to respond)
%util > 80% (disk is constantly busy)
High iowait in CPU metrics
Many processes in D state (uninterruptible sleep)

Network: The Invisible Traffic Jam

# Overall network stats
$ sar -n DEV 1 3

# Connections by state
$ ss -s

# Find network-heavy processes
$ nethogs

# Check for packet loss
$ ping -c 10 -i 0.2 google.com

# DNS resolution times
$ dig google.com | grep "Query time"

Application-Specific Debugging

For Java Applications:

# Quick JVM health check
$ jcmd <PID> VM.version
$ jcmd <PID> VM.flags
$ jcmd <PID> GC.heap_info

# If you suspect GC issues
$ jstat -gcutil <PID> 1000 5

For Python Applications:

# Check for memory leaks
$ python -m tracemalloc <your_script.py>

# Profile CPU usage
$ python -m cProfile -o profile.stats <your_script.py>

For Database Troubles:

# PostgreSQL
$ pg_top
$ SELECT * FROM pg_stat_activity WHERE state = 'active';

# MySQL
$ mysqladmin processlist
$ SHOW ENGINE INNODB STATUS\G

The 60-Second Checklist

Run this script when alerted:

#!/bin/bash
# save as /usr/local/bin/quick-diag

echo "=== $(date) ==="
echo "=== Uptime & Load ==="
uptime
echo

echo "=== Memory ==="
free -h
echo

echo "=== Disk ==="
df -h / | tail -1
echo

echo "=== Top 5 CPU Processes ==="
ps aux --sort=-%cpu | head -6
echo

echo "=== Top 5 Memory Processes ==="
ps aux --sort=-%mem | head -6
echo

echo "=== Network Connections ==="
ss -s | head -5
echo

echo "=== Checking common issues ==="

# Check for zombie processes
if [ $(ps aux | grep -c 'defunct') -gt 0 ]; then
  echo "⚠️  Zombie processes found!"
  ps aux | grep defunct
fi

# Check disk space
ROOT_USAGE=$(df / | tail -1 | awk '{print $5}' | sed 's/%//')
if [ $ROOT_USAGE -gt 90 ]; then
  echo "⚠️  Root filesystem ${ROOT_USAGE}% full!"
fi

# Check load vs CPUs
CPUS=$(nproc)
LOAD1=$(uptime | awk -F'load average:' '{print $2}' | cut -d, -f1 | tr -d ' ')
LOAD_RATIO=$(echo "$LOAD1 / $CPUS" | bc -l)

if (( $(echo "$LOAD_RATIO > 2" | bc -l) )); then
  echo "⚠️  High load average: $LOAD1 on $CPUS CPUs"
fi

Advanced Tools for Deep Dives

When quick checks aren’t enough:

perf - CPU profiling at the kernel level

perf top -p <PID>
perf record -p <PID> -g -- sleep 30

strace - System call tracing

strace -p <PID> -c
strace -p <PID> -e open,read,write

bpftrace/eBPF - Modern observability

bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'

Conclusion

Remember the troubleshooting hierarchy:

Check the obvious (disk space, memory, load)
Identify the resource bottleneck (CPU, memory, disk, network)
Find the culprit processes
Understand why they’re misbehaving
Fix or mitigate the issue

Practice these commands in non-production environments. When production is burning, muscle memory will save you.

Linux Performance Troubleshooting: The 60-Second Diagnostic

Linux Performance Troubleshooting: The 60-Second Diagnostic

The First 10 Seconds: Quick System Overview

CPU Investigation: What’s Actually Running?

Memory: The Silent Killer

Disk I/O: The Common Bottleneck

Network: The Invisible Traffic Jam

Application-Specific Debugging

For Java Applications:

For Python Applications:

For Database Troubles:

The 60-Second Checklist

Advanced Tools for Deep Dives

Conclusion

About Alex Rodriguez

Related Articles

SRE Error Budgets: Turning Theory into Practice

Kubernetes Production Deployment Patterns You Need to Know