How to Profile Flask Applications with py-spy Without Adding Code Instrumentation
When your Flask application shows high CPU usage under production load, you’ll want to identify bottlenecks without changing the code. Instrumenting with cProfile or decorators can skew results or pollute production. py-spy lets us attach to a running process—like a Gunicorn worker PID—and sample the stack traces periodically.
This approach typically adds less than 3% overhead while revealing hotspots like slow loops or inefficient computations.
Why py-spy for Flask? (Production-Safe Sampling Profilers)
Let’s consider the trade-offs among popular Python profilers. When we instrument code with something like cProfile decorators, we introduce overhead that can change the very behavior we’re trying to measure—particularly problematic in production where every cycle counts. py-spy takes a different approach: it periodically samples the process’s stack traces using operating system signals, typically every 10 milliseconds. This statistical profiling method generally adds less than 3% CPU overhead while capturing hotspots across all threads, making it suitable for multi-worker setups like Gunicorn.
Of course, sampling has limitations—very short functions might be missed—but for CPU-bound bottlenecks in Flask endpoints, it works well. Here’s how py-spy compares:
| Profiler | Code Change | Overhead | Live Attach | Flamegraph |
|---|---|---|---|---|
| cProfile | Yes | 20-50% | No | Manual |
| pyinstrument | Decorator | ~10% | No | Yes |
| py-spy | No | <3% | Yes | Built-in |
| Scalene | No | ~5% | Yes | Yes |
We might choose Scalene if memory usage is the concern, or pyinstrument for web apps where decorators don’t disrupt much. For live production Flask without changes, though, py-spy often fits best.
Installing py-spy
py-spy is a Rust binary, so installation is quick across platforms (Linux, macOS, Windows).
Recommended: Cargo (ensures latest version, cross-platform):
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source $HOME/.cargo/env
cargo install py-spy
Alternative: pip (simpler if Python env ready, but may lag on releases):
pip install py-spy
macOS with Homebrew:
brew install py-spy
Verify installation:
py-spy --version
Expect something like py-spy 0.3.14. On Linux, you might need to address permissions later—see pitfalls.
Creating Our Example Flask Application
To demonstrate, we’ll create a simple Flask app with a CPU-intensive endpoint that mimics real-world data processing—say, numerical simulations or feature engineering without NumPy best practices. This naive loop represents a common premature optimization gap.
app.py:
from flask import Flask
import numpy as np
app = Flask(__name__)
@app.route('/slow')
def slow_compute():
result = 0
for i in range(10**6):
result += np.sin(i) * np.cos(i)
return {'result': result}
@app.route('/health')
def health():
return {'status': 'ok'}
if __name__ == '__main__':
app.run(debug=False)
gunicorn.conf.py:
bind = "0.0.0.0:5000"
workers = 4
Start the server:
gunicorn -c gunicorn.conf.py app:app
Generate load (install wrk if needed: brew install wrk on macOS):
wrk -t12 -c50 -d30s http://localhost:5000/slow
Expect ~200 req/s with high CPU.
Step 1: Identify the Target Process
First, we need the PID of a Gunicorn worker process handling requests. You can use:
ps aux | grep '[g]unicorn'
# Or more precisely:
pgrep -f 'gunicorn.*app:app'
Note the PID, say 12345.
Step 2: Real-Time Profiling with top
py-spy top provides a live, updating view similar to top but for Python functions, sorted by CPU usage. Run:
py-spy top --pid 12345 --sort cpu
You’ll see output like:
PID CPU% COMMAND FUNCTION
12345 92.3% gunicorn slow_compute()
12345 45.1% gunicorn np.sin()
12345 40.2% gunicorn for loop body
This reveals slow_compute() dominating CPU time.
Step 3: Generate Flame Graph
For a visual overview, use py-spy record to sample for a duration and output an interactive SVG:
py-spy record --pid 12345 -d 30 -o flask-profile.svg --subprocesses --rate 1000
After 30 seconds, open flask-profile.svg in your browser. Wide bars at the top indicate the hottest functions; drill down to see callers.
Applying the Fix: Vectorization
NumPy excels at vectorized operations, where mathematical functions apply to entire arrays at once rather than element-by-element in a Python loop. This leverages optimized C code under the hood, avoiding the interpreter overhead of millions of loop iterations.
Replace the loop in slow_compute():
Before:
result = 0
for i in range(10**6): # Slow: Python loop
result += np.sin(i) * np.cos(i)
After:
i = np.arange(10**6)
result = np.sum(np.sin(i) * np.cos(i)) # Vectorized: ~50x faster computation
To apply without full restart, send a reload signal to the master process:
kill -HUP 12345
Gunicorn workers reload gracefully.
Benchmarks (Apple M2 MacBook, 4 Gunicorn Workers)
We used wrk for load testing:
wrk -t12 -c400 -d30s http://localhost:5000/slow
| Endpoint | Requests/s Before | Requests/s After | CPU Drop |
|---|---|---|---|
| /slow | 210 | 1050 | 92% → 18% |
| /health | 5000 | 5000 | No change |
Your mileage may vary with hardware, NumPy version, or Python interpreter.
Common Pitfalls and Solutions
While py-spy is straightforward, a few issues arise in production environments.
Permission denied errors are common on Linux systems without root privileges. py-spy needs access to /proc/<pid>. Quick fix: prefix with sudo. For permanent non-root access, use sudo setcap cap_sys_ptrace=ep $(which py-spy)—this grants the necessary ptrace capability.
Missing flamegraph output often stems from older py-spy versions lacking SVG support. Update via cargo install py-spy --force or your package manager.
Incomplete thread visibility happens with Python’s GIL; add --native to include C extensions and native threads: py-spy top --native --pid <pid>.
py-spy works with ASGI servers like Uvicorn the same way—target the PID directly.
In Docker containers, PIDs are namespaced. From host: docker exec -it <container> py-spy top --pid 1 (PID 1 inside container). Ensure host network mode or volume mounts if needed.
Choosing sampling rate: Default 100Hz is usually fine, but for short-lived requests, increase with --rate 10000. Too high adds overhead; experiment.
Interpreting results: Flame graphs show inclusive time—wide bars include callees. Zoom to isolate functions. If no obvious hotspots, consider I/O or memory with other tools like Scalene.
Quick Reference Checklist
As we profile and optimize:
- Run
py-spy top --pid <PID>to identify the top CPU consumers - Capture
py-spy record --pid <PID> -o profile.svgfor visual analysis - Implement the fix, then reload workers with
kill -HUP <master-PID> - Retest under load with
wrkor similar - Verify no code instrumentation was needed
Further Reading
These related articles complement py-spy usage in Flask:
When py-spy reveals I/O or memory issues, consider tools like Scalene or Austin for deeper analysis.
Sponsored by Durable Programming
Need help maintaining or upgrading your Python application? Durable Programming specializes in keeping Python apps secure, performant, and up-to-date.
Hire Durable Programming