When Things Break
Failure modes, debugging strategies, and self-healing.
Diagnose Common Failures Step by Step
Every system fails eventually. The difference between a good system and a fragile one is how quickly you can find and fix the problem. Walk through the five most common failure modes in the Forest Analyzer, learning the symptoms, root causes, and fixes for each.
GEE Authentication Failure
This is the most common error. If GEE_SERVICE_ACCOUNT_KEY_FILE is missing or the credentials have expired, all analysis fails. Nothing else in the pipeline can compensate — the analyzer raises:
This single point of failure affects forest coverage calculation, GLAD alerts, RADD alerts, NDVI analysis, satellite imagery, and land cover classification. Everything that touches GEE stops working at once.
Debug checklist:
- Check
.env— isGEE_SERVICE_ACCOUNT=trueset? - Does the key file at
GEE_SERVICE_ACCOUNT_KEY_FILEactually exist on disk? - Open the key file — is it valid JSON? (Truncated downloads are a common cause.)
- Has the service account been granted access to the GEE project specified in
GEE_PROJECT_ID? - Check startup logs for
"Provider initialized successfully"vs"Provider initialization failed".
Database Connection Pool Exhaustion
The system maintains a pool of 15 database connections. When all are in use, new requests fall back to slower direct connections. You will see this in the logs:
This is not an outright crash — it is graceful degradation. But it signals that the system is under more load than expected, or that connections are leaking.
Debug checklist:
- Check
DB_POOL_SIZEandDB_MAX_OVERFLOWsettings in.env. - Look for connection leaks — code paths where a connection is acquired but never released back to the pool.
- Check if analyses are running unusually long, holding connections open.
- Monitor concurrent request count — a traffic spike may simply exceed capacity.
DB_POOL_SIZE or fixing connection leaks resolves the issue.
Stuck Analysis
An analysis stays in PROCESSING status for more than 30 minutes. This usually means the GEE request timed out or the worker process crashed mid-analysis.
The system has a built-in watchdog: the stuck_analysis_fixer background job runs every 2 minutes. It finds analyses stuck in PROCESSING for longer than 30 minutes and resets them to PENDING for automatic retry. After 3 failed retries, the analysis is marked as ERROR.
Debug checklist:
- Query the queue table:
SELECT status, retry_count, error_message FROM analysis_queue WHERE id = '...' - Check worker logs — did the worker crash or lose its GEE connection?
- Verify GEE is responding (startup log or a manual test request).
- If
retry_counthas hit 3, the fixer has given up. Manual investigation is needed.
POST /api/queue/retry/{queue_id}.
ImageCollection vs Image Error
You see this GEE error in the logs:
Cause: Two key datasets are ImageCollections, not single Images:
- RADD alerts:
projects/radar-wur/raddalert/v1 - ESA WorldCover:
ESA/WorldCover/v200
If you try to load either with ee.Image("..."), GEE rejects it because the asset contains multiple images, not one.
Fix: Use ee.ImageCollection("...").first() for WorldCover, or filter and composite for RADD:
Email Not Delivered
A user registers but never receives their activation email. Or an analysis completes but the notification email does not arrive. This is often a configuration issue rather than a code bug.
Debug checklist:
- Check the user's spam/junk folder first — automated emails often land there.
- Verify SMTP credentials in
.env:SMTP_HOST,SMTP_PORT,SMTP_USER,SMTP_PASSWORD. - If using Gmail,
SMTP_PASSWORDmust be an app-specific password, not the account password. Gmail blocks sign-ins from "less secure apps" by default. - Check the email_service logs in the backend for send errors or connection timeouts.
- Test with
POST /api/auth/resend-activationto trigger a fresh email and watch the logs.
Debugging Recipes
Quick, targeted procedures for checking specific aspects of system health.
How to check if the queue worker is running
The queue worker starts automatically in the startup_event() function when the backend server launches. Look for this line in the backend logs:
If the server was restarted, the worker restarts with it — there is no separate process to manage. If the log line is missing, the worker failed to start. Check for import errors or configuration issues in analysis_queue_worker.py.
How to manually retry a failed analysis
Send a POST request to the retry endpoint:
This resets the job status from FAILED or ERROR back to PENDING and increments the retry count. The queue worker will pick it up on its next polling cycle. Use GET /api/queue/status/{queue_id} to monitor progress afterward.
How to check GEE connectivity
Look at the backend startup logs for the provider initialization message:
If the provider failed to initialize, all GEE-dependent features (forest analysis, alerts, imagery, NDVI, land cover) will be unavailable. Check the GEE_SERVICE_ACCOUNT_KEY_FILE path and credentials.
How to find why a specific analysis failed
- Call the status endpoint:
GET /api/queue/status/{queue_id}. The response includes anerror_messagefield with the failure reason. - For more detail, query the database directly:
SELECT id, status, error_message, retry_count, updated_at FROM analysis_queue WHERE id = 'xxx' - Cross-reference the
updated_attimestamp with backend logs for the full stack trace.
How to increase database connection limits
Set these values in your .env file:
This gives you 10 pooled connections plus up to 20 overflow connections (30 total). Also make sure PostgreSQL's max_connections on the server side is high enough to accommodate this — the default is typically 100, which should be sufficient for most deployments.
Resilience Patterns
How the Forest Analyzer handles failure by design, not by accident.
Self-Healing with the Stuck Analysis Fixer
A background thread wakes up every 2 minutes and queries the database for analyses stuck in PROCESSING status for longer than 30 minutes. When it finds one, it resets the status to PENDING so the queue worker retries it automatically. After 3 failed retries, the analysis is marked as ERROR — at that point, the problem is persistent and needs human investigation.
This is the watchdog pattern: an independent monitor that catches problems the main system cannot detect about itself. The queue worker does not know it crashed — but the stuck analysis fixer notices the work was never completed.
Graceful Degradation
Multiple subsystems are designed to fail independently without taking down the whole application:
- Connection pool fallback: When the pool is exhausted, the system creates direct (slower) connections rather than rejecting requests outright.
- Provider empty results: When the GEE provider fails to fetch satellite imagery or alert data, it returns an empty result object instead of raising an exception. The PDF report still generates — it just omits the missing section.
- Optional alert scheduler: If the alert scheduler fails to start, the rest of the system continues normally. Users lose periodic alert checks, but on-demand analysis still works.
- Email delivery: If SMTP is misconfigured, analysis results are still saved to the database. The user can retrieve them from the API even if the notification email never arrives.
Each subsystem failing independently is better than the whole thing crashing. The system prefers partial results over no results.
The Debugging Mindset: Follow the Data Path
When something breaks, trace the data's path through the system:
The break is almost always at a boundary — where one system talks to another. These boundaries are where credentials expire, networks time out, and data formats mismatch.
The three most common failure categories:
- Credentials expired: GEE service account keys, JWT tokens, SMTP passwords. The fix is always in
.envor the key file. - Network timeouts: GEE requests over slow connections, database queries on large geometries. The fix is usually increasing timeouts or reducing payload size.
- Data format mismatches: ImageCollection vs Image, invalid GeoJSON, unexpected null fields. The fix is in the code that prepares or consumes the data.
Check Your Understanding
Troubleshooting Reference
Common Errors
| Error | Cause | Fix |
|---|---|---|
Forest analysis requires GEE |
GEE credentials missing or invalid | Check GEE_SERVICE_ACCOUNT_KEY_FILE in .env |
Connection pool exhausted |
Too many concurrent requests | Increase DB_POOL_SIZE / DB_MAX_OVERFLOW |
Image.load: Asset is not an Image |
Loading ImageCollection as Image | Use .first() or filter + .max() |
Expected homogeneous collection |
Mixed image types in RADD | .select('Alert') before .max() |
Invalid email or password |
Account not activated | Check activation email / resend |
Activation link expired |
Token older than 24 hours | Resend activation email |
Background Services
| Service | Interval | Purpose |
|---|---|---|
| Alert Scheduler | 24 hours (06:00 UTC) | Check subscribed plots for new deforestation alerts |
| Stuck Analysis Fixer | 2 minutes | Rescue analyses stuck in PROCESSING status |
| Queue Worker | Continuous polling | Process pending analysis jobs |
Health Check Points
| What | How to Check |
|---|---|
| Database | GET /api/docs — if the Swagger UI loads, the database is connected |
| GEE Provider | Check startup logs for "Provider initialized successfully" |
| Queue Worker | Check logs for "Queue worker started" |
| Email (SMTP) | Send test via POST /api/auth/resend-activation |
| Alert Scheduler | Check logs or call get_scheduler_status() |