Diagnose Common Failures Step by Step

Every system fails eventually. The difference between a good system and a fragile one is how quickly you can find and fix the problem. Walk through the five most common failure modes in the Forest Analyzer, learning the symptoms, root causes, and fixes for each.

Step 1

GEE Authentication Failure

This is the most common error. If GEE_SERVICE_ACCOUNT_KEY_FILE is missing or the credentials have expired, all analysis fails. Nothing else in the pipeline can compensate — the analyzer raises:

"Forest analysis requires Google Earth Engine for accurate data."

This single point of failure affects forest coverage calculation, GLAD alerts, RADD alerts, NDVI analysis, satellite imagery, and land cover classification. Everything that touches GEE stops working at once.

Debug checklist:

  • Check .env — is GEE_SERVICE_ACCOUNT=true set?
  • Does the key file at GEE_SERVICE_ACCOUNT_KEY_FILE actually exist on disk?
  • Open the key file — is it valid JSON? (Truncated downloads are a common cause.)
  • Has the service account been granted access to the GEE project specified in GEE_PROJECT_ID?
  • Check startup logs for "Provider initialized successfully" vs "Provider initialization failed".
Fixed when: Backend startup logs show "Provider initialized successfully" and a test analysis completes without the GEE error message.
Step 2

Database Connection Pool Exhaustion

The system maintains a pool of 15 database connections. When all are in use, new requests fall back to slower direct connections. You will see this in the logs:

"Connection pool exhausted, creating direct connection."

This is not an outright crash — it is graceful degradation. But it signals that the system is under more load than expected, or that connections are leaking.

Debug checklist:

  • Check DB_POOL_SIZE and DB_MAX_OVERFLOW settings in .env.
  • Look for connection leaks — code paths where a connection is acquired but never released back to the pool.
  • Check if analyses are running unusually long, holding connections open.
  • Monitor concurrent request count — a traffic spike may simply exceed capacity.
Fixed when: The "pool exhausted" log messages stop appearing under normal load. Increasing DB_POOL_SIZE or fixing connection leaks resolves the issue.
Step 3

Stuck Analysis

An analysis stays in PROCESSING status for more than 30 minutes. This usually means the GEE request timed out or the worker process crashed mid-analysis.

The system has a built-in watchdog: the stuck_analysis_fixer background job runs every 2 minutes. It finds analyses stuck in PROCESSING for longer than 30 minutes and resets them to PENDING for automatic retry. After 3 failed retries, the analysis is marked as ERROR.

Debug checklist:

  • Query the queue table: SELECT status, retry_count, error_message FROM analysis_queue WHERE id = '...'
  • Check worker logs — did the worker crash or lose its GEE connection?
  • Verify GEE is responding (startup log or a manual test request).
  • If retry_count has hit 3, the fixer has given up. Manual investigation is needed.
Fixed when: The analysis either completes on retry, or you identify and resolve the underlying GEE/network issue and manually retry via POST /api/queue/retry/{queue_id}.
Step 4

ImageCollection vs Image Error

You see this GEE error in the logs:

"Image.load: Asset 'X' is not an Image."

Cause: Two key datasets are ImageCollections, not single Images:

  • RADD alerts: projects/radar-wur/raddalert/v1
  • ESA WorldCover: ESA/WorldCover/v200

If you try to load either with ee.Image("..."), GEE rejects it because the asset contains multiple images, not one.

Fix: Use ee.ImageCollection("...").first() for WorldCover, or filter and composite for RADD:

# RADD: filter by date, select the Alert band, then composite radd = ee.ImageCollection("projects/radar-wur/raddalert/v1") \ .filterBounds(geometry) \ .filterDate(start_date, end_date) \ .select('Alert').max() # WorldCover: just grab the first (and usually only) image worldcover = ee.ImageCollection("ESA/WorldCover/v200").first()
Fixed when: The "Asset is not an Image" error disappears and the analysis returns valid alert/land cover data.
Step 5

Email Not Delivered

A user registers but never receives their activation email. Or an analysis completes but the notification email does not arrive. This is often a configuration issue rather than a code bug.

Debug checklist:

  • Check the user's spam/junk folder first — automated emails often land there.
  • Verify SMTP credentials in .env: SMTP_HOST, SMTP_PORT, SMTP_USER, SMTP_PASSWORD.
  • If using Gmail, SMTP_PASSWORD must be an app-specific password, not the account password. Gmail blocks sign-ins from "less secure apps" by default.
  • Check the email_service logs in the backend for send errors or connection timeouts.
  • Test with POST /api/auth/resend-activation to trigger a fresh email and watch the logs.
Fixed when: The resend-activation endpoint returns success and the user receives the email within a few minutes.

Debugging Recipes

Quick, targeted procedures for checking specific aspects of system health.

How to check if the queue worker is running

The queue worker starts automatically in the startup_event() function when the backend server launches. Look for this line in the backend logs:

"Queue worker started"

If the server was restarted, the worker restarts with it — there is no separate process to manage. If the log line is missing, the worker failed to start. Check for import errors or configuration issues in analysis_queue_worker.py.

How to manually retry a failed analysis

Send a POST request to the retry endpoint:

POST /api/queue/retry/{queue_id}

This resets the job status from FAILED or ERROR back to PENDING and increments the retry count. The queue worker will pick it up on its next polling cycle. Use GET /api/queue/status/{queue_id} to monitor progress afterward.

How to check GEE connectivity

Look at the backend startup logs for the provider initialization message:

# Success: "Provider initialized successfully" # Failure: "Provider initialization failed"

If the provider failed to initialize, all GEE-dependent features (forest analysis, alerts, imagery, NDVI, land cover) will be unavailable. Check the GEE_SERVICE_ACCOUNT_KEY_FILE path and credentials.

How to find why a specific analysis failed

  1. Call the status endpoint: GET /api/queue/status/{queue_id}. The response includes an error_message field with the failure reason.
  2. For more detail, query the database directly: SELECT id, status, error_message, retry_count, updated_at FROM analysis_queue WHERE id = 'xxx'
  3. Cross-reference the updated_at timestamp with backend logs for the full stack trace.

How to increase database connection limits

Set these values in your .env file:

DB_POOL_SIZE=10 DB_MAX_OVERFLOW=20

This gives you 10 pooled connections plus up to 20 overflow connections (30 total). Also make sure PostgreSQL's max_connections on the server side is high enough to accommodate this — the default is typically 100, which should be sufficient for most deployments.

Resilience Patterns

How the Forest Analyzer handles failure by design, not by accident.

Self-Healing with the Stuck Analysis Fixer

🔄
The Watchdog Pattern

A background thread wakes up every 2 minutes and queries the database for analyses stuck in PROCESSING status for longer than 30 minutes. When it finds one, it resets the status to PENDING so the queue worker retries it automatically. After 3 failed retries, the analysis is marked as ERROR — at that point, the problem is persistent and needs human investigation.

This is the watchdog pattern: an independent monitor that catches problems the main system cannot detect about itself. The queue worker does not know it crashed — but the stuck analysis fixer notices the work was never completed.

Graceful Degradation

Multiple subsystems are designed to fail independently without taking down the whole application:

Each subsystem failing independently is better than the whole thing crashing. The system prefers partial results over no results.

The Debugging Mindset: Follow the Data Path

When something breaks, trace the data's path through the system:

Browser → API endpoint → Service → External service → Database → Response

The break is almost always at a boundary — where one system talks to another. These boundaries are where credentials expire, networks time out, and data formats mismatch.

The three most common failure categories:

Check Your Understanding

Question 1
A user reports their analysis has been "processing" for 2 hours. What is the most likely explanation?
Question 2
PDF reports generate successfully but contain no satellite imagery. The analysis data itself is correct. What would you check?
Question 3
You see "Expected a homogeneous image collection" in the GEE logs. What is the fix?

Troubleshooting Reference

Common Errors

Error Cause Fix
Forest analysis requires GEE GEE credentials missing or invalid Check GEE_SERVICE_ACCOUNT_KEY_FILE in .env
Connection pool exhausted Too many concurrent requests Increase DB_POOL_SIZE / DB_MAX_OVERFLOW
Image.load: Asset is not an Image Loading ImageCollection as Image Use .first() or filter + .max()
Expected homogeneous collection Mixed image types in RADD .select('Alert') before .max()
Invalid email or password Account not activated Check activation email / resend
Activation link expired Token older than 24 hours Resend activation email

Background Services

Service Interval Purpose
Alert Scheduler 24 hours (06:00 UTC) Check subscribed plots for new deforestation alerts
Stuck Analysis Fixer 2 minutes Rescue analyses stuck in PROCESSING status
Queue Worker Continuous polling Process pending analysis jobs

Health Check Points

What How to Check
Database GET /api/docs — if the Swagger UI loads, the database is connected
GEE Provider Check startup logs for "Provider initialized successfully"
Queue Worker Check logs for "Queue worker started"
Email (SMTP) Send test via POST /api/auth/resend-activation
Alert Scheduler Check logs or call get_scheduler_status()