Module 03: Data Flow & Communication - EUDR Forest Analyzer Course

Trace Data Through a Complete Analysis

Below is a simulated conversation between the system's components as a real analysis unfolds. Press Next to step through each message, or Play All to watch them stream in. Pay attention to how each component transforms the data before handing it off.

User

Uploading plantation_borneo.geojson to /api/analysis/upload

FileProcessor

Parsed 3 features. One polygon was self-intersecting — auto-repaired. Area: 247.3 ha.

Database (PostGIS)

Stored 3 plots with PostGIS geometries. Upload ID: a7f3b...

ForestAnalyzer

Starting analysis. Coordinates detected in Indonesia (ID) — HIGH risk per Article 29. +20 points.

GLAD Service

Hansen GFC via GEE: 3.2 ha tree cover loss in 2022. Post-cutoff alert!

RADD Service

Sentinel-1 radar confirms change in same area. 2.8 ha. High confidence.

ForestAnalyzer

Risk score: 82/100. Forest loss: 3.1%. Verdict: NON_COMPLIANT

ReportGenerator

PDF built: 8 pages with satellite imagery, NDVI maps, alert table. Emailing user.

0/8

Notice the handoffs

Each component receives structured data, adds its own context, and passes an enriched result downstream. The User sends a file; the ReportGenerator receives an AnalysisResult containing everything accumulated along the way.

Concurrent Fetching with asyncio.gather

GLAD and RADD queries are independent — neither needs the other's result. The analyzer fires both at the same time using async/await and concurrency:

Python

glad_task = self._fetch_glad_alerts(geometry, plot_id)
radd_task = self._fetch_radd_alerts(geometry, plot_id)

glad_alerts, radd_data = await asyncio.gather(
    glad_task, radd_task, return_exceptions=True
)

Plain English

Line 1: Create a task that will query the GLAD service (Landsat optical data) for deforestation alerts within this geometry.

Line 2: Create a second task that will query the RADD service (Sentinel-1 radar data) for the same area.

Lines 4-6: Run both tasks at the same time. Wait until both finish. If either one crashes, capture the error instead of killing the other task (return_exceptions=True). Unpack the two results into glad_alerts and radd_data.

Data Flow Tasks

How to add a new step to the analysis pipeline

Say you want to add a biodiversity check alongside the existing GLAD and RADD queries.

Open backend/services/forest_analyzer_with_alerts.py and find the analyze_plot() method.
Create a new async method (e.g., _fetch_biodiversity_risk(geometry, plot_id)) that calls your data source and returns a result object.
Add your new task to the asyncio.gather call alongside the GLAD and RADD tasks so it runs concurrently.
Unpack the new result and add it to the details dict that gets attached to the AnalysisResult.
If the new data should affect the risk score, update the scoring logic in the same method.

How to track data through the system

When debugging, follow this chain to trace how a piece of data moves:

API endpoint — Find the route in backend/api/. This is where the HTTP request arrives and parameters are validated.
Service method — The endpoint calls a method in backend/services/. This is where business logic lives.
Data model — The service creates or updates a dataclass from backend/models/.
Database table — The model is persisted via SQL in backend/utils/database.py.
Response — The result flows back up: database row → model → service → API response JSON. Each layer adds context.

How to add a new data model

Create a new dataclass in the appropriate file under backend/models/ (e.g., models/alerts.py for alert-related data).
Add a to_dict() method so the model can be serialized to JSON for API responses.
Use the model in your service layer — import it and return instances from service methods.
Create the corresponding database table in backend/utils/database.py inside the ensure_tables_exist() function, using SQL CREATE TABLE IF NOT EXISTS.
Add any necessary indexes, especially spatial indexes if the model includes geometry columns.

Communication Patterns

Synchronous vs Asynchronous

The system offers two paths for running an analysis:

Synchronous (real-time): The user uploads a file, the server analyzes it immediately, and the response comes back in the same HTTP connection. Upload → analyze → wait → results. The user holds the connection open the entire time.

Asynchronous (queue-based): The user uploads a file and gets back a queue ID immediately. A background worker picks up the job, runs the analysis, saves results to the database, and sends an email notification. Upload → queue → background worker processes → email notification.

Think of it like ordering food. Synchronous is ordering at a counter — you stand there and wait while they make it, then walk away with your meal. Asynchronous is ordering delivery — you place the order, go do other things, and get notified when it arrives at your door. The food (analysis) is the same either way; the difference is whether you block and wait or free yourself up.

The sync path lives in /api/analysis/analyze/{upload_id}. The async path uses /api/queue/submit, with a background worker in analysis_queue_worker.py polling for pending jobs. Real-time progress is available via WebSocket connections managed by websocket_manager.py.

Why asyncio.gather?

When two operations do not depend on each other, running them one after the other wastes time. The GLAD service queries Landsat optical data from Google Earth Engine. The RADD service queries Sentinel-1 radar data from GEE. Neither needs the other's result to do its work.

asyncio.gather() fires both requests concurrently. If each takes 3 seconds, sequential execution would take 6 seconds. With gather, both run at the same time and the total is roughly 3 seconds — cutting analysis time in half.

The return_exceptions=True parameter is critical: if RADD fails (say, the plot is outside tropical coverage), the GLAD result is still preserved. Without it, one failure would cancel everything.

Alert Deduplication

GLAD and RADD may detect the same deforestation event. GLAD sees it through optical satellite imagery (Landsat, 30m resolution). RADD sees it through radar (Sentinel-1, 10m resolution). The affected areas often overlap but rarely match exactly because the two sensors have different resolutions and detection methods.

To avoid double-counting, the system takes max(glad_area, radd_area) as the reported deforestation area rather than summing them. This gives a conservative but honest estimate.

When both systems detect alerts in the same area, the system flags this as cross-validated, which means higher confidence. A single-source detection might be a false positive (cloud shadow misread as loss, for instance), but two independent sensors agreeing makes the finding much more reliable.

Data Flow Reference

Analysis Pipeline Steps

Step	Component	Input	Output
1. File upload	FileProcessor	GeoJSON / KML / Shapefile	`PlotData` (geometry, area, features)
2. Storage	PostGIS	`PlotData`	`plot_id` in database
3. Forest coverage	GEE Provider	geometry	coverage_2020, coverage_current, loss%
4. GLAD alerts	GLAD Service	geometry + date range	has_alerts, count, area_ha, loss_by_year
5. RADD alerts	RADD Service	geometry + date range	has_alerts, count, area_ha
6. Risk scoring	ForestAnalyzer	all above + country code	risk_score (0–100)
7. Compliance	ForestAnalyzer	risk_score + thresholds	`ComplianceStatus` enum
8. Report	ReportGenerator	`AnalysisResult`	PDF file

Data Models

Model	File	Key Fields
`PlotData`	`models/__init__.py`	geometry, feature_count, total_area_hectares, bounds
`AnalysisResult`	`models/__init__.py`	plot_id, forest_coverage_percent, compliance_status, risk_score, details
`ComplianceStatus`	`models/__init__.py`	COMPLIANT \| NON_COMPLIANT \| NEEDS_REVIEW \| UNKNOWN
`RiskLevel`	`models/__init__.py`	LOW \| MEDIUM \| HIGH \| CRITICAL
`DeforestationAlert`	`models/alerts.py`	plot_id, alert_date, alert_type, confidence, affected_area_hectares