Prediction Methodology

How NYC Water Check estimates bacteria levels after rain

This page documents the current model, the historical backtests behind it, and the rainfall-source validation work used to decide how repaired local gauge-derived fields should drive the prediction pipeline.

The goal here is not marketing copy. This is a working methods note for a public-facing prediction system built on real historical shoreline bacteria and rainfall data.

Dataset snapshot

Historical paired samples
14,812
Sites in model set
155
Best test RMSE
0.7253
Best threshold accuracy
73.6%

Overview

Current prediction logic

The production estimate starts with each site's own typical bacteria level, then adjusts that baseline using repaired rainfall features learned from the historical record. It is a pooled citywide model rather than a fully custom coefficient set per site, because fully site-specific fits overfit the data.

Outcome being modeled

Enterococcus concentration on the log scale. The site's median historical log level acts as the baseline, and the rain model predicts the residual above or below that baseline.

Core rain windows

The model uses rain during the last 48 hours, the middle of the prior week, the late prior week, and weighted prior-week memory. We also evaluate longer weekly lag windows.

Model family

Ridge regression with log-transformed rain features. This reduces instability from large storm days and keeps the coefficients from swinging too far on sparse patterns.

Backtest

Historical model performance

These backtests use chronological train/test splits so the model is always evaluated on future samples, not shuffled history. Lower RMSE and MAE are better. Threshold accuracy measures whether the model correctly places a sample below or above the 35 MPN swim threshold.

CandidateFeaturesTest RMSE (log)Test MAE (log)Threshold accuracy
long-memoryrecentRain, midRain, lateRain, lagWeek1, lagWeek2, lagWeek3, lagWeek4, lagWeek5, lagWeek6, lagWeek7, lagWeek8, dryWeeks, seasonSin, seasonCos0.72530.556873.4%
7d-distributedrecentRain, midRain, lateRain0.72720.557473.4%
extended-memoryrecentRain, midRain, lateRain, priorWeeksRainMemory, dryWeeks, seasonSin, seasonCos0.72810.558173.6%
48hrecentRain0.72820.559473.5%
baselinesite baseline only0.80090.616769.2%
Best numeric fit

long-memory had the lowest held-out RMSE at 0.7253, which is only a modest gain over the simpler pooled variants.

Best threshold result

extended-memory remained the best rain-only threshold model we tested at 73.6%.

Main modeling lesson

Fully per-site models looked attractive but performed worse overall. Pooled models with site baselines generalize better on future samples.

Decision Metric

What happens if we optimize for the swim threshold directly?

We also tested simpler strategies that predict only the safe / unsafe threshold decision. None of the basic rain-threshold rules outperformed the current regression-style model on the held-out history.

Current production-style regression73.6%

Best rain-only threshold performer we tested across the held-out historical set.

48-hour threshold rule71.9%

Per-site rain cutoff using only the most recent 48 hours.

Wet / dry bucket rule67.9%

Simple bucketed rule based on a rain index rather than a regression.

Rainfall Inputs

Which rainfall source appears more trustworthy?

We compared the stored site rain fields against the assigned NOAA daily station records and did the same for Open-Meteo historical forecast data. The validation window covers March 23, 2021 onward.

SourceCompared rowsCorrelation to NOAAMAERMSEWithin 0.10 in
Raw stored gauge-derived rain fields30,4690.5050.138 in0.418 in76.7%
Open-Meteo historical forecast30,4690.3210.167 in0.502 in75.0%

Result: the raw stored gauge-derived fields are materially closer to NOAA station observations overall, so the evidence does not support replacing the entire rainfall source with Open-Meteo. Across sites, the raw stored fields beat Open-Meteo at 69 sites versus 37 for Open-Meteo. The model metrics above use the repaired version of those stored fields.

Outlier Review

Where the rain fields were repaired or still look suspect

The main risk is not that the entire stored rainfall source is bad. The current model now repairs the clearest bulk-filled or misaligned sample-date clusters before building rain features, while less certain cases stay on the review list.

September 2, 2021Repaired in model input

`precipitationDayOf = 0.10 in` and `precipitationPreviousDay = 7.13 in` were swapped before feature generation.

August 26, 2021Repaired in model input

`precipitationPreviousSat = 4.45 in` was removed from the prior-week rain window.

June 26, 2025Repaired in model input

`precipitationPreviousMon = 2.54 in` was removed after NOAA and Open-Meteo both showed a near-dry day.

July 27, 2023Still under review

Raritan sites shared `precipitationPreviousTue = 0.28 in`; this stays on the watchlist until it can be confirmed site by site.

Operational conclusion: keep the stored gauge-derived rainfall as the primary source, apply targeted repairs for confirmed sample-date + field problems, and leave ambiguous clusters out of the override set until they can be checked against station-level evidence.

Next Steps

What would make the prediction system stronger?

The current rain-only model is useful but not definitive. With the clearest rain outliers repaired, the most promising next gains are broader QA automation and richer predictors, not just more rain windows.

Expand rain QA coverage

Turn the manual outlier checks into a recurring station-level validation pass so future imports can be flagged before they affect predictions.

Train directly for the threshold

If the product goal is a safe / unsafe call, the next model should optimize that decision directly rather than optimizing only numeric bacteria error.

Add more physical context

Tide timing, recent prior bacteria readings, and station-level weather context are the likeliest variables to improve the model beyond the current rain-only ceiling.