Outcome being modeled
Enterococcus concentration on the log scale. The site's median historical log level acts as the baseline, and the rain model predicts the residual above or below that baseline.
Prediction Methodology
This page documents the current model, the historical backtests behind it, and the rainfall-source validation work used to decide how repaired local gauge-derived fields should drive the prediction pipeline.
The goal here is not marketing copy. This is a working methods note for a public-facing prediction system built on real historical shoreline bacteria and rainfall data.
Dataset snapshot
Overview
The production estimate starts with each site's own typical bacteria level, then adjusts that baseline using repaired rainfall features learned from the historical record. It is a pooled citywide model rather than a fully custom coefficient set per site, because fully site-specific fits overfit the data.
Enterococcus concentration on the log scale. The site's median historical log level acts as the baseline, and the rain model predicts the residual above or below that baseline.
The model uses rain during the last 48 hours, the middle of the prior week, the late prior week, and weighted prior-week memory. We also evaluate longer weekly lag windows.
Ridge regression with log-transformed rain features. This reduces instability from large storm days and keeps the coefficients from swinging too far on sparse patterns.
Backtest
These backtests use chronological train/test splits so the model is always evaluated on future samples, not shuffled history. Lower RMSE and MAE are better. Threshold accuracy measures whether the model correctly places a sample below or above the 35 MPN swim threshold.
| Candidate | Features | Test RMSE (log) | Test MAE (log) | Threshold accuracy |
|---|---|---|---|---|
| long-memory | recentRain, midRain, lateRain, lagWeek1, lagWeek2, lagWeek3, lagWeek4, lagWeek5, lagWeek6, lagWeek7, lagWeek8, dryWeeks, seasonSin, seasonCos | 0.7253 | 0.5568 | 73.4% |
| 7d-distributed | recentRain, midRain, lateRain | 0.7272 | 0.5574 | 73.4% |
| extended-memory | recentRain, midRain, lateRain, priorWeeksRainMemory, dryWeeks, seasonSin, seasonCos | 0.7281 | 0.5581 | 73.6% |
| 48h | recentRain | 0.7282 | 0.5594 | 73.5% |
| baseline | site baseline only | 0.8009 | 0.6167 | 69.2% |
long-memory had the lowest held-out RMSE at 0.7253, which is only a modest gain over the simpler pooled variants.
extended-memory remained the best rain-only threshold model we tested at 73.6%.
Fully per-site models looked attractive but performed worse overall. Pooled models with site baselines generalize better on future samples.
Decision Metric
We also tested simpler strategies that predict only the safe / unsafe threshold decision. None of the basic rain-threshold rules outperformed the current regression-style model on the held-out history.
Best rain-only threshold performer we tested across the held-out historical set.
Per-site rain cutoff using only the most recent 48 hours.
Simple bucketed rule based on a rain index rather than a regression.
Rainfall Inputs
We compared the stored site rain fields against the assigned NOAA daily station records and did the same for Open-Meteo historical forecast data. The validation window covers March 23, 2021 onward.
| Source | Compared rows | Correlation to NOAA | MAE | RMSE | Within 0.10 in |
|---|---|---|---|---|---|
| Raw stored gauge-derived rain fields | 30,469 | 0.505 | 0.138 in | 0.418 in | 76.7% |
| Open-Meteo historical forecast | 30,469 | 0.321 | 0.167 in | 0.502 in | 75.0% |
Result: the raw stored gauge-derived fields are materially closer to NOAA station observations overall, so the evidence does not support replacing the entire rainfall source with Open-Meteo. Across sites, the raw stored fields beat Open-Meteo at 69 sites versus 37 for Open-Meteo. The model metrics above use the repaired version of those stored fields.
Outlier Review
The main risk is not that the entire stored rainfall source is bad. The current model now repairs the clearest bulk-filled or misaligned sample-date clusters before building rain features, while less certain cases stay on the review list.
`precipitationDayOf = 0.10 in` and `precipitationPreviousDay = 7.13 in` were swapped before feature generation.
`precipitationPreviousSat = 4.45 in` was removed from the prior-week rain window.
`precipitationPreviousMon = 2.54 in` was removed after NOAA and Open-Meteo both showed a near-dry day.
Raritan sites shared `precipitationPreviousTue = 0.28 in`; this stays on the watchlist until it can be confirmed site by site.
Operational conclusion: keep the stored gauge-derived rainfall as the primary source, apply targeted repairs for confirmed sample-date + field problems, and leave ambiguous clusters out of the override set until they can be checked against station-level evidence.
Next Steps
The current rain-only model is useful but not definitive. With the clearest rain outliers repaired, the most promising next gains are broader QA automation and richer predictors, not just more rain windows.
Turn the manual outlier checks into a recurring station-level validation pass so future imports can be flagged before they affect predictions.
If the product goal is a safe / unsafe call, the next model should optimize that decision directly rather than optimizing only numeric bacteria error.
Tide timing, recent prior bacteria readings, and station-level weather context are the likeliest variables to improve the model beyond the current rain-only ceiling.