Forecasting Urban Water Escherichia coli Contamination Using Machine Learning Models
The state of Indiana ranks first in the nation for water recreation impairments due to contaminated waterways. According to U.S. Environmental Protection Agency, 73% of rivers and streams and 23% of lakes and reservoirs have recreational use impairments like swimming, fishing and boating. Increased density of urban population and agricultural activities are some of the key contributors to run-off into our urban watersheds. The fecal coliform bacteria Escherichia coli (E. coli ) have been used as an indicator of bacterial pollution in the water streams. Local governmental water authorities and non-profit organizations routinely collect samples of urban waters weekly (or biweekly) to measure water quality parameters including E. coli counts. These analytical methods are time-consuming and only provide retrospective analysis of E. coli loads. Thus, forecasting of E. coli contamination in urban waters is necessary to provide real-time information to the public about their suitability for bodily contact, recreation, fishing, boating, and domestic utilization. Another caveat of the current methods is the lack of integration of the local climatic conditions such as changes in temperature and precipitation. E. coli contamination in urban water streams was predicted utilizing the last 20 years of climatic factors (temperature, precipitation) and water sample analysis data. E. coli data was collected for three water streams from the Marion County (Indiana) watershed project for a period of 2003-2022. Daily temperature and precipitation data for Marion County were obtained from the National Oceanic and Atmospheric Administration site. These 2 sources of data were combined using the date field as a common parameter. An initial exploratory data analysis was performed to understand the correlation of parameters to E. coli levels. Next, additional calculated values such as cumulative degree days, max precipitation in 10 days or 15 days were included as input for 6 machine learning models (Logistic Regression, Random Forest Classifier, Extra Trees Classifier, Decision Tree Classifier, Gradient boosting Classifier and XGB Classifier). Feature importance analysis and overall accuracy scores across these 6 machine learning models were compared to identify the best model. XGB classifier consistently had ROC value of above 85% for 3 individual water streams.