Background

The following is a data analysis of nitrate concentration spikes in water after rainfall events at a location in Illinois. Nitrate data comes from the US Geological Survey. Precipitation data comes from Daymet.

Since there are not nitrate monitoring locations in every county, we focused on doing a case study on the location that has the most data. To find this location, we used the following criteria:

  1. Monitors the parameter “99133,” or nitrates in milligrams per Liter

  2. Location is still active today

  3. Had at least one nitrate spike above the federal threshold of 10mg/L this year

  4. Its data goes as far back as possible

Part 1

First, we loaded the R libraries we will use, including dataRetrieval. This is a package that allows us to request USGS data from R Studio.

library(dataRetrieval)
library(tidyverse)
library(lubridate)
library(dygraphs)
library(sp)
library(rgeos)
library(xts)
library(data.table)
library(DT)

Then, we will run the following code chunk. It uses a dataRetrieval. function to request all the USGS monitoring locations in Illinois and then filters them by keeping the ones that are still active and that have nitrate data. This will give us the site number, coordinates and how long each location has been active.

today <- Sys.Date()

IL_site <- readNWISdata(stateCd= "IL", parameterCd="99133",
                        service="site", seriesCatalogOutput=TRUE) %>% 
  filter(site_tp_cd == "ST") %>%
  filter(end_date == "2022-02-03") %>% # today
  filter(parm_cd == "99133") %>%
  distinct(site_no, .keep_all = TRUE) %>%
  select(site_no, station_nm, lat = dec_lat_va,
         long = dec_long_va, begin_date)

Then we made this function that does the following:

  1. Pulls all the site numbers for the state and requests nitrate data from this year.

  2. It aggregates the data to find the maximum nitrate reading, yearlyPeak.

  3. It creates a new column that detects if the maximum nitrate level in each location was above 10 mg/L or not.

  4. It removes sites that did not have a nitrate peak higher than the federal threshold this year.

  5. It arranges the remainder locations from oldest to newest and selects the top one.

This means the function will chose one location in the state that has data that goes as far back as possible and that had a nitrate spike above federal guidelines.

state_function <- function(state){
  
  site <- state
  
  # nitrate levels from this year
  iv <- readNWISdata(siteNumbers = site$site_no, parameterCd = "99133",
                     startDate = "2021-01-01", endDate = "2021-08-18",
                     service = "iv")
  
  # now aggregate by yearly max 
  iv$year <- year(iv$dateTime)
  
  yearPeak <- iv %>% 
    group_by(site_no, year) %>%
    summarise(yearlyPeak = max(X_99133_00000, na.rm = T)) %>%
    filter(yearlyPeak <= 40) %>% # wonky filter 
    select(-year)
  
  # join these to "site", sort by date, pick oldest one above 10mg/L  
  site <- site %>% 
    right_join(yearPeak) %>%
    arrange(begin_date) %>%
    mutate(illegal = case_when(
      yearlyPeak >= 10 ~ "Yes",
      yearlyPeak < 10 ~ "No")) %>%
    filter(illegal == "Yes") %>%
    select(-illegal) %>%
    head(1) # this is our top location at this state
  
  return(site)
}

# The following site has the data we are looking for:
IL_site <- state_function(IL_site) # 03336890 (IL)

This is the location we will analyze in this case study:

Located in Champaign County, IL:

Part 2

Once we have the coordinates from the nitrate gauge, we can use Streamstats. This USGS application lets you plug the coordinates from the gauge:

And it gives you a shape file from the watershed. Every drop of water that falls within that perimeter will eventually drain to the nitrate gauge:

Initially, we analyzed this location by getting precipitation data from the closest NOAA weather station, which fell outside the basin. Although this can provide a representation of the weather in the watershed because of its proximity, we can increase the accuracy of our results by using the watershed shape file to get its spatial data.

The spatial data source we used, Daymet, has a single pixel selection tool. After plugin coordinates for a point, it returns precipitation data for a pixel of 1km.

We used ArcGIS to overlay a 1km grid on the gauge’s drainage basin:

And then creating pins at the center of each square intersecting the watershed:

Part 3

We then imported the coordinates from all the pins to extract the precipitation data in the watershed.

coordinates <- read_csv("Data/ArcGIS/coordinates.csv") #143 obs. of 4 var

coordinates <- coordinates %>% select(lat = Lat,
                                      long = Long)

To download Daymet data from the single pixel extraction tool, a link is generated with different parameters like the time range and the coordinate. So, to automate the request, we generated the links for all the coordinates here:

coordinates$link <- paste("https://daymet.ornl.gov/single-pixel/api/data?lat=",
                          coordinates$lat, 
                          "&lon=",
                          coordinates$long,
                          "&vars=prcp&start=2014-01-01&end=2020-06-30", # set dates here
                          sep = "")

These were then used in this loop to extract the data:

for (i in 1:143) {
  
  name <- paste("a", i, sep = "")
  
  assign(name, fread(coordinates$link[i]))
  
}

After all the data was compiled, we aggregated the precipitation values by getting a daily average across basin.

watershed_average <- watershed_average %>% 
  group_by(year, yday) %>%
  summarise(mean_precip_mm = round(mean(`prcp (mm/day)`), 2)) 

# convert to inches 
watershed_average$mean_precip_in <- round(watershed_average$mean_precip_mm * 0.03937007874, 2)

Then we cleaned up the date format:

watershed_average$origin <- paste(watershed_average$year, "-01-01", sep = "")
watershed_average$date <- as.Date(watershed_average$yday, origin = watershed_average$origin)

watershed_average <- select(watershed_average, -origin, -yday)
rm(i, name)

Once we had the precipitation data ready, we did a request for nitrate concentration in the gauge over the same period of time. Since this data is continuous (measured every 15 minutes), we aggregated the data to daily maximum nitrate levels to compare it to rainfall on the same granular level.

# request USGS from 2014-04-01 until 2020-06-29
usgs_IL <- readNWISdata(siteNumbers = "03336890", parameterCd = "99133",
                        startDate = "2014-04-01", endDate = "2020-06-29",
                        service = "iv")

# clean columns
usgs_IL <- usgs_IL %>% select(site_no,
                              datetime = dateTime,
                              nitrate_lv = X_99133_00000)

usgs_IL$date <- as.Date(usgs_IL$datetime) # 602,152 obs of 4 var

# aggregate to peak nitrate levels per day 
usgs_IL_peak <- usgs_IL %>% group_by(date) %>%
  summarise(daily_peak = max(nitrate_lv, na.rm = T)) # 1826 obs of 2 var

rm(usgs_IL)

With both variables ready, we converted the tables to XTS format to create an interactive visualization.

# first, simplify watershed_average into champaign_IL
champaign_IL <- as.data.frame(watershed_average)
champaign_IL <- select(champaign_IL, date, total_precip = mean_precip_in)

nitrates_IL <- xts(x = usgs_IL_peak$daily_peak, order.by = usgs_IL_peak$date)
precipitation_IL <- xts(x = champaign_IL$total_precip, order.by = champaign_IL$date)
IL_xts <- merge(nitrates_IL, precipitation_IL, join = 'left')
rm(nitrates_IL, precipitation_IL, usgs_IL_peak, champaign_IL)

# plot
IL_2 <- dygraph(IL_xts, main = "Nitrates and precipitation in IL location, 2014-2020") %>%
  dyAxis("y", label = "Daily Peak Nitrates (mg/L)") %>%
  dyAxis("y2", label = "Total Daily Precipitation (in)", independentTicks = TRUE) %>%
  dySeries("precipitation_IL", axis = 'y2') %>%
  dyRangeSelector(height = 20) %>%
  dyHighlight(highlightCircleSize = 5) %>%
  dyShading(from = 10, axis = "y",to = 50, color = "#FFE6E6") %>%
  dyOptions(colors = c("#a97de3", "#6fc978"), drawGrid = F) %>%
  dyLegend(labelsSeparateLines = T)
IL_2

Stats

In order to quantify the nitrate spikes and how those coincide with precipitation, we had to define the events through code. But first, we downloaded USGS nitrate concentration data for our site in IL. This time, we made a function to filter it by season, only keeping the spring months, defined as April through June.

We decided to analyze the spring data since that is when most farmers apply fertilizer and when climate assessments, such as the IL climate assessment, predict precipitation could keep increasing.

After filtering the spring data, this function aggregates that data into daily nitrate peaks, as earlier in the analysis.

# request IL data for April-June 2014-2020
# IL: 03336890
spring_IL <- readNWISdata(siteNumber = "03336890", parameterCd = "99133",
                          startDate = "2014-04-01", endDate = "2020-06-29",
                          service = "iv") 

spring_function <- function(spring_state){
  
  # filter by month 
  spring_state$month <- month(spring_state$dateTime)
  
  spring_state <- spring_state %>% filter(month %in% c(4:6))
  
  # clean columns
  spring_state <- spring_state %>% select(site_no,
                                          datetime = dateTime,
                                          nitrate_lv = X_99133_00000)
  
  spring_state$date <- as.Date(spring_state$datetime)
  
  # aggregate by daily peak
  spring_state <- spring_state %>% group_by(date) %>%
    summarise(daily_peak = max(nitrate_lv, na.rm = T)) 
  
  return(spring_state)
  
}

# apply function to IL
spring_IL <- spring_function(spring_IL) # 606 obs

This time we also incorporated nutrient load data. While concentration might be really important on a local health case aspect, given the federal limit on drinking water, the amount of nutrient loading is also relevant on a larger scale, such as its impact in the Gulf of Mexico’s hypoxia zone.

In Illinois, USGS stores load data as the parameter “91049.” This data has pounds of nitrate per day measured every 15 minutes. We aggregated these values as the average daily load to use it on the same granular level as our other variables.

Note: load data is not available throughout our entire date range. Some spike events will not include this information.

load_IL <- readNWISdata(siteNumbers = "03336890", parameterCd = "91049",
                        startDate = "2014-04-01", endDate = "2020-06-29",
                        service = "iv")

# make function to clean and aggregate to average load per day 

load_function <- function(load_state){
  
  # select and rename columns
  load_state <- load_state %>% select(dateTime, load = X_91049_00000)
  
  # aggregate to average load per day
  load_state$date <- date(load_state$dateTime)
  
  load_state <- load_state %>% 
    group_by(date) %>%
    summarise(average_daily_load = round(mean(load)))
  
  # filter to same spring months 
  load_state$month <- month(load_state$date)
  
  load_state <- load_state %>% filter(month %in% c(4:6)) %>% select(-month)
  
  return(load_state)
  
}

# run function in IL

load_IL <- load_function(load_IL) # 411 obs

Then, we filtered this data as well as the precipitation data to the spring seasons from 2014 through 2020.

# filter USGS load data to spring months 
load_IL$month <- month(load_IL$date)
load_IL <- load_IL %>% filter(month %in% c(4:6)) %>% select(-month)

# filter NASA Daymet data to same time range as USGS, by month 
spring_champaign_IL <- as.data.frame(watershed_average)
spring_champaign_IL <- select(spring_champaign_IL, date, total_precip = mean_precip_in)

spring_champaign_IL$month <- month(spring_champaign_IL$date)

spring_champaign_IL <- spring_champaign_IL %>%
  filter(month %in% c(4:6)) %>%
  select(date, total_precip)

# combine them
spikes_IL <- left_join(spring_IL, spring_champaign_IL) %>%
  left_join(load_IL) # 606 obs

rm(spring_IL, spring_champaign_IL, load_IL, coordinates, IL_xts)

Once spring data sets (nitrate concentration, loading and precipitation) are merged, we can set parameters to define a nitrate spike event.

This event will be true when a day has a nitrate level above 10mg/L but that the day before the levels were below 10mg/L. We will start by using a lag function to have the previous days values in a new column and then detect a nitrate spike based on the criteria we defined.

spikes_IL <- spikes_IL %>% 
  mutate(nitrates_24b = lag(daily_peak, 1)) %>% 
  mutate(nitrate_spike = case_when(
    (daily_peak >= 10 & nitrates_24b < 10) ~ TRUE
  ))

Now we want to see how much it rained the day of the spike plus the day before, in case a rain event went overnight. We are using a lag function to aggregate those two days of rain into “pre_spike_rain.”

spikes_IL <- spikes_IL %>% 
  mutate(precipitation_24b = lag(total_precip, 1)) %>%
  mutate(pre_spike_rain = total_precip + precipitation_24b)

Then, we did a similar aggregation for the nitrate load. However, instead of adding the load from the day before when it started raining, we added the load from the day after the nitrate spike. We did this because in this watershed, there seems to be a bigger lag from when the rain starts to when the gauge records an increased load.

Several factors can affect this, such as the size of the watershed and the percentage of crop land in the watershed.

spikes_IL <- spikes_IL %>%
  mutate(load_24a = lead(average_daily_load, 1)) %>%
  mutate(post_spike_load = average_daily_load + load_24a)

In the springs 2014-2020, we found 42 spike events.

# filter for TRUE spikes 
spikes_IL <- spikes_IL %>% filter(nitrate_spike == T) %>%
  select(date, nitrate_spike, daily_peak, pre_spike_rain, post_spike_load) # 42 obs

# and now filter for any spikes that had any amount of rain (just not 0)
spikes_IL <- spikes_IL %>% mutate(rained = case_when(
  pre_spike_rain != 0 ~ "Yes",
  pre_spike_rain == 0 ~ "No"
))

spikes_IL$rained <- as.factor(spikes_IL$rained)

rainy_spikes_IL <- spikes_IL %>% filter(rained == "Yes") # 36 obs

We also found that 36 out of those 42 spikes, or around 86%, had any measure of rain in that day or the day before.

Since we will compare this data to other watersheds that might have different precipitation levels, we defined heavy rain events as the top 10% of events by precipitation amount. Now, we can split those 36 events into two categories: heavy or non-heavy.

rainy_spikes_IL <- rainy_spikes_IL %>% arrange(desc(rainy_spikes_IL$pre_spike_rain))

heavy_IL <- rainy_spikes_IL %>% head(round(nrow(rainy_spikes_IL)*0.10)) # 4 obs

non_heavy_IL <- rainy_spikes_IL %>% tail(round(nrow(rainy_spikes_IL)*0.90)) # 32 obs

You can explore these in more detail on the tables below.

Spikes with heavy rain:

Spikes with non-heavy rain:

On average, spikes with heavy rain led to a higher nitrate concentration spike and higher nitrate loading than those with lighter rain.

Here are some stats about both type of events:

summary(heavy_IL)
##       date            nitrate_spike    daily_peak    pre_spike_rain 
##  Min.   :2014-06-25   Mode:logical   Min.   :10.20   Min.   :2.030  
##  1st Qu.:2015-03-07   TRUE:4         1st Qu.:11.18   1st Qu.:2.518  
##  Median :2016-12-04                  Median :14.20   Median :2.795  
##  Mean   :2017-03-10                  Mean   :14.60   Mean   :2.667  
##  3rd Qu.:2018-12-09                  3rd Qu.:17.62   3rd Qu.:2.945  
##  Max.   :2020-06-04                  Max.   :19.80   Max.   :3.050  
##                                                                     
##  post_spike_load rained 
##  Min.   :22169   No :0  
##  1st Qu.:23718   Yes:4  
##  Median :25266          
##  Mean   :38350          
##  3rd Qu.:46440          
##  Max.   :67614          
##  NA's   :1
summary(non_heavy_IL)
##       date            nitrate_spike    daily_peak    pre_spike_rain  
##  Min.   :2014-05-14   Mode:logical   Min.   :10.00   Min.   :0.0100  
##  1st Qu.:2015-06-07   TRUE:32        1st Qu.:10.38   1st Qu.:0.4900  
##  Median :2017-04-17                  Median :10.60   Median :0.8850  
##  Mean   :2017-04-19                  Mean   :11.58   Mean   :0.8784  
##  3rd Qu.:2018-09-20                  3rd Qu.:12.50   3rd Qu.:1.1200  
##  Max.   :2020-06-29                  Max.   :16.90   Max.   :1.7800  
##                                                                      
##  post_spike_load  rained  
##  Min.   :  1419   No : 0  
##  1st Qu.:  4044   Yes:32  
##  Median :  6349           
##  Mean   : 12234           
##  3rd Qu.: 10262           
##  Max.   :110360           
##  NA's   :13

Those 4 heavy rain spikes also contributed nearly a third of the nutrient that was washed off from all the 36 spike events with rain.

percent_IL <- paste((round((sum(heavy_IL$post_spike_load, na.rm = T) / (sum(non_heavy_IL$post_spike_load, na.rm = T)+sum(heavy_IL$post_spike_load, na.rm = T)))*100,1)
), "%", sep = "")

percent_IL
## [1] "33.1%"