driftR implements tidy evaluation, an opinionated approach to capturing and processing expressions using tidyverse packages dplyr and rlang. These packages are automatically installed when you install driftR if they are not already packages you use. This article is meant to be a quick introduction to tidy evaluation for users who have not encountered it before.

The tidyverse

The tidyverse is a group of packages the provide a shared grammar and philosophy for conducting many of the most common tasks in data analysis: cleaning and wrangling data and visualizing data by creating plots and figures. The tidyverse is made up of a number of packages, including:

  • ggplot2 - a set of tools for implementing the Grammar of Graphics
  • dplyr - a set of common verbs for cleaning and wrangling data
  • tidyr - functions for consistently tidying data so that “(1) each variable is in a column, (2) each observation is in a row, and (3) each value is a cell”
  • tibble - a re-imagining of R’s data frame
  • rlang - a set of programming tools for writing dplyr-like functions

As noted above, driftR is built on dplyr and rlang in both literal and philosophical terms. Like dplyr, driftR is built around the idea of verbs for wrangling data. While dplyr’s verbs are meant to be general, driftR’s verbs are specifically tailored for correcting water quality monitoring data.

In order to implement this philosophical, verb-driven approach to data cleaning, driftR provides a set of wrappers around two dplyr functions - mutate() for adding new variables and slice() for removing rows. When you call a drifR function, you are actually calling one or more linked instances of mutate() or slice() that perform specific tasks with particular inputs in a standardized order.

Like the tidyverse, driftR is opinionated. The verbs that are used are implemented in specific ways meant to ensure that the data cleaning process for water quality monitoring data output is consistent and highly reproducible. While driftR does enforce a specific workflow on users, the benifit is the package’s consistency and speed (relative to other, non-reproducible methods like cleaning water quality data in a spreadsheet application).

Some Sample Data

The following examples use the data listed below:

driftData <- data.frame(
  Date = c("9/18/2015", "9/18/2015", "9/18/2015", "9/18/2015", "9/18/2015", "9/18/2015"),
  Time = c("12:10:49", "12:15:50", "12:20:51", "12:25:51", "12:30:51", "12:35:51"),
  Temp = c(14.76, 14.64, 14.57, 14.51, 14.50, 14.63),
  SpCond = c(0.754, 0.750, 0.750, 0.749, 0.749, 0.749),
  stringsAsFactors = FALSE
)

Quoting

One advantage of using a tidy evaluation approach for packages is that it offers a more flexible way to accept inputs from end users.

A Quick Example

Many of the base R functions use the $ symbol to differentiate between data frames and variables within them. The mean() function from the stats package provides an example of this behavior:

> mean(driftData$Temp)
[1] 14.60167

Packages that are built on top of tidyverse functions often separate the data frame parameter from the variable parameter in their structure. For instance, we could write a “wrapper” around the mean() function that accepts the data frame and variable parameters separately:

tidyMean <- function(data, variable){
  mean(data[[variable]])
}

This has consequences for how users interact with the function, however. If users do not quote the variable parameter, they will receive an error that the object with the name of the input variable cannot be found:

> tidyMean(driftData, Temp)
  Error in (function(x, i, exact) if (is.matrix(i)) as.matrix(x)[[i]] else .subset2(x,  :
   object 'Temp' not found

If end users do quote the input parameter, the function will evaluate successfully:

> tidyMean(driftData, "Temp")
[1] 14.60167

This, of course, increases the cognitive burden on new users of R who are still learning how it works. There are other common examples of where this occurs in the R ecosystem. For example, the install.packages() function requires that package names be quoted, while the library() function accepts either quoted or unquoted names.

Tidy Evaluation in driftR

driftR makes use of tidy evaluation to work around this issue. All driftR functions will accept variable names as parameters that are either quoted or unquoted.

library("driftR")

> dr_factor(driftData, corrFactor = corrFac,
            dateVar = Date, timeVar = Time)
#        Date     Time  Temp SpCond   corrFac
# 1 9/18/2015 12:10:49 14.76  0.754 0.0000000
# 2 9/18/2015 12:15:50 14.64  0.750 0.2003995
# 3 9/18/2015 12:20:51 14.57  0.750 0.4007989
# 4 9/18/2015 12:25:51 14.51  0.749 0.6005326
# 5 9/18/2015 12:30:51 14.50  0.749 0.8002663
# 6 9/18/2015 12:35:51 14.63  0.749 1.0000000

> dr_factor(driftData, corrFactor = "corrFac",
            dateVar = "Date", timeVar = "Time")
#        Date     Time  Temp SpCond   corrFac
# 1 9/18/2015 12:10:49 14.76  0.754 0.0000000
# 2 9/18/2015 12:15:50 14.64  0.750 0.2003995
# 3 9/18/2015 12:20:51 14.57  0.750 0.4007989
# 4 9/18/2015 12:25:51 14.51  0.749 0.6005326
# 5 9/18/2015 12:30:51 14.50  0.749 0.8002663
# 6 9/18/2015 12:35:51 14.63  0.749 1.0000000

This mirrors the behavior of tidyverse functions.

Piping

The pipe operator (%>%) is a product of the tidyverse package magrittr. It offers users the ability to make code more readable by changing a series of related functions together. Pipes are ideal for situations where there is only one input and output, and where there are not a significant number of steps being performed. (Wickham and Grolemund)[http://r4ds.had.co.nz] recommend keeping pipes to ten lines or fewer.

“Reading” Pipes

Pipes are designed to be “read” like lines of literal text. For example:

library("driftR")

driftData <- driftData %>%
  dr_factor(corrFactor = corrFac,
            dateVar = Date,
            timeVar = Time,
            keepDateTime = TRUE) %>%
  dr_correctOne(sourceVar = SpCond,
                cleanVar = SpCond_Corr,
                calVal = 1.07,
                calStd = 1,
                factorVar = corrFac)

One could read the pipe above in the following way:

  • Assigning (<-) to the object driftData the result of taking the existing driftData object, then
  • creating a factor variable named corrFac, then
  • making a one-point drift correction to SpCond that will be stored in a new variable named SpCond_Corr.

Each time then is used, a pipe operater has been used. We include the pipe operator at the end of each function we wish to chain together.

.data

One thing you will notice is that we do not make reference to the driftData object in either dr_factor() or dr_correctOne() when they are included in a pipe. Compare that to the code block in the Tidy Evaluation in driftR section above, which includes explicit reference to driftData in the dr_factor(). Using the pipe operator simplifies our code in part by reducing the amount of characters needed in the function. Specifically, the pipe operator allows us to drop our references to particular data frames. This is facilitated by the pronoun .data. Each of the driftR functions incorporates .data, allowing the data frame object’s name to be easily passed from one function to the next in a pipe.

A Full Session with %>%

Below is an example of what a full correction of a data set should look like if you are piping driftR functions together:

# load needed packages
library(driftR)
library(dplyr)
library(readr)

# import data exported from a Sonde
# example file located in the package
waterTibble <- dr_read(file = system.file("extdata", "rawData.csv", package="driftR"),
                       instruments = "Sonde", defineVar = TRUE, cleanVar = TRUE, case = "snake")

# calculate correction factors and correct variables
waterTibble <- waterTibble %>%
  dr_factor(corrFactor = corrFac, dateVar = Date,
            timeVar = Time, keepDateTime = TRUE) %>%
  dr_correctOne(sourceVar = SpCond, cleanVar = SpCond_Corr,
                calVal = 1.07, calStd = 1, factorVar = corrFac) %>%
  rename(Turbidity = `Turbidity.`) %>%
  dr_correctOne(sourceVar = Turbidity, cleanVar = Turbidity_Corr,
               calVal = 1.3, calStd = 0, factorVar = corrFac) %>%
  dr_correctOne(sourceVar = DO, cleanVar = DO_Corr, calVal = 97.6,
                calStd = 99, factorVar = corrFac) %>%
  dr_correctTwo(sourceVar = pH, cleanVar = pH_Corr,
                calValLow = 7.01, calStdLow = 7, calValHigh = 11.8,
                calStdHigh =  10, factorVar = corrFac) %>%
  dr_correctTwo(sourceVar = Chloride, cleanVar = Chloride_Corr,
                calValLow = 11.6, calStdLow = 10, calValHigh = 1411,
                calStdHigh =  1000, factorVar = corrFac) %>%
  dr_drop(head=6, tail=6) %>%
  dr_replace(sourceVar = Turbidity, overwrite = TRUE, exp = Turbidity < 0) %>%
  select(Date, Time, SpCond, SpCond_Corr, pH, pH_Corr, pHmV,
         Chloride, Chloride_Corr, AmmoniumN, NitrateN,
         Turbidity, Turbidity_Corr, DO, DO_Corr, corrFac)

# export cleaned data
write_csv(waterTibble, path = "waterData.csv", na = "NA")