vignettes/TidyEval.Rmd
TidyEval.Rmd
driftR
implements tidy evaluation, an opinionated approach to capturing and processing expressions using tidyverse
packages dplyr
and rlang
. These packages are automatically installed when you install driftR
if they are not already packages you use. This article is meant to be a quick introduction to tidy evaluation for users who have not encountered it before.
tidyverse
The tidyverse
is a group of packages the provide a shared grammar and philosophy for conducting many of the most common tasks in data analysis: cleaning and wrangling data and visualizing data by creating plots and figures. The tidyverse
is made up of a number of packages, including:
ggplot2
- a set of tools for implementing the Grammar of Graphics
dplyr
- a set of common verbs for cleaning and wrangling datatidyr
- functions for consistently tidying data so that “(1) each variable is in a column, (2) each observation is in a row, and (3) each value is a cell”tibble
- a re-imagining of R
’s data framerlang
- a set of programming tools for writing dplyr
-like functionsAs noted above, driftR
is built on dplyr
and rlang
in both literal and philosophical terms. Like dplyr
, driftR
is built around the idea of verbs for wrangling data. While dplyr
’s verbs are meant to be general, driftR
’s verbs are specifically tailored for correcting water quality monitoring data.
In order to implement this philosophical, verb-driven approach to data cleaning, driftR
provides a set of wrappers around two dplyr
functions - mutate()
for adding new variables and slice()
for removing rows. When you call a drifR
function, you are actually calling one or more linked instances of mutate()
or slice()
that perform specific tasks with particular inputs in a standardized order.
Like the tidyverse
, driftR
is opinionated. The verbs that are used are implemented in specific ways meant to ensure that the data cleaning process for water quality monitoring data output is consistent and highly reproducible. While driftR
does enforce a specific workflow on users, the benifit is the package’s consistency and speed (relative to other, non-reproducible methods like cleaning water quality data in a spreadsheet application).
The following examples use the data listed below:
driftData <- data.frame(
Date = c("9/18/2015", "9/18/2015", "9/18/2015", "9/18/2015", "9/18/2015", "9/18/2015"),
Time = c("12:10:49", "12:15:50", "12:20:51", "12:25:51", "12:30:51", "12:35:51"),
Temp = c(14.76, 14.64, 14.57, 14.51, 14.50, 14.63),
SpCond = c(0.754, 0.750, 0.750, 0.749, 0.749, 0.749),
stringsAsFactors = FALSE
)
One advantage of using a tidy evaluation approach for packages is that it offers a more flexible way to accept inputs from end users.
Many of the base R
functions use the $
symbol to differentiate between data frames and variables within them. The mean()
function from the stats
package provides an example of this behavior:
> mean(driftData$Temp)
[1] 14.60167
Packages that are built on top of tidyverse
functions often separate the data frame parameter from the variable parameter in their structure. For instance, we could write a “wrapper” around the mean()
function that accepts the data frame and variable parameters separately:
tidyMean <- function(data, variable){
mean(data[[variable]])
}
This has consequences for how users interact with the function, however. If users do not quote the variable parameter, they will receive an error that the object with the name of the input variable cannot be found:
> tidyMean(driftData, Temp)
Error in (function(x, i, exact) if (is.matrix(i)) as.matrix(x)[[i]] else .subset2(x, :
object 'Temp' not found
If end users do quote the input parameter, the function will evaluate successfully:
> tidyMean(driftData, "Temp")
[1] 14.60167
This, of course, increases the cognitive burden on new users of R
who are still learning how it works. There are other common examples of where this occurs in the R
ecosystem. For example, the install.packages()
function requires that package names be quoted, while the library()
function accepts either quoted or unquoted names.
driftR
driftR
makes use of tidy evaluation to work around this issue. All driftR
functions will accept variable names as parameters that are either quoted or unquoted.
library("driftR")
> dr_factor(driftData, corrFactor = corrFac,
dateVar = Date, timeVar = Time)
# Date Time Temp SpCond corrFac
# 1 9/18/2015 12:10:49 14.76 0.754 0.0000000
# 2 9/18/2015 12:15:50 14.64 0.750 0.2003995
# 3 9/18/2015 12:20:51 14.57 0.750 0.4007989
# 4 9/18/2015 12:25:51 14.51 0.749 0.6005326
# 5 9/18/2015 12:30:51 14.50 0.749 0.8002663
# 6 9/18/2015 12:35:51 14.63 0.749 1.0000000
> dr_factor(driftData, corrFactor = "corrFac",
dateVar = "Date", timeVar = "Time")
# Date Time Temp SpCond corrFac
# 1 9/18/2015 12:10:49 14.76 0.754 0.0000000
# 2 9/18/2015 12:15:50 14.64 0.750 0.2003995
# 3 9/18/2015 12:20:51 14.57 0.750 0.4007989
# 4 9/18/2015 12:25:51 14.51 0.749 0.6005326
# 5 9/18/2015 12:30:51 14.50 0.749 0.8002663
# 6 9/18/2015 12:35:51 14.63 0.749 1.0000000
This mirrors the behavior of tidyverse
functions.
The pipe operator (%>%
) is a product of the tidyverse
package magrittr
. It offers users the ability to make code more readable by changing a series of related functions together. Pipes are ideal for situations where there is only one input and output, and where there are not a significant number of steps being performed. (Wickham and Grolemund)[http://r4ds.had.co.nz] recommend keeping pipes to ten lines or fewer.
Pipes are designed to be “read” like lines of literal text. For example:
library("driftR")
driftData <- driftData %>%
dr_factor(corrFactor = corrFac,
dateVar = Date,
timeVar = Time,
keepDateTime = TRUE) %>%
dr_correctOne(sourceVar = SpCond,
cleanVar = SpCond_Corr,
calVal = 1.07,
calStd = 1,
factorVar = corrFac)
One could read the pipe above in the following way:
<-
) to the object driftData
the result of taking the existing driftData
object, then
corrFac
, then
SpCond
that will be stored in a new variable named SpCond_Corr
.Each time then is used, a pipe operater has been used. We include the pipe operator at the end of each function we wish to chain together.
.data
One thing you will notice is that we do not make reference to the driftData
object in either dr_factor()
or dr_correctOne()
when they are included in a pipe. Compare that to the code block in the Tidy Evaluation in driftR
section above, which includes explicit reference to driftData
in the dr_factor()
. Using the pipe operator simplifies our code in part by reducing the amount of characters needed in the function. Specifically, the pipe operator allows us to drop our references to particular data frames. This is facilitated by the pronoun .data
. Each of the driftR
functions incorporates .data
, allowing the data frame object’s name to be easily passed from one function to the next in a pipe.
%>%
Below is an example of what a full correction of a data set should look like if you are piping driftR
functions together:
# load needed packages
library(driftR)
library(dplyr)
library(readr)
# import data exported from a Sonde
# example file located in the package
waterTibble <- dr_read(file = system.file("extdata", "rawData.csv", package="driftR"),
instruments = "Sonde", defineVar = TRUE, cleanVar = TRUE, case = "snake")
# calculate correction factors and correct variables
waterTibble <- waterTibble %>%
dr_factor(corrFactor = corrFac, dateVar = Date,
timeVar = Time, keepDateTime = TRUE) %>%
dr_correctOne(sourceVar = SpCond, cleanVar = SpCond_Corr,
calVal = 1.07, calStd = 1, factorVar = corrFac) %>%
rename(Turbidity = `Turbidity.`) %>%
dr_correctOne(sourceVar = Turbidity, cleanVar = Turbidity_Corr,
calVal = 1.3, calStd = 0, factorVar = corrFac) %>%
dr_correctOne(sourceVar = DO, cleanVar = DO_Corr, calVal = 97.6,
calStd = 99, factorVar = corrFac) %>%
dr_correctTwo(sourceVar = pH, cleanVar = pH_Corr,
calValLow = 7.01, calStdLow = 7, calValHigh = 11.8,
calStdHigh = 10, factorVar = corrFac) %>%
dr_correctTwo(sourceVar = Chloride, cleanVar = Chloride_Corr,
calValLow = 11.6, calStdLow = 10, calValHigh = 1411,
calStdHigh = 1000, factorVar = corrFac) %>%
dr_drop(head=6, tail=6) %>%
dr_replace(sourceVar = Turbidity, overwrite = TRUE, exp = Turbidity < 0) %>%
select(Date, Time, SpCond, SpCond_Corr, pH, pH_Corr, pHmV,
Chloride, Chloride_Corr, AmmoniumN, NitrateN,
Turbidity, Turbidity_Corr, DO, DO_Corr, corrFac)
# export cleaned data
write_csv(waterTibble, path = "waterData.csv", na = "NA")