transform

Use keyboard arrow keys to
- advance ( → ) and
- go back ( ← )
Type “s” to see speaker notes
Type “?” to see other keyboard shortcuts

Part III: Transform

The Data Analysis Pipeline

a grammar for transforming data frames

library(dplyr) OR library(tidyverse)

dplyr (pronounced dee-ply-er, a play on words with “data” and “pliers”) is a useful R package we’ll discuss in this section. The various functions we’ll use, like select, filter, and mutate are all functions that belong to the dplyr package.

Just as a reminder, in R, we bring in the functionality of a package by using the library() command. Because dplyr forms part of the tidyverse suite of packages, we can bring in the useful functions of dplyr by either using the library(dplyr) command or the library(tidyverse) command.

The idea behind dplyr is that any data analytic task can be broken down into a small number of basic or atomic tasks, and there should be a consistent way to specify each atomic task - a grammar.

As we will see in this last section, each dplyr function takes a data frame, does something with it, and then returns a modified data frame as its output. Dplyr functions can be strung together to create powerful data analysis pipelines in just a few lines of code.

Subsetting Data

Subsetting Columns vs Rows

select()

filter()

select()

select(data_frame, ...)

select()

select(covid_testing, mrn, last_name)

Your Turn #1

Which of the following will select the first_name column from the covid_testing data frame and capture the result in a data frame named newdata?

A) newdata = select(first_name, covid_testing)

B) newdata <- select(covid_testing, first_name)

C) select(newdata, covid_testing, first_name)

D) newdata <- select(covid_testing, First_Name)

E) Both B and D

Type your response in the chat!

01:00

filter()

filter(data_frame, ...)

filter()

filter(covid_testing, mrn == 5000083)

A Potential Pitfall!

Error: Problem with filter() input ..1. x Input ..1 is named. ℹ This usually means that you’ve used = instead of ==.

OR

Error: unexpected ‘=’

OR

invalid (do_set) left-hand side to assignment

Logical Operators

logical expression	means	example
`x < y`	less than	`pan_day < 10`
`x > y`	greater than	`mrn > 5001000`
`x == y`	equal to	`first_name == last_name`
`x <= y`	less than or equal to	`mrn <= 5000000`
`x >= y`	greater than or equal to	`pan_day >= 30`
`x != y`	not equal to	`test_id != "covid"`
`is.na(x)`	a missing value	`is.na(clinic_name)`
`!is.na(x)`	not a missing value	`!is.na(pan_day)`

Your Turn #2

Write a filter() statement that returns a data frame containing only the rows from covid_testing in which the last_name column is NOT equal to “stark”.

(You don’t have to capture the returned data frame)

Type your response in the chat!

01:00

filter(covid_testing, last_name != "stark")

Your Turn #3

Which of these would successfully filter the covid_testing data frame to only tests with positive results?

A) filter(covid_testing, result == positive)

B) filter(covid_testing, result = “positive”)

C) filter(covid_testing, result == “positive”)

D) filter(covid_testing, positive == “result”)

01:00

The Pipe Operator %>%

The pipe operator we’ll use is %>%

(You’ll start to sometimes see |>, in R 4.1.0 forward)

One of the most powerful concepts in the tidyverse suite of packages is the pipe operator, which is written in two possible ways:

percent, greater than, percent (%>%) (this is the original pipe which gets included as part of dplyr and tidyverse)
vertical pipe, greater than (|>) (this is a newer option, and is now “native”, meaning it comes from base R, if you’re using R version 4.1.0 or later)

We’re going to use the original pipe, for two reasons:

There are very specific occasions, which we won’t run into today, in which the older pipe and the newer pipe do different things.
We think the older pipe is still going to be what you see most, at least for maybe another year.

Still, we want you to know that a newer version of the pipe exists and you might see it or use it in the future! It works in an almost identical way.

The Pipe Operator

Passes the object on the left as the first argument to the function on the right

covid_testing %>% filter(pan_day <= 10) is equivalent to filter(covid_testing, pan_day <= 10)

OR, if you in the future use the “new” pipe:

covid_testing |> filter(pan_day <= 10) is equivalent to filter(covid_testing, pan_day <= 10)

Both pipe operators pass the object on its left as the first argument to the function on its right.

In this workshop, we’ll use the “original” pipe (that’s the one that has percent greater than percent) in code examples and quiz questions, because we think this is the one you’ll see the most in code that your coworkers share with you or you find in online examples. We’re also running on the latest stable version of R that ships with our server software, which is 3.6. This will gradually change, and when we get 4.1.0 as the default R version, we’ll likely change these materials to reflect that.

That means, and I’m going to read the top line of code in blue, that the statement “covid_testing, pipe, filter such that pan_day is less than or equal to ten” is equivalent to “filter the covid_testing data frame such that pan day is less than or equal to ten”. Those two lines of code are equivalent.

In both cases we’re taking the covid_testing data frame, passing it as the first argument to the filter() function, and adding a condition that we’re filtering by. In our case that condition is pan_day less than or equal to 10.

We could say the same thing of the second line of blue code on your screen which uses the newer pipe.

This is the last time you’ll see that new pipe today… from here out we’re going to use the old favorite percent greater than percent.

Start with the covid_testing data frame. THEN
Select so that we get only certain columns. THEN
Filter so that we get only certain rows.

Here’s why the pipe (%>% or |>) is so useful.

“Tidy” functions like select(), filter(), and others we’ll see later always have as first argument a data frame, and they always return a data frame as well. Data frame in, data frame out.

This makes it possible to create a pipeline in which a data frame object is handed from one dplyr function to the next. The data frame result of step 1 becomes the data frame starting point for step 2, then the result of step 2 becomes the starting point for step 3, and so on.

For example, here we start with covid_testing, then select the last_name and result columns, then filter to get rows where result is equal to “positive”.

You might wonder why we’ve put each step in its own line. Is this a requirement? No, it’s not. Many R users like to use whitespace (new lines, tabs, spaces, indents) to make their code more human readable.

By connecting logical steps, you can get a pipeline of data analysis steps which are concise and also fairly human readable. You can think of the pipe symbol as the word “then…”, describing the steps in order.

This approach to coding is powerful because it makes it much easier for someone who doesn’t know R well to read and understand your code as a series of instructions.

Your Turn #4

Rewrite the following statement with a pipe:

select(mydata, first_name, last_name)

Type the answer in the chat!

01:00

Create or Update Columns

mutate()

Create new or updated, optionally calculated columns.

mutate()

Create new or updated, optionally calculated columns.

mutate()

Create new or updated, optionally calculated columns.

mutate()

Create new or updated, optionally calculated columns.

mutate()

mutate(covid_testing,
     col_rec_tat_mins = col_rec_tat * 60)

mutate()

mutate(covid_testing,
     ct_value = round(ct_value))

Your Turn #5

Open 03 – Transform.qmd and work through the exercises.

Click “yes” when you are finished.

05:00

Now let’s do some hands-on work. Please go to your “exercises” folder and open the 03 transform file. You’ll have five minutes to go through the instructions in that file!

…

Everyone ready? I’m going to go through the solutions very quickly. In this first exercise, I’ll start with covid_testing, then add a pipe, and then use my filter, making sure I use the double equal. So clinic_name == “picu”. Finally, I’ll add another pipe and then keep only the columns I care about, using select(rec_ver_tat, pan_day).

Then I’ll use mutate without a pipe and make a new column composed of the sum of two existing columns. I’ll do it like this: mutate covid_testing comma total_tat equals col_rec_tat plus rec_ver_tat.

And finally, I’ll take the data frame name out of that mutate and use it as the start of a pipeline. So I have covid_testing, then, mutate, total_tat equals col_rec_tat plus rec_ver_tat.

Recap

select() subsets columns by name

filter() subsets rows by a logical condition

mutate() creates new calculated columns or changes existing columns

Use the pipe operator %>% to combine dplyr functions into a pipeline

What Else?

Cheatsheet (more dplyr functions!)

Next Up: Dashboards

Our next topic is:

Part 4: Dashboards