transform

Use keyboard arrow keys to
- advance ( → ) and
- go back ( ← )
Type “s” to see speaker notes
Type “?” to see other keyboard shortcuts

Part III: Transform

The Data Analysis Pipeline

a grammar for transforming data frames

library(dplyr) OR library(tidyverse)

dplyr (pronounced dee-ply-er, a play on words with “data” and “pliers”) is a useful R package we’ll discuss in this section. The various functions we’ll use, like select, filter, and mutate are all functions that belong to the dplyr package.

Just as a reminder, in R, we bring in the functionality of a package by using the library() command. Because dplyr forms part of the tidyverse suite of packages, we can bring in the useful functions of dplyr by either using the library(dplyr) command or the library(tidyverse) command.

The idea behind dplyr is that any data analytic task can be broken down into a small number of basic or atomic tasks, and there should be a consistent way to specify each atomic task - a grammar.

As we will see in this last section, each dplyr function takes a data frame, does something with it, and then returns a modified data frame as its output. Dplyr functions can be strung together to create powerful data analysis pipelines in just a few lines of code.

Subsetting Data

Subsetting Columns vs Rows

select()

filter()

select()

select(data_frame, ...)

select()

select(covid_testing, mrn, last_name)

Your Turn #1

Which of the following will select the first_name column from the covid_testing data frame and capture the result in a data frame named newdata?

A) newdata = select(first_name, covid_testing)

B) newdata <- select(covid_testing, first_name)

C) select(newdata, covid_testing, first_name)

D) newdata <- select(covid_testing, First_Name)

E) Both B and D

A poll will come up for you to put in your answer in Teams!

01:00

filter()

filter(data_frame, ...)

filter()

filter(covid_testing, mrn == 5000083)

A Potential Pitfall!

Error: Problem with filter() input ..1. x Input ..1 is named. ℹ This usually means that you’ve used = instead of ==.

OR

Error: unexpected ‘=’

OR

invalid (do_set) left-hand side to assignment

Logical Operators

logical expression	means	example
`x < y`	less than	`pan_day < 10`
`x > y`	greater than	`mrn > 5001000`
`x == y`	equal to	`first_name == last_name`
`x <= y`	less than or equal to	`mrn <= 5000000`
`x >= y`	greater than or equal to	`pan_day >= 30`
`x != y`	not equal to	`test_id != "covid"`
`is.na(x)`	a missing value	`is.na(clinic_name)`
`!is.na(x)`	not a missing value	`!is.na(pan_day)`

Your Turn #2

Write a filter() statement that returns a data frame containing only the rows from covid_testing in which the last_name column is NOT equal to “stark”.

(You don’t have to capture the returned data frame)

Type your response in the chat!

01:00

filter(covid_testing, last_name != "stark")

Your Turn #3

Which of these would successfully filter the covid_testing data frame to only tests with positive results?

A) filter(covid_testing, result == positive)

B) filter(covid_testing, result = “positive”)

C) filter(covid_testing, result == “positive”)

D) filter(covid_testing, positive == “result”)

01:00

The Pipe Operator %>%

The pipe operator we’ll use is %>%

(You can also use |>, in R 4.1.0 forward)

One of the most powerful concepts in the tidyverse suite of packages is the pipe operator, which is written in two possible ways:

percent, greater than, percent (%>%) (this is the original pipe which gets included as part of dplyr and tidyverse)
vertical pipe, greater than (|>) (this is a newer option, and is now “native”, meaning it comes from base R, if you’re using R version 4.1.0 or later)

We’re going to use the original pipe, for two reasons:

There are very specific occasions, which we won’t run into today, in which the older pipe and the newer pipe do different things.
We think the older pipe is still going to be what you see most, at least for maybe another year.

Still, we want you to know that a newer version of the pipe exists and you might see it or use it in the future! It works in an almost identical way.

The Pipe Operator

Passes the object on the left as the first argument to the function on the right

covid_testing %>% filter(pan_day <= 10) is equivalent to filter(covid_testing, pan_day <= 10)

OR, if you in the future use the “new” pipe:

covid_testing |> filter(pan_day <= 10) is equivalent to filter(covid_testing, pan_day <= 10)

Both pipe operators pass the object on its left as the first argument to the function on its right.

In this workshop, we’ll use the “original” pipe (that’s the one that has percent greater than percent) in code examples and quiz questions, because we think this is the one you’ll see the most in code that your coworkers share with you or you find in online examples. We’re also running on the latest stable version of R that ships with our server software, which is 3.6. This will gradually change, and when we get 4.1.0 as the default R version, we’ll likely change these materials to reflect that.

That means, and I’m going to read the top line of code in blue, that the statement “covid_testing, pipe, filter such that pan_day is less than or equal to ten” is equivalent to “filter the covid_testing data frame such that pan day is less than or equal to ten”. Those two lines of code are equivalent.

In both cases we’re taking the covid_testing data frame, passing it as the first argument to the filter() function, and adding a condition that we’re filtering by. In our case that condition is pan_day less than or equal to 10.

We could say the same thing of the second line of blue code on your screen which uses the newer pipe.

This is the last time you’ll see that new pipe today… from here out we’re going to use the old favorite percent greater than percent.

Start with the covid_testing data frame. THEN
Select so that we get only certain columns. THEN
Filter so that we get only certain rows.

Here’s why the pipe (%>% or |>) is so useful.

“Tidy” functions like select(), filter(), and others we’ll see later always have as first argument a data frame, and they always return a data frame as well. Data frame in, data frame out.

This makes it possible to create a pipeline in which a data frame object is handed from one dplyr function to the next. The data frame result of step 1 becomes the data frame starting point for step 2, then the result of step 2 becomes the starting point for step 3, and so on.

For example, here we start with covid_testing, then select the last_name and result columns, then filter to get rows where result is equal to “positive”.

You might wonder why we’ve put each step in its own line. Is this a requirement? No, it’s not. Many R users like to use whitespace (new lines, tabs, spaces, indents) to make their code more human readable.

By connecting logical steps, you can get a pipeline of data analysis steps which are concise and also fairly human readable. You can think of the pipe symbol as the word “then…”, describing the steps in order.

This approach to coding is powerful because it makes it much easier for someone who doesn’t know R well to read and understand your code as a series of instructions.

Your Turn #4

Rewrite the following statement with a pipe:

select(mydata, first_name, last_name)

Type the answer in the chat!

01:00

Create or Update Columns

mutate()

Create new or updated, optionally calculated columns.

mutate()

Create new or updated, optionally calculated columns.

mutate()

Create new or updated, optionally calculated columns.

mutate()

Create new or updated, optionally calculated columns.

mutate()

mutate(covid_testing,
     col_rec_tat_mins = col_rec_tat * 60)

mutate()

mutate(covid_testing,
     ct_value = round(ct_value))

Your Turn #5

Open 03 – Transform.qmd and work through the exercises for the section that says “Your Turn #5.”

Click “thumbs up” when you are finished.

05:00

Now let’s do some hands-on work. Please go to your “exercises” folder and open the 03 transform file. You’ll have five minutes to go through the first part of that file. There’s a place where it says “stop here…”, so please do!

…

Everyone ready? I’m going to go through the solutions. In this first exercise, I’ll start with covid_testing, then add a pipe, and then use my filter, making sure I use the double equal. So clinic_name == “picu”. Finally, I’ll add another pipe and then keep only the columns I care about, using select(rec_ver_tat, pan_day).

Then I’ll use mutate without a pipe and make a new column composed of the sum of two existing columns. I’ll do it like this: mutate covid_testing comma total_tat equals col_rec_tat plus rec_ver_tat.

And finally, I’ll take the data frame name out of that mutate and use it as the start of a pipeline. So I have covid_testing, then, mutate, total_tat equals col_rec_tat plus rec_ver_tat.

Group By and Summarize

A very common use case is to divide your data into groups, and get information about each group.

For this, we’ll use group_by and summarize.

Group by combined with summarize is a way for us to lump cases together and then get a statistic for each group. For example, maybe you want the median blood sugar for girls and the median blood sugar for boys in your study, or the maximum wait time for King of Prussia emergency department patients and the maximum wait time for University City emergency department patients.

When you use group by, you have to tell R how to separate your cases into groups. In the image here, there are three groups, each of which is represented by a different shade of gold. Any variable that is categorical data can be used to group. For example, you can group by sex, or race, or zip code. Maybe these three groups are three states, like New Jersey, Pennsylvania, and Delaware!

Once you have your data in groups, you can then use the summarize command to get summary statistics for each group. The summary for each group is represented in blue in this small image.

Summarizing can take lots of different forms! Sometimes you want to know how big the group is, how many members it has. Sometimes you want to know what the average value of something is per group, or what the maximum value is. You can also summarize and give several different measures for each group, like maximum, minimum, mean, and median. It looks like in this image there are two values given for each group. Maybe we have two values for New Jersey, Pennsylvania and Delaware, like the number of patients we have in each state and the number of patients in each state using Medicaid.

Additional Practice (Time Permitting)

If time permits:

Open 03 – Transform.qmd and work through the exercises for the section that says “Your Turn #6. We’ll do this together!

Say one of these, depending on your time frame:

This is an optional exercise, and I think we have time to do it!

OR

We don’t have time to do this optional exercise right now, but you have access to your project and these slides, and this might be fun for you to do in your own time later).

Please go to your “exercises” folder and open the 03 transform file. Give me that thumbs up when you’re in that file, and you are looking at the section called “Your Turn 6.” We haven’t had a chance to talk about group by and aggregation yet, so this is a section of code we’ll work on together.

Everyone ready?

Great, let’s begin by reading the instructions, which tell us we want to compare different clinics.

You are interested in understanding the relative utilization of COVID tests and the range of turnaround times across locations, which are captured with the clinic_name variable. Use group_by and summarize to calculate:

a) The number of orders ordered by each clinic/unit, creating a new summary variable num_orders

b) The median total turnaround time for each clinic/unit (using the total_tat variable you created in the previous exercise), using a new summary variable median_tat.

We’re also given a hint in the exercise file. It tells us, “The function to calculate a median is (predictably) median(…).”

Let’s start by adding our starting data frame. What’s the name of our starting data frame again? The one with all the data in it? That’s right, covid_testing. I’m adding that to the first line of this chunk.

And we want to use group_by.

When you use group by, you need to tell R what variable you’re using to separate data into groups. Here, we want to get a statistic for each clinic, so we’ll add group_by(clinic_name) in the second part of this pipeline. That will form one group for each clinic name.

What were we asked to do? Let’s look at letter (a) again. We were asked to find the number of orders ordered by each clinic/unit, creating a new summary variable num_orders.

We’ll want to put some kind of function here that counts the number of members in each group. In R, a very easy way to do this is to use the n function. Here I’m just going to put n, followed by an open and closed parenthesis.

But we were also asked to get a second summarizing variable for each group. What was that? Well, in letter (b) we read that we’re asked to calculate median total turnaround time for each clinic/unit (using the total_tat variable you created in the previous exercise), using a new summary variable median_tat.

We were also told that we could use the function median() to calculate a median value. So how do you think we can calculate the median total TAT for each group? Type into chat if you think you know!

…

That’s right! We’re going to add median(total_tat) as the value of our new variable, median_tat.

Now let’s run this chunk and see if I typed everything correctly! Great! Now we have values by groups, and we set the grouping to be each clinic having its own group.

Group by and aggregation functions together do a great job of helping you characterize your data and notice groupwise differences, whether that’s by location, sex, race, disease status, or some other category!