• Use keyboard arrow keys to
    • advance ( → ) and
    • go back ( ← )
  • Type “s” to see speaker notes
  • Type “?” to see other keyboard shortcuts

Part III: Transform

The Data Analysis Pipeline

a grammar for transforming data frames

library(dplyr) OR library(tidyverse)

Subsetting Data

Subsetting Columns vs Rows

select()

filter()

select()

select(data_frame, ...)

select()

select(covid_testing, mrn, last_name)

Your Turn #1

Which of the following will select the first_name column from the covid_testing data frame and capture the result in a data frame named newdata?

A) newdata = select(first_name, covid_testing)

B) newdata <- select(covid_testing, first_name)

C) select(newdata, covid_testing, first_name)

D) newdata <- select(covid_testing, First_Name)

E) Both B and D

Type your response in the chat!

01:00

filter()

filter(data_frame, ...)

filter()

filter(covid_testing, mrn == 5000083)

A Potential Pitfall!

Error: Problem with filter() input ..1. x Input ..1 is named. ℹ This usually means that you’ve used = instead of ==.

OR

Error: unexpected ‘=’

OR

invalid (do_set) left-hand side to assignment

Logical Operators

logical expression means example
x < y less than pan_day < 10
x > y greater than mrn > 5001000
x == y equal to first_name == last_name
x <= y less than or equal to mrn <= 5000000
x >= y greater than or equal to pan_day >= 30
x != y not equal to test_id != "covid"
is.na(x) a missing value is.na(clinic_name)
!is.na(x) not a missing value !is.na(pan_day)

Your Turn #2

Write a filter() statement that returns a data frame containing only the rows from covid_testing in which the last_name column is NOT equal to “stark”.

(You don’t have to capture the returned data frame)

Type your response in the chat!

01:00
filter(covid_testing, last_name != "stark")

Your Turn #3

Which of these would successfully filter the covid_testing data frame to only tests with positive results?

A) filter(covid_testing, result == positive)

B) filter(covid_testing, result = “positive”)

C) filter(covid_testing, result == “positive”)

D) filter(covid_testing, positive == “result”)

01:00

The Pipe Operator %>%

The Pipe Operator %>%

The pipe operator we’ll use is %>%

(You’ll start to sometimes see |>, in R 4.1.0 forward)

The Pipe Operator

Passes the object on the left as the first argument to the function on the right

covid_testing %>% filter(pan_day <= 10) is equivalent to filter(covid_testing, pan_day <= 10)

OR, if you in the future use the “new” pipe:

covid_testing |> filter(pan_day <= 10) is equivalent to filter(covid_testing, pan_day <= 10)

  • Start with the covid_testing data frame. THEN
  • Select so that we get only certain columns. THEN
  • Filter so that we get only certain rows.

Your Turn #4

Rewrite the following statement with a pipe:

select(mydata, first_name, last_name)

Type the answer in the chat!

01:00

Create or Update Columns

mutate()

Create new or updated, optionally calculated columns.

mutate()

Create new or updated, optionally calculated columns.

mutate()

Create new or updated, optionally calculated columns.

mutate()

Create new or updated, optionally calculated columns.

mutate()

mutate(covid_testing,
     col_rec_tat_mins = col_rec_tat * 60)

mutate()

mutate(covid_testing,
     ct_value = round(ct_value))

Your Turn #5

Open 03 – Transform.qmd and work through the exercises.

Click “yes” when you are finished.

05:00

Recap

select() subsets columns by name

filter() subsets rows by a logical condition

mutate() creates new calculated columns or changes existing columns

Use the pipe operator %>% to combine dplyr functions into a pipeline

What Else?

Cheatsheet (more dplyr functions!)

Next Up: Dashboards

Our next topic is:

Part 4: Dashboards