2  readr

Published

March 15, 2023

Modified

January 8, 2024

2.1 readr resources

The primary task of readr is to parse a flat, plain-text file into a data frame in which each column is cast into the correct type. Parsing takes place in three basic stages:

  1. The flat file is parsed into a rectangular matrix of strings.
  2. The type of each column is determined.
  3. Each column of strings is parsed into a vector of a more specific type.

2.2 Overview

The read_* functions work by first calling the respective spec_* function, as described in Section 10.4, which uses guess_parser() on each column and casts the character vectors to the specified types using the parse_* functions as described in Section 2.4. Use spec() after the csv is read in to see all of the column types, or spec_csv() to see the column specifications before reading in the data. This is particularly useful with very wide data.

# Read in data
df <- read_csv(readr_example("mini-gapminder-americas.csv"))
#> Rows: 6 Columns: 5
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): country
#> dbl (4): year, lifeExp, pop, gdpPercap
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df
#> # A tibble: 6 × 5
#>   country    year lifeExp      pop gdpPercap
#>   <chr>     <dbl>   <dbl>    <dbl>     <dbl>
#> 1 Argentina  1952    62.5 17876956     5911.
#> 2 Bolivia    1952    40.4  2883315     2677.
#> 3 Brazil     1952    50.9 56602560     2109.
#> 4 Canada     1952    68.8 14785584    11367.
#> 5 Chile      1952    54.7  6377619     3940.
#> 6 Colombia   1952    50.6 12350771     2144.

# See full column specification
spec(df)
#> cols(
#>   country = col_character(),
#>   year = col_double(),
#>   lifeExp = col_double(),
#>   pop = col_double(),
#>   gdpPercap = col_double()
#> )

2.2.1 Overriding the defaults

One of the main aspects of working with readr is the ability to override the default column specifications. One way to start the process of overriding the defaults is to copy the code output generated by spec().

Override default column types with col_types argument in three ways:

  1. With a compact string representation where each character represents one column: "dcnf"
  2. With a named list using col_* functions
  3. With a names list using character abbreviations of types

Columns not specified will be parsed automatically. You can skip columns to not import them with col_skip(). Alternatively, you can only read in the columns that you specify with cols_only() instead of a named list. You can also set a default type for columns through the .default argument in the named list.

2.2.2 Chickens example

Here is a simple example of reading in a column as a factor instead of a character and integer instead of double. Use of spec_csv() to only show the column types.

# 1. Compact method
read_csv(readr_example("chickens.csv"), col_types = "cfic")
#> # A tibble: 5 × 4
#>   chicken                 sex     eggs_laid motto                               
#>   <chr>                   <fct>       <int> <chr>                               
#> 1 Foghorn Leghorn         rooster         0 That's a joke, ah say, that's a jok…
#> 2 Chicken Little          hen             3 The sky is falling!                 
#> 3 Ginger                  hen            12 Listen. We'll either die free chick…
#> 4 Camilla the Chicken     hen             7 Bawk, buck, ba-gawk.                
#> 5 Ernie The Giant Chicken rooster         0 Put Captain Solo in the cargo hold.

# 2. col_* functions
spec_csv(readr_example("chickens.csv"), col_types = list(
  sex = col_factor(c("hen", "rooster")),
  eggs_laid = col_integer()
))
#> cols(
#>   chicken = col_character(),
#>   sex = col_factor(levels = c("hen", "rooster"), ordered = FALSE, include_na = FALSE),
#>   eggs_laid = col_integer(),
#>   motto = col_character()
#> )

# 3. list with abbreviations
spec_csv(readr_example("chickens.csv"), col_types = list(
  sex = "f",
  eggs_laid = "i"
))
#> cols(
#>   chicken = col_character(),
#>   sex = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
#>   eggs_laid = col_integer(),
#>   motto = col_character()
#> )

Notice that the col_* functions method is the most explicit and provides the most flexibility to provide additional parameters such as specifying the levels of a factor of a date format.

2.2.3 Available column specifications

Function Abbreviated string
col_logical() "l"
col_integer() "i"
col_double() "d"
col_character() "c"
col_factor(levels, ordered) "f"
col_date(format = "") "D"
col_time(format = "") "t"
col_datetime(format = "") "T"
col_number() "n"
col_skip() "_" or "-"
col_guess() "?"

2.3 Example data

I have created some example data in Appendix A that provides a good overview of the different data formats that might be encountered and how to deal with them. Consult the data key to see how columns should be parsed.

Start by seeing how it does by default with guess_parser() using spec_csv() and what the data looks like:

# Data
read_csv("data/readr-example.csv", show_col_types = FALSE)
#> # A tibble: 50 × 13
#>    a     b         c d     e           f g              h i          j     k    
#>    <chr> <chr> <dbl> <lgl> <chr>   <dbl> <chr>      <dbl> <date>     <chr> <chr>
#>  1 p     b      95.2 TRUE  gnarly     48 $42,498.… 1.47e7 1585-05-06 20 J… Feb …
#>  2 x     b      71.1 FALSE none       36 $141,971… 1.58e7 1366-12-10 23 A… Jun …
#>  3 e     d      15.6 FALSE kook       35 $15,852.… 1.55e7 1606-06-05 4 Ju… Sep …
#>  4 o     b      57.6 TRUE  kook       46 $66,555.… 1.46e7 1675-01-14 20 O… Dec …
#>  5 v     d      91.1 TRUE  hello      23 $143,747… 1.35e7 1400-05-12 27 J… Oct …
#>  6 r     d      40.0 TRUE  gnarly     36 $46,013.… 1.34e7 1578-09-14 26 J… Nov …
#>  7 h     <NA>   70.8 TRUE  gnarly     43 $126,686… 1.32e7 1492-08-28 5 No… Aug …
#>  8 p     e      82.0 TRUE  goodbye    43 $85,459.… 1.37e7 1214-10-11 17 F… Feb …
#>  9 u     <NA>   96.9 FALSE hello      37 $143,002… 1.56e7 1310-10-12 27 O… Nov …
#> 10 i     d      12.1 FALSE hello      48 $162,345… 1.61e7 1299-01-20 25 S… Aug …
#> # ℹ 40 more rows
#> # ℹ 2 more variables: l <dbl>, m <dttm>

# Column specifications
spec_csv("data/readr-example.csv")
#> cols(
#>   a = col_character(),
#>   b = col_character(),
#>   c = col_double(),
#>   d = col_logical(),
#>   e = col_character(),
#>   f = col_double(),
#>   g = col_character(),
#>   h = col_double(),
#>   i = col_date(format = ""),
#>   j = col_character(),
#>   k = col_character(),
#>   l = col_double(),
#>   m = col_datetime(format = "")
#> )

The parser generally does well. Columns a, b, c, and d are all completely correct, including identifying NA values in column b, but there are some issues to deal with in columns e, f, and g.

2.3.1 Overriding guesses

To override the guesses from guess_parser() we can use col_* functions to be more specific, turning character() into factor() (Section 2.4.4) and double() to integer() (Section 2.4.1). Use col_number() to properly read in values such as $42,498.74 as numeric (Section 2.4.2). Can also add “none” to the na argument to turn “none” into NA in column e. Use of col_only() only reads in the specified columns.

read_csv("data/readr-example.csv", 
         na = c("", "NA", "none"), # Add none as NA in b
         col_types = cols_only(
  e = col_factor(c("hello", "goodbye", "kook", "gnarly")),
  f = col_integer(),
  g = col_number()
))
#> # A tibble: 50 × 3
#>    e           f       g
#>    <fct>   <int>   <dbl>
#>  1 gnarly     48  42499.
#>  2 <NA>       36 141972.
#>  3 kook       35  15853.
#>  4 kook       46  66555.
#>  5 hello      23 143748.
#>  6 gnarly     36  46013.
#>  7 gnarly     43 126687.
#>  8 goodbye    43  85460.
#>  9 hello      37 143003.
#> 10 hello      48 162346.
#> # ℹ 40 more rows

2.3.2 Dealing with dates

Dealing with dates often requires the use of a date format in col_date(), col_time(), and col_datetime() functions. Columns i and m are correctly specified as date and datetime because they follow the formats in locale(). See Section 2.4.3 for details.

read_csv("data/readr-example.csv", col_types = cols_only(
  h = col_date("%Y%m%d"), # date as 8 digit number (20230316)
  i = col_date(), # date in locale (2023-03-16)
  j = col_date("%d %B %Y"), # 16 March 2023
  k = col_date("%b %d %Y"), # Mar 16 2023
  l = col_time("%h"), # time
  m = col_datetime() # datetime in locale (2023-03-16 11:49)
))
#> # A tibble: 50 × 6
#>    h          i          j          k          l         m                  
#>    <date>     <date>     <date>     <date>     <time>    <dttm>             
#>  1 1465-12-20 1585-05-06 1503-01-20 1313-02-04 150248:00 2023-05-16 02:57:00
#>  2 1582-09-13 1366-12-10 1390-04-23 1553-06-17 197379:00 2023-04-28 23:26:00
#>  3 1546-08-15 1606-06-05 1585-07-04 1299-09-18 101224:00 2023-09-22 13:17:00
#>  4 1463-11-27 1675-01-14 1237-10-20 1558-12-01 225648:00 2023-02-27 07:48:00
#>  5 1352-03-26 1400-05-12 1394-07-27 1297-10-21 300957:00 2023-02-11 17:22:00
#>  6 1335-05-25 1578-09-14 1593-07-26 1621-11-23  43411:00 2023-06-02 06:30:00
#>  7 1315-03-20 1492-08-28 1237-11-05 1681-08-09 317706:00 2023-07-26 12:09:00
#>  8 1370-01-09 1214-10-11 1650-02-17 1301-02-28 224552:00 2023-06-26 01:25:00
#>  9 1562-05-23 1310-10-12 1660-10-27 1524-11-08  48277:00 2023-12-25 14:20:00
#> 10 1608-06-21 1299-01-20 1263-09-25 1290-08-23 239288:00 2023-04-14 04:22:00
#> # ℹ 40 more rows

2.4 Vector parsers

Parse character vectors and return specified vector types with parse_* functions.

2.4.1 Atomic vectors

parse_logical(), parse_integer(), parse_double(), and parse_character() are straightforward parsers that produce the corresponding atomic vector.

parse_double(c("1.56", "2.34", "3.56"))
#> [1] 1.56 2.34 3.56

class(parse_integer(c("1", "2", "3")))
#> [1] "integer"

parse_logical(c("true", "false"))
#> [1]  TRUE FALSE

2.4.2 Flexible numeric parser

parse_number() is more flexible than parse_double(); it allows non-numeric prefixes and suffixes, and knows how to deal with grouping marks.

parse_number(c("0%", "10%", "150%"))
#> [1]   0  10 150

parse_number(c("$1,234.5", "$12.45"))
#> [1] 1234.50   12.45

2.4.3 Date-times

readr supports three types of date-times, which all take a format argument to define how the date-time is formatted:

  • dates: number of days since 1970-01-01
    • parse_date() uses the date_format specified by the locale(). The default value is %AD which uses an automatic date parser that recognizes dates of the format Y-m-d or Y/m/d.
  • times: number of seconds since midnight
    • parse_time() uses the time_format specified by locale(). The default value is %At which uses an automatic time parser that recognizes times of the form H:M optionally followed by seconds and am/pm.
  • datetimes: number of seconds since midnight 1970-01-01

Parsing dates with default formats:

parse_date("2010-10-01")
#> [1] "2010-10-01"
parse_time("1:00pm")
#> 13:00:00
parse_datetime("2010-10-01 21:45")
#> [1] "2010-10-01 21:45:00 UTC"

See the default formats with locale()

locale()
#> <locale>
#> Numbers:  123,456.78
#> Formats:  %AD / %AT
#> Timezone: UTC
#> Encoding: UTF-8
#> <date_names>
#> Days:   Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed), Thursday
#>         (Thu), Friday (Fri), Saturday (Sat)
#> Months: January (Jan), February (Feb), March (Mar), April (Apr), May (May),
#>         June (Jun), July (Jul), August (Aug), September (Sep), October
#>         (Oct), November (Nov), December (Dec)
#> AM/PM:  AM/PM

In most cases you will need to supply a format for date and datetime. See date format specifications.

  • Year: "%Y" (4 digits); "%y" (2 digits)
  • Month: "%m" (2 digits), "%b" (abbreviated name), "%B" (full name)
  • Day: "%d" (2 digits), "%e" (optional leading space), "%a" (abbreviated name)
  • Hour: "%H", or "%I" with AM/PM, or “%h” if times represent durations longer than one day.
  • Minutes: "%M"
  • Seconds: "%S" (integer seconds), "%OS" (partial seconds)
  • Time zone: "%Z" (as name, e.g. “America/Chicago”), "%z" (as offset from UTC, e.g. “+0800”)
  • AM/PM indicator: “%p”
  • Shortcuts:
    • "%D" = "%m/%d/%y"
    • "%F" = "%Y-%m-%d"
    • "%R" = "%H:%M"
    • "%T" = "%H:%M:%S"
    • "%x" = "%y/%m/%d"
parse_date("1 January, 2020", "%d %B, %Y")
#> [1] "2020-01-01"
parse_date("20230315", "%Y%m%d")
#> [1] "2023-03-15"
parse_datetime("02/02/23", "%m/%d/%y")
#> [1] "2023-02-02 UTC"

2.4.4 Factors

readr does not parse characters as factors. Use parse_factor() with optional argument for levels. If levels is NULL, they are discovered from unique values in the supplied character vector.

parse_factor(c("a", "b", "a"), levels = c("a", "b", "c"))
#> [1] a b a
#> Levels: a b c
parse_factor(c("a", "b", "a"))
#> [1] a b a
#> Levels: a b

2.5 Column specification

readr works by guessing which parser to use for each column. You can access the results using guess_parser().

guess_parser(c("1", "2", "3"))
#> [1] "double"
guess_parser(c("1,000", "2,000", "3,000"))
#> [1] "number"
guess_parser(c("2001/10/10"))
#> [1] "date"

Use spec_csv() and others to see the specification that readr would generate for columns in a file.

spec_csv(readr_example("challenge.csv"))
#> cols(
#>   x = col_double(),
#>   y = col_logical()
#> )

Use spec() to see the specifications that readr used when reading in a tibble from a flat file.

df <- read_csv(readr_example("mtcars.csv"))
#> Rows: 32 Columns: 11
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

spec(df)
#> cols(
#>   mpg = col_double(),
#>   cyl = col_double(),
#>   disp = col_double(),
#>   hp = col_double(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_double(),
#>   am = col_double(),
#>   gear = col_double(),
#>   carb = col_double()
#> )

readr uses the first 1000 rows to guess the column type to speed up the reading process. You can change the number of rows used with guess_max. Note the difference with challenge.csv with y going from logical to date.

spec_csv(readr_example("challenge.csv"))
#> cols(
#>   x = col_double(),
#>   y = col_logical()
#> )
spec_csv(readr_example("challenge.csv"), guess_max = 1001)
#> cols(
#>   x = col_double(),
#>   y = col_date(format = "")
#> )