2 readr
2.1 readr
resources
The primary task of readr
is to parse a flat, plain-text file into a data frame in which each column is cast into the correct type. Parsing takes place in three basic stages:
- The flat file is parsed into a rectangular matrix of strings.
- The type of each column is determined.
- Each column of strings is parsed into a vector of a more specific type.
2.2 Overview
The read_*
functions work by first calling the respective spec_*
function, as described in Section 10.4, which uses guess_parser()
on each column and casts the character vectors to the specified types using the parse_*
functions as described in Section 2.4. Use spec()
after the csv is read in to see all of the column types, or spec_csv()
to see the column specifications before reading in the data. This is particularly useful with very wide data.
# Read in data
df <- read_csv(readr_example("mini-gapminder-americas.csv"))
#> Rows: 6 Columns: 5
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): country
#> dbl (4): year, lifeExp, pop, gdpPercap
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df
#> # A tibble: 6 × 5
#> country year lifeExp pop gdpPercap
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Argentina 1952 62.5 17876956 5911.
#> 2 Bolivia 1952 40.4 2883315 2677.
#> 3 Brazil 1952 50.9 56602560 2109.
#> 4 Canada 1952 68.8 14785584 11367.
#> 5 Chile 1952 54.7 6377619 3940.
#> 6 Colombia 1952 50.6 12350771 2144.
# See full column specification
spec(df)
#> cols(
#> country = col_character(),
#> year = col_double(),
#> lifeExp = col_double(),
#> pop = col_double(),
#> gdpPercap = col_double()
#> )
2.2.1 Overriding the defaults
One of the main aspects of working with readr
is the ability to override the default column specifications. One way to start the process of overriding the defaults is to copy the code output generated by spec()
.
Override default column types with col_types
argument in three ways:
- With a compact string representation where each character represents one column:
"dcnf"
- With a named list using
col_*
functions - With a names list using character abbreviations of types
Columns not specified will be parsed automatically. You can skip columns to not import them with col_skip()
. Alternatively, you can only read in the columns that you specify with cols_only()
instead of a named list. You can also set a default type for columns through the .default
argument in the named list.
2.2.2 Chickens example
Here is a simple example of reading in a column as a factor instead of a character and integer instead of double. Use of spec_csv()
to only show the column types.
# 1. Compact method
read_csv(readr_example("chickens.csv"), col_types = "cfic")
#> # A tibble: 5 × 4
#> chicken sex eggs_laid motto
#> <chr> <fct> <int> <chr>
#> 1 Foghorn Leghorn rooster 0 That's a joke, ah say, that's a jok…
#> 2 Chicken Little hen 3 The sky is falling!
#> 3 Ginger hen 12 Listen. We'll either die free chick…
#> 4 Camilla the Chicken hen 7 Bawk, buck, ba-gawk.
#> 5 Ernie The Giant Chicken rooster 0 Put Captain Solo in the cargo hold.
# 2. col_* functions
spec_csv(readr_example("chickens.csv"), col_types = list(
sex = col_factor(c("hen", "rooster")),
eggs_laid = col_integer()
))
#> cols(
#> chicken = col_character(),
#> sex = col_factor(levels = c("hen", "rooster"), ordered = FALSE, include_na = FALSE),
#> eggs_laid = col_integer(),
#> motto = col_character()
#> )
# 3. list with abbreviations
spec_csv(readr_example("chickens.csv"), col_types = list(
sex = "f",
eggs_laid = "i"
))
#> cols(
#> chicken = col_character(),
#> sex = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
#> eggs_laid = col_integer(),
#> motto = col_character()
#> )
Notice that the col_*
functions method is the most explicit and provides the most flexibility to provide additional parameters such as specifying the levels of a factor of a date format.
2.2.3 Available column specifications
Function | Abbreviated string |
---|---|
col_logical() |
"l" |
col_integer() |
"i" |
col_double() |
"d" |
col_character() |
"c" |
col_factor(levels, ordered) |
"f" |
col_date(format = "") |
"D" |
col_time(format = "") |
"t" |
col_datetime(format = "") |
"T" |
col_number() |
"n" |
col_skip() |
"_" or "-"
|
col_guess() |
"?" |
2.3 Example data
I have created some example data in Appendix A that provides a good overview of the different data formats that might be encountered and how to deal with them. Consult the data key to see how columns should be parsed.
Start by seeing how it does by default with guess_parser()
using spec_csv()
and what the data looks like:
# Data
read_csv("data/readr-example.csv", show_col_types = FALSE)
#> # A tibble: 50 × 13
#> a b c d e f g h i j k
#> <chr> <chr> <dbl> <lgl> <chr> <dbl> <chr> <dbl> <date> <chr> <chr>
#> 1 p b 95.2 TRUE gnarly 48 $42,498.… 1.47e7 1585-05-06 20 J… Feb …
#> 2 x b 71.1 FALSE none 36 $141,971… 1.58e7 1366-12-10 23 A… Jun …
#> 3 e d 15.6 FALSE kook 35 $15,852.… 1.55e7 1606-06-05 4 Ju… Sep …
#> 4 o b 57.6 TRUE kook 46 $66,555.… 1.46e7 1675-01-14 20 O… Dec …
#> 5 v d 91.1 TRUE hello 23 $143,747… 1.35e7 1400-05-12 27 J… Oct …
#> 6 r d 40.0 TRUE gnarly 36 $46,013.… 1.34e7 1578-09-14 26 J… Nov …
#> 7 h <NA> 70.8 TRUE gnarly 43 $126,686… 1.32e7 1492-08-28 5 No… Aug …
#> 8 p e 82.0 TRUE goodbye 43 $85,459.… 1.37e7 1214-10-11 17 F… Feb …
#> 9 u <NA> 96.9 FALSE hello 37 $143,002… 1.56e7 1310-10-12 27 O… Nov …
#> 10 i d 12.1 FALSE hello 48 $162,345… 1.61e7 1299-01-20 25 S… Aug …
#> # ℹ 40 more rows
#> # ℹ 2 more variables: l <dbl>, m <dttm>
# Column specifications
spec_csv("data/readr-example.csv")
#> cols(
#> a = col_character(),
#> b = col_character(),
#> c = col_double(),
#> d = col_logical(),
#> e = col_character(),
#> f = col_double(),
#> g = col_character(),
#> h = col_double(),
#> i = col_date(format = ""),
#> j = col_character(),
#> k = col_character(),
#> l = col_double(),
#> m = col_datetime(format = "")
#> )
The parser generally does well. Columns a
, b
, c
, and d
are all completely correct, including identifying NA
values in column b
, but there are some issues to deal with in columns e
, f
, and g
.
2.3.1 Overriding guesses
To override the guesses from guess_parser()
we can use col_*
functions to be more specific, turning character()
into factor()
(Section 2.4.4) and double()
to integer()
(Section 2.4.1). Use col_number()
to properly read in values such as $42,498.74 as numeric (Section 2.4.2). Can also add “none” to the na
argument to turn “none” into NA
in column e
. Use of col_only()
only reads in the specified columns.
read_csv("data/readr-example.csv",
na = c("", "NA", "none"), # Add none as NA in b
col_types = cols_only(
e = col_factor(c("hello", "goodbye", "kook", "gnarly")),
f = col_integer(),
g = col_number()
))
#> # A tibble: 50 × 3
#> e f g
#> <fct> <int> <dbl>
#> 1 gnarly 48 42499.
#> 2 <NA> 36 141972.
#> 3 kook 35 15853.
#> 4 kook 46 66555.
#> 5 hello 23 143748.
#> 6 gnarly 36 46013.
#> 7 gnarly 43 126687.
#> 8 goodbye 43 85460.
#> 9 hello 37 143003.
#> 10 hello 48 162346.
#> # ℹ 40 more rows
2.3.2 Dealing with dates
Dealing with dates often requires the use of a date format in col_date()
, col_time()
, and col_datetime()
functions. Columns i
and m
are correctly specified as date
and datetime
because they follow the formats in locale()
. See Section 2.4.3 for details.
read_csv("data/readr-example.csv", col_types = cols_only(
h = col_date("%Y%m%d"), # date as 8 digit number (20230316)
i = col_date(), # date in locale (2023-03-16)
j = col_date("%d %B %Y"), # 16 March 2023
k = col_date("%b %d %Y"), # Mar 16 2023
l = col_time("%h"), # time
m = col_datetime() # datetime in locale (2023-03-16 11:49)
))
#> # A tibble: 50 × 6
#> h i j k l m
#> <date> <date> <date> <date> <time> <dttm>
#> 1 1465-12-20 1585-05-06 1503-01-20 1313-02-04 150248:00 2023-05-16 02:57:00
#> 2 1582-09-13 1366-12-10 1390-04-23 1553-06-17 197379:00 2023-04-28 23:26:00
#> 3 1546-08-15 1606-06-05 1585-07-04 1299-09-18 101224:00 2023-09-22 13:17:00
#> 4 1463-11-27 1675-01-14 1237-10-20 1558-12-01 225648:00 2023-02-27 07:48:00
#> 5 1352-03-26 1400-05-12 1394-07-27 1297-10-21 300957:00 2023-02-11 17:22:00
#> 6 1335-05-25 1578-09-14 1593-07-26 1621-11-23 43411:00 2023-06-02 06:30:00
#> 7 1315-03-20 1492-08-28 1237-11-05 1681-08-09 317706:00 2023-07-26 12:09:00
#> 8 1370-01-09 1214-10-11 1650-02-17 1301-02-28 224552:00 2023-06-26 01:25:00
#> 9 1562-05-23 1310-10-12 1660-10-27 1524-11-08 48277:00 2023-12-25 14:20:00
#> 10 1608-06-21 1299-01-20 1263-09-25 1290-08-23 239288:00 2023-04-14 04:22:00
#> # ℹ 40 more rows
2.4 Vector parsers
Parse character vectors and return specified vector types with parse_*
functions.
2.4.1 Atomic vectors
parse_logical()
, parse_integer()
, parse_double()
, and parse_character()
are straightforward parsers that produce the corresponding atomic vector.
parse_double(c("1.56", "2.34", "3.56"))
#> [1] 1.56 2.34 3.56
class(parse_integer(c("1", "2", "3")))
#> [1] "integer"
parse_logical(c("true", "false"))
#> [1] TRUE FALSE
2.4.2 Flexible numeric parser
parse_number()
is more flexible than parse_double()
; it allows non-numeric prefixes and suffixes, and knows how to deal with grouping marks.
parse_number(c("0%", "10%", "150%"))
#> [1] 0 10 150
parse_number(c("$1,234.5", "$12.45"))
#> [1] 1234.50 12.45
2.4.3 Date-times
readr
supports three types of date-times, which all take a format
argument to define how the date-time is formatted:
- dates: number of days since 1970-01-01
-
parse_date()
uses thedate_format
specified by thelocale()
. The default value is%AD
which uses an automatic date parser that recognizes dates of the formatY-m-d
orY/m/d
.
-
- times: number of seconds since midnight
-
parse_time()
uses thetime_format
specified bylocale()
. The default value is%At
which uses an automatic time parser that recognizes times of the formH:M
optionally followed by seconds and am/pm.
-
- datetimes: number of seconds since midnight 1970-01-01
- Recognizes ISO8601 datetimes.
Parsing dates with default formats:
parse_date("2010-10-01")
#> [1] "2010-10-01"
parse_time("1:00pm")
#> 13:00:00
parse_datetime("2010-10-01 21:45")
#> [1] "2010-10-01 21:45:00 UTC"
See the default formats with locale()
locale()
#> <locale>
#> Numbers: 123,456.78
#> Formats: %AD / %AT
#> Timezone: UTC
#> Encoding: UTF-8
#> <date_names>
#> Days: Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed), Thursday
#> (Thu), Friday (Fri), Saturday (Sat)
#> Months: January (Jan), February (Feb), March (Mar), April (Apr), May (May),
#> June (Jun), July (Jul), August (Aug), September (Sep), October
#> (Oct), November (Nov), December (Dec)
#> AM/PM: AM/PM
In most cases you will need to supply a format
for date and datetime. See date format specifications.
- Year:
"%Y"
(4 digits);"%y"
(2 digits) - Month:
"%m"
(2 digits),"%b"
(abbreviated name),"%B"
(full name) - Day:
"%d"
(2 digits),"%e"
(optional leading space),"%a"
(abbreviated name) - Hour:
"%H"
, or"%I"
with AM/PM, or “%h” if times represent durations longer than one day. - Minutes:
"%M"
- Seconds:
"%S"
(integer seconds),"%OS"
(partial seconds) - Time zone:
"%Z"
(as name, e.g. “America/Chicago”),"%z"
(as offset from UTC, e.g. “+0800”) - AM/PM indicator: “%p”
- Shortcuts:
-
"%D"
="%m/%d/%y"
-
"%F"
="%Y-%m-%d"
-
"%R"
="%H:%M"
-
"%T"
="%H:%M:%S"
-
"%x"
="%y/%m/%d"
-
parse_date("1 January, 2020", "%d %B, %Y")
#> [1] "2020-01-01"
parse_date("20230315", "%Y%m%d")
#> [1] "2023-03-15"
parse_datetime("02/02/23", "%m/%d/%y")
#> [1] "2023-02-02 UTC"
2.4.4 Factors
readr
does not parse characters as factors. Use parse_factor()
with optional argument for levels
. If levels
is NULL
, they are discovered from unique values in the supplied character vector.
parse_factor(c("a", "b", "a"), levels = c("a", "b", "c"))
#> [1] a b a
#> Levels: a b c
parse_factor(c("a", "b", "a"))
#> [1] a b a
#> Levels: a b
2.5 Column specification
readr
works by guessing which parser to use for each column. You can access the results using guess_parser()
.
guess_parser(c("1", "2", "3"))
#> [1] "double"
guess_parser(c("1,000", "2,000", "3,000"))
#> [1] "number"
guess_parser(c("2001/10/10"))
#> [1] "date"
Use spec_csv()
and others to see the specification that readr
would generate for columns in a file.
spec_csv(readr_example("challenge.csv"))
#> cols(
#> x = col_double(),
#> y = col_logical()
#> )
Use spec()
to see the specifications that readr
used when reading in a tibble from a flat file.
df <- read_csv(readr_example("mtcars.csv"))
#> Rows: 32 Columns: 11
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
spec(df)
#> cols(
#> mpg = col_double(),
#> cyl = col_double(),
#> disp = col_double(),
#> hp = col_double(),
#> drat = col_double(),
#> wt = col_double(),
#> qsec = col_double(),
#> vs = col_double(),
#> am = col_double(),
#> gear = col_double(),
#> carb = col_double()
#> )
readr
uses the first 1000 rows to guess the column type to speed up the reading process. You can change the number of rows used with guess_max
. Note the difference with challenge.csv
with y
going from logical to date.
spec_csv(readr_example("challenge.csv"))
#> cols(
#> x = col_double(),
#> y = col_logical()
#> )
spec_csv(readr_example("challenge.csv"), guess_max = 1001)
#> cols(
#> x = col_double(),
#> y = col_date(format = "")
#> )