library(tidyverse)
# Read the data without further specifications
data <- read_csv("data-raw/readr-example.csv")Importing data
We have seen how you can access data from R like penguins and various packages make data available, but you will quickly want to work with data you have on your own computer. The tidyverse has a set of packages to import rectangular data:
- readr: Import plain text data files such as comma-separated values (csv).
- readxl: Import data from Excel. It is downloaded with the tidyverse, but you need to load it explicitly.
- googlesheets4: Import data from Google Sheets. This package needs to be installed separately with
install.packages("googlesheets4").
Importing does two very important things:
- It provides you with access to your data in your R session.
- It protects your data. Any changes you make to the data are on a copy of the data. You can only change the raw data that you read in by explicitly writing out data to the same file. To prevent this, we have separate
data-raw/anddata/folders. This means no more going to close a spreadsheet and being uncertain when the application asks if you want to save changes.
There are three main tasks to perform in importing data in R and possibly a fourth if reading from a spreadsheet.
- Locate the path to the data from the working directory.
- Declare how missing values are identified in the data if different from the default.
- Explicitly declare the column types if you want to override the default parsing.
- If reading in Excel or GoogleSheet data, you may want to identify the sheet and cell range of the data.
Resources
Below you will find a fully worked out example using readr. readxl and googlesheets4 work similarly with the added complexity of spreadsheets and data permissions with GoogleSheets. Here are some important resources if you have questions importing data.
Importing data with readr
- Start with the example below, but also with the Data import chapter of R for Data Science.
- readr documentation
- here package documentation
- Malcolm Barrett - Why should I use the here package when I’m already using projects?
Spreadsheets
- Start with the Spreadsheets chapter of R for Data Science.
- readxl documentation
- googlesheets4 documentation
readr example
Let’s start with an example of reading in data using read_csv() from readr. We will be using this example dataset. Right-click the link to download the file and place it in your data-raw/ folder in your r-notes project.
1. Locate the path to the data from the working directory
The first and main task is to locate where the data is. It is here that the benefits of RStudio Projects becomes clear. You want to use relative paths from your working directory to your data (such as data-raw/readr-example.csv) and not absolute paths (such as /Users/jrsadler/Documents/r-notes/data-raw/readr-example.csv). The former will work for anyone you give your project folder to. The latter only works for me and only with my current folder structure.
The first argument of read_csv() is the path to the data within quotations. Use tab complete to help you fill in the correct path. If you have downloaded the example dataset and placed it in your data-raw/ folder, you should be able to write read_csv("data-") and then hit tab to complete the path, selecting the correct file along the way. Let’s see what we get if we only do this step and assign the imported data to an object named data.
Rows: 20 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): a, b, e, g, j, k
dbl (3): c, f, h
lgl (1): d
dttm (1): l
date (1): i
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We get a tibble that is 20 rows and 12 columns and an overview of the column types that read_csv() guessed based on the data. Yay! We have successfully imported the data.
Here package
An alternative approach that is even more robust is to use the here package, which helps to create paths to your data regardless of the organization of your project. The here package is particularly useful if you are working with Quarto or R Markdown documents, which render from the location of the document and not necessarily the top level of your project.
The here package uses the here() function, which takes the components of your path from the root of the project to the file as characters separated by commas. In this case, it would be: here("data-raw", "readr-example.csv"):
library(here)
# Read the data with the here package
data <- read_csv(here("data-raw", "readr-example.csv"))Parsing
Let’s look at data to see how it did. You may want to go into the Files pane in RStudio and choose View File after clicking on readr-example.csv to see what the raw data looks like.
# Look at the data
data# A tibble: 20 × 12
a b c d e f g h i j k
<chr> <chr> <dbl> <lgl> <chr> <dbl> <chr> <dbl> <date> <chr> <chr>
1 p b 95.2 TRUE Asia 48 $42,498… 1.47e7 1585-05-06 20 J… Feb …
2 x b 71.1 FALSE none 36 $141,97… 1.58e7 1366-12-10 23 A… Jun …
3 e d 15.6 FALSE Africa 35 $15,852… 1.55e7 1606-06-05 4 Ju… Sep …
4 o b 57.6 TRUE Africa 46 $66,555… 1.46e7 1675-01-14 20 O… Dec …
5 v d 91.1 TRUE Americas 23 $143,74… 1.35e7 1400-05-12 27 J… Oct …
6 r d 40.0 TRUE Asia 36 $46,013… 1.34e7 1578-09-14 26 J… Nov …
7 h <NA> 70.8 TRUE Asia 43 $126,68… 1.32e7 1492-08-28 5 No… Aug …
8 p e 82.0 TRUE Europe 43 $85,459… 1.37e7 1214-10-11 17 F… Feb …
9 u <NA> 96.9 FALSE Americas 37 $143,00… 1.56e7 1310-10-12 27 O… Nov …
10 i d 12.1 FALSE Americas 48 $162,34… 1.61e7 1299-01-20 25 S… Aug …
11 f d 36.4 FALSE Americas 39 $15,815… 1.56e7 1251-03-17 7 Ju… Apr …
12 l <NA> 36.6 FALSE none 43 $168,76… 1.62e7 1341-03-12 25 J… Dec …
13 z <NA> 88.5 FALSE Asia 21 $8,913.… 1.21e7 1355-12-18 21 J… May …
14 b <NA> 23.7 TRUE none 49 $20,651… 1.33e7 1228-07-07 26 J… Aug …
15 e c 42.9 FALSE Americas 42 $82,921… 1.51e7 1423-09-03 19 S… Feb …
16 j d 43.1 TRUE Europe 20 $165,14… 1.35e7 1309-05-14 26 F… Oct …
17 u b 46.3 FALSE Asia 40 $14,301… 1.27e7 1422-10-19 19 A… Mar …
18 n b 28.1 TRUE Americas 38 $39,063… 1.36e7 1439-03-14 11 O… Jul …
19 z <NA> 91.4 FALSE Africa 32 $47,703… 1.43e7 1295-04-12 17 S… Oct …
20 t a 67.1 FALSE Asia 44 $75,697… 1.52e7 1524-10-09 15 F… Sep …
# ℹ 1 more variable: l <dttm>
Going column by column:
- a: Looks good.
- b: Looks good.
read_csv()correctly identified bothNAand the one blank cell as missing data. - c: Looks good.
- d: Looks good.
TandFwere correctly turned into a logical vector. - e: Parsed as a character vector, but the value of
noneis missing data. Additionally, we could turn this into a factor. - f: Parsed as a double-precision vector but is all whole numbers. We could denote this to an integer.
- g: Parsed as a character vector because of the presence of dollar signs and commas but is actually numeric data.
- h: Parsed as a numeric vector, but if you look closely, this is actually a date is the format of
yyyymmddwithout dashes. - i: Correctly parsed as a date vector.
- j: Parsed as character vector but contains date data.
- k: Parsed as character vector but contains date data.
- l: Correctly parsed as date-time vector.
We need to make some changes to our data. It would be possible to make these changes after reading in the data with our data wrangling tools, or we can do it while reading in the data.
This is also a good reminder that you need to check your data when you import it.
2. Missing values
We can deal with the extra missing value from the data with the na argument that has defaults for both blank cells and "NA". In this case, we need to add "none". But we also need to be careful. We do not want to use only "none"; we want to add it to the default.
read_csv(here("data-raw", "readr-example.csv"),
na = c("", "NA", "none")) # Add none as NA for column e# A tibble: 20 × 12
a b c d e f g h i j k
<chr> <chr> <dbl> <lgl> <chr> <dbl> <chr> <dbl> <date> <chr> <chr>
1 p b 95.2 TRUE Asia 48 $42,498… 1.47e7 1585-05-06 20 J… Feb …
2 x b 71.1 FALSE <NA> 36 $141,97… 1.58e7 1366-12-10 23 A… Jun …
3 e d 15.6 FALSE Africa 35 $15,852… 1.55e7 1606-06-05 4 Ju… Sep …
4 o b 57.6 TRUE Africa 46 $66,555… 1.46e7 1675-01-14 20 O… Dec …
5 v d 91.1 TRUE Americas 23 $143,74… 1.35e7 1400-05-12 27 J… Oct …
6 r d 40.0 TRUE Asia 36 $46,013… 1.34e7 1578-09-14 26 J… Nov …
7 h <NA> 70.8 TRUE Asia 43 $126,68… 1.32e7 1492-08-28 5 No… Aug …
8 p e 82.0 TRUE Europe 43 $85,459… 1.37e7 1214-10-11 17 F… Feb …
9 u <NA> 96.9 FALSE Americas 37 $143,00… 1.56e7 1310-10-12 27 O… Nov …
10 i d 12.1 FALSE Americas 48 $162,34… 1.61e7 1299-01-20 25 S… Aug …
11 f d 36.4 FALSE Americas 39 $15,815… 1.56e7 1251-03-17 7 Ju… Apr …
12 l <NA> 36.6 FALSE <NA> 43 $168,76… 1.62e7 1341-03-12 25 J… Dec …
13 z <NA> 88.5 FALSE Asia 21 $8,913.… 1.21e7 1355-12-18 21 J… May …
14 b <NA> 23.7 TRUE <NA> 49 $20,651… 1.33e7 1228-07-07 26 J… Aug …
15 e c 42.9 FALSE Americas 42 $82,921… 1.51e7 1423-09-03 19 S… Feb …
16 j d 43.1 TRUE Europe 20 $165,14… 1.35e7 1309-05-14 26 F… Oct …
17 u b 46.3 FALSE Asia 40 $14,301… 1.27e7 1422-10-19 19 A… Mar …
18 n b 28.1 TRUE Americas 38 $39,063… 1.36e7 1439-03-14 11 O… Jul …
19 z <NA> 91.4 FALSE Africa 32 $47,703… 1.43e7 1295-04-12 17 S… Oct …
20 t a 67.1 FALSE Asia 44 $75,697… 1.52e7 1524-10-09 15 F… Sep …
# ℹ 1 more variable: l <dttm>
3. Column types
read_csv() allows you to view exactly how it parsed the data and how you can change its guesses with spec().
spec(data)cols(
a = col_character(),
b = col_character(),
c = col_double(),
d = col_logical(),
e = col_character(),
f = col_double(),
g = col_character(),
h = col_double(),
i = col_date(format = ""),
j = col_character(),
k = col_character(),
l = col_datetime(format = "")
)
This shows the functions used to parse each column. We can change the column type by changing the col_*() function used. The structure is very similar to the output of spec(). You can list the columns you want to change through the col_types argument either within the cols() function to keep all columns or cols_only() to only include the columns you explicitly specify. Let’s start by making column e a factor, f an integer, and g numeric. We will discuss dates more below.
read_csv(here("data-raw", "readr-example.csv"),
na = c("", "NA", "none"), # Add none as NA for column e
col_types = cols(
e = col_factor(),
f = col_integer(),
g = col_number()
))# A tibble: 20 × 12
a b c d e f g h i j k
<chr> <chr> <dbl> <lgl> <fct> <int> <dbl> <dbl> <date> <chr> <chr>
1 p b 95.2 TRUE Asia 48 42499. 1.47e7 1585-05-06 20 J… Feb …
2 x b 71.1 FALSE <NA> 36 141972. 1.58e7 1366-12-10 23 A… Jun …
3 e d 15.6 FALSE Africa 35 15853. 1.55e7 1606-06-05 4 Ju… Sep …
4 o b 57.6 TRUE Africa 46 66555. 1.46e7 1675-01-14 20 O… Dec …
5 v d 91.1 TRUE Americas 23 143748. 1.35e7 1400-05-12 27 J… Oct …
6 r d 40.0 TRUE Asia 36 46013. 1.34e7 1578-09-14 26 J… Nov …
7 h <NA> 70.8 TRUE Asia 43 126687. 1.32e7 1492-08-28 5 No… Aug …
8 p e 82.0 TRUE Europe 43 85460. 1.37e7 1214-10-11 17 F… Feb …
9 u <NA> 96.9 FALSE Americas 37 143003. 1.56e7 1310-10-12 27 O… Nov …
10 i d 12.1 FALSE Americas 48 162346. 1.61e7 1299-01-20 25 S… Aug …
11 f d 36.4 FALSE Americas 39 15816. 1.56e7 1251-03-17 7 Ju… Apr …
12 l <NA> 36.6 FALSE <NA> 43 168766. 1.62e7 1341-03-12 25 J… Dec …
13 z <NA> 88.5 FALSE Asia 21 8913. 1.21e7 1355-12-18 21 J… May …
14 b <NA> 23.7 TRUE <NA> 49 20651. 1.33e7 1228-07-07 26 J… Aug …
15 e c 42.9 FALSE Americas 42 82922. 1.51e7 1423-09-03 19 S… Feb …
16 j d 43.1 TRUE Europe 20 165148. 1.35e7 1309-05-14 26 F… Oct …
17 u b 46.3 FALSE Asia 40 14301 1.27e7 1422-10-19 19 A… Mar …
18 n b 28.1 TRUE Americas 38 39064. 1.36e7 1439-03-14 11 O… Jul …
19 z <NA> 91.4 FALSE Africa 32 47704. 1.43e7 1295-04-12 17 S… Oct …
20 t a 67.1 FALSE Asia 44 75697. 1.52e7 1524-10-09 15 F… Sep …
# ℹ 1 more variable: l <dttm>
3. Column types: Parsing dates
Dates can be difficult to parse because there are so many different ways to write dates. Therefore, we may need to do a bit more work and tell read_csv() not only that a column is a date, but also what format it is in. You declare the format using abbreviations defined by the POSIX standard. These are:
- Year:
"%Y"(4 digits);"%y"(2 digits) - Month:
"%m"(2 digits),"%b"(abbreviated name),"%B"(full name) - Day:
"%d"(2 digits),"%e"(optional leading space),"%a"(abbreviated name) - Shortcuts:
"%D"="%m/%d/%y"or 02/23/26"%F"="%Y-%m-%d"or 2026-02-23"%x"="%y/%m/%d"or 26/02/23
See date format specifications in the readr documentation for more choices and details.
In the case of the example data we can use:
read_csv(here("data-raw", "readr-example.csv"),
na = c("", "NA", "none"), # Add none as NA for column e
col_types = cols(
e = col_factor(),
f = col_integer(),
g = col_number(),
h = col_date("%Y%m%d"), # date as 8 digit number (20260223)
# i is correctly parsed as 2026-02-23
j = col_date("%d %B %Y"), # 23 February 2026
k = col_date("%b %d %Y") # Feb 23 2026
# l is correctly parsed as datetime (2026-02-23 14:00)
))# A tibble: 20 × 12
a b c d e f g h i j
<chr> <chr> <dbl> <lgl> <fct> <int> <dbl> <date> <date> <date>
1 p b 95.2 TRUE Asia 48 4.25e4 1465-12-20 1585-05-06 1503-01-20
2 x b 71.1 FALSE <NA> 36 1.42e5 1582-09-13 1366-12-10 1390-04-23
3 e d 15.6 FALSE Africa 35 1.59e4 1546-08-15 1606-06-05 1585-07-04
4 o b 57.6 TRUE Africa 46 6.66e4 1463-11-27 1675-01-14 1237-10-20
5 v d 91.1 TRUE Americ… 23 1.44e5 1352-03-26 1400-05-12 1394-07-27
6 r d 40.0 TRUE Asia 36 4.60e4 1335-05-25 1578-09-14 1593-07-26
7 h <NA> 70.8 TRUE Asia 43 1.27e5 1315-03-20 1492-08-28 1237-11-05
8 p e 82.0 TRUE Europe 43 8.55e4 1370-01-09 1214-10-11 1650-02-17
9 u <NA> 96.9 FALSE Americ… 37 1.43e5 1562-05-23 1310-10-12 1660-10-27
10 i d 12.1 FALSE Americ… 48 1.62e5 1608-06-21 1299-01-20 1263-09-25
11 f d 36.4 FALSE Americ… 39 1.58e4 1559-05-08 1251-03-17 1235-06-07
12 l <NA> 36.6 FALSE <NA> 43 1.69e5 1620-05-18 1341-03-12 1455-07-25
13 z <NA> 88.5 FALSE Asia 21 8.91e3 1214-05-25 1355-12-18 1242-06-21
14 b <NA> 23.7 TRUE <NA> 49 2.07e4 1333-08-10 1228-07-07 1606-01-26
15 e c 42.9 FALSE Americ… 42 8.29e4 1506-05-03 1423-09-03 1411-09-19
16 j d 43.1 TRUE Europe 20 1.65e5 1350-09-21 1309-05-14 1480-02-26
17 u b 46.3 FALSE Asia 40 1.43e4 1268-03-06 1422-10-19 1512-08-19
18 n b 28.1 TRUE Americ… 38 3.91e4 1356-02-05 1439-03-14 1378-10-11
19 z <NA> 91.4 FALSE Africa 32 4.77e4 1426-05-17 1295-04-12 1252-09-17
20 t a 67.1 FALSE Asia 44 7.57e4 1521-10-04 1524-10-09 1412-02-15
# ℹ 2 more variables: k <date>, l <dttm>
Hopefully this example shows the flexibility of read_csv(), the importance of checking your data, and the complexity of dates. It may not look as nice, but I prefer the format of column h, making a date as an 8 digit number in the form of a four-digit year, a two-digit month, and a two-digit day. This way spreadsheets will not interpret your data as dates.