Importing data

We have seen how you can access data from R like penguins and various packages make data available, but you will quickly want to work with data you have on your own computer. The tidyverse has a set of packages to import rectangular data:

Importing does two very important things:

  1. It provides you with access to your data in your R session.
  2. It protects your data. Any changes you make to the data are on a copy of the data. You can only change the raw data that you read in by explicitly writing out data to the same file. To prevent this, we have separate data-raw/ and data/ folders. This means no more going to close a spreadsheet and being uncertain when the application asks if you want to save changes.

There are three main tasks to perform in importing data in R and possibly a fourth if reading from a spreadsheet.

  1. Locate the path to the data from the working directory.
  2. Declare how missing values are identified in the data if different from the default.
  3. Explicitly declare the column types if you want to override the default parsing.
  4. If reading in Excel or GoogleSheet data, you may want to identify the sheet and cell range of the data.

Resources

Below you will find a fully worked out example using readr. readxl and googlesheets4 work similarly with the added complexity of spreadsheets and data permissions with GoogleSheets. Here are some important resources if you have questions importing data.

Importing data with readr

Spreadsheets

readr example

Let’s start with an example of reading in data using read_csv() from readr. We will be using this example dataset. Right-click the link to download the file and place it in your data-raw/ folder in your r-notes project.

1. Locate the path to the data from the working directory

The first and main task is to locate where the data is. It is here that the benefits of RStudio Projects becomes clear. You want to use relative paths from your working directory to your data (such as data-raw/readr-example.csv) and not absolute paths (such as /Users/jrsadler/Documents/r-notes/data-raw/readr-example.csv). The former will work for anyone you give your project folder to. The latter only works for me and only with my current folder structure.

The first argument of read_csv() is the path to the data within quotations. Use tab complete to help you fill in the correct path. If you have downloaded the example dataset and placed it in your data-raw/ folder, you should be able to write read_csv("data-") and then hit tab to complete the path, selecting the correct file along the way. Let’s see what we get if we only do this step and assign the imported data to an object named data.

library(tidyverse)
# Read the data without further specifications
data <- read_csv("data-raw/readr-example.csv")
Rows: 20 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): a, b, e, g, j, k
dbl  (3): c, f, h
lgl  (1): d
dttm (1): l
date (1): i

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We get a tibble that is 20 rows and 12 columns and an overview of the column types that read_csv() guessed based on the data. Yay! We have successfully imported the data.

Here package

An alternative approach that is even more robust is to use the here package, which helps to create paths to your data regardless of the organization of your project. The here package is particularly useful if you are working with Quarto or R Markdown documents, which render from the location of the document and not necessarily the top level of your project.

The here package uses the here() function, which takes the components of your path from the root of the project to the file as characters separated by commas. In this case, it would be: here("data-raw", "readr-example.csv"):

library(here)
# Read the data with the here package
data <- read_csv(here("data-raw", "readr-example.csv"))

Parsing

Let’s look at data to see how it did. You may want to go into the Files pane in RStudio and choose View File after clicking on readr-example.csv to see what the raw data looks like.

# Look at the data
data
# A tibble: 20 × 12
   a     b         c d     e            f g             h i          j     k    
   <chr> <chr> <dbl> <lgl> <chr>    <dbl> <chr>     <dbl> <date>     <chr> <chr>
 1 p     b      95.2 TRUE  Asia        48 $42,498… 1.47e7 1585-05-06 20 J… Feb …
 2 x     b      71.1 FALSE none        36 $141,97… 1.58e7 1366-12-10 23 A… Jun …
 3 e     d      15.6 FALSE Africa      35 $15,852… 1.55e7 1606-06-05 4 Ju… Sep …
 4 o     b      57.6 TRUE  Africa      46 $66,555… 1.46e7 1675-01-14 20 O… Dec …
 5 v     d      91.1 TRUE  Americas    23 $143,74… 1.35e7 1400-05-12 27 J… Oct …
 6 r     d      40.0 TRUE  Asia        36 $46,013… 1.34e7 1578-09-14 26 J… Nov …
 7 h     <NA>   70.8 TRUE  Asia        43 $126,68… 1.32e7 1492-08-28 5 No… Aug …
 8 p     e      82.0 TRUE  Europe      43 $85,459… 1.37e7 1214-10-11 17 F… Feb …
 9 u     <NA>   96.9 FALSE Americas    37 $143,00… 1.56e7 1310-10-12 27 O… Nov …
10 i     d      12.1 FALSE Americas    48 $162,34… 1.61e7 1299-01-20 25 S… Aug …
11 f     d      36.4 FALSE Americas    39 $15,815… 1.56e7 1251-03-17 7 Ju… Apr …
12 l     <NA>   36.6 FALSE none        43 $168,76… 1.62e7 1341-03-12 25 J… Dec …
13 z     <NA>   88.5 FALSE Asia        21 $8,913.… 1.21e7 1355-12-18 21 J… May …
14 b     <NA>   23.7 TRUE  none        49 $20,651… 1.33e7 1228-07-07 26 J… Aug …
15 e     c      42.9 FALSE Americas    42 $82,921… 1.51e7 1423-09-03 19 S… Feb …
16 j     d      43.1 TRUE  Europe      20 $165,14… 1.35e7 1309-05-14 26 F… Oct …
17 u     b      46.3 FALSE Asia        40 $14,301… 1.27e7 1422-10-19 19 A… Mar …
18 n     b      28.1 TRUE  Americas    38 $39,063… 1.36e7 1439-03-14 11 O… Jul …
19 z     <NA>   91.4 FALSE Africa      32 $47,703… 1.43e7 1295-04-12 17 S… Oct …
20 t     a      67.1 FALSE Asia        44 $75,697… 1.52e7 1524-10-09 15 F… Sep …
# ℹ 1 more variable: l <dttm>

Going column by column:

  • a: Looks good.
  • b: Looks good. read_csv() correctly identified both NA and the one blank cell as missing data.
  • c: Looks good.
  • d: Looks good. T and F were correctly turned into a logical vector.
  • e: Parsed as a character vector, but the value of none is missing data. Additionally, we could turn this into a factor.
  • f: Parsed as a double-precision vector but is all whole numbers. We could denote this to an integer.
  • g: Parsed as a character vector because of the presence of dollar signs and commas but is actually numeric data.
  • h: Parsed as a numeric vector, but if you look closely, this is actually a date is the format of yyyymmdd without dashes.
  • i: Correctly parsed as a date vector.
  • j: Parsed as character vector but contains date data.
  • k: Parsed as character vector but contains date data.
  • l: Correctly parsed as date-time vector.

We need to make some changes to our data. It would be possible to make these changes after reading in the data with our data wrangling tools, or we can do it while reading in the data.

This is also a good reminder that you need to check your data when you import it.

2. Missing values

We can deal with the extra missing value from the data with the na argument that has defaults for both blank cells and "NA". In this case, we need to add "none". But we also need to be careful. We do not want to use only "none"; we want to add it to the default.

read_csv(here("data-raw", "readr-example.csv"),
         na = c("", "NA", "none")) # Add none as NA for column e
# A tibble: 20 × 12
   a     b         c d     e            f g             h i          j     k    
   <chr> <chr> <dbl> <lgl> <chr>    <dbl> <chr>     <dbl> <date>     <chr> <chr>
 1 p     b      95.2 TRUE  Asia        48 $42,498… 1.47e7 1585-05-06 20 J… Feb …
 2 x     b      71.1 FALSE <NA>        36 $141,97… 1.58e7 1366-12-10 23 A… Jun …
 3 e     d      15.6 FALSE Africa      35 $15,852… 1.55e7 1606-06-05 4 Ju… Sep …
 4 o     b      57.6 TRUE  Africa      46 $66,555… 1.46e7 1675-01-14 20 O… Dec …
 5 v     d      91.1 TRUE  Americas    23 $143,74… 1.35e7 1400-05-12 27 J… Oct …
 6 r     d      40.0 TRUE  Asia        36 $46,013… 1.34e7 1578-09-14 26 J… Nov …
 7 h     <NA>   70.8 TRUE  Asia        43 $126,68… 1.32e7 1492-08-28 5 No… Aug …
 8 p     e      82.0 TRUE  Europe      43 $85,459… 1.37e7 1214-10-11 17 F… Feb …
 9 u     <NA>   96.9 FALSE Americas    37 $143,00… 1.56e7 1310-10-12 27 O… Nov …
10 i     d      12.1 FALSE Americas    48 $162,34… 1.61e7 1299-01-20 25 S… Aug …
11 f     d      36.4 FALSE Americas    39 $15,815… 1.56e7 1251-03-17 7 Ju… Apr …
12 l     <NA>   36.6 FALSE <NA>        43 $168,76… 1.62e7 1341-03-12 25 J… Dec …
13 z     <NA>   88.5 FALSE Asia        21 $8,913.… 1.21e7 1355-12-18 21 J… May …
14 b     <NA>   23.7 TRUE  <NA>        49 $20,651… 1.33e7 1228-07-07 26 J… Aug …
15 e     c      42.9 FALSE Americas    42 $82,921… 1.51e7 1423-09-03 19 S… Feb …
16 j     d      43.1 TRUE  Europe      20 $165,14… 1.35e7 1309-05-14 26 F… Oct …
17 u     b      46.3 FALSE Asia        40 $14,301… 1.27e7 1422-10-19 19 A… Mar …
18 n     b      28.1 TRUE  Americas    38 $39,063… 1.36e7 1439-03-14 11 O… Jul …
19 z     <NA>   91.4 FALSE Africa      32 $47,703… 1.43e7 1295-04-12 17 S… Oct …
20 t     a      67.1 FALSE Asia        44 $75,697… 1.52e7 1524-10-09 15 F… Sep …
# ℹ 1 more variable: l <dttm>

3. Column types

read_csv() allows you to view exactly how it parsed the data and how you can change its guesses with spec().

spec(data)
cols(
  a = col_character(),
  b = col_character(),
  c = col_double(),
  d = col_logical(),
  e = col_character(),
  f = col_double(),
  g = col_character(),
  h = col_double(),
  i = col_date(format = ""),
  j = col_character(),
  k = col_character(),
  l = col_datetime(format = "")
)

This shows the functions used to parse each column. We can change the column type by changing the col_*() function used. The structure is very similar to the output of spec(). You can list the columns you want to change through the col_types argument either within the cols() function to keep all columns or cols_only() to only include the columns you explicitly specify. Let’s start by making column e a factor, f an integer, and g numeric. We will discuss dates more below.

read_csv(here("data-raw", "readr-example.csv"),
         na = c("", "NA", "none"), # Add none as NA for column e
         col_types = cols(
           e = col_factor(),
           f = col_integer(),
           g = col_number()
         ))
# A tibble: 20 × 12
   a     b         c d     e            f       g       h i          j     k    
   <chr> <chr> <dbl> <lgl> <fct>    <int>   <dbl>   <dbl> <date>     <chr> <chr>
 1 p     b      95.2 TRUE  Asia        48  42499.  1.47e7 1585-05-06 20 J… Feb …
 2 x     b      71.1 FALSE <NA>        36 141972.  1.58e7 1366-12-10 23 A… Jun …
 3 e     d      15.6 FALSE Africa      35  15853.  1.55e7 1606-06-05 4 Ju… Sep …
 4 o     b      57.6 TRUE  Africa      46  66555.  1.46e7 1675-01-14 20 O… Dec …
 5 v     d      91.1 TRUE  Americas    23 143748.  1.35e7 1400-05-12 27 J… Oct …
 6 r     d      40.0 TRUE  Asia        36  46013.  1.34e7 1578-09-14 26 J… Nov …
 7 h     <NA>   70.8 TRUE  Asia        43 126687.  1.32e7 1492-08-28 5 No… Aug …
 8 p     e      82.0 TRUE  Europe      43  85460.  1.37e7 1214-10-11 17 F… Feb …
 9 u     <NA>   96.9 FALSE Americas    37 143003.  1.56e7 1310-10-12 27 O… Nov …
10 i     d      12.1 FALSE Americas    48 162346.  1.61e7 1299-01-20 25 S… Aug …
11 f     d      36.4 FALSE Americas    39  15816.  1.56e7 1251-03-17 7 Ju… Apr …
12 l     <NA>   36.6 FALSE <NA>        43 168766.  1.62e7 1341-03-12 25 J… Dec …
13 z     <NA>   88.5 FALSE Asia        21   8913.  1.21e7 1355-12-18 21 J… May …
14 b     <NA>   23.7 TRUE  <NA>        49  20651.  1.33e7 1228-07-07 26 J… Aug …
15 e     c      42.9 FALSE Americas    42  82922.  1.51e7 1423-09-03 19 S… Feb …
16 j     d      43.1 TRUE  Europe      20 165148.  1.35e7 1309-05-14 26 F… Oct …
17 u     b      46.3 FALSE Asia        40  14301   1.27e7 1422-10-19 19 A… Mar …
18 n     b      28.1 TRUE  Americas    38  39064.  1.36e7 1439-03-14 11 O… Jul …
19 z     <NA>   91.4 FALSE Africa      32  47704.  1.43e7 1295-04-12 17 S… Oct …
20 t     a      67.1 FALSE Asia        44  75697.  1.52e7 1524-10-09 15 F… Sep …
# ℹ 1 more variable: l <dttm>

3. Column types: Parsing dates

Dates can be difficult to parse because there are so many different ways to write dates. Therefore, we may need to do a bit more work and tell read_csv() not only that a column is a date, but also what format it is in. You declare the format using abbreviations defined by the POSIX standard. These are:

  • Year: "%Y" (4 digits); "%y" (2 digits)
  • Month: "%m" (2 digits), "%b" (abbreviated name), "%B" (full name)
  • Day: "%d" (2 digits), "%e" (optional leading space), "%a" (abbreviated name)
  • Shortcuts:
    • "%D" = "%m/%d/%y" or 02/23/26
    • "%F" = "%Y-%m-%d" or 2026-02-23
    • "%x" = "%y/%m/%d" or 26/02/23

See date format specifications in the readr documentation for more choices and details.

In the case of the example data we can use:

read_csv(here("data-raw", "readr-example.csv"),
         na = c("", "NA", "none"), # Add none as NA for column e
         col_types = cols(
           e = col_factor(),
           f = col_integer(),
           g = col_number(),
           h = col_date("%Y%m%d"), # date as 8 digit number (20260223)
           # i is correctly parsed as 2026-02-23
           j = col_date("%d %B %Y"), # 23 February 2026
           k = col_date("%b %d %Y") # Feb 23 2026
           # l is correctly parsed as datetime (2026-02-23 14:00)
         ))
# A tibble: 20 × 12
   a     b         c d     e           f      g h          i          j         
   <chr> <chr> <dbl> <lgl> <fct>   <int>  <dbl> <date>     <date>     <date>    
 1 p     b      95.2 TRUE  Asia       48 4.25e4 1465-12-20 1585-05-06 1503-01-20
 2 x     b      71.1 FALSE <NA>       36 1.42e5 1582-09-13 1366-12-10 1390-04-23
 3 e     d      15.6 FALSE Africa     35 1.59e4 1546-08-15 1606-06-05 1585-07-04
 4 o     b      57.6 TRUE  Africa     46 6.66e4 1463-11-27 1675-01-14 1237-10-20
 5 v     d      91.1 TRUE  Americ…    23 1.44e5 1352-03-26 1400-05-12 1394-07-27
 6 r     d      40.0 TRUE  Asia       36 4.60e4 1335-05-25 1578-09-14 1593-07-26
 7 h     <NA>   70.8 TRUE  Asia       43 1.27e5 1315-03-20 1492-08-28 1237-11-05
 8 p     e      82.0 TRUE  Europe     43 8.55e4 1370-01-09 1214-10-11 1650-02-17
 9 u     <NA>   96.9 FALSE Americ…    37 1.43e5 1562-05-23 1310-10-12 1660-10-27
10 i     d      12.1 FALSE Americ…    48 1.62e5 1608-06-21 1299-01-20 1263-09-25
11 f     d      36.4 FALSE Americ…    39 1.58e4 1559-05-08 1251-03-17 1235-06-07
12 l     <NA>   36.6 FALSE <NA>       43 1.69e5 1620-05-18 1341-03-12 1455-07-25
13 z     <NA>   88.5 FALSE Asia       21 8.91e3 1214-05-25 1355-12-18 1242-06-21
14 b     <NA>   23.7 TRUE  <NA>       49 2.07e4 1333-08-10 1228-07-07 1606-01-26
15 e     c      42.9 FALSE Americ…    42 8.29e4 1506-05-03 1423-09-03 1411-09-19
16 j     d      43.1 TRUE  Europe     20 1.65e5 1350-09-21 1309-05-14 1480-02-26
17 u     b      46.3 FALSE Asia       40 1.43e4 1268-03-06 1422-10-19 1512-08-19
18 n     b      28.1 TRUE  Americ…    38 3.91e4 1356-02-05 1439-03-14 1378-10-11
19 z     <NA>   91.4 FALSE Africa     32 4.77e4 1426-05-17 1295-04-12 1252-09-17
20 t     a      67.1 FALSE Asia       44 7.57e4 1521-10-04 1524-10-09 1412-02-15
# ℹ 2 more variables: k <date>, l <dttm>

Hopefully this example shows the flexibility of read_csv(), the importance of checking your data, and the complexity of dates. It may not look as nice, but I prefer the format of column h, making a date as an 8 digit number in the form of a four-digit year, a two-digit month, and a two-digit day. This way spreadsheets will not interpret your data as dates.