class: center, middle, inverse, title-slide # Coding for the Humanities ## An Introduction to R for Digital Humanities ### Jesse Sadler --- <style type="text/css"> pre { max-width: 100%; overflow-x: scroll; } </style> # Agenda ## 1. Why use code for Digital Humanities ## 2. Introduction to R ## 3. Using R for Digital Humanities --- class: inverse, center, middle # 1. Why use code for Digital Humanities --- class: center, inverse ## The power of code <img src="https://media.giphy.com/media/BemKqR9RDK4V2/giphy.gif" style="display: block; margin: auto;" /> --- class: center, inverse ## How do you harness the power? <img src="https://media.giphy.com/media/3oKIPnAiaMCws8nOsE/giphy.gif" style="display: block; margin: auto;" /> --- ## The spreadsheet: The first killer app <img src="images/excel.png" width="819" style="display: block; margin: auto;" /> --- # Spreadsheets are nice π Nice way to enter and look at data π Programatic logic in cells and creation of visualizations π What you see is what you get (WYSIWYG) --- # Spreadsheets are nice, but... π Nice way to enter and look at data π Programatic logic in cells and creation of visualizations π What you see is what you get (WYSIWYG) π Mixing of data entry and analysis π Mouse clicks or mistaken drags can lead to errors π No way to track changes: Do you want to save changes? --- # Other options for Digital Humanities - GUI (Graphical User Interface) applications - Tableau - Gephi - QGIS -- - Web based - Google maps - Palladio -- - Programming languages (Command line interface) - Java Script - Python - R --- # Programming languages... π Have a steep learning curve -- π Powerful and endlessly customizable πͺπͺ -- π Transparent: Every step can be clearly documented -- π Transferable: To new data and to other people -- π Reproducible -- π Build open-source communities of shared problems and answers π― --- # How do I get started? ## Find a project -- .center[ <img src="images/draw-owl.jpeg" width="75%" style="display: block; margin: auto;" /> ] --- # Projects for Digital Humanities -- - Do you have something that you can count or involves numbers? - Can you put various values into a spreadsheet? - Do you want to analyze the data? - Do you want to visualize the data? -- ### Then you have a project ππΊ -- You can use other kinds of data with R, but this is a good starting point. --- # A warning: Writing code is frustrating .center[ <img src="https://media.giphy.com/media/nkLB4Gp8H6hFe/giphy.gif" style="display: block; margin: auto;" /> ] .footnote[ [David Robinson](http://varianceexplained.org/r/teach-tidyverse/): "all programming syntax is confusing for non-programmers." ] --- ## But frustration is normal and consequences are small <img src="images/this-is-fine.png" width="100%" style="display: block; margin: auto;" /> --- # Why R - Open Source: It's free -- - Built for data analysis -- - Powerful visualization tools -- - Popular inside and outside academia -- - Strong community of teachers, learners, and programmers - [#Rstats on twitter](https://twitter.com/hashtag/rstats) - [R ladies](https://rladies.org) - [RStudio Community](https://community.rstudio.com) .center[ <img src="images/Rlogo.png" width="25%" style="display: block; margin: auto;" /> ] --- class: inverse, center, middle # 2. Introduction to R --- # Download and Install R 1. [Go to the R Project for Statistical Computing website](https://www.r-project.org) 2. [Click on the download link](https://cran.r-project.org/mirrors.html) 3. Choose a server from which to download R - [RStudio servers](https://cloud.r-project.org/) - Or a server near you 4. Select your operating system 5. Mac - Click on the pkg file of the latest release - Follow installation instructions 6. Windows - Click on Base: Install R for the first time - Click on the download link - Follow installation instructions --- # Download and Install RStudio 1. [Go to the RStudio website](https://www.rstudio.com) 2. [Click on the Download RStudio link](https://www.rstudio.com/products/rstudio/download/) 3. [Click on the RStudio Desktop: Open Source License link](https://www.rstudio.com/products/rstudio/download/#download) 4. Select your operating system 5. Follow the installation instructions 6. Open the application and make sure everything works --- ## What is RStudio and how is it different from R? .pull-left[ ## R: Archives <img src="images/archive.jpg" width="75%" style="display: block; margin: auto;" /> ] .pull-right[ ## RStudio: Catalogue <img src="images/dvdm-inventory.png" width="488" style="display: block; margin: auto;" /> ] ??? - R is the actual programming language that performs the commands - RStudio is an IDE (Integrated development environment) or GUI that helps you write and organize code - R is necessary to use RStudio, but R is more complex and confusing without RStudio --- # RStudio .center[ <img src="images/RStudio.png" width="819" style="display: block; margin: auto;" /> ] --- # RStudio .center[ <img src="images/RStudio-script.png" width="819" style="display: block; margin: auto;" /> ] --- class: inverse, center, middle # Let's run some code See script: 01-running-code.R --- # Running code Run code in the console ```r 5 + 10 ``` ``` [1] 15 ``` -- Use the assignment operator to create named variables or objects ```r x <- 15 * 3 ``` -- Print contents of variable with the print function or name of the object ```r # These two commands do the same thing print(x) ``` ``` [1] 45 ``` ```r x ``` ``` [1] 45 ``` --- # Functions in R Functions take in objects, perform some set of operations, and return another object. Functions consist of... -- - The name of the function -- - Opening and closing parentheses -- - Arguments separated by commas within the parentheses ```r function_name(argument1 = value1, argument2 = value2, ...) ``` --- # Functions in R The `sum()` function takes in any number of numeric objects and adds them together. ```r sum(x, 8, 24) ``` ``` [1] 77 ``` -- Which can be saved as: ```r y <- sum(x, 8, 24) ``` -- If the values change, the code can be rerun with the new values overwriting the old. ```r x <- 15 * 5 y <- sum(x, 8, 24) ``` --- # Functions with named arguments The sequence function creates a vector of numbers ```r seq(from = 0, to = 20, length.out = 8) ``` ``` [1] 0.000000 2.857143 5.714286 8.571429 11.428571 14.285714 17.142857 [8] 20.000000 ``` -- Or use the `by` argument instead of `length.out` ```r seq(from = 0, to = 20, by = 3) ``` ``` [1] 0 3 6 9 12 15 18 ``` -- Arguments do not have to be named if they are placed in the correct order. Though this may make the code difficult to read. ```r seq(0, 20, 3) ``` ``` [1] 0 3 6 9 12 15 18 ``` --- # Getting help ### If you need help, use `?function_name()` and look at the examples ```r ?seq() ``` -- ### Make sure to read error messages Search the error messages if you do not know what they mean. --- class: inverse, center, middle # Expanding R and the tidyverse See script: 02-install-pkgs.R --- # Expanding R and the tidyverse - Base R: Packages that come with R -- - Expand base R with packages -- - [Comprehensive R Archive Network: CRAN](https://cran.r-project.org) -- - [The tidyverse](https://tidyverse.org) ??? Could have Shiny app of CRAN downloads on another tab --- # What is the tidyverse .pull-left[ ## Hadley Wickham <img src="images/wickham.jpg" width="75%" style="display: block; margin: auto;" /> ] .pull-right[ ## The tidyverse <img src="images/tidyverse.jpeg" width="135" style="display: block; margin: auto;" /> ] --- ## Downloading packages To download packages use `install.packages()`: make sure to include the quotation marks ```r install.packages("tidyverse") ``` -- ## Loading packages To use a package it has to be loaded ```r library(tidyverse) ``` --- # Check that everything works ```r ggplot(data = mpg, aes(x = hwy, y = displ)) + geom_point() + geom_smooth() ``` <img src="index_files/figure-html/mpg-1.png" width="50%" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Setting up the environment for coding in RStudio --- # What is real vs what is temporary .center[ <img src="images/script-vs-console.png" width="100%" style="display: block; margin: auto;" /> ] --- # Save your scripts, not your environment Tools > Global Options .center[ <img src="images/RStudio-preferences.png" width="60%" style="display: block; margin: auto;" /> ] --- # Organizing your scripts and data .pull-left[ ## RStudio Projects <img src="images/RStudio-project.png" width="339" style="display: block; margin: auto;" /> ] -- .pull-right[ ## Organizing projects ```r project-name |--data |--cleaned-data.csv |--data-raw |--raw-data.csv |--clean-data.R |--plots |--bar-plot.png |--scatter-plot.png |--project-name.Rproj |--scripts |--exploratory-analysis.R |--create-plots.R ``` ] --- class: middle # 1. Create a project in a new directory # 2. Create a script and save it .footnote[ Put the project somewhere you can find it and collect your R projects. Maybe a folder called R or rstats. ] --- class: middle .center[ # Download the tutorial ## https://github.com/jessesadler/hope-intro2r ] .footnote[ Put the downloaded project in the same parent directory (R or rstats) as the R project you just created. ] --- class: inverse, center, middle # 3. Using R for Digital Humanities --- # Agenda ### 1. Import data with `readr` ### 2. Explore data with `dplyr` ### 3. Relational data with `dplyr` ### 4. Visualization with `ggplot2` ### 5. Tidying data with `tidyr` --- class: inverse, center, middle # Import data with `readr` See script: 03-loading-data.R --- # Import data **Remember to load the tidyverse** ```r library(tidyverse) ``` -- Import the raw data with `read_csv()` ```r letters <- read_csv("data-raw/dvdm-correspondence-1591.csv") ``` --- # Inspect the letters data frame Letters sent to Daniel van der Meulen from 1578 to 1592. ```r letters ``` ``` # A tibble: 424 x 5 writer source destination year date <chr> <chr> <chr> <dbl> <dbl> 1 Languet, Hubert Cologne Antwerp 1578 15781009 2 Languet, Hubert Ghent Antwerp 1578 15781224 3 Banos, Theophile de Frankfurt Antwerp 1578 NA 4 Oudenforte, R. Lier Antwerp 1580 15800402 5 Burmania, Dominiques de Duisburg Cologne 1580 15801010 6 Albada, Aggaeus de Cologne Antwerp 1582 15820318 7 Albada, Aggaeus de Cologne Antwerp 1582 15820729 8 Goyvaerts, Hendrick Frankfurt Antwerp 1582 15820922 9 Burmania, Dominiques de Haarlem Antwerp 1582 15820923 10 Albada, Aggaeus de Cologne Antwerp 1582 15821028 # ... with 414 more rows ``` --- # Parse the date variable Parse date variable as a date instead of a numeric value. ```r letters <- read_csv("data-raw/dvdm-correspondence-1591.csv", col_types = cols(date = col_date(format = "%Y%m%d"))) glimpse(letters) ``` ``` Observations: 424 Variables: 5 $ writer <chr> "Languet, Hubert", "Languet, Hubert", "Banos, Theo... $ source <chr> "Cologne", "Ghent", "Frankfurt", "Lier", "Duisburg... $ destination <chr> "Antwerp", "Antwerp", "Antwerp", "Antwerp", "Colog... $ year <dbl> 1578, 1578, 1578, 1580, 1580, 1582, 1582, 1582, 15... $ date <date> 1578-10-09, 1578-12-24, NA, 1580-04-02, 1580-10-1... ``` .footnote[ For more information on parsing data, see [Chapter 11 of R for Data Science](https://r4ds.had.co.nz/data-import.html). ] --- # Saving the data frame The converse of `read_csv()` is `write_csv()`. Write the data frame to the "data" folder with the date variable parsed correctly. ```r write_csv(letters, "data/dvdm-correspondence.csv") ``` "data/" places the file in the data folder. "dvdm-correspondence.csv" names the file. Make sure to put the csv extension on the end. -- .footnote[ **Tip**: Minimize the objects that you save. The code that created the objects is more important. ] --- class: middle .center[ # π First workflow completed π You can now import data, parse it, and save data. ### Before moving on restart your R session: Session > Restart R ] By restarting R for each new script, you ensure that every script can work on its own. --- class: inverse, center, middle # Explore data with `dplyr` <img src="images/dplyr.png" width="50%" style="display: block; margin: auto;" /> See script: 04-explore-data.R --- # Explore data with `dplyr` ## Six main verbs - select: pick variables or columns to keep or discard - arrange: reorder rows by values in columns - filter: pick observations or rows by their values - mutate: create new variables from information from existing variables - summarise: Collapse observations to a summary value such as count, mean, or median - group: Group the observations in order to make a summary of them --- ## `select()` Pick variables or columns to keep or discard ```r select(letters, writer, date) ``` -- Get rid of variables you do not want to keep ```r select(letters, -writer, -date) ``` -- Rearrange columns ```r select(letters, writer, date, destination, source) ``` -- Rename columns within selection: `new_name = old_name` ```r select(letters, writer, from = source, to = destination) ``` -- To keep all variables but rename one or more use `rename()` ```r rename(letters, correspondent = writer) ``` --- # `arrange()` `arrange()` does not change the data. It only alters the presentation of the data. Arrange rows by values in a variable ```r arrange(letters, writer) ``` -- Arrange rows by multiple variables ```r arrange(letters, source, destination) ``` -- Arrange rows in descending order with `desc()` ```r arrange(letters, desc(date)) ``` --- # `filter()` ## Comparison in R: what to keep vs what to discard - Greater than and less than: `>, >=, <, <=` - Equal and not equal: `==` and `!=` - *note the two equal signs to show equality* - And, or, not: `&, |, !` --- # `filter()` Pick letters sent to Antwerp ```r filter(letters, destination == "Antwerp") ``` -- Why doesn't this work? ```r filter(letters, destination == Antwerp) ``` -- Pick letters sent during certain years ```r filter(letters, year >= 1584 & year <= 1586) ``` -- Remove rows that do not have a known source or destination: `NA` ```r filter(letters, !is.na(source), !is.na(destination)) ``` --- # `mutate()` Create new variables from information from existing variables ```r mutate(data, variable_name = function_call) ``` -- Create new variable with dates from Julian calendar for letters sent after 1582. ```r # New data frame with letters after 1582 gregorian_letters <- filter(letters, year > 1582) ``` -- Subtract 10 days from date column of new data frame ```r # New variable with Julian calendar dates mutate(gregorian_letters, julian = date - 10) ``` --- # `summarise()` English spelling comes from Hadley Wickham's birthplace of New Zealand π³πΏ You can also use `summarize()` if you prefer the zed π -- `summarise()` has a similar formula to `mutate()` ```r summarise(data, variable_name = function_call) ``` -- Let's try it with `n()` to count the observations ```r summarise(letters, count = n()) ``` ``` # A tibble: 1 x 1 count <int> 1 424 ``` π€·π€·π€·π€·π€·π€·π€·π€·π€· --- ## `summarise()` with `group_by()` Summarise is not very useful without `group_by()`. -- Group the data frame by a variable and then use `summarise()` and `n()` ```r # create grouped data frame letters_writer <- group_by(letters, writer) ``` How many letters were written by each correspondent? ```r summarise(letters_writer, count = n()) ``` ``` # A tibble: 65 x 2 writer count <chr> <int> 1 Achelen, Aelken van 1 2 Albada, Aggaeus de 5 3 Anraet, Thomas 1 4 Backer, Andre 1 5 Banos, Theophile de 4 6 Bastingius, Jeremias 1 7 Beke, Jan van der 3 8 Bellasi, Agostino 7 9 Berrewijns, Hans 1 10 Bongars, Jacques 7 # ... with 55 more rows ``` --- class: inverse, center, middle # The pipe: %>% <img src="images/magrittr.png" width="50%" style="display: block; margin: auto;" /> See script: 05-the-pipe.R --- # What if you want to perform multiple operations? How many letters were sent from each location while Daniel lived in Bremen? -- ```r # letters to Bremen letters_bremen <- filter(letters, destination == "Bremen") # group data frame by source bremen_grouped <- group_by(letters_bremen, source) # number of letters per source bremen_summarised <- summarise(bremen_grouped, count = n()) # arrange with most at top finally_done <- arrange(bremen_summarised, desc(count)) ``` --- background-image: url("https://media.giphy.com/media/tw1zMQrM2IhC8/giphy.gif") background-size: cover --- .center[ # The pipe to the rescue ] .pull-left[ <img src="images/magrittr.png" width="236" style="display: block; margin: auto;" /> ] .pull-right[ <img src="https://media.giphy.com/media/D49L3FpxqtQ3u/giphy.gif" style="display: block; margin: auto;" /> ] --- # The pipe: %>% ### Read the pipe as "and then" Pipe the output of one function directly into the next one -- ```r data %>% do_this() %>% do_something_else() %>% do_one_more_thing() ``` -- ### Keyboard shortcuts macOS: Cmd+Shift+m Windows: Ctrl+Shift+M --- ## The pipe: %>% Let's try to find letters sent to Bremen again with the pipe. ```r letters %>% filter(destination == "Bremen") %>% # letters to Bremen group_by(source) %>% # group by source summarise(count = n()) %>% # letters per source arrange(desc(count)) # most letters at the top ``` ``` # A tibble: 28 x 2 source count <chr> <int> 1 Haarlem 58 2 Antwerp 46 3 Frankfurt 11 4 Venice 11 5 Verona 6 6 Amsterdam 5 7 Dordrecht 5 8 London 5 9 Vicenza 5 10 Delft 3 # ... with 18 more rows ``` --- class: center, middle # π€ Second workflow completed π€ You can now explore data and make various calculations about your data. ### Before moving on restart your R session: Session > Restart R --- class: inverse, center, middle # Relational data with `dplyr` <img src="images/dplyr.png" width="50%" style="display: block; margin: auto;" /> See script: 06-relational-data.R --- ## Relational data: working with multiple data frames .pull-left[ See `?left_join()` for details. **inner join** <img src="images/join-inner.png" width="216" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="images/join-outer.png" width="75%" style="display: block; margin: auto;" /> ] .bottom[ Images from [Wickham and Grolemund, R for Data Science](https://r4ds.had.co.nz/relational-data.html) ] --- ## Join number of letters per writer with kinship and gender data about correspondents ```r left_join(df1, df2, by = "key_variable") ``` 1: Import correspondence data -- ```r correspondents <- read_csv("data/correspondents.csv") ``` 2: Set up letters data -- ```r per_writer <- group_by(letters, writer) %>% summarise(count = n()) ``` 3: Join the two data frames -- ```r left_join(per_writer, correspondents, by = "writer") ``` Could use any of the joins because all the correspondents are in both data frames exactly once. --- ## Join by variables with different names ```r left_join(df1, df2, by = c("variable1" = "variable2")) ``` -- Rename one of the writer variables and join the data frames ```r rename(per_writer, schrijver = writer) %>% left_join(correspondents, by = c("schrijver" = "writer")) ``` ``` # A tibble: 65 x 4 schrijver count kinship gender <chr> <int> <chr> <chr> 1 Achelen, Aelken van 1 non male 2 Albada, Aggaeus de 5 non male 3 Anraet, Thomas 1 non male 4 Backer, Andre 1 non male 5 Banos, Theophile de 4 non male 6 Bastingius, Jeremias 1 non male 7 Beke, Jan van der 3 df male 8 Bellasi, Agostino 7 non male 9 Berrewijns, Hans 1 df male 10 Bongars, Jacques 7 non male # ... with 55 more rows ``` --- class: center, middle # π€― Third workflow completed π€― You can now join multiple data frames together. ### Before moving on restart your R session: Session > Restart R --- class: inverse, center, middle # Visualization with `ggplot2` <img src="images/ggplot2.png" width="50%" style="display: block; margin: auto;" /> See script: 07-ggplot2-scatterplots.R --- # Making a plot with prices of goods in Holland .center[ <img src="index_files/figure-html/dutch-prices-01-1.png" height="500" style="display: block; margin: auto;" /> ] --- class: inverse # Let's look at the code ```r ggplot(data = dutch_prices, aes(x = year, y = guilders, color = commodity)) + geom_point() ``` <img src="index_files/figure-html/dutch-prices-02-1.png" height="300" style="display: block; margin: auto;" /> --- ## Grammar of graphics - Data - Geometric objects (geoms) that provide the type of objects drawn - Aesthetic mapping of the data such as location, size, or color - Scales of the plot axes - Statistical transformation - Coordinates -- .pull-left[ ```r ggplot(data = dutch_prices, aes(x = year, y = guilders, color = commodity)) + geom_point() ``` ] .pull-right[ ![](index_files/figure-html/dutch-prices-03-1.png) ] --- # Layers: Data ```r ggplot(data = dutch_prices) ``` <img src="index_files/figure-html/layer-data-1.png" height="400" style="display: block; margin: auto;" /> --- # Layers: aesthetics ```r ggplot(data = dutch_prices, * aes(x = year, y = guilders)) ``` <img src="index_files/figure-html/layer-aes-1.png" height="400" style="display: block; margin: auto;" /> --- # Layers: geoms ```r ggplot(data = dutch_prices, aes(x = year, y = guilders)) + * geom_point() ``` <img src="index_files/figure-html/layer-point-1.png" height="400" style="display: block; margin: auto;" /> --- # Layers: multiple geoms ```r ggplot(data = dutch_prices, aes(x = year, y = guilders)) + geom_point() + * geom_line(aes(group = commodity)) ``` <img src="index_files/figure-html/layer-line-1.png" height="400" style="display: block; margin: auto;" /> --- # Layers: map more aesthetics to variables .pull-left[ ```r ggplot(data = dutch_prices, aes(x = year, y = guilders, * color = commodity)) + geom_point() ``` <img src="index_files/figure-html/layer-color-1.png" height="300" style="display: block; margin: auto;" /> ] .pull-right[ ```r ggplot(data = dutch_prices, aes(x = year, y = guilders, * shape = commodity)) + geom_point() ``` <img src="index_files/figure-html/layer-shape-1.png" height="300" style="display: block; margin: auto;" /> ] --- # Layers: change geoms ```r ggplot(data = dutch_prices, aes(x = year, y = guilders, color = commodity)) + * geom_line() ``` <img src="index_files/figure-html/geom-line-1.png" height="400" style="display: block; margin: auto;" /> --- # Layers: non-mapped aesthetic changes ```r ggplot(data = dutch_prices, aes(x = year, y = guilders)) + * geom_point(color = "orange", size = 3, alpha = 0.5) ``` <img src="index_files/figure-html/layer-non-aes-1.png" height="400" style="display: block; margin: auto;" /> --- # Layers: Facet wrap ```r ggplot(data = dutch_prices, aes(x = year, y = guilders)) + geom_point() + * facet_wrap(~ commodity) ``` <img src="index_files/figure-html/layer-facet-1.png" height="400" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Visualization with `ggplot2`: Statistical transformations <img src="images/ggplot2.png" width="33%" style="display: block; margin: auto;" /> See script: 08-ggplot2-stats.R --- # Bar plots: statistical transformations .pull-left[ Why does this work? ```r ggplot(letters, aes(year)) + geom_bar() ``` ] .pull-right[ <img src="index_files/figure-html/stat-transform-1.png" height="400" style="display: block; margin: auto;" /> ] --- # Stat identity .pull-left[ ```r letters %>% group_by(year) %>% summarise(count = n()) %>% ggplot(aes(x = year, y = count)) + * geom_bar(stat = "identity") ``` *Notice the use of the pipe (`%>%`) into `ggplot()`.* ] .pull-right[ <img src="index_files/figure-html/stat-identity-1.png" height="400" style="display: block; margin: auto;" /> ] --- # Add another variable with fill ```r ggplot(letters, * aes(year, fill = destination)) + geom_bar() ``` <img src="index_files/figure-html/aes-fill-1.png" height="400" style="display: block; margin: auto;" /> --- # Coordinate flip .pull-left[ ```r ggplot(letters, aes(destination)) + geom_bar() ``` <img src="index_files/figure-html/regular-coord-1.png" height="350" style="display: block; margin: auto;" /> ] -- .pull-right[ ```r ggplot(letters, aes(destination)) + geom_bar() + * coord_flip() ``` <img src="index_files/figure-html/coord-flip-1.png" height="350" style="display: block; margin: auto;" /> ] --- # Histogram with date variable ```r ggplot(letters, aes(date)) + geom_histogram(binwidth = 60) ``` <img src="index_files/figure-html/histogram-1.png" height="400" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Visualization with `ggplot2`: Labels and themes <img src="images/ggplot2.png" width="33%" style="display: block; margin: auto;" /> See script: 09-ggplot2-labels.R --- ## Labels with `ggplot2` ```r ggplot(data = dutch_prices, aes(x = year, y = guilders, color = commodity)) + geom_point() + labs(title = "Prices of Goods in Holland", x = "Date", y = "Price in Guilders", color = "Commodities") ``` <img src="index_files/figure-html/ggplot-labs-1.png" height="300" style="display: block; margin: auto;" /> --- ## Themes with `ggplot2` See `?theme()` for more ways to tweak themes. ```r ggplot(data = dutch_prices, aes(x = year, y = guilders, color = commodity)) + geom_point() + * theme_bw() ``` <img src="index_files/figure-html/ggplot-themes-1.png" height="350" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Visualization with `ggplot2`: Saving plots <img src="images/ggplot2.png" width="33%" style="display: block; margin: auto;" /> See script: 10-ggplot2-save.R --- # Saving a plot with `ggsave()` ```r ggplot(data = dutch_prices, aes(x = year, y = guilders, color = commodity)) + geom_point() + labs(title = "Prices of Goods in Holland", x = "Date", y = "Price in Guilders", color = "Commodities") ``` To save the last plot use `ggsave()` ```r ggsave("plots/my-first-plot.png") ``` See `?ggsave()` for options on changing width and height of plot. --- class: center, middle # πͺ Fourth workflow completed πͺ You can now make various kinds of plots with `ggplot2` and save them. ### Before moving on restart your R session: Session > Restart R --- class: inverse, center, middle # Tidying data with tidyr <img src="images/tidyr.png" width="50%" style="display: block; margin: auto;" /> See script: 11-tidying-data.R --- # Tidy data 1. Each variable must have its own column. 2. Each observation must have its own row. 3. Each value must have its own cell. <img src="images/tidy-data.png" width="614" style="display: block; margin: auto;" /> [Wickham and Grolemund, R for Data Science](https://r4ds.had.co.nz/tidy-data.html) --- # What is problematic about this data? ```r read_csv("/data-raw/barley.csv") ``` ``` # A tibble: 150 x 13 Year September October November December January February March April <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> 1 1549 1.25 1.38 1.5 1.31 1.44 1.4 1.38 1.38 2 1550 1.38 1.63 1.63 1.88 1.9 1.81 1.85 2 3 1551 1.56 1.69 1.56 1.63 1.75 1.75 1.75 1.75 4 1552 1.75 2.31 2.25 2.5 2.75 3 3.25 3 5 1553 1.38 1.69 1.63 1.5 1.63 1.63 1.63 1.63 6 1554 1.63 1.63 1.81 1.88 1.75 1.75 1.75 1.56 7 1555 1.5 1.5 1.63 1.5 1.5 1.38 1.63 1.5 8 1556 1.88 1.75 1.75 2.13 2.13 - - - 9 1557 - - - - - - - - 10 1558 1.38 1.5 1.5 1.38 1.38 1.25 1.13 1 # ... with 140 more rows, and 4 more variables: May <chr>, June <chr>, # July <chr>, August <chr> ``` .footnote[ Data is from [Nicholas Poynder, Monthly Grain Prices at Les Halles, Paris, 1549-1698](http://www.iisg.nl/hpw/poynder-france.php) ] --- # Dealing with `NA` ### What character is being used instead of `NA`? ```r read_csv("data-raw/barley.csv", na = "-") ``` ``` # A tibble: 150 x 13 Year September October November December January February March April <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1549 1.25 1.38 1.5 1.31 1.44 1.4 1.38 1.38 2 1550 1.38 1.63 1.63 1.88 1.9 1.81 1.85 2 3 1551 1.56 1.69 1.56 1.63 1.75 1.75 1.75 1.75 4 1552 1.75 2.31 2.25 2.5 2.75 3 3.25 3 5 1553 1.38 1.69 1.63 1.5 1.63 1.63 1.63 1.63 6 1554 1.63 1.63 1.81 1.88 1.75 1.75 1.75 1.56 7 1555 1.5 1.5 1.63 1.5 1.5 1.38 1.63 1.5 8 1556 1.88 1.75 1.75 2.13 2.13 NA NA NA 9 1557 NA NA NA NA NA NA NA NA 10 1558 1.38 1.5 1.5 1.38 1.38 1.25 1.13 1 # ... with 140 more rows, and 4 more variables: May <dbl>, June <dbl>, # July <dbl>, August <dbl> ``` --- <img src="images/wide2long.png" width="749" height="50%" style="display: block; margin: auto;" /> --- # Gather a data frame ```r gather(data, key = "key", value = "value", variables_to_gather) ``` **key:** the name for the variable whose values are currently variable names. **value:** the name of the variable whose values are contained within the columns that are to be gathered. --- # Gather barley data frame .pull-left[ ```r gather(barley, key = month, value = price, -Year) ``` ``` # A tibble: 1,800 x 3 Year month price <dbl> <chr> <dbl> 1 1549 September 1.25 2 1550 September 1.38 3 1551 September 1.56 4 1552 September 1.75 5 1553 September 1.38 6 1554 September 1.63 7 1555 September 1.5 8 1556 September 1.88 9 1557 September NA 10 1558 September 1.38 # ... with 1,790 more rows ``` ] -- .pull-right[ ```r gather(barley, key = month, value = price, September:August) ``` ``` # A tibble: 1,800 x 3 Year month price <dbl> <chr> <dbl> 1 1549 September 1.25 2 1550 September 1.38 3 1551 September 1.56 4 1552 September 1.75 5 1553 September 1.38 6 1554 September 1.63 7 1555 September 1.5 8 1556 September 1.88 9 1557 September NA 10 1558 September 1.38 # ... with 1,790 more rows ``` ] --- # Now what do we need to do? -- 1. Create a full date with day, month, and year -- 2. Unite the day, month, and year, data into a single variable -- 3. Make the variable a date class -- 4. Label the type of grain -- To help with number 3, we can use the **lubridate** package. ```r install.packages("lubridate") ``` And load the package. ```r library(lubridate) ``` --- ## 1. Create a full date with day, month, and year -- ```r barley_long %>% * mutate(day = 1) ``` ``` # A tibble: 1,800 x 4 Year month price day <dbl> <chr> <dbl> <dbl> 1 1549 September 1.25 1 2 1550 September 1.38 1 3 1551 September 1.56 1 4 1552 September 1.75 1 5 1553 September 1.38 1 6 1554 September 1.63 1 7 1555 September 1.5 1 8 1556 September 1.88 1 9 1557 September NA 1 10 1558 September 1.38 1 # ... with 1,790 more rows ``` --- ## 2. Unite the day, month, and year, data into a single variable -- ```r barley_long %>% mutate(day = 1) %>% * unite(col = date, "Year", "month", "day", sep = " ") ``` ``` # A tibble: 1,800 x 2 date price <chr> <dbl> 1 1549 September 1 1.25 2 1550 September 1 1.38 3 1551 September 1 1.56 4 1552 September 1 1.75 5 1553 September 1 1.38 6 1554 September 1 1.63 7 1555 September 1 1.5 8 1556 September 1 1.88 9 1557 September 1 NA 10 1558 September 1 1.38 # ... with 1,790 more rows ``` --- ## 3. Make the variable a date class with lubridate -- ```r barley_long %>% mutate(day = 1) %>% unite(col = date, "Year", "month", "day", sep = " ") %>% * mutate(date = ymd(date)) ``` ``` # A tibble: 1,800 x 2 date price <date> <dbl> 1 1549-09-01 1.25 2 1550-09-01 1.38 3 1551-09-01 1.56 4 1552-09-01 1.75 5 1553-09-01 1.38 6 1554-09-01 1.63 7 1555-09-01 1.5 8 1556-09-01 1.88 9 1557-09-01 NA 10 1558-09-01 1.38 # ... with 1,790 more rows ``` --- ## 4. Label the type of grain -- ```r barley_long %>% mutate(day = 1) %>% unite(col = date, "Year", "month", "day", sep = " ") %>% mutate(date = ymd(date)) %>% * mutate(grain = "barley") ``` ``` # A tibble: 1,800 x 3 date price grain <date> <dbl> <chr> 1 1549-09-01 1.25 barley 2 1550-09-01 1.38 barley 3 1551-09-01 1.56 barley 4 1552-09-01 1.75 barley 5 1553-09-01 1.38 barley 6 1554-09-01 1.63 barley 7 1555-09-01 1.5 barley 8 1556-09-01 1.88 barley 9 1557-09-01 NA barley 10 1558-09-01 1.38 barley # ... with 1,790 more rows ``` --- # Repeat with oats and wheat data Repeat steps with "oats.csv" and "wheat.csv" to create `barley_tidied`, `oats_tidied`, and `wheat_tidied`. Make sure to label the grain the correct type of grain. -- ```r oats_tidied <- read_csv("data-raw/oats.csv", na = "-") %>% gather(key = month, value = price, -Year) %>% mutate(day = 1) %>% unite(col = date, "Year", "month", "day", sep = " ") %>% mutate(date = ymd(date), grain = "oats") wheat_tidied <- read_csv("data-raw/wheat.csv", na = "-") %>% gather(key = month, value = price, -Year) %>% mutate(day = 1) %>% unite(col = date, "Year", "month", "day", sep = " ") %>% mutate(date = ymd(date), grain = "wheat") ``` --- # Bind barley, oats, and wheat data Bind the three data frames together with `bind_rows()` to create a `grain_prices` data frame. ```r bind_rows(barley_tidied, oats_tidied, wheat_tidied) ``` ``` # A tibble: 5,400 x 3 date price grain <date> <dbl> <chr> 1 1549-09-01 1.25 barley 2 1550-09-01 1.38 barley 3 1551-09-01 1.56 barley 4 1552-09-01 1.75 barley 5 1553-09-01 1.38 barley 6 1554-09-01 1.63 barley 7 1555-09-01 1.5 barley 8 1556-09-01 1.88 barley 9 1557-09-01 NA barley 10 1558-09-01 1.38 barley # ... with 5,390 more rows ``` --- # Visualize the data .pull-left[ ```r ggplot(data = grain_prices, aes(x = date, y = price, color = grain)) + geom_line() + labs(x = "Date", y = "Tournois pounds per setier", color = "Type of Grain") ``` ] .pull-right[ <img src="index_files/figure-html/grain-prices-1.png" height="500" style="display: block; margin: auto;" /> ] --- class: center, middle # π Fifth workflow completed π You can now tidy messy data. ## πΊππΊππΊππΊπ --- # Overview of Digital Humanities Workflow with R 1. Find or create data that can go into a spreadsheet. -- 2. Create an R Project with RStudio and set up structure of documents. -- 3. Import data and explore the data. Create a script when you get something you want to save. -- 4. Create and save visuals. --- # Resources for learning more about R - General - [Garrett Grolemund and Hadley Wickham, R for Data Science](https://r4ds.had.co.nz) - [List of R Manuals](http://colinfay.me/r-manuals/) - [Kieran Healy, Data Visualization for Social Science](http://socviz.co) - GIS: Maps - [Introduction to GIS with R from my blog](https://jessesadler.com/post/gis-with-r-intro/) - [Lovelace, Nowosad, Muenchow, Geocomputation with R](https://geocompr.robinlovelace.net) - Network Analysis - [Introduction to Network Analysis with R](https://jessesadler.com/post/network-analysis-with-r/) - [Katya Ognyanova, Static and dynamic network visualization with R](http://kateto.net/network-visualization) - Text Analysis - [Silge and Robinson, Text Mining with R: A Tidy Approach](https://www.tidytextmining.com) --- background-image: url("https://media.giphy.com/media/6tHy8UAbv3zgs/giphy.gif") background-size: cover