Importing and Merging spreadsheets

27 octobre 2017

Two potential work scenarios

Scenario 1.

All collaborators work on the same online collaborative spreadsheet (Google sheets). We want to upload the google spreadsheet into R to perform the analysis.

Scenario 2.

Each collaborator works on different spreadsheet with the same format (template) and conventions (units). We want to merge all csv file into one final csv file to perform the analysis on the entire dataset.

Scenario 1.
All collaborators work on google spreads

Scenario 1. All collaborators work on google spreads

Login and list available worksheets

# load package
library(googlesheets)

# List available google sheets
# Ask you to authenticate in a browser
gs_ls()

R>>  # A tibble: 27 x 10
R>>                  sheet_title        author  perm version
R>>                        <chr>         <chr> <chr>   <chr>
R>>   1                   estran antoinetardif    rw     new
R>>   2     Meta_analyses_google    s.vissault    rw     new
R>>   3 Liste de diffusion écol… ecosherbrooke    rw     new
R>>   4 Payment Form_Workshop P… dalal.e.hanna    rw     new
R>>   5                   AGENDA ecosherbrooke    rw     new
R>>   6 R Workshop Facilitator … dalal.e.hanna    rw     new
R>>   7      R workshop feedback  zofia.taranu    rw     new
R>>   8 Sondage sur l'animation…     legagneux    rw     new
R>>   9 Déjeuner-café-rencontre… audrey.bourr…    rw     new
R>>  10           Notes - BIO109    s.vissault    rw     new
R>>  # ... with 17 more rows, and 6 more variables: updated <dttm>,
R>>  #   sheet_key <chr>, ws_feed <chr>, alternate <chr>, self <chr>,
R>>  #   alt_key <chr>

Scenario 1. All collaborators work on google spreads

Get your worksheet

worksheet <- gs_title("Meta_analyses_google")

R>>  Sheet successfully identified: "Meta_analyses_google"

List the sheets inside the worksheet

gs_ws_ls(worksheet)

R>>  [1] "Sheet1"

Scenario 1. All collaborators work on google spreads

Read the sheet needed to perform the analysis

data <- gs_read(ss=worksheet, ws = "Sheet1", verbose=FALSE)

R>>  Parsed with column specification:
R>>  cols(
R>>    genus = col_character(),
R>>    species = col_character(),
R>>    country = col_character(),
R>>    province = col_character(),
R>>    n = col_integer(),
R>>    plot_size = col_integer(),
R>>    doi = col_character()
R>>  )

Scenario 1. All collaborators work on google spreads

Transform the `dplyr` object as a conventional dataset object

(data <- as.data.frame(data))

R>>         genus species country province  n plot_size                     doi
R>>  1 Helianthus  annuus  canada   quebec 13       400 10.1111/1365-2745.12326
R>>  2 Helianthus  annuus  canada  ontario 20       300 10.1111/1365-2745.12325

And that's it, you're all set!

Scenario 1. All collaborators work on google spreads

Advantages:

Secure (dataset stays on the private domain)
Everyone can work on the same worksheet and avoid conflicts
Individual sheets are easily accessible

Scenario 2. Each collaborator works on a different spreadsheet

Note: For this scenario, I assume that headers are consistent among files.

Import a flat file (csv format for instance)

data <- read.csv('file.csv')

Import `.xlsx` files (requires java `sudo R CMD javareconf`):

install.packages('xlsx')
data <- read.xlsx('file.csv')

Scenario 2. Each collaborator works on a different spreadsheet

With 3 to 4 collaborators…

data1 <- read.csv('file1.csv')
data2 <- read.csv('file2.csv')
data3 <- read.csv('file3.csv')
data4 <- read.csv('file4.csv')

But… if the team is 15 members (15 files)… you cannot

Solution: You have to loop over the files

Multiple spreadsheets

Use the same headers, put all of them in a folder then…

# create a vector containing the paths to all files
files <- list.files('path/to/folder', pattern='^F.*csv$', full.names=TRUE)

# Create a empty list
ls_csv <- list()

# The loop populates each level of the list with the content of each file
for (i in 1:length(files)){
  print(files[i])
  ls_csv[[i]] <- read.csv(files[i])
}

# Finally, we apply the function rbind at each level of the list
data <- do.call('rbind', ls_csv)

Exercise

Create several fake files (up to 3 CSV files) with common headers
Save them in a same folder
Open a R script
Write down a function merging them (use the previous slide to help)

Note: Don't hesitate.. if you have any questions, we're here to help you!

Two potential work scenarios

Scenario 1. All collaborators work on google spreads

Scenario 1. All collaborators work on google spreads

Login and list available worksheets

Scenario 1. All collaborators work on google spreads

Get your worksheet

List the sheets inside the worksheet

Scenario 1. All collaborators work on google spreads

Read the sheet needed to perform the analysis

Scenario 1. All collaborators work on google spreads

Transform the dplyr object as a conventional dataset object

And that's it, you're all set!

Scenario 1. All collaborators work on google spreads

Advantages:

Scenario 2. Each collaborator works on a different spreadsheet

Scenario 2. Each collaborator works on a different spreadsheet

Note: For this scenario, I assume that headers are consistent among files.

Import a flat file (csv format for instance)

Import .xlsx files (requires java sudo R CMD javareconf):

Scenario 2. Each collaborator works on a different spreadsheet

With 3 to 4 collaborators…

But… if the team is 15 members (15 files)… you cannot

Solution: You have to loop over the files

Multiple spreadsheets

Use the same headers, put all of them in a folder then…

Exercise

Scenario 1.
All collaborators work on google spreads

Transform the `dplyr` object as a conventional dataset object

Import `.xlsx` files (requires java `sudo R CMD javareconf`):