inst/handouts/fundamentals/tablewrangling.Rmd

---
title: "Wrangling Tables"
author: "Steph Locke"
date: "`r Sys.Date()`"
output:
  html_document:
    theme: journal
---
## Requirements 

 - necessary package: data.table (>=v1.9.5)
 - suggested package: optiRum
 
## SQL
### data.table is "SQL-like"
```
DT[i, j, by]
DT[WHERE | JOIN | ORDER, SELECT | UPDATE, GROUP]
```

### data.table behaves like a database
A data.table acts like an in-memory RDBMS:

- The result of a query is a table
- Can set clustered primary keys over single or multiple columns
- Can set non-clustered secondary keys over single or multiple columns
- Can perform a wide range of relational algebra
- Can do CRUD operations
- Think in columns / sets not rows!

### data.table differences to SQL Server
There are some differences that need to be mentioned:

- A primary key can have repeat values
- No constrained foreign keys
- Joins are done by *position in key* so tblA with key A, and tblB with key C,A would join on tblA[A] and tblB[C]
- It's inherently dynamic so vectors of column names etc. can be provided to it
- DELETE doesn't exist explicitly
- It's case-sensitive

## data.table cookbook 
### Single table basics
Task | Generic syntax | Example(s)\*
------------- | ------------- | -------------
CREATE | data.table(...) setDT() | `data.table(a=1:3 , b=LETTERS[1:3])` `data.table(iris)`
PRIMARY KEY | data.table(...,key) setkey() | `data.table(a=1:3 , b=LETTERS[1:3], key="b")` `setkey(data.table(iris),Species)`
SELECT basic | DT[ , .( cols )] | `irisDT[ , .(Species, Sepal.Length)]`
SELECT alias | DT[ , .( a=col )] | `irisDT[ , .(Species, Length=Sepal.Length)]`
SELECT COUNT | DT[ , .N]  | `irisDT[ ,.N]`
SELECT COUNT DISTINCT | DT[ , uniqueN(cols)] |  `irisDT[ ,uniqueN(.SD)]`
SELECT aggregation | DT[ , .( sum(col) , .N )] | `irisDT[ , .(Count=.N, Length=mean(Sepal.Length))]`
WHERE exact on primary key| DT[value] DT[value, ] | `irisDT["setosa"]` `irisDT["setosa", .(Count=.N)]`
WHERE |  DT[condition] DT[condition, j, by] | `irisDT[Species=="setosa"]` `irisDT[Species=="setosa", .(Count=.N)]`
WHERE BETWEEN | DT[between(col, min, max)] DT[ col %between% c(min,max) ]| `irisDT[between(Sepal.Length, 1, 5)]` `irisDT[Sepal.Length %between% c(1,5)]`
WHERE LIKE | DT[like(col,pattern)] DT[ col %like% pattern ] | `irisDT[like(Species,"set")]` `irisDT[Species %like% "set"]`
ORDER asc. | DT[order(cols)] DT[order(cols), j, by] | `irisDT[order(Species)]`
ORDER desc. | DT[order(-cols)] DT[order(-cols), j, by] | `irisDT[order(-Species)]`
ORDER multiple | DT[order(cols)] DT[order(cols), j, by] | `irisDT[order(-Species, Petal.Width)]`
GROUP BY single | DT[i, j, by] | `irisDT[ ,.N, by=Species]`
GROUP BY multiple | DT[i, j, by] | `irisDT[ ,.N, by=.(Species,Width=Petal.Width)]`
TOP | head(DT, n) | `head(irisDT)`
HAVING | DT[i, j, by][condition] | `irisDT[ , .(Count=.N), by=Species][Count>25]` 
Sub-queries | DT[...][...][...] | `irisDT[ , .(Sepal.Length=mean(Sepal.Length)), by=Species][Sepal.Length>6, .(Species)]`

\* Uses `irisDT <- data.table(iris)`

### CRUD
Task | Generic syntax | Example(s)\*
------------- | ------------- | -------------
INSERT | DT <- rbindlist(DT, newDT) | `irisDT<-rbindlist( irisDT, irisDT[1] )`
READ aka SELECT (see above) | DT[ , .( cols )] | `irisDT[ , .(Species, Sepal.Length)]`
UPDATE / ADD column | DT[ , a := b ] | `irisDT[ , Sepal.Area := Sepal.Width * Sepal.Length]`
UPDATE / ADD multiple columns | DT[ , \`:=\`(a = b, c = d) ] | ```r 'irisDT[ , \x60:=\x60(CreatedDate = Sys.Date(), User = "Steph")]'` ``
UPDATE / ADD multiple columns by reference | DT[ , (newcols):=vals ] | `irisDT[ , c("a","b"):=.(1,2)]`
DELETE | DT <- DT[!condition] | `irisDT <- irisDT[!(Species=="setosa" & Petal.Length>=1.5)]`
DROP table | DT <- NULL | `irisDT<-NULL`
DROP column | DT[,col:=NULL] | `iristDT[,Species:=NULL]`

\* Uses `irisDT <- data.table(iris)`

### Metadata
Task | Generic syntax | Example(s)\*
------------- | ------------- | -------------
Structure | str(DT) | `str(irisDT)`
Column Names | colnames(DT) | `colnames(irisDT)`
Summary stats | summary(DT) | `summary(irisDT)`
Retrieve primary key info | key(DT) | `key(irisDT)`
List all data.tables | tables() | `tables()`

\* Uses `irisDT <- data.table(iris)`

### Relationships
Task | Generic syntax | Example(s)\*
------------- | ------------- | -------------
INNER JOIN | Y[X, nomatch=0] | `lookupDT[irisDT,nomatch=0]`
LEFT JOIN | Y[X] | `lookupDT[irisDT]`
FULL JOIN| merge(X, Y, all=TRUE) | `merge(irisDT, lookupDT, all=TRUE)`
CROSS JOIN | optiRum::CJ.dt(X,Y) | `CJ.dt(irisDT, lookupDT)`
UNION ALL | rbindlist( list(X,Y), fill=TRUE ) | `rbindlist( list(irisDT, lookupDT), fill=TRUE )`
UNION | unique( rbindlist( list(X,Y), fill=TRUE ) ) | `unique( rbindlist( list(irisDT, lookupDT), fill=TRUE ) )`
JOIN and AGGREGATE | Y[X, cols, by] | `lookupDT[irisDT,.(count=.N),by=Band]`

\* Uses:
```
irisDT   <- data.table(iris, key="Species")
lookupDT <- data.table(Species=c("setosa", "virginica", "Blah"), Band=c("A", "B", "A"), key="Species")
``` 

### Intermediate tasks
Task | Generic syntax | Example(s)\*
------------- | ------------- | -------------
SELECT dynamically | DT[ , colnames , with=FALSE] , DT[ , .SD , .SDcols=colnames| `cols<-colnames(irisDT); irisDT[ , cols, with=FALSE]` `cols<-colnames(irisDT); irisDT[ , .SD, .SDcols=colnames]`
GROUP BY dynamically | DT[ , ..., by=colnames] | `irisDT[,.N,by=c("Species")]` 
GROUP BY, ORDER BY group | DT[ , ... , keyby] |`irisDT[,.N,keyby=Species]`  
UPDATE / ADD column of summary stat | DT[ , a := b ] | `irisDT[ , All.SL.Mean:=mean(Sepal.Length)]`
UPDATE / ADD column by group | DT[ , a := b, by] | `irisDT[ , Species.SL.Mean:=mean(Sepal.Length), by=Species]`
TOP by group | DT[ , head(.SD), by] | `irisDT[ , head(.SD,2) , by=Species]`
Largest record | DT[ which.max(col) ] | `irisDT[ which.max(Sepal.Length) ]`
Largest record by group | DT[ , .SD[ which.max(col) ], by] | `irisDT[ , .SD[ which.max(Sepal.Length) ], by=Species]`
Cumulative total | DT[ , cumsum(col) ] | `irisDT[ , cumsum(Sepal.Width)]`
NEGATIVE SELECT | DT[ , .SD, .SDcols=-"colname"] | `irisDT[ , .SD, .SDcols=-"Species"]`
RANK | DT[ , frank(col) ] | `irisDT[ , frank(Sepal.Length,ties.method="first")]`
AGGREGATE multiple columns | DT[ , lapply(.SD, sum)] | `irisDT[ , lapply(.SD,sum), .SDcols=-"Species"]`
AGGREGATE multiple columns by group | DT[ , lapply(.SD, sum), by] | `irisDT[ , lapply(.SD,sum), by=Species]`
COUNT DISTINCT multiple columns by group | DT[ , lapply(.SD, uniqueN), by] | `irisDT[ , lapply(.SD,uniqueN), by=Species]`
COUNT NULL multiple columns by group | DT[ , lapply(.SD, function(x) sum(is.na(x))), by] | `irisDT[ , lapply(.SD,function(x) sum(is.na(x))), by=Species]`
PIVOT data - to single value column | melt(DT,...) | `melt(irisDT)`
PIVOT data -  to aggregate | dcast(DT, a~b, function) | `dcast(melt(irisDT), Species ~ variable, sum)`
Convert a large data.frame or list | setDT() |  `iris<-iris; setDT(iris)`
ROW_NUMBER | DT[ , .I] | `irisDT[ , .I]`
GROUP number | DT[, .GRP ,by] | `irisDT[ , .GRP, by=Species]`  

\* Uses `irisDT <- data.table(iris)`

### Advanced tasks
Task | Generic syntax | Example(s)\*
------------- | ------------- | -------------
GROUP BY each new incidence of group | DT[ , cols , by=(col, rleid(col))] | `irisDT[order(Sepal.Length), .N, by=.(Species, rleid(Species))]`
Calculate using (previous/next) N row | DT[ ,  shift( cols, n)] | `irisDT[ , prev.Sepal.Length:=shift(Sepal.Length), by=Species ]`
ORDER underlying table | setorder() | `setorder(irisDT,Species)`
JOIN & GROUP by keys | X[Y, .N, by=.EACHI] | `irisDT[lookupDT, .N, by=.EACHI]`
TRANSPOSE data.table | transpose(DT) | `transpose(irisDT)`
Split string to columns | DT[, tstrsplit(charCol, pattern) ] | `irisDT[ , tstrsplit(as.character(Species),"e")]`


\* Uses:
```
irisDT   <- data.table(iris, key="Species")
lookupDT <- data.table(Species=c("setosa", "virginica", "Blah"), Band=c("A", "B", "A"), key="Species")
``` 

## More resources
 - [data.table wiki](https://github.com/Rdatatable/data.table/wiki/Getting-started)