This repository was archived by the owner on Feb 8, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 27
/
Copy pathtablewrangling.Rmd
153 lines (133 loc) · 7.8 KB
/
tablewrangling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
title: "Wrangling Tables"
author: "Steph Locke"
date: "`r Sys.Date()`"
output:
html_document:
theme: journal
---
## Requirements
- necessary package: data.table (>=v1.9.5)
- suggested package: optiRum
## SQL
### data.table is "SQL-like"
```
DT[i, j, by]
DT[WHERE | JOIN | ORDER, SELECT | UPDATE, GROUP]
```
### data.table behaves like a database
A data.table acts like an in-memory RDBMS:
- The result of a query is a table
- Can set clustered primary keys over single or multiple columns
- Can set non-clustered secondary keys over single or multiple columns
- Can perform a wide range of relational algebra
- Can do CRUD operations
- Think in columns / sets not rows!
### data.table differences to SQL Server
There are some differences that need to be mentioned:
- A primary key can have repeat values
- No constrained foreign keys
- Joins are done by *position in key* so tblA with key A, and tblB with key C,A would join on tblA[A] and tblB[C]
- It's inherently dynamic so vectors of column names etc. can be provided to it
- DELETE doesn't exist explicitly
- It's case-sensitive
## data.table cookbook
### Single table basics
Task | Generic syntax | Example(s)\*
------------- | ------------- | -------------
CREATE | data.table(...) setDT() | `data.table(a=1:3 , b=LETTERS[1:3])` `data.table(iris)`
PRIMARY KEY | data.table(...,key) setkey() | `data.table(a=1:3 , b=LETTERS[1:3], key="b")` `setkey(data.table(iris),Species)`
SELECT basic | DT[ , .( cols )] | `irisDT[ , .(Species, Sepal.Length)]`
SELECT alias | DT[ , .( a=col )] | `irisDT[ , .(Species, Length=Sepal.Length)]`
SELECT COUNT | DT[ , .N] | `irisDT[ ,.N]`
SELECT COUNT DISTINCT | DT[ , uniqueN(cols)] | `irisDT[ ,uniqueN(.SD)]`
SELECT aggregation | DT[ , .( sum(col) , .N )] | `irisDT[ , .(Count=.N, Length=mean(Sepal.Length))]`
WHERE exact on primary key| DT[value] DT[value, ] | `irisDT["setosa"]` `irisDT["setosa", .(Count=.N)]`
WHERE | DT[condition] DT[condition, j, by] | `irisDT[Species=="setosa"]` `irisDT[Species=="setosa", .(Count=.N)]`
WHERE BETWEEN | DT[between(col, min, max)] DT[ col %between% c(min,max) ]| `irisDT[between(Sepal.Length, 1, 5)]` `irisDT[Sepal.Length %between% c(1,5)]`
WHERE LIKE | DT[like(col,pattern)] DT[ col %like% pattern ] | `irisDT[like(Species,"set")]` `irisDT[Species %like% "set"]`
ORDER asc. | DT[order(cols)] DT[order(cols), j, by] | `irisDT[order(Species)]`
ORDER desc. | DT[order(-cols)] DT[order(-cols), j, by] | `irisDT[order(-Species)]`
ORDER multiple | DT[order(cols)] DT[order(cols), j, by] | `irisDT[order(-Species, Petal.Width)]`
GROUP BY single | DT[i, j, by] | `irisDT[ ,.N, by=Species]`
GROUP BY multiple | DT[i, j, by] | `irisDT[ ,.N, by=.(Species,Width=Petal.Width)]`
TOP | head(DT, n) | `head(irisDT)`
HAVING | DT[i, j, by][condition] | `irisDT[ , .(Count=.N), by=Species][Count>25]`
Sub-queries | DT[...][...][...] | `irisDT[ , .(Sepal.Length=mean(Sepal.Length)), by=Species][Sepal.Length>6, .(Species)]`
\* Uses `irisDT <- data.table(iris)`
### CRUD
Task | Generic syntax | Example(s)\*
------------- | ------------- | -------------
INSERT | DT <- rbindlist(DT, newDT) | `irisDT<-rbindlist( irisDT, irisDT[1] )`
READ aka SELECT (see above) | DT[ , .( cols )] | `irisDT[ , .(Species, Sepal.Length)]`
UPDATE / ADD column | DT[ , a := b ] | `irisDT[ , Sepal.Area := Sepal.Width * Sepal.Length]`
UPDATE / ADD multiple columns | DT[ , \`:=\`(a = b, c = d) ] | ```r 'irisDT[ , \x60:=\x60(CreatedDate = Sys.Date(), User = "Steph")]'` ``
UPDATE / ADD multiple columns by reference | DT[ , (newcols):=vals ] | `irisDT[ , c("a","b"):=.(1,2)]`
DELETE | DT <- DT[!condition] | `irisDT <- irisDT[!(Species=="setosa" & Petal.Length>=1.5)]`
DROP table | DT <- NULL | `irisDT<-NULL`
DROP column | DT[,col:=NULL] | `iristDT[,Species:=NULL]`
\* Uses `irisDT <- data.table(iris)`
### Metadata
Task | Generic syntax | Example(s)\*
------------- | ------------- | -------------
Structure | str(DT) | `str(irisDT)`
Column Names | colnames(DT) | `colnames(irisDT)`
Summary stats | summary(DT) | `summary(irisDT)`
Retrieve primary key info | key(DT) | `key(irisDT)`
List all data.tables | tables() | `tables()`
\* Uses `irisDT <- data.table(iris)`
### Relationships
Task | Generic syntax | Example(s)\*
------------- | ------------- | -------------
INNER JOIN | Y[X, nomatch=0] | `lookupDT[irisDT,nomatch=0]`
LEFT JOIN | Y[X] | `lookupDT[irisDT]`
FULL JOIN| merge(X, Y, all=TRUE) | `merge(irisDT, lookupDT, all=TRUE)`
CROSS JOIN | optiRum::CJ.dt(X,Y) | `CJ.dt(irisDT, lookupDT)`
UNION ALL | rbindlist( list(X,Y), fill=TRUE ) | `rbindlist( list(irisDT, lookupDT), fill=TRUE )`
UNION | unique( rbindlist( list(X,Y), fill=TRUE ) ) | `unique( rbindlist( list(irisDT, lookupDT), fill=TRUE ) )`
JOIN and AGGREGATE | Y[X, cols, by] | `lookupDT[irisDT,.(count=.N),by=Band]`
\* Uses:
```
irisDT <- data.table(iris, key="Species")
lookupDT <- data.table(Species=c("setosa", "virginica", "Blah"), Band=c("A", "B", "A"), key="Species")
```
### Intermediate tasks
Task | Generic syntax | Example(s)\*
------------- | ------------- | -------------
SELECT dynamically | DT[ , colnames , with=FALSE] , DT[ , .SD , .SDcols=colnames| `cols<-colnames(irisDT); irisDT[ , cols, with=FALSE]` `cols<-colnames(irisDT); irisDT[ , .SD, .SDcols=colnames]`
GROUP BY dynamically | DT[ , ..., by=colnames] | `irisDT[,.N,by=c("Species")]`
GROUP BY, ORDER BY group | DT[ , ... , keyby] |`irisDT[,.N,keyby=Species]`
UPDATE / ADD column of summary stat | DT[ , a := b ] | `irisDT[ , All.SL.Mean:=mean(Sepal.Length)]`
UPDATE / ADD column by group | DT[ , a := b, by] | `irisDT[ , Species.SL.Mean:=mean(Sepal.Length), by=Species]`
TOP by group | DT[ , head(.SD), by] | `irisDT[ , head(.SD,2) , by=Species]`
Largest record | DT[ which.max(col) ] | `irisDT[ which.max(Sepal.Length) ]`
Largest record by group | DT[ , .SD[ which.max(col) ], by] | `irisDT[ , .SD[ which.max(Sepal.Length) ], by=Species]`
Cumulative total | DT[ , cumsum(col) ] | `irisDT[ , cumsum(Sepal.Width)]`
NEGATIVE SELECT | DT[ , .SD, .SDcols=-"colname"] | `irisDT[ , .SD, .SDcols=-"Species"]`
RANK | DT[ , frank(col) ] | `irisDT[ , frank(Sepal.Length,ties.method="first")]`
AGGREGATE multiple columns | DT[ , lapply(.SD, sum)] | `irisDT[ , lapply(.SD,sum), .SDcols=-"Species"]`
AGGREGATE multiple columns by group | DT[ , lapply(.SD, sum), by] | `irisDT[ , lapply(.SD,sum), by=Species]`
COUNT DISTINCT multiple columns by group | DT[ , lapply(.SD, uniqueN), by] | `irisDT[ , lapply(.SD,uniqueN), by=Species]`
COUNT NULL multiple columns by group | DT[ , lapply(.SD, function(x) sum(is.na(x))), by] | `irisDT[ , lapply(.SD,function(x) sum(is.na(x))), by=Species]`
PIVOT data - to single value column | melt(DT,...) | `melt(irisDT)`
PIVOT data - to aggregate | dcast(DT, a~b, function) | `dcast(melt(irisDT), Species ~ variable, sum)`
Convert a large data.frame or list | setDT() | `iris<-iris; setDT(iris)`
ROW_NUMBER | DT[ , .I] | `irisDT[ , .I]`
GROUP number | DT[, .GRP ,by] | `irisDT[ , .GRP, by=Species]`
\* Uses `irisDT <- data.table(iris)`
### Advanced tasks
Task | Generic syntax | Example(s)\*
------------- | ------------- | -------------
GROUP BY each new incidence of group | DT[ , cols , by=(col, rleid(col))] | `irisDT[order(Sepal.Length), .N, by=.(Species, rleid(Species))]`
Calculate using (previous/next) N row | DT[ , shift( cols, n)] | `irisDT[ , prev.Sepal.Length:=shift(Sepal.Length), by=Species ]`
ORDER underlying table | setorder() | `setorder(irisDT,Species)`
JOIN & GROUP by keys | X[Y, .N, by=.EACHI] | `irisDT[lookupDT, .N, by=.EACHI]`
TRANSPOSE data.table | transpose(DT) | `transpose(irisDT)`
Split string to columns | DT[, tstrsplit(charCol, pattern) ] | `irisDT[ , tstrsplit(as.character(Species),"e")]`
\* Uses:
```
irisDT <- data.table(iris, key="Species")
lookupDT <- data.table(Species=c("setosa", "virginica", "Blah"), Band=c("A", "B", "A"), key="Species")
```
## More resources
- [data.table wiki](https://github.com/Rdatatable/data.table/wiki/Getting-started)