-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy path060-strings.Rmd
188 lines (161 loc) · 9.45 KB
/
060-strings.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
# Strings
## Regular Expressions
### Intro
Resources:
More detailed phone number example: https://stackoverflow.com/questions/16699007/regular-expression-to-match-standard-10-digit-phone-number
Excellent cheat sheet: https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf
More than you ever wanted: https://www.regular-expressions.info/tutorial.html
A great package for regex in r is Hadley Wickham's 'stringr' package. We'll be using this often.
```{r}
library(stringr)
```
The main idea behind regular expressions (or regex) is to find blocks of text that match a certain pattern. This pattern can be as simple as a single letter or number, or as complicated as an entire address. Some simple examples.
```{r}
allchars = '1234567890abcdefghijklmnopqrstuvwxyz'
str_extract_all(allchars, '1')
str_extract_all(allchars, '12')
str_extract_all(allchars, '13')
str_extract_all(allchars, 'a')
```
A character class is a set of characters, any one of which will match.
```{r}
str_extract_all(allchars, '[12345]')
str_extract_all(allchars, '[abcde]')
str_extract_all(allchars, '[[:digit:]]')
```
What if you wanted all numbers? There are a number of built in classes to save on typing:
'\\d' is all digits
'[[:lower:]]' is all lower case letters
'\\s' is all white space
'.' is anything at all and is known as the wildcard
```{r}
str_extract_all(allchars, '[[:digit:]]')
str_extract_all(allchars, '[[:lower:]]')
str_extract_all(allchars, '\\s')
```
We'll use a motivating example to introduce some other thing you can do with regular expressions. Let's say you wanted to extract all the phone numbers in a body of text.
```{r}
exampleText = "Some phone numbers are 7608675309, perhaps also 403-596-4038. Some people use periods in their numbers, so 402.367.5039. We can also have things like 304 385 1029 or (760)-581-3957. How can we extract all of these?"
```
We'll start simple and try to build expressions that are general enough to get all the numbers in the above paragraph. The first number is the easiest, how can we match on any 10 numbers?
```{r}
str_extract_all(exampleText, '\\d\\d\\d\\d\\d\\d\\d\\d\\d\\d')
```
More succinctly we can use a quantifier to match a given number of times: {n} will match what it preceedes it n times.
```{r}
str_extract_all(exampleText, '\\d{10}')
```
Other useful quantifiers are \* to match at least 0 times, + to match at least once, and {n,m} to match between n and m times. We can use \* to get the first two numbers.
```{r}
str_extract_all(exampleText, '\\d{3}-*\\d{3}-*\\d{4}')
```
Sometimes the number groups are separated by -, sometimes by '.'. What happens if we try to get the number separated by periods?
```{r}
str_extract_all(exampleText, '\\d{3}.*\\d{3}.*\\d{4}')
```
What happened? Since the period is the wildcard it matched all kinds of things. If we want to treat the period literally, instead of as a special character, we need to escape it with double backslashes:
```{r}
str_extract_all(exampleText, '\\d{3}\\.*\\d{3}\\.*\\d{4}')
```
Now it's treating period as a literal character. We can combine the expression matching period to the one matching - using character classes:
```{r}
str_extract_all(exampleText, '\\d{3}[-\\.]*\\d{3}[-\\.]*\\d{4}')
```
The pattern '[-\\.]*' will match either - or period at least 0 times. If you epect phone numbers with some other delimiter, you can toss it in that character class just the same. We can use the same idea to capture the space separated one.
```{r}
str_extract_all(exampleText, '\\d{3}[-\\.\\s]*\\d{3}[-\\.\\s]*\\d{4}')
```
The last challenge is getting the number with the parentheses. We can use the same idea. Note that since parenthese are a special character we need to escape them.
```{r}
str_extract_all(exampleText, '\\(*\\d{3}\\)*[-\\.\\s]*\\d{3}[-\\.\\s]*\\d{4}')
```
### Basic Web Scraping Example
Regular expressions are very useful in the context of web scraping. This example is from a small scraping job we did on the Pulaski Board of Supervisors website to extract their meeting minutes. First, we should have a look at their website to get a lay of the land. It's at: http://www.pulaskicounty.org/Board-of-Supervisors.html
```{r}
library(XML)
rooturl = "http://www.pulaskicounty.org/Board-of-Supervisors.html"
bosMainPage = htmlParse(rooturl)
bosMainPage
links = xpathSApply(bosMainPage, "//a/@href")
links
```
The above code downloads the html for the Board's website and saves it in bosMainPage. The next line extracts all the links leading away from that page. It looks like a promising place to look is in the links such as 'Board-of-Supervisors-Minutes-1992.html'. We can get just those links with the following:
```{r}
minutesLinks = grep("Minutes", links, value = T)
fullMinutesUrl = paste0("http://www.pulaskicounty.org/", minutesLinks)
fullMinutesUrl
```
The function grep is a base r function that returns strings which contain a pattern. We used it above to take all the urls on the main page and extract only the ones which contain 'Minutes', since those are the ones that contain links to the meeting minutes.
Now we have all the urls to the pages where the minutes are kept. To work with all of them we could use a loop or an apply function, but we'll just do one for illustration. We want to go to the page, find all the pdf's with the minutes, and download them. A natural idea is to find everything with '.pdf', but observe:
```{r}
minutePage2016 = htmlParse(fullMinutesUrl[2])
pdfLinks = grep(".pdf", xpathSApply(minutePage2016, "//a/@href"), value = T)
pdfLinks
```
It's always wise to make sure your expressions are getting you the things you want. Here, we have all the meeting pdfs, but also some extra page images. In this case, we can match like so:
```{r}
pdfLinks = grep("Minutes and Agendas", xpathSApply(minutePage2016, "//a/@href"), value = T)
fullPdfUrl = paste0("http://www.pulaskicounty.org/", pdfLinks)
fullPdfUrl
```
By looping through all the years above and combining the results, we can get a full list of all the urls for the Pulaski Board of Supervisors minutes pdfs. Then, it's a simple application of a loop to go in and download them all for later work.
### Lookaround Examples
```{r}
naicsAndCompanies = c("Name: Herbalife Ltd; NAICS: 325411, 325412, 424490",
"Name: Sanofi-Aventis SA; NAICS: 325412",
"Name: Abbott Laboratories; NAICS: 325411, 325412, 325413, 325620, 334516, 339112",
"Name: Alexion Pharmaceuticals Inc; NAICS: 325412",
"Name: Purdue Pharma LP; NAICS: 325412; Name: Transcept Pharmaceuticals Inc; NAICS: 325412",
"Name: Sanofi-Aventis SA; NAICS: 325412")
naicsAndCompanies
```
We want a table with company names in one column and NAICS codes in another column. How can we extract the components we want?
Let's focus on the first entry for now. Notice that a semicolon separates the name from the NAICS codes. We can use that to split the string using lookarounds.
(?=X) is the lookahead operator. It looks AHEAD for the pattern X; that is, it matches things that come BEFORE it
(?<=X) is the lookbehind operator. It looks BEHIND for the pattern X; that is, it matches things that come AFTER it
(X) contains the pattern X that comes before the look ahead or after the lookbehind
'.\*' means match the wildcard '.' any number of times '*'.
```{r}
firstRegexCase = naicsAndCompanies[[1]]
firstRegexCase
company = str_extract(firstRegexCase, "(.*?)(?=;)")
naicsCodes = str_extract(firstRegexCase, "(?<=NAICS: )(.*)")
company
naicsCodes
```
We're getting there, but not quite. We don't want that annoying 'Name: ' in front of the company. We can achieve that with the following. The lookbehind '(?<=Name: )' will find a place in the string that matches the pattern 'Name: ' and then match what comes next, which in this case is anything (.*).
```{r}
cleanerCompany = str_extract(company, "(?<=Name: )(.*)")
cleanerCompany
```
Note that we could have removed the 'Name: ' directly using a function called gsub, but the lookbehind will be useful later. gsub replaces the pattern in the first argument with the pattern in the second wherever it finds it in the third argument.
```{r}
gsub("Name: ", "", company)
```
Now we can tackle the whole vector. We'll use the same idea of separating names from codes using semicolons, but with two new challenges.
```{r}
naicsAndCompanies[[5]]
```
This entry has multiple companies in the same line. To get at both of them, we need to use str_extract_all() instead of just str_extract().
```{r}
str_extract_all(naicsAndCompanies, "(?<=Name: )(.*)(?=;)")
```
Whoops, looks like we didn't split the 5th string at the right place! This is because the regular expression by default tries to match the longest string possible. This is called greedy matching. We can disable it by adding '?' after the wild card
```{r}
companies = str_extract_all(naicsAndCompanies, "(?<=Name: )(.*?)(?=;)")
companies
```
Now let's do the same for the NAICS codes.
```{r}
str_extract_all(paste0(naicsAndCompanies), "(?<=NAICS: )(.*?)(?=;)")
```
What's going on here? This code looks for stuff between 'NAICS: ' and ';', but except for the 5th line there's never a ';' after 'NAICS: '. To get around this, we can use a logical OR operator, along with a special symbol meaning 'end of the string'. Here, the expression '(;|$)' means 'match either ; OR the end of the line;.
```{r}
allNaics = str_extract_all(paste0(naicsAndCompanies), "(?<=NAICS: )(.*?)(?=(;|$))")
allNaics
```
Finally, we can put it all together and we have a clean company by naics table:
```{r}
companyNaicsList = data.frame(company = unlist(companies), naicsCodes = unlist(allNaics))
companyNaicsList
```