-
Notifications
You must be signed in to change notification settings - Fork 47
/
Copy path03-strings.Rmd
399 lines (258 loc) · 13.9 KB
/
03-strings.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
---
title: "Working with text using the stringr package"
author: ""
date: ""
output: html_document
keep_md: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
Working with text can be difficult. Often data collected as text is unstructured, of variable length, and prone to differences in how it is collected. You can therefore expect to do rather more cleaning of text data before it is in a shape that is useful for analysis.
This notebook is a minimal introduction to working with text using the **stringr** package, which has a lot of useful functions for working with text. We also look at "regular expressions", a language for matching patterns in text that forms the basis for many of the functions in **stringr**.
In this lesson we'll:
1. Learn about writing strings, "escaping" certain characters with the backspace, and some basic functions like case conversion and string concatenation.
2. Learn the basics of regular expressions, a powerful tool for matching patterns in strings.
3. See how regular expressions can be used together with various **stringr** functions to do a number of useful things:
+ detect which strings match a pattern
+ extract the part of a string that matches a pattern
+ replace the matched part of a string with a new string
+ split up strings by some criterion
For further reading, I strongly recommend [this chapter](http://r4ds.had.co.nz/strings.html) in R4DS, which covers the **stringr** package in quite a bit more detail than I do here. If you are new to regular expressions, there are a number of excellent tutorials online. Start with [this one](https://regexone.com/), and further information is [here](http://www.python-course.eu/re.php) and [here](http://www.regular-expressions.info/tutorial.html). For a tutorial that is slightly more R focussed, there is a Swirl course on Regular Expressions [here](https://github.com/jonmcalder/Regular_Expressions). There are many good "cheatsheets" to help you as you go, for example those by [Pete Freitag](http://www.petefreitag.com/cheatsheets/regex/) or [Ian Kopacka](https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf). The bible of regular expressions is apparently [this book](https://www.amazon.com/dp/0596528124/), if you really want to get into it!
First we load the packages we'll need for the notebook.
```{r}
library(tidyverse)
library(stringr)
```
## String basics
Below we enter a string and take a look at some basic information about the stored object.
```{r}
string <- "This is a string"
string
str(string)
length(string) # note length is 1
nchar(string) # "length" of the string, base R
str_length(string) # "length" of the string, stringr
```
Strings are created using single or double quotes. If we want to include literal quotes within the string, we need to use `\"`. This is a common way of including symbols that have a special meaning in the context of strings (we'll see much more of this with regular expressions). Try pasting the following lines into R/RStudio - you'll see that `string` on its own prints out the string itself, not the printed representation of it (for which we need `writeLines`).
```{r}
# "Escaping" the " character
string <- "I told you \"This is a string\""
string
writeLines(string)
```
If we escape certain characters that have a special meaning by putting a `\` in front, how do we get a backslash character into a string? We have to escape the escape character!
```{r}
writeLines("This is a \ string")
writeLines("This is a \\ string")
```
There are a few other "special (meta)characters" you need to remember:
* `\t` for "tab"
* `\r` for "carriage return"
* `\n` for "new line"
```{r}
# tab
string <- "This \t is a \nstring"
writeLines(string)
# carriage return and new line
string <- "This is a string\r\nThis is another"
writeLines(string)
```
We can combine strings with `str_c()`.
```{r}
str_c("x", "y", sep = ",")
str_c("x", "y", sep = " and ")
```
To extract parts of strings using their location in the string, we use `str_sub()`. The numbers passed as arguments to `str_sub()` indicate the start and end position of the part we want to extract. Negative numbers are counted backwards from the end of the word.
```{r}
web <- "http://ewn.co.za/"
# first four characters
str_sub(web, 1, 4)
# last six characters
str_sub(web, -6, -1)
# all but first and last characters
str_sub(web, 2, -2)
```
There are a few other useful things we can do with **stringr**. For example, we can change case. You need to be a bit careful here, because this conversion depends on your language settings (your [locale](http://r4ds.had.co.nz/strings.html#locales)). You can set this within the functions below.
```{r}
# convert to upper case
str_to_upper("This is a string")
```
```{r}
# convert to lower case
str_to_lower("This Is A StrinG")
```
We can also remove whitespace at the beginning and end of a string.
```{r}
str_trim(" This is a string \t\r\n")
```
Note that with all these functions, if we have a vector of strings, the functions are applied to each element of the vector.
```{r}
str_to_upper(c("String number 1 ", "String number 2", "String 3"))
```
## Matching patterns with regular expressions
A regular expression is a formal language that is very useful in describing patterns in strings.
```{r}
x <- c("apple", "banana", "pear", "28.50", "Probability 0")
```
> Note! **stringr** has very useful functions `str_view()` and `str_view_all()` that allow you to view a HTML rendering of a pattern matched by a regular expression. This would show the whole string, with a grey block indicating the match. Unfortunately, these functions don't work in jupyter notebook, so in the notebook I've used `str_extract()` and `str_extract_all()` below. These functions don't do the same thing as `str_view()`: they just *extract* the matched part of the string. This is much less illustrative of what's going on, so I suggest you rather do this next section in R, and replace `str_extract()` with `str_view()` (I've already done this in the .Rmd file)
The basic idea is to match the pattern given in the regular expression. The simplest case is where we want to match an exact string. Note that the default behaviour is to find only the first match. We can get all matches with `str_extract_all()`, and many **stringr** functions have `_all()` variants. We'll pick this point up again later, once we're more familiar with regular expressions.
```{r}
str_view(x, "a")
str_view(x, "ap")
# note 'a' is short for 'regex("a")'
str_view(x, regex("a"))
```
### Special metacharacters: `\d`, `\w`, `\s`
There are a number of "special (meta)characters" that can be used in regular expressions to match a certain type of character. Common ones include:
* `\d`: any digit
* `\w`: any alphanumeric character
* `\s`: any whitespace
These can all be negated by capitalizing them:
* `\D`: any NON-digit
* `\W`: any NON-alphanumeric character
* `\S`: any NON-whitespace
Note that below we need to add a second backslash. The reason you need two is because both the string parser used by **stringr** and the regular expression itself support escapes. Therefore, `\\d` (for example) is parsed to `\d` by the string parser and then interpreted by the regular expression parser. This is just a feature of **stringr**: you need to add a second backslash to anything that you would ordinarily use one for in a regular expression.
```{r}
str_view(x, "\\d")
str_view(x, "\\w")
str_view(x, "\\s")
```
```{r}
str_view(x, "\\D")
str_view(x, "\\W")
str_view(x, "\\S")
```
### "Or" matches with `[]`, matching anything with `.`, and negation with `^`
If we want to match a particular string, say "ap", we use that as the regular expression.
```{r}
str_view(x, "ap")
```
If we want to match any a *or* p, we use `[ap]`. The square brackets indicate an *or* kind of match. Note that the below reads "match a single letter that is either *a* or *p*.
```{r}
str_view(x, "[ap]")
```
We can use the square brackets notation to build up a more complex pattern. For example below we look for two letters, the first either *a* or *p*, the second either *p* or *e*.
```{r}
str_view(x, "[ap][pe]")
```
A useful special character here is `.`, which indicates "anything". Note that below we match a two-letter pattern "*a* followed by anything".
```{r}
str_view(x, "a.")
```
Another useful special character is `^`, which denotes negation when used within square brackets (it can also mean the beginning of a string, when not within brackets, see below!). Below we match a two-letter sequence, the first letter anything but *b*, the second letter an *a*.
```{r}
str_view(x, "[^b]a")
```
Finally, some other useful tools are `[A-Z]` (any capital letter), `[a-z]` (any lowercase letter), `[0-9]` any digit. These can also be combined.
```{r}
str_view(x, "[A-Z]")
str_view(x, "[a-z]")
str_view(x, "[0-9]")
str_view(x, "[A-Za-z0-9]")
```
### Matching repetitions
We've looked in fair detail at various ways of matching one or two characters. We can extend these methods to longer patterns, but this will quickly become unwieldy if we need to write out all the possible options that each character can take on. Some important tools allow us to specify pattern **repetitions**, allowing us to express lengthy patterns concisely.
Suppose we wanted to match a sequence of 3 alphanumeric characters. We could say (the long way),
```{r}
str_view(x, "\\w\\w\\w")
```
Or we can use curly braces to specify the number of matches we want:
```{r}
str_view(x, "\\w{3}")
```
This gives us considerable flexibility. For example:
```{r}
str_view(x, "\\w{2,3}") # any alphanum between 2 and 3 times
str_view(x, "\\w{2,}") # any alphanum at least 2
str_view(x, "[an]{2}") # two characters, each of which can be an a or n
```
Three other tools assist with controlling how many times a pattern matches. These are:
* `+`: one or more matches
* `*`: zero or more matches
* `?`: zero or one match
These special characters are always applied *after* the pattern they refer to.
The `+` denotes one or more matches. For example `a+` matches one or more `a`'s, `[abc]+` matches one or more of any `a`, `b`, or `c`.
```{r}
str_view(x, "a[na]+")
```
A `*` denotes *zero* or more matches.
```{r}
str_view(x, "a[na]*")
```
Finally, the `?` metacharacter matches either zero or one of the preceding character or group, and indicates an "optional" character that is particularly useful in distinguising between different word spelling. For example, the pattern `ab?c` indicates that the `b` is optional and will thus match both `abc` and `ac`.
```{r}
str_view("apple", "ap?p")
str_view("apple", "an?p")
```
### Specifying the start and end of a string
The start of a string can be matched with `^`, the end with `$`:
```{r}
str_view(x, "^a")
str_view(x, "a$")
str_view(x, "^apple$")
```
### Specifying groupings
Parentheses can be used in a similar way to mathematical operations, to specify natural groupings and indicate precedence for certain operations over others.
```{r}
str_view("I love cats", "I love (cats|dogs)")
str_view("I love birds", "I love (cats|dogs)")
str_view("I love dogs", "I love cats|dogs")
```
## Other useful **stringr** functions
The previous section gives you a number of ways of using regular expressions to match patterns. These use the language of regular expressions, and thus aren't really R-dependent. This section is about using the patterns that you extract to do further useful things. The functions that we discuss in this section are all **stringr** functions.
### Determine which strings match a pattern: `str_detect()`
The `str_detect()` function can be used to determine whether an input string contains a specified pattern. Where the input is a vector, `str_detect()` checks whether each element in turn contains the pattern i.e. it returns a logical vector of the same length as the input.
```{r}
str_detect(x, "\\d")
```
A useful way to use `str_detect()` is to extract those elements of an input vector that do (or don't) match some pattern.
```{r}
# elements of x containing a digit
x[str_detect(x, "\\d")]
# elements of x not containing a digit
x[!str_detect(x, "\\d")]
```
Rather than the above, **stringr** provides a useful wrapper function `str_subset()` that extracts the elements of *x* that match a given pattern.
```{r}
# gives same result as `x[str_detect(x, "\\d")]`
str_subset(x, "\\d")
# extracts indices of elements matching a pattern
str_which(x, "\\d")
```
Another useful helper function is `str_count()`, which counts up the number of times a pattern is matched in a string or vector of strings.
```{r}
# number of digits in each element of x
str_count(x, "\\d")
```
### Extract content of matches: `str_extract()` and `str_extract_all()`
The `str_extract()` function extracts the actual text of a match.
```{r}
str_extract(x, "a[nb]")
```
Note that this just returns the *first* match. If we want to return *all* matches, we need to use `str_extract_all()`.
```{r}
str_extract_all(x, "a[nb]")
```
### Replacing matches: `str_replace()` and `str_replace_all()`
One thing we often want to do is replace a matched pattern with another string. We do this using `str_replace()` and `str_replace_all()`.
```{r}
str_replace(x,"[aeiou]","-")
str_replace_all(x,"[aeiou]","-")
```
### Split a string based on a match: `str_split()`
Finally, we can split a string up by specifying some separating pattern. The examples below use, respectively, a single space, the word "a", and a new line with carriage return to specify breakpoints.
```{r}
str_split("This is a string", " ")
str_split("This is a string", "\\sa\\s")
str_split("This is a string\r\nThis is another", "\r\n")
```
## Exercises
1. Do the regular expression tutorial at https://regexone.com/
2. All of the exercises in [Chapter 14](http://r4ds.had.co.nz/strings.html) of R4DS are good, and doing them all would be a great way of learning all about **stringr**. Depending on how much time you have, a good selection of exercises is:
+ 14.3.2.1
+ 14.3.3.1
+ 14.3.4.1
+ 14.3.5.1
+ 14.4.2
+ 14.4.5.1