正则表达式

参考：https://r4ds.hadley.nz/regexps

正则表达式是一种简洁而强大的语言，用于描述字符串中的模式。正则表达式（regular expressions）有时被缩写为 “regex” 或 “regexp”。

library(tidyverse)

1 轮流匹配（alternas）

用于模糊匹配。

1.1 `"ab|d"`

"ab|d" 匹配“ab”或“d”：

alternas <- c("abc", "abcde", "acbde")

str_view(alternas, "ab|d")

[1] │ <ab>c
[2] │ <ab>c<d>e
[3] │ acb<d>e

1.2 `"[abd]"`

"[abd]" 匹配“a”、“b”和“c”中的任意一个：

str_view(alternas, "[abc]")

[1] │ <a><b><c>
[2] │ <a><b><c>de
[3] │ <a><c><b>de

1.3 `"[a-c]"`

"[a-c]" 匹配包含“a”到“c”及其之间字母的字符，即匹配“a”、“b”或“c”：

str_view(alternas, "[a-c]")

[1] │ <a><b><c>
[2] │ <a><b><c>de
[3] │ <a><c><b>de

1.4 `"[^abc]"`

"[^abc]" 匹配不包含“a”、“b”及“c”的字符：

str_view(alternas, "[^abc]")

[2] │ abc<d><e>
[3] │ acb<d><e>

区分大小写

正则表达式是需要区分大小写的，例如：

str_view(alternas, "[ABc]")

[1] │ ab<c>
[2] │ ab<c>de
[3] │ a<c>bde

如果我们不需要区分大小写，有以下三种方法可以使用：

将大小写字母同时列出：

str_view(alternas, "[ABCabc]")

[1] │ <a><b><c>
[2] │ <a><b><c>de
[3] │ <a><c><b>de

告诉正则表达式忽略大小写。在 stringr 中，可以通过将正则表达式封装到 regex() 中，从而调用一些参数来控制正则表达式的行为。例如通过添加 ignore_case = TRUE，就可以实现忽略大小写。其他编程语言中，这些参数通常被称为“flag”。
```
str_view(alternas, regex("[ABC]", ignore_case = TRUE))
```
```
[1] │ <a><b><c>
[2] │ <a><b><c>de
[3] │ <a><c><b>de
```

使用 str_to_lower() 将待匹配字符全部转换为小写：

alternas %>% 
  str_to_lower() %>% 
  str_view("[abc]")

[1] │ <a><b><c>
[2] │ <a><b><c>de
[3] │ <a><c><b>de

在具体应用中可根据实际情况选择其中一种方法。

2 定量匹配（quantifiers）

2.1 `"a."`

"a." 匹配包含 “a” 和另一个任意字符的字符串：

str_view(c("a", "ab", "ae", "bd", "ea", "eab"), "a.")

[2] │ <ab>
[3] │ <ae>
[6] │ e<ab>

# 匹配开头为“a”，最后为”e”，并且中间包含任意三个字符的字符串：
str_view(fruit, "a...e")

 [1] │ <apple>
 [7] │ bl<ackbe>rry
[48] │ mand<arine>
[51] │ nect<arine>
[62] │ pine<apple>
[64] │ pomegr<anate>
[70] │ r<aspbe>rry
[73] │ sal<al be>rry

# 如果只写一个点"."则只要有任意字符（包括空格）的对象都会被匹配到，而空对象不会被匹配到：
str_view(c("", "a ", "a b", "ae", "bd", "ea", "eab", "%"), ".")

[2] │ <a>< >
[3] │ <a>< ><b>
[4] │ <a><e>
[5] │ <b><d>
[6] │ <e><a>
[7] │ <e><a><b>
[8] │ <%>

2.2 `"ab?"`

"ab?" 匹配 “a”或“ab”：

str_view(c("a", "ac", "ab", "abb", "abc", "abcd"), "ab?")

[1] │ <a>
[2] │ <a>c
[3] │ <ab>
[4] │ <ab>b
[5] │ <ab>c
[6] │ <ab>cd

2.3 `"ab+"`

"ab+" 匹配“ab”、“abb”、“abbb”……，即“a”后至少一个“b”：

str_view(c("a", "ac", "ab", "abb", "abc", "abcd"), "ab+")

[3] │ <ab>
[4] │ <abb>
[5] │ <ab>c
[6] │ <ab>cd

2.4 `"ab*"`

“ab*” 匹配“a”、“ab”、“abb”、“abbb”……，即“a”或“a”后加任意数量的“b”：

str_view(c("ab", "ac", "ab", "abb", "abc", "abcd"), "ab*")

[1] │ <ab>
[2] │ <a>c
[3] │ <ab>
[4] │ <abb>
[5] │ <ab>c
[6] │ <ab>cd

Tip

匹配“ab”或以“a”开头“b”结尾的字符：

str_view(c("ab", "acb", "a b", "acdb"), "a.*b")

[1] │ <ab>
[2] │ <acb>
[3] │ <a b>
[4] │ <acdb>

2.5 `"a{n}"`

除了上面任意数量字符的匹配，我们还可以使用 {} 精确指定字符的匹配数量：

chr <- c("a", "aab", "aba", "aaa.b", "aaaabc")

# 匹配“aa”
str_view(chr, "a{2}")

[2] │ <aa>b
[4] │ <aa>a.b
[5] │ <aa><aa>bc

# 等价于
str_view(chr, "aa")

[2] │ <aa>b
[4] │ <aa>a.b
[5] │ <aa><aa>bc

2.6 `"a{n,}"`

"a{2,}" 匹配“aa”、“aaa”、“aaaa”……，即匹配连续≥2次“a”的字符：

str_view(chr, "a{2,}")

[2] │ <aa>b
[4] │ <aaa>.b
[5] │ <aaaa>bc

2.7 `"a{n,m}"`

"a{n,m}" 匹配连续n-m个“a”：

# 匹配“aaa”和“aaaa”
str_view(chr, "a{3,4}")

[4] │ <aaa>.b
[5] │ <aaaa>bc

3 锚点匹配（anchors）

3.1 `"^a"` 和 `"a$"`

给匹配字符加上了位置锚点，只匹配开头（"^a"）或结尾（“a$”）字符：

str_view(fruit, "^a")

[1] │ <a>pple
[2] │ <a>pricot
[3] │ <a>vocado

str_view(fruit, "a$")

 [4] │ banan<a>
[15] │ cherimoy<a>
[30] │ feijo<a>
[36] │ guav<a>
[56] │ papay<a>
[74] │ satsum<a>

# 强制完全匹配
str_view(fruit, "^apple$")

[1] │ <apple>

3.2 `"\\b"`

"\\b" 是字符边界标志，匹配单词之间的边界（即单词的开始或结束）。例如，我们要匹配以“The”开头的对象，如果直接写 "^The"，则还会匹配到“These”、“There”、“Their”等单词，所以这个时候我们可以在“The”的后面加上边界标志 "\\b" ，表示这是一个独立的单词：

chr <- sentences[1:10]
chr

 [1] "The birch canoe slid on the smooth planks." 
 [2] "Glue the sheet to the dark blue background."
 [3] "It's easy to tell the depth of a well."     
 [4] "These days a chicken leg is a rare dish."   
 [5] "Rice is often served in round bowls."       
 [6] "The juice of lemons makes fine punch."      
 [7] "The box was thrown beside the parked truck."
 [8] "The hogs were fed chopped corn and garbage."
 [9] "Four hours of steady work faced us."        
[10] "A large size in stockings is hard to sell."

str_view(chr, "^The")

[1] │ <The> birch canoe slid on the smooth planks.
[4] │ <The>se days a chicken leg is a rare dish.
[6] │ <The> juice of lemons makes fine punch.
[7] │ <The> box was thrown beside the parked truck.
[8] │ <The> hogs were fed chopped corn and garbage.

str_view(chr, "^The\\b")

[1] │ <The> birch canoe slid on the smooth planks.
[6] │ <The> juice of lemons makes fine punch.
[7] │ <The> box was thrown beside the parked truck.
[8] │ <The> hogs were fed chopped corn and garbage.

# 或者限定“The”后面必须有至少一个空格
str_view(chr, "^The\\s+")

[1] │ <The >birch canoe slid on the smooth planks.
[6] │ <The >juice of lemons makes fine punch.
[7] │ <The >box was thrown beside the parked truck.
[8] │ <The >hogs were fed chopped corn and garbage.

4 转译符（escape）

以上的这些符号称为元字符（metacharacters），它们在正则表达式中起到类似函数名的作用，不参与字面匹配。所有的元字符包括：.^$\|*+?{}[]()。如果我们想匹配元字符本身时应该怎么做呢？这时候就需要引入转译符（escape）。有两种转译符："\\" 和 "[]"：

chr <- c("abc", "a.bc", "a^bc", "a|bc", "a*bc", "a?bc", "a\\bc", "a\bc")
str_view(chr, "a\\.b")

[2] │ <a.b>c

str_view(chr, "a\\^b")

[3] │ <a^b>c

str_view(chr, "a\\|b")

[4] │ <a|b>c

str_view(chr, "a\\*b")

[5] │ <a*b>c

str_view(chr, "a\\?b")

[6] │ <a?b>c

str_view(chr, "\\\\")

[7] │ a<\>bc

"[]" 的的效果和上面一样，但是它不能用于匹配转译符"//"：

str_view(chr, "a[.]b")
str_view(chr, "a[*]b")
str_view(chr, "a[?]b")

5 字符类别（character class/character set）匹配

匹配某种类型的字符，如 [a-z] 匹配任何小写字母， [0-9] 匹配任何数字。比较特殊的有：

5.1 `"\\d+"`

匹配任意数字字符：

chr <- "abcd ABCD  12345 -!@#%."
str_view(chr, "\\d+")

[1] │ abcd ABCD  <12345> -!@#%.

# 等价
str_view(chr, "[0-9]+")

[1] │ abcd ABCD  <12345> -!@#%.

5.2 `"\\D+"`

匹配任何非数字字符：

str_view(chr, "\\D+")

[1] │ <abcd ABCD  >12345< -!@#%.>

5.3 `"\\s+"`

匹配空格（whitespaces），包括制表和换行符：

str_view(chr, "\\s+")

[1] │ abcd< >ABCD<  >12345< >-!@#%.

# 或者直接输入空格
str_view(chr, " +")

[1] │ abcd< >ABCD<  >12345< >-!@#%.

5.4 `"\\S+"`

匹配任何非空格字符：

str_view(chr, "\\S+")

[1] │ <abcd> <ABCD>  <12345> <-!@#%.>

区分 "\\S+" 和 ".+"

".+" ：非空匹配，即只要有任意字符及空格的对象都会被匹配到，而空对象不会被匹配到：

chr2 <- c("", "  ", "a ", "a b", "ae", "bd", "ea", "eab", " %")
str_view(chr2, ".+")

[2] │ <  >
[3] │ <a >
[4] │ <a b>
[5] │ <ae>
[6] │ <bd>
[7] │ <ea>
[8] │ <eab>
[9] │ < %>

str_view(chr2, "\\S+")

[3] │ <a> 
[4] │ <a> <b>
[5] │ <ae>
[6] │ <bd>
[7] │ <ea>
[8] │ <eab>
[9] │  <%>

注意第二个对象 " " 为只包含了空格的对象，".+" 会匹配到该对象，而 "\S+" 则不会。

5.5 `"\\w+"`

匹配任何单词字符，即字母和数字：

str_view(chr, "\\w+")

[1] │ <abcd> <ABCD>  <12345> -!@#%.

5.6 `"\\W+"`

匹配任何“非单词”字符，即一些特殊字符：

str_view(chr, "\\W+")

[1] │ abcd< >ABCD<  >12345< -!@#%.>

实践

《R for Data Science》的 ”Regular expressions“ 一章的最后部分有几个案例帮助进一步熟练掌握正则表达式的应用，可以参考。

Session Info

R version 4.3.3 (2024-02-29)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Shanghai
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
 [5] purrr_1.0.2     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1   
 [9] ggplot2_3.5.0   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] gtable_0.3.4      jsonlite_1.8.8    compiler_4.3.3    tidyselect_1.2.0 
 [5] scales_1.3.0      yaml_2.3.8        fastmap_1.1.1     R6_2.5.1         
 [9] generics_0.1.3    knitr_1.45        htmlwidgets_1.6.4 munsell_0.5.0    
[13] pillar_1.9.0      tzdb_0.4.0        rlang_1.1.3       utf8_1.2.4       
[17] stringi_1.8.3     xfun_0.42         timechange_0.3.0  cli_3.6.2        
[21] withr_3.0.0       magrittr_2.0.3    digest_0.6.34     grid_4.3.3       
[25] rstudioapi_0.15.0 hms_1.1.3         lifecycle_1.0.4   vctrs_0.6.5      
[29] evaluate_0.23     glue_1.7.0        codetools_0.2-19  fansi_1.0.6      
[33] colorspace_2.1-0  rmarkdown_2.25    tools_4.3.3       pkgconfig_2.0.3  
[37] htmltools_0.5.7

1 轮流匹配（alternas）

1.1 "ab|d"

1.2 "[abd]"

1.3 "[a-c]"

1.4 "[^abc]"

2 定量匹配（quantifiers）

2.1 "a."

2.2 "ab?"

2.3 "ab+"

2.4 "ab*"

2.5 "a{n}"

2.6 "a{n,}"

2.7 "a{n,m}"

3 锚点匹配（anchors）

3.1 "^a" 和 "a$"

3.2 "\\b"

4 转译符（escape）

5 字符类别（character class/character set）匹配

5.1 "\\d+"

5.2 "\\D+"

5.3 "\\s+"

5.4 "\\S+"

5.5 "\\w+"

5.6 "\\W+"

1.1 `"ab|d"`

1.2 `"[abd]"`

1.3 `"[a-c]"`

1.4 `"[^abc]"`

2.1 `"a."`

2.2 `"ab?"`

2.3 `"ab+"`

2.4 `"ab*"`

2.5 `"a{n}"`

2.6 `"a{n,}"`

2.7 `"a{n,m}"`

3.1 `"^a"` 和 `"a$"`

3.2 `"\\b"`

5.1 `"\\d+"`

5.2 `"\\D+"`

5.3 `"\\s+"`

5.4 `"\\S+"`

5.5 `"\\w+"`

5.6 `"\\W+"`