正则表达式

参考:https://r4ds.hadley.nz/regexps

正则表达式是一种简洁而强大的语言,用于描述字符串中的模式。正则表达式(regular expressions)有时被缩写为 “regex” 或 “regexp”。

1 轮流匹配(alternas)

用于模糊匹配。

1.1 "ab|d"

"ab|d" 匹配“ab”或“d”:

alternas <- c("abc", "abcde", "acbde")

str_view(alternas, "ab|d")
[1] │ <ab>c
[2] │ <ab>c<d>e
[3] │ acb<d>e

1.2 "[abd]"

"[abd]" 匹配“a”、“b”和“c”中的任意一个:

str_view(alternas, "[abc]")
[1] │ <a><b><c>
[2] │ <a><b><c>de
[3] │ <a><c><b>de

1.3 "[a-c]"

"[a-c]" 匹配包含“a”到“c”及其之间字母的字符,即匹配“a”、“b”或“c”:

str_view(alternas, "[a-c]")
[1] │ <a><b><c>
[2] │ <a><b><c>de
[3] │ <a><c><b>de

1.4 "[^abc]"

"[^abc]" 匹配不包含“a”、“b”及“c”的字符:

str_view(alternas, "[^abc]")
[2] │ abc<d><e>
[3] │ acb<d><e>

正则表达式是需要区分大小写的,例如:

str_view(alternas, "[ABc]")
[1] │ ab<c>
[2] │ ab<c>de
[3] │ a<c>bde

如果我们不需要区分大小写,有以下三种方法可以使用:

  1. 将大小写字母同时列出:

    str_view(alternas, "[ABCabc]")
    [1] │ <a><b><c>
    [2] │ <a><b><c>de
    [3] │ <a><c><b>de
  2. 告诉正则表达式忽略大小写。在 stringr 中,可以通过将正则表达式封装到 regex() 中,从而调用一些参数来控制正则表达式的行为。例如通过添加 ignore_case = TRUE,就可以实现忽略大小写。其他编程语言中,这些参数通常被称为“flag”。

    str_view(alternas, regex("[ABC]", ignore_case = TRUE))
    [1] │ <a><b><c>
    [2] │ <a><b><c>de
    [3] │ <a><c><b>de
  3. 使用 str_to_lower() 将待匹配字符全部转换为小写:

    alternas %>% 
      str_to_lower() %>% 
      str_view("[abc]")
    [1] │ <a><b><c>
    [2] │ <a><b><c>de
    [3] │ <a><c><b>de

在具体应用中可根据实际情况选择其中一种方法。

2 定量匹配(quantifiers)

2.1 "a."

"a." 匹配包含 “a” 和另一个任意字符的字符串:

str_view(c("a", "ab", "ae", "bd", "ea", "eab"), "a.")
[2] │ <ab>
[3] │ <ae>
[6] │ e<ab>
# 匹配开头为“a”,最后为”e”,并且中间包含任意三个字符的字符串:
str_view(fruit, "a...e")
 [1] │ <apple>
 [7] │ bl<ackbe>rry
[48] │ mand<arine>
[51] │ nect<arine>
[62] │ pine<apple>
[64] │ pomegr<anate>
[70] │ r<aspbe>rry
[73] │ sal<al be>rry
# 如果只写一个点"."则只要有任意字符(包括空格)的对象都会被匹配到,而空对象不会被匹配到:
str_view(c("", "a ", "a b", "ae", "bd", "ea", "eab", "%"), ".")
[2] │ <a>< >
[3] │ <a>< ><b>
[4] │ <a><e>
[5] │ <b><d>
[6] │ <e><a>
[7] │ <e><a><b>
[8] │ <%>

2.2 "ab?"

"ab?" 匹配 “a”或“ab”:

str_view(c("a", "ac", "ab", "abb", "abc", "abcd"), "ab?")
[1] │ <a>
[2] │ <a>c
[3] │ <ab>
[4] │ <ab>b
[5] │ <ab>c
[6] │ <ab>cd

2.3 "ab+"

"ab+" 匹配“ab”、“abb”、“abbb”……,即“a”后至少一个“b”:

str_view(c("a", "ac", "ab", "abb", "abc", "abcd"), "ab+")
[3] │ <ab>
[4] │ <abb>
[5] │ <ab>c
[6] │ <ab>cd

2.4 "ab*"

“ab*” 匹配“a”、“ab”、“abb”、“abbb”……,即“a”或“a”后加任意数量的“b”:

str_view(c("ab", "ac", "ab", "abb", "abc", "abcd"), "ab*")
[1] │ <ab>
[2] │ <a>c
[3] │ <ab>
[4] │ <abb>
[5] │ <ab>c
[6] │ <ab>cd
Tip

匹配“ab”或以“a”开头“b”结尾的字符:

str_view(c("ab", "acb", "a b", "acdb"), "a.*b")
[1] │ <ab>
[2] │ <acb>
[3] │ <a b>
[4] │ <acdb>

2.5 "a{n}"

除了上面任意数量字符的匹配,我们还可以使用 {} 精确指定字符的匹配数量:

chr <- c("a", "aab", "aba", "aaa.b", "aaaabc")

# 匹配“aa”
str_view(chr, "a{2}")
[2] │ <aa>b
[4] │ <aa>a.b
[5] │ <aa><aa>bc
# 等价于
str_view(chr, "aa")
[2] │ <aa>b
[4] │ <aa>a.b
[5] │ <aa><aa>bc

2.6 "a{n,}"

"a{2,}" 匹配“aa”、“aaa”、“aaaa”……,即匹配连续≥2次“a”的字符:

str_view(chr, "a{2,}")
[2] │ <aa>b
[4] │ <aaa>.b
[5] │ <aaaa>bc

2.7 "a{n,m}"

"a{n,m}" 匹配连续n-m个“a”:

# 匹配“aaa”和“aaaa”
str_view(chr, "a{3,4}")
[4] │ <aaa>.b
[5] │ <aaaa>bc

3 锚点匹配(anchors)

3.1 "^a""a$"

给匹配字符加上了位置锚点,只匹配开头("^a")或结尾(“a$”)字符:

str_view(fruit, "^a")
[1] │ <a>pple
[2] │ <a>pricot
[3] │ <a>vocado
str_view(fruit, "a$")
 [4] │ banan<a>
[15] │ cherimoy<a>
[30] │ feijo<a>
[36] │ guav<a>
[56] │ papay<a>
[74] │ satsum<a>
# 强制完全匹配
str_view(fruit, "^apple$")
[1] │ <apple>

3.2 "\\b"

"\\b"字符边界标志,匹配单词之间的边界(即单词的开始或结束)。例如,我们要匹配以“The”开头的对象,如果直接写 "^The",则还会匹配到“These”、“There”、“Their”等单词,所以这个时候我们可以在“The”的后面加上边界标志 "\\b" ,表示这是一个独立的单词:

chr <- sentences[1:10]
chr
 [1] "The birch canoe slid on the smooth planks." 
 [2] "Glue the sheet to the dark blue background."
 [3] "It's easy to tell the depth of a well."     
 [4] "These days a chicken leg is a rare dish."   
 [5] "Rice is often served in round bowls."       
 [6] "The juice of lemons makes fine punch."      
 [7] "The box was thrown beside the parked truck."
 [8] "The hogs were fed chopped corn and garbage."
 [9] "Four hours of steady work faced us."        
[10] "A large size in stockings is hard to sell." 
str_view(chr, "^The")
[1] │ <The> birch canoe slid on the smooth planks.
[4] │ <The>se days a chicken leg is a rare dish.
[6] │ <The> juice of lemons makes fine punch.
[7] │ <The> box was thrown beside the parked truck.
[8] │ <The> hogs were fed chopped corn and garbage.
str_view(chr, "^The\\b")
[1] │ <The> birch canoe slid on the smooth planks.
[6] │ <The> juice of lemons makes fine punch.
[7] │ <The> box was thrown beside the parked truck.
[8] │ <The> hogs were fed chopped corn and garbage.
# 或者限定“The”后面必须有至少一个空格
str_view(chr, "^The\\s+")
[1] │ <The >birch canoe slid on the smooth planks.
[6] │ <The >juice of lemons makes fine punch.
[7] │ <The >box was thrown beside the parked truck.
[8] │ <The >hogs were fed chopped corn and garbage.

4 转译符(escape)

以上的这些符号称为元字符(metacharacters),它们在正则表达式中起到类似函数名的作用,不参与字面匹配。所有的元字符包括:.^$\|*+?{}[]()。如果我们想匹配元字符本身时应该怎么做呢?这时候就需要引入转译符(escape)有两种转译符:"\\""[]"

chr <- c("abc", "a.bc", "a^bc", "a|bc", "a*bc", "a?bc", "a\\bc", "a\bc")
str_view(chr, "a\\.b")
[2] │ <a.b>c
str_view(chr, "a\\^b")
[3] │ <a^b>c
str_view(chr, "a\\|b")
[4] │ <a|b>c
str_view(chr, "a\\*b")
[5] │ <a*b>c
str_view(chr, "a\\?b")
[6] │ <a?b>c
str_view(chr, "\\\\") 
[7] │ a<\>bc

"[]" 的的效果和上面一样,但是它不能用于匹配转译符"//"

str_view(chr, "a[.]b")
str_view(chr, "a[*]b")
str_view(chr, "a[?]b")

5 字符类别(character class/character set)匹配

匹配某种类型的字符,如 [a-z] 匹配任何小写字母, [0-9] 匹配任何数字。比较特殊的有:

5.1 "\\d+"

匹配任意数字字符:

chr <- "abcd ABCD  12345 -!@#%."
str_view(chr, "\\d+")
[1] │ abcd ABCD  <12345> -!@#%.
# 等价
str_view(chr, "[0-9]+")
[1] │ abcd ABCD  <12345> -!@#%.

5.2 "\\D+"

匹配任何非数字字符:

str_view(chr, "\\D+")
[1] │ <abcd ABCD  >12345< -!@#%.>

5.3 "\\s+"

匹配空格(whitespaces),包括制表和换行符:

str_view(chr, "\\s+")
[1] │ abcd< >ABCD<  >12345< >-!@#%.
# 或者直接输入空格
str_view(chr, " +")
[1] │ abcd< >ABCD<  >12345< >-!@#%.

5.4 "\\S+"

匹配任何非空格字符:

str_view(chr, "\\S+")
[1] │ <abcd> <ABCD>  <12345> <-!@#%.>

".+" :非空匹配,即只要有任意字符及空格的对象都会被匹配到,而空对象不会被匹配到:

chr2 <- c("", "  ", "a ", "a b", "ae", "bd", "ea", "eab", " %")
str_view(chr2, ".+")
[2] │ <  >
[3] │ <a >
[4] │ <a b>
[5] │ <ae>
[6] │ <bd>
[7] │ <ea>
[8] │ <eab>
[9] │ < %>
str_view(chr2, "\\S+")
[3] │ <a> 
[4] │ <a> <b>
[5] │ <ae>
[6] │ <bd>
[7] │ <ea>
[8] │ <eab>
[9] │  <%>

注意第二个对象 " " 为只包含了空格的对象,".+" 会匹配到该对象,而 "\S+" 则不会。

5.5 "\\w+"

匹配任何单词字符,即字母和数字:

str_view(chr, "\\w+")
[1] │ <abcd> <ABCD>  <12345> -!@#%.

5.6 "\\W+"

匹配任何“非单词”字符,即一些特殊字符:

str_view(chr, "\\W+")
[1] │ abcd< >ABCD<  >12345< -!@#%.>

实践

《R for Data Science》的 ”Regular expressions“ 一章的最后部分有几个案例帮助进一步熟练掌握正则表达式的应用,可以参考。


R version 4.3.3 (2024-02-29)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Shanghai
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
 [5] purrr_1.0.2     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1   
 [9] ggplot2_3.5.0   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] gtable_0.3.4      jsonlite_1.8.8    compiler_4.3.3    tidyselect_1.2.0 
 [5] scales_1.3.0      yaml_2.3.8        fastmap_1.1.1     R6_2.5.1         
 [9] generics_0.1.3    knitr_1.45        htmlwidgets_1.6.4 munsell_0.5.0    
[13] pillar_1.9.0      tzdb_0.4.0        rlang_1.1.3       utf8_1.2.4       
[17] stringi_1.8.3     xfun_0.42         timechange_0.3.0  cli_3.6.2        
[21] withr_3.0.0       magrittr_2.0.3    digest_0.6.34     grid_4.3.3       
[25] rstudioapi_0.15.0 hms_1.1.3         lifecycle_1.0.4   vctrs_0.6.5      
[29] evaluate_0.23     glue_1.7.0        codetools_0.2-19  fansi_1.0.6      
[33] colorspace_2.1-0  rmarkdown_2.25    tools_4.3.3       pkgconfig_2.0.3  
[37] htmltools_0.5.7