为什么 stringr::str_detect 无法检测到我的字符串？

Question

我正在从 pdf 中提取行，并尝试使用 dplyr::filter(stringr::str_detect(my_column, 'my string')) 检测特定字符串。

该字符串似乎没有 detectable 编码。

这是一个 link 的 PDF 文件：https://bioconductor.org/packages/release/bioc/vignettes/Rsubread/inst/doc/SubreadUsersGuide.pdf

字符串是第 42 页 table（左侧栏）中的 em-dash。

我已尝试检测 em-dash 的几种表现形式，但在本文档中找不到。

如何确定此 em-dash 的编码以便我可以用它过滤我的小标题？

pdftools::pdf_text("SubreadUsersGuide.pdf") %>% 
  stringr::str_split(pattern = '\r') %>% 
  tibble::tibble(
    line = .
  ) %>% 
  tidyr::unnest(cols = line) %>% 
  dplyr::filter(
    stringr::str_detect(line, pattern = '^EM_DASH')
  )

Answer 1

您要匹配的字符不是破折号，它是 MINUS sign belonging to the Symbol, Math Unicode 类别，编码为 U+2212。

要匹配字符串开头的任何一个或多个 Unicode 破折号 + 减号，您可以使用

pattern = "^[\p{Pd}\xAD\u2212]+"

这里，

^ - 字符串开头
[ - 一个字符的开头 class：
- \p{Pd} - 任何 Puncutation, Dash 字符
- \xAD - 软连字符
- \u2212 - 减号。
]+ - 字符结束 class，出现一次或多次。

参见regex demo。

为什么 stringr::str_detect 无法检测到我的字符串？

Why is stringr::str_detect not able to detect my string?

regex

unicode

r

stringr