如何在 R 的标记中保留特殊符号，如“（”、“、”和“#”？

Question

我正在处理一个文本文件，其中包含招聘广告中的“c#”、“c++”和“.net”等词。当我将它转换为标记时，“#”、“++”和点被删除。我怎样才能将它们保留在生成的令牌中？这是我的代码：

unnest_tokens(word,REQUIREMENTS, token = "words",to_lower=TRUE)

Answer 1

问题出在参数 token = "words" 上，它在非单词字符上拆分（可能使用正则表达式 \W+）。此函数丢弃分隔符，因此为了保留这些字符，您将不得不使用 "words" 以外的其他参数。你可能想用 token = "regex" 和类似这样的东西定义你自己的分割正则表达式：

unnest_tokens(word,
              REQUIREMENTS,
              token = "regex",
              to_lower = TRUE,
              pattern = "\s+") # split on whitespace rather than non-word elements

这样，you can define whatever regex you need 自定义文本的标记化方式。

如何在 R 的标记中保留特殊符号，如“（”、“、”和“#”？

How to keep special symbols like "(" "," and "#" in tokens in R?

r

data-mining

tokenize