R模式匹配列中行的多个组合以进行替换？

Question

我正在尝试找出如何从同一数据框中的另一列上的数据框的一列中识别任何字符串的实例，以便进行替换。在这种情况下，我有我提取的论坛帖子，其中人们通过名称引用其他用户，我想删除这些名称以进行分析，否则它们将被视为大量单词。以下是此数据框的输入：

structure(list(uber_name = structure(c(9L, 2L, 1L, 2L, 3L, 10L, 
3L, 9L, 11L), .Label = c("aluber1968", "bigdreamslittlemoney", 
"FuberNYC", "JamesM", "jonnyplastic", "JustDre", "KING D", "klimarov", 
"NycGirl705", "shumacker", "spike69", "theitalian", "Uberman8263", 
"Ez2dj", "Manhmptn", "NYCDriver", "staytune", "UBS", "Ubured", 
"Jme10", "Lennyyellowcab", "Mir", "eagle88", "Ibuys4730", "NoUsername", 
"BathoTrask", "Douglas", "LGC", "Jakeinny098", "Rustyshackelford", 
"shabbyroch", "ubershiza", "drbrkln", "elys123", "bossdriver", 
"HerbyHerb", "Jim1985", "Malik38", "STIDRIVER", "vxlon7", "Waqar", 
"tohunt4me", "DogPound", "SuliB", "AlBrklyn", "John Cunningham", 
"MReeves", "PinkFoot", "alextheboss", "luisannalui", "censoredbytheFCC", 
"KONY", "cieru", "Jorlev", "Smooth954", "marcusguber", "nyc321", 
"Tony from New Jersey", "Vanstaal", "Bkrah", "brunoamat2", "gebbels6", 
"Kevin7889", "uanic", "Uber OG", "UberKilledMyMarriage", "ya mon its me", 
"HunkAWestchester", "Mr Affinito", "ninja warrior", "NoNonsense", 
"notacabdriver", "Notauberhater", "TwoFiddyMile", "bilyvh", "cybertec69", 
"JohnnyBlanco", "SOBE", "ubernyc"), class = "factor"), uber_write = c("I see people post about getting a w", 
"you have 2 choices either you drive", "More than a year ago I didnt drive ", 
"yeah i stopped driving for them for", "Ive been getting some promotions la", 
"FuberNYC saidIve been getting some ", "shumacker saidAnd You feel importan", 
"FuberNYC saidIve been getting some ", "They start coming after few months "
), uber_date = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = c("Jan 19, 2017", "Mar 30, 2017", "Jan 23, 2017", 
"Jan 12, 2017", "Jan 9, 2017", "Jan 1, 2017", "Dec 31, 2016", 
"Nov 26, 2016", "Nov 3, 2016", "Dec 22, 2016", "Dec 13, 2016", 
"Dec 2, 2016", "Nov 15, 2016", "Oct 31, 2016", "Oct 20, 2016", 
"Mar 14, 2017", "Sep 1, 2016", "Jul 26, 2016", "Mar 1, 2017", 
"Feb 25, 2017", "Sep 8, 2016", "Sep 9, 2016", "Apr 21, 2015"), class = "factor")), .Names = c("uber_name", 
"uber_write", "uber_date"), class = c("data.table", "data.frame"
), row.names = c(NA, -9L), .internal.selfref = <pointer: 0x0000000000220788>)

我以前使用过 gsub，但我不知道如何将它应用到这个实例。我想在 "uber_names" 列中使用任何名称，并从任何 "uber_writes" 的帖子中删除这些用户。

Answer 1

您可以为 data.table (dt) 中的所有用户名创建向量 uber_names，然后生成一个正则表达式 (name1|name2|name3) 来替换所有匹配的用户名""，如：

library(data.table)
uber_names <- dt$uber_name
dt[, uber_write_filtered := gsub(
    pattern = paste0("(", paste(uber_names, collapse = "|"), ")"),
    replacement = "", uber_write)]

Answer 2

我无法重新创建您的数据框，但这是一个接近的数据框：

data <- 
structure(list(uber_name = c("aluber1968", "bigdreamslittlemoney", 
"FuberNYC", "JamesM", "jonnyplastic", "JustDre", "KING D", "klimarov", 
"NycGirl705", "shumacker", "spike69", "theitalian", "Uberman8263", 
"Ez2dj", "Manhmptn", "NYCDriver", "staytune", "UBS", "Ubured", 
"Jme10", "Lennyyellowcab", "Mir", "eagle88", "Ibuys4730", "NoUsername", 
"BathoTrask", "Douglas", "LGC", "Jakeinny098", "Rustyshackelford", 
"shabbyroch", "ubershiza", "drbrkln", "elys123", "bossdriver", 
"HerbyHerb", "Jim1985", "Malik38", "STIDRIVER", "vxlon7", "Waqar", 
"tohunt4me", "DogPound", "SuliB", "AlBrklyn", "John Cunningham", 
"MReeves", "PinkFoot", "alextheboss", "luisannalui", "censoredbytheFCC", 
"KONY", "cieru", "Jorlev", "Smooth954", "marcusguber", "nyc321", 
"Tony from New Jersey", "Vanstaal", "Bkrah", "brunoamat2", "gebbels6", 
"Kevin7889", "uanic", "Uber OG", "UberKilledMyMarriage", "ya mon its me", 
"HunkAWestchester", "Mr Affinito", "ninja warrior", "NoNonsense", 
"notacabdriver", "Notauberhater", "TwoFiddyMile", "bilyvh", "cybertec69", 
"JohnnyBlanco", "SOBE", "ubernyc"), uber_write = c("I see people post about getting a w", 
"you have 2 choices either you drive", "More than a year ago I didnt drive ", 
"yeah i stopped driving for them for", "Ive been getting some promotions la", 
"FuberNYC saidIve been getting some ", "shumacker saidAnd You feel importan", 
"FuberNYC saidIve been getting some ", "They start coming after few months ", 
"I see people post about getting a w", "you have 2 choices either you drive", 
"More than a year ago I didnt drive ", "yeah i stopped driving for them for", 
"Ive been getting some promotions la", "FuberNYC saidIve been getting some ", 
"shumacker saidAnd You feel importan", "FuberNYC saidIve been getting some ", 
"They start coming after few months ", "I see people post about getting a w", 
"you have 2 choices either you drive", "More than a year ago I didnt drive ", 
"yeah i stopped driving for them for", "Ive been getting some promotions la", 
"FuberNYC saidIve been getting some ", "shumacker saidAnd You feel importan", 
"FuberNYC saidIve been getting some ", "They start coming after few months ", 
"I see people post about getting a w", "you have 2 choices either you drive", 
"More than a year ago I didnt drive ", "yeah i stopped driving for them for", 
"Ive been getting some promotions la", "FuberNYC saidIve been getting some ", 
"shumacker saidAnd You feel importan", "FuberNYC saidIve been getting some ", 
"They start coming after few months ", "I see people post about getting a w", 
"you have 2 choices either you drive", "More than a year ago I didnt drive ", 
"yeah i stopped driving for them for", "Ive been getting some promotions la", 
"FuberNYC saidIve been getting some ", "shumacker saidAnd You feel importan", 
"FuberNYC saidIve been getting some ", "They start coming after few months ", 
"I see people post about getting a w", "you have 2 choices either you drive", 
"More than a year ago I didnt drive ", "yeah i stopped driving for them for", 
"Ive been getting some promotions la", "FuberNYC saidIve been getting some ", 
"shumacker saidAnd You feel importan", "FuberNYC saidIve been getting some ", 
"They start coming after few months ", "I see people post about getting a w", 
"you have 2 choices either you drive", "More than a year ago I didnt drive ", 
"yeah i stopped driving for them for", "Ive been getting some promotions la", 
"FuberNYC saidIve been getting some ", "shumacker saidAnd You feel importan", 
"FuberNYC saidIve been getting some ", "They start coming after few months ", 
"I see people post about getting a w", "you have 2 choices either you drive", 
"More than a year ago I didnt drive ", "yeah i stopped driving for them for", 
"Ive been getting some promotions la", "FuberNYC saidIve been getting some ", 
"shumacker saidAnd You feel importan", "FuberNYC saidIve been getting some ", 
"They start coming after few months ", "I see people post about getting a w", 
"you have 2 choices either you drive", "More than a year ago I didnt drive ", 
"yeah i stopped driving for them for", "Ive been getting some promotions la", 
"FuberNYC saidIve been getting some ", "shumacker saidAnd You feel importan"
)), .Names = c("uber_name", "uber_write"), row.names = c(NA, 
-79L), class = "data.frame")

这是一个答案：

paste0(data$uber_name, collapse = "|") -> dont_want
data$uber_write2 <- gsub(pattern = dont_want, "", data$uber_write)

R模式匹配列中行的多个组合以进行替换？

R pattern matching multiple combinations of rows in columns for replacement?

string

r

gsub