如何通过循环 agrepl 并用矢量替换匹配项来清理 Aerosmith 的唱片目录？

Question

我从唱片网站上抓取了一些数据来制作 Aerosmith 歌曲类别的信息图。数据集有一个歌曲变量，它有许多 random/unwanted 个字符，一些标点符号和一些行在列中有不止一首歌曲。

我正在尝试用矢量 'y' 循环遍历歌曲，找到近似匹配并将匹配替换为值 'y'，但一无所获。我不确定 for 循环是否是最好的方法，基本上我处于停滞状态。

下面的代码是一个可重现的数据集，也是我用来搜索和替换的代码。

y <- c('Eat the Rich','Cry\'n','Dream On','Crazy')

set.seed(123)

alpha <- data_frame(
 songs= paste0(sample(c('walkthisway','adfkbjf','dudelookslikealady','cryn','eattherich'),100,replace=T),sample(c('aadfa','aghnds','crazy','wwrrsdg'),100,replace=T)),

 album=sample(c('Toys in the Attic','Get a Grip','Greatest Hits'),100,replace=T))

alpha %>% head()

这是我用代码取得的进展，当向量 'y' 仅包含 1 个值时它似乎有效。

alpha[[i]][agrepl(y,alpha[[i]])] <- y

Answer 1

需要什么:-)

# Remove special characters
# In this case " " and "'"
foo <- gsub(" |'", "" , y)
# Transform to lower case
foo <- tolower(foo)

for(i in foo) {
    # Get original song name
    bar <- y[which(foo == i)]
    # Find matches and replace with original song
    alpha$songs[grep(i, alpha$songs)] <- bar
}

如何通过循环 agrepl 并用矢量替换匹配项来清理 Aerosmith 的唱片目录？

How to clean Aerosmith's discography by looping over agrepl and replacing matches with a vector?

indexing

loops

r

gsub