自动匹配和替换单词和位置及其替换

Automatically match and replace word & position with its replacement

我正在使用 R 中的 Google Analytics 转换路径数据。我导入的数据框如下例所示:

    Channel_Path                                | Source_Path
Social > Email > Social > Paid Search > Social  | facebook > mailtool > m.facebook.com > google > facebook+instagram
Organic Search > Email > Social                 | google > mailtool > pinterest

如您所见,不同的频道由“>”符号分隔。我想做的是:

将“Channel_Path”列中的“社交”替换为“Source_Path”列中的相应值,而不更改任何其他值。这应该发生在数据集中的所有行上。

结果应该如下所示:

      Channel_Path                                                   | Source_Path
facebook > Email > m.facebook.com > Paid Search > facebook+instagram | facebook > mailtool > m.facebook.com > google > facebook+instagram
Organic Search > Email > pinterest                                   | google > mailtool > pinterest

我遇到的问题是我正在处理一个大型数据集(60.000 行)并且我不知道如何根据它们的位置自动替换值。

为了更好的重现性,这里是上面给出的例子的代码:

df <- data.frame(Channel_Path = c("Social > Email > Social > Paid Search > Social", "Organic Search > Email > Social"),
             Source_Path = c("facebook > mailtool > m.facebook.com > google > facebook+instagram", "google > mailtool > pinterest"))

谢谢!

输入:

df <- data.frame(Channel_Path = c("Social > Email > Social > Paid Search > Social", "Organic Search > Email > Social"),
         Source_Path = c("facebook > mailtool > m.facebook.com > google > facebook+instagram", "google > mailtool > pinterest"))

函数:

library(tidyr)
library(dplyr)
library(stringr)

google_analytics <- function(col1,col2){
str1 <- str_split(col1," > ")[[1]]
str2 <- str_split(col2," > ")[[1]]
result <- ""
for(i in 1:length(str1)){
  if(str1[i]=="Social"){
    str1[i] <- ifelse(str2[i] %in% c("facebook+instagram","m.facebook.com"),"facebook",str2[i])
  }
  if(i==length(str1)){
    result <- paste0(result, str1[i])
    next
  }
  result <- paste0(result, str1[i], " > ")
}

return(result)
}

df <- df %>% rowwise() %>% dplyr::mutate(Channel_Path=google_analytics(Channel_Path,Source_Path))

输出:

Channel_Path                                   Source_Path                                             
  <chr>                                          <chr>                                                   
1 facebook > Email > facebook > Paid Search > f~ facebook > mailtool > m.facebook.com > google > faceboo~
2 Organic Search > Email > pinterest             google > mailtool > pinterest

我们可以获取 " > " 上分隔各列的长格式数据,将 Channel_Path 值替换为 Channel_Path == 'Social' 并再次粘贴这些值。

library(dplyr)

df %>%
  mutate(row = row_number()) %>%
  tidyr::separate_rows(Channel_Path, Source_Path, sep = " > ") %>%
  mutate(Channel_Path = ifelse(Channel_Path == 'Social', 
                               Source_Path, Channel_Path)) %>%
  group_by(row) %>%
  summarise(across(.fns = ~paste(., collapse = " > "))) %>%
  select(-row) 

#                                                          Channel_Path
#1 facebook > Email > m.facebook.com > Paid Search > facebook+instagram
#2                                   Organic Search > Email > pinterest
#                                                         Source_Path
#1 facebook > mailtool > m.facebook.com > google > facebook+instagram
#2                                      google > mailtool > pinterest

我们将逐行工作,对于每一行,我们将使用 scan() 解析每一列的元素,然后我们将使用 ifelse() 获取正确元素的向量,我们将折叠回我们请求的输出。

library(dplyr, warn.conflicts = FALSE)

df %>%
  rowwise() %>%
  mutate_at("Channel_Path", ~{
    cp <- scan(text = ., what = character(), sep = ">", strip.white = TRUE, quiet = TRUE)
    sp <- scan(text = Source_Path, what = character(), sep = ">", strip.white = TRUE, quiet = TRUE)
    cp <- ifelse(cp == "Social", sp, cp)
    paste(cp, collapse = " > ")
  }) %>%
  ungroup()
#> # A tibble: 2 x 2
#>   Channel_Path                             Source_Path                          
#>   <chr>                                    <chr>                                
#> 1 facebook > Email > m.facebook.com > Pai~ facebook > mailtool > m.facebook.com~
#> 2 Organic Search > Email > pinterest       google > mailtool > pinterest