自动匹配和替换单词和位置及其替换
Automatically match and replace word & position with its replacement
我正在使用 R 中的 Google Analytics 转换路径数据。我导入的数据框如下例所示:
Channel_Path | Source_Path
Social > Email > Social > Paid Search > Social | facebook > mailtool > m.facebook.com > google > facebook+instagram
Organic Search > Email > Social | google > mailtool > pinterest
如您所见,不同的频道由“>”符号分隔。我想做的是:
将“Channel_Path”列中的“社交”替换为“Source_Path”列中的相应值,而不更改任何其他值。这应该发生在数据集中的所有行上。
结果应该如下所示:
Channel_Path | Source_Path
facebook > Email > m.facebook.com > Paid Search > facebook+instagram | facebook > mailtool > m.facebook.com > google > facebook+instagram
Organic Search > Email > pinterest | google > mailtool > pinterest
我遇到的问题是我正在处理一个大型数据集(60.000 行)并且我不知道如何根据它们的位置自动替换值。
为了更好的重现性,这里是上面给出的例子的代码:
df <- data.frame(Channel_Path = c("Social > Email > Social > Paid Search > Social", "Organic Search > Email > Social"),
Source_Path = c("facebook > mailtool > m.facebook.com > google > facebook+instagram", "google > mailtool > pinterest"))
谢谢!
输入:
df <- data.frame(Channel_Path = c("Social > Email > Social > Paid Search > Social", "Organic Search > Email > Social"),
Source_Path = c("facebook > mailtool > m.facebook.com > google > facebook+instagram", "google > mailtool > pinterest"))
函数:
library(tidyr)
library(dplyr)
library(stringr)
google_analytics <- function(col1,col2){
str1 <- str_split(col1," > ")[[1]]
str2 <- str_split(col2," > ")[[1]]
result <- ""
for(i in 1:length(str1)){
if(str1[i]=="Social"){
str1[i] <- ifelse(str2[i] %in% c("facebook+instagram","m.facebook.com"),"facebook",str2[i])
}
if(i==length(str1)){
result <- paste0(result, str1[i])
next
}
result <- paste0(result, str1[i], " > ")
}
return(result)
}
df <- df %>% rowwise() %>% dplyr::mutate(Channel_Path=google_analytics(Channel_Path,Source_Path))
输出:
Channel_Path Source_Path
<chr> <chr>
1 facebook > Email > facebook > Paid Search > f~ facebook > mailtool > m.facebook.com > google > faceboo~
2 Organic Search > Email > pinterest google > mailtool > pinterest
我们可以获取 " > "
上分隔各列的长格式数据,将 Channel_Path
值替换为 Channel_Path == 'Social'
并再次粘贴这些值。
library(dplyr)
df %>%
mutate(row = row_number()) %>%
tidyr::separate_rows(Channel_Path, Source_Path, sep = " > ") %>%
mutate(Channel_Path = ifelse(Channel_Path == 'Social',
Source_Path, Channel_Path)) %>%
group_by(row) %>%
summarise(across(.fns = ~paste(., collapse = " > "))) %>%
select(-row)
# Channel_Path
#1 facebook > Email > m.facebook.com > Paid Search > facebook+instagram
#2 Organic Search > Email > pinterest
# Source_Path
#1 facebook > mailtool > m.facebook.com > google > facebook+instagram
#2 google > mailtool > pinterest
我们将逐行工作,对于每一行,我们将使用 scan()
解析每一列的元素,然后我们将使用 ifelse()
获取正确元素的向量,我们将折叠回我们请求的输出。
library(dplyr, warn.conflicts = FALSE)
df %>%
rowwise() %>%
mutate_at("Channel_Path", ~{
cp <- scan(text = ., what = character(), sep = ">", strip.white = TRUE, quiet = TRUE)
sp <- scan(text = Source_Path, what = character(), sep = ">", strip.white = TRUE, quiet = TRUE)
cp <- ifelse(cp == "Social", sp, cp)
paste(cp, collapse = " > ")
}) %>%
ungroup()
#> # A tibble: 2 x 2
#> Channel_Path Source_Path
#> <chr> <chr>
#> 1 facebook > Email > m.facebook.com > Pai~ facebook > mailtool > m.facebook.com~
#> 2 Organic Search > Email > pinterest google > mailtool > pinterest
我正在使用 R 中的 Google Analytics 转换路径数据。我导入的数据框如下例所示:
Channel_Path | Source_Path
Social > Email > Social > Paid Search > Social | facebook > mailtool > m.facebook.com > google > facebook+instagram
Organic Search > Email > Social | google > mailtool > pinterest
如您所见,不同的频道由“>”符号分隔。我想做的是:
将“Channel_Path”列中的“社交”替换为“Source_Path”列中的相应值,而不更改任何其他值。这应该发生在数据集中的所有行上。
结果应该如下所示:
Channel_Path | Source_Path
facebook > Email > m.facebook.com > Paid Search > facebook+instagram | facebook > mailtool > m.facebook.com > google > facebook+instagram
Organic Search > Email > pinterest | google > mailtool > pinterest
我遇到的问题是我正在处理一个大型数据集(60.000 行)并且我不知道如何根据它们的位置自动替换值。
为了更好的重现性,这里是上面给出的例子的代码:
df <- data.frame(Channel_Path = c("Social > Email > Social > Paid Search > Social", "Organic Search > Email > Social"),
Source_Path = c("facebook > mailtool > m.facebook.com > google > facebook+instagram", "google > mailtool > pinterest"))
谢谢!
输入:
df <- data.frame(Channel_Path = c("Social > Email > Social > Paid Search > Social", "Organic Search > Email > Social"),
Source_Path = c("facebook > mailtool > m.facebook.com > google > facebook+instagram", "google > mailtool > pinterest"))
函数:
library(tidyr)
library(dplyr)
library(stringr)
google_analytics <- function(col1,col2){
str1 <- str_split(col1," > ")[[1]]
str2 <- str_split(col2," > ")[[1]]
result <- ""
for(i in 1:length(str1)){
if(str1[i]=="Social"){
str1[i] <- ifelse(str2[i] %in% c("facebook+instagram","m.facebook.com"),"facebook",str2[i])
}
if(i==length(str1)){
result <- paste0(result, str1[i])
next
}
result <- paste0(result, str1[i], " > ")
}
return(result)
}
df <- df %>% rowwise() %>% dplyr::mutate(Channel_Path=google_analytics(Channel_Path,Source_Path))
输出:
Channel_Path Source_Path
<chr> <chr>
1 facebook > Email > facebook > Paid Search > f~ facebook > mailtool > m.facebook.com > google > faceboo~
2 Organic Search > Email > pinterest google > mailtool > pinterest
我们可以获取 " > "
上分隔各列的长格式数据,将 Channel_Path
值替换为 Channel_Path == 'Social'
并再次粘贴这些值。
library(dplyr)
df %>%
mutate(row = row_number()) %>%
tidyr::separate_rows(Channel_Path, Source_Path, sep = " > ") %>%
mutate(Channel_Path = ifelse(Channel_Path == 'Social',
Source_Path, Channel_Path)) %>%
group_by(row) %>%
summarise(across(.fns = ~paste(., collapse = " > "))) %>%
select(-row)
# Channel_Path
#1 facebook > Email > m.facebook.com > Paid Search > facebook+instagram
#2 Organic Search > Email > pinterest
# Source_Path
#1 facebook > mailtool > m.facebook.com > google > facebook+instagram
#2 google > mailtool > pinterest
我们将逐行工作,对于每一行,我们将使用 scan()
解析每一列的元素,然后我们将使用 ifelse()
获取正确元素的向量,我们将折叠回我们请求的输出。
library(dplyr, warn.conflicts = FALSE)
df %>%
rowwise() %>%
mutate_at("Channel_Path", ~{
cp <- scan(text = ., what = character(), sep = ">", strip.white = TRUE, quiet = TRUE)
sp <- scan(text = Source_Path, what = character(), sep = ">", strip.white = TRUE, quiet = TRUE)
cp <- ifelse(cp == "Social", sp, cp)
paste(cp, collapse = " > ")
}) %>%
ungroup()
#> # A tibble: 2 x 2
#> Channel_Path Source_Path
#> <chr> <chr>
#> 1 facebook > Email > m.facebook.com > Pai~ facebook > mailtool > m.facebook.com~
#> 2 Organic Search > Email > pinterest google > mailtool > pinterest