如何在逗号分隔值数量不等的多列上 pivot_longer
How to pivot_longer on multiple columns with unequal numbers of comma-separated values
我的数据看起来很乱,其中多列有多个逗号分隔值:
df <- data.frame(
Line = 1:2,
Utterance = c("hi there", "how're ya"),
A_aoi = c("C*B*C", "*"),
A_aoi_dur = c("100,25,30,50,144", "200"),
B_aoi = c("*A", "*A*A*C"),
B_aoi_dur = c("777,876", "50,22,33,100,150,999")
)
我想做的是 pivot_longer
这样每个逗号分隔值都有自己的行。我 可以 完成这个,但看起来我正在完成的方式是什么,因为它涉及多个中间步骤和临时 df
s 使代码冗长和繁重:
library(dplyr)
library(tidyr)
# first temporary `df`:
df1 <- df %>%
select(-ends_with("dur")) %>%
pivot_longer(cols = ends_with("aoi"),
names_to = "Speaker") %>%
separate_rows(value, sep = "(?!^|$)") %>%
mutate(Speaker = sub("^(.).*", "\1", Speaker)) %>%
rename(AOI = value)
# second temporary `df`:
df2 <- df %>%
select(-ends_with("aoi")) %>%
pivot_longer(cols = ends_with("dur")) %>%
separate_rows(value, sep = ",") %>%
rename(Dur = value)
# final `df` (aka, the **expected outcome**):
df3 <- cbind(df1, df2[,4])
df3
Line Utterance Speaker AOI Dur
1 1 hi there A C 100
2 1 hi there A * 25
3 1 hi there A B 30
4 1 hi there A * 50
5 1 hi there A C 144
6 1 hi there B * 777
7 1 hi there B A 876
8 2 how're ya A * 200
9 2 how're ya B * 50
10 2 how're ya B A 22
11 2 how're ya B * 33
12 2 how're ya B A 100
13 2 how're ya B * 150
14 2 how're ya B C 999
如何更简洁地实现这种转变?
我不知道这是否真的“更简洁”,但这是一种在单个管道链中完成所有工作的方法。
- 除
Line
和 Utterance
之外的所有列中的值都将在值列中结束,因此将它们全部旋转更长的时间
- 从列名
中分离出Speaker
- 将值列旋转到最终形状(一列用于
AOI
,一列用于 Dur
)。这种由不同的键组成的先长后宽的模式很常见
- 最后,拆分值,以便我们可以将它们放在自己的行中。我认为
separate_rows
不能很好地处理列的不同分隔符,尤其是在我们需要拆分每个字符的地方,因此我们可以手动进行操作,然后 unnest
以获得所需的输出。请注意,这取决于 AOI
和 Dur
具有相同数量的元素,我假设给定输入是正确的。
library(tidyverse)
df <- data.frame(
Line = 1:2,
Utterance = c("hi there", "how're ya"),
A_aoi = c("C*B*C", "*"),
A_aoi_dur = c("100,25,30,50,144", "200"),
B_aoi = c("*A", "*A*A*C"),
B_aoi_dur = c("777,876", "50,22,33,100,150,999")
)
df %>%
pivot_longer(
cols = matches("(aoi|dur)$"),
names_to = "name",
values_to = "value"
) %>%
separate(name, into = c("Speaker", "aoi_dur"), sep = "_", extra = "merge") %>%
pivot_wider(names_from = aoi_dur, values_from = value) %>%
rename(AOI = aoi, Dur = aoi_dur) %>%
mutate(
AOI = str_split(AOI, pattern = ""),
Dur = str_split(Dur, pattern = ",")
) %>%
unnest(c(AOI, Dur))
#> # A tibble: 14 x 5
#> Line Utterance Speaker AOI Dur
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 hi there A C 100
#> 2 1 hi there A * 25
#> 3 1 hi there A B 30
#> 4 1 hi there A * 50
#> 5 1 hi there A C 144
#> 6 1 hi there B * 777
#> 7 1 hi there B A 876
#> 8 2 how're ya A * 200
#> 9 2 how're ya B * 50
#> 10 2 how're ya B A 22
#> 11 2 how're ya B * 33
#> 12 2 how're ya B A 100
#> 13 2 how're ya B * 150
#> 14 2 how're ya B C 999
由 reprex package (v1.0.0)
于 2021-06-24 创建
这里是tidyverse
中的一个选项
- 我们通过
paste
ing '_AOI' 重命名 ends_with
'_aoi' 的列 (rename_with
)
- 从 'wide' 重塑为 'long' -
pivot_longer
- 在 'AOI' 中的每个字符之间插入分隔符
,
以构成通用分隔符 - str_replace_all
- 最后,在
,
分隔符上使用 separate_rows
library(dplyr)
library(tidyr)
library(stringr)
df %>%
rename_with(~ str_c(., "_AOI"), ends_with("_aoi")) %>%
pivot_longer(cols = contains("_"),
names_to = c("Speaker", ".value"), names_pattern = "^(.*)_([^_]+$)") %>%
mutate(AOI = str_replace_all(AOI, "(?<=.)(?=.)", ",")) %>%
separate_rows(c(AOI, dur), sep = ",", convert = TRUE)
-输出
# A tibble: 14 x 5
Line Utterance Speaker AOI dur
<int> <chr> <chr> <chr> <int>
1 1 hi there A_aoi C 100
2 1 hi there A_aoi * 25
3 1 hi there A_aoi B 30
4 1 hi there A_aoi * 50
5 1 hi there A_aoi C 144
6 1 hi there B_aoi * 777
7 1 hi there B_aoi A 876
8 2 how're ya A_aoi * 200
9 2 how're ya B_aoi * 50
10 2 how're ya B_aoi A 22
11 2 how're ya B_aoi * 33
12 2 how're ya B_aoi A 100
13 2 how're ya B_aoi * 150
14 2 how're ya B_aoi C 999
我的数据看起来很乱,其中多列有多个逗号分隔值:
df <- data.frame(
Line = 1:2,
Utterance = c("hi there", "how're ya"),
A_aoi = c("C*B*C", "*"),
A_aoi_dur = c("100,25,30,50,144", "200"),
B_aoi = c("*A", "*A*A*C"),
B_aoi_dur = c("777,876", "50,22,33,100,150,999")
)
我想做的是 pivot_longer
这样每个逗号分隔值都有自己的行。我 可以 完成这个,但看起来我正在完成的方式是什么,因为它涉及多个中间步骤和临时 df
s 使代码冗长和繁重:
library(dplyr)
library(tidyr)
# first temporary `df`:
df1 <- df %>%
select(-ends_with("dur")) %>%
pivot_longer(cols = ends_with("aoi"),
names_to = "Speaker") %>%
separate_rows(value, sep = "(?!^|$)") %>%
mutate(Speaker = sub("^(.).*", "\1", Speaker)) %>%
rename(AOI = value)
# second temporary `df`:
df2 <- df %>%
select(-ends_with("aoi")) %>%
pivot_longer(cols = ends_with("dur")) %>%
separate_rows(value, sep = ",") %>%
rename(Dur = value)
# final `df` (aka, the **expected outcome**):
df3 <- cbind(df1, df2[,4])
df3
Line Utterance Speaker AOI Dur
1 1 hi there A C 100
2 1 hi there A * 25
3 1 hi there A B 30
4 1 hi there A * 50
5 1 hi there A C 144
6 1 hi there B * 777
7 1 hi there B A 876
8 2 how're ya A * 200
9 2 how're ya B * 50
10 2 how're ya B A 22
11 2 how're ya B * 33
12 2 how're ya B A 100
13 2 how're ya B * 150
14 2 how're ya B C 999
如何更简洁地实现这种转变?
我不知道这是否真的“更简洁”,但这是一种在单个管道链中完成所有工作的方法。
- 除
Line
和Utterance
之外的所有列中的值都将在值列中结束,因此将它们全部旋转更长的时间 - 从列名 中分离出
- 将值列旋转到最终形状(一列用于
AOI
,一列用于Dur
)。这种由不同的键组成的先长后宽的模式很常见 - 最后,拆分值,以便我们可以将它们放在自己的行中。我认为
separate_rows
不能很好地处理列的不同分隔符,尤其是在我们需要拆分每个字符的地方,因此我们可以手动进行操作,然后unnest
以获得所需的输出。请注意,这取决于AOI
和Dur
具有相同数量的元素,我假设给定输入是正确的。
Speaker
library(tidyverse)
df <- data.frame(
Line = 1:2,
Utterance = c("hi there", "how're ya"),
A_aoi = c("C*B*C", "*"),
A_aoi_dur = c("100,25,30,50,144", "200"),
B_aoi = c("*A", "*A*A*C"),
B_aoi_dur = c("777,876", "50,22,33,100,150,999")
)
df %>%
pivot_longer(
cols = matches("(aoi|dur)$"),
names_to = "name",
values_to = "value"
) %>%
separate(name, into = c("Speaker", "aoi_dur"), sep = "_", extra = "merge") %>%
pivot_wider(names_from = aoi_dur, values_from = value) %>%
rename(AOI = aoi, Dur = aoi_dur) %>%
mutate(
AOI = str_split(AOI, pattern = ""),
Dur = str_split(Dur, pattern = ",")
) %>%
unnest(c(AOI, Dur))
#> # A tibble: 14 x 5
#> Line Utterance Speaker AOI Dur
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 hi there A C 100
#> 2 1 hi there A * 25
#> 3 1 hi there A B 30
#> 4 1 hi there A * 50
#> 5 1 hi there A C 144
#> 6 1 hi there B * 777
#> 7 1 hi there B A 876
#> 8 2 how're ya A * 200
#> 9 2 how're ya B * 50
#> 10 2 how're ya B A 22
#> 11 2 how're ya B * 33
#> 12 2 how're ya B A 100
#> 13 2 how're ya B * 150
#> 14 2 how're ya B C 999
由 reprex package (v1.0.0)
于 2021-06-24 创建这里是tidyverse
- 我们通过
paste
ing '_AOI' 重命名 - 从 'wide' 重塑为 'long' -
pivot_longer
- 在 'AOI' 中的每个字符之间插入分隔符
,
以构成通用分隔符 -str_replace_all
- 最后,在
,
分隔符上使用separate_rows
ends_with
'_aoi' 的列 (rename_with
)
library(dplyr)
library(tidyr)
library(stringr)
df %>%
rename_with(~ str_c(., "_AOI"), ends_with("_aoi")) %>%
pivot_longer(cols = contains("_"),
names_to = c("Speaker", ".value"), names_pattern = "^(.*)_([^_]+$)") %>%
mutate(AOI = str_replace_all(AOI, "(?<=.)(?=.)", ",")) %>%
separate_rows(c(AOI, dur), sep = ",", convert = TRUE)
-输出
# A tibble: 14 x 5
Line Utterance Speaker AOI dur
<int> <chr> <chr> <chr> <int>
1 1 hi there A_aoi C 100
2 1 hi there A_aoi * 25
3 1 hi there A_aoi B 30
4 1 hi there A_aoi * 50
5 1 hi there A_aoi C 144
6 1 hi there B_aoi * 777
7 1 hi there B_aoi A 876
8 2 how're ya A_aoi * 200
9 2 how're ya B_aoi * 50
10 2 how're ya B_aoi A 22
11 2 how're ya B_aoi * 33
12 2 how're ya B_aoi A 100
13 2 how're ya B_aoi * 150
14 2 how're ya B_aoi C 999