如何在逗号分隔值数量不等的多列上 pivot_longer

Question

我的数据看起来很乱，其中多列有多个逗号分隔值：

df <- data.frame(
  Line = 1:2,
  Utterance = c("hi there", "how're ya"),
  A_aoi = c("C*B*C", "*"),
  A_aoi_dur = c("100,25,30,50,144", "200"),
  B_aoi = c("*A", "*A*A*C"),
  B_aoi_dur = c("777,876", "50,22,33,100,150,999")
)

我想做的是 pivot_longer 这样每个逗号分隔值都有自己的行。我可以完成这个，但看起来我正在完成的方式是什么，因为它涉及多个中间步骤和临时 dfs 使代码冗长和繁重：

library(dplyr)
library(tidyr)

# first temporary `df`:
df1 <- df %>%
  select(-ends_with("dur")) %>%
  pivot_longer(cols = ends_with("aoi"),
               names_to = "Speaker") %>%
  separate_rows(value, sep = "(?!^|$)") %>%
  mutate(Speaker = sub("^(.).*", "\1", Speaker)) %>%
  rename(AOI = value)

# second temporary `df`:
df2 <- df %>%
  select(-ends_with("aoi")) %>%
  pivot_longer(cols = ends_with("dur")) %>%
  separate_rows(value, sep = ",") %>%
  rename(Dur = value)

# final `df` (aka, the **expected outcome**):
df3 <- cbind(df1, df2[,4])

df3
   Line Utterance Speaker AOI Dur
1     1  hi there       A   C 100
2     1  hi there       A   *  25
3     1  hi there       A   B  30
4     1  hi there       A   *  50
5     1  hi there       A   C 144
6     1  hi there       B   * 777
7     1  hi there       B   A 876
8     2 how're ya       A   * 200
9     2 how're ya       B   *  50
10    2 how're ya       B   A  22
11    2 how're ya       B   *  33
12    2 how're ya       B   A 100
13    2 how're ya       B   * 150
14    2 how're ya       B   C 999

如何更简洁地实现这种转变？

Answer 1

我不知道这是否真的“更简洁”，但这是一种在单个管道链中完成所有工作的方法。

除 Line 和 Utterance 之外的所有列中的值都将在值列中结束，因此将它们全部旋转更长的时间
从列名

Speaker

将值列旋转到最终形状（一列用于 AOI，一列用于 Dur）。这种由不同的键组成的先长后宽的模式很常见
最后，拆分值，以便我们可以将它们放在自己的行中。我认为 separate_rows 不能很好地处理列的不同分隔符，尤其是在我们需要拆分每个字符的地方，因此我们可以手动进行操作，然后 unnest 以获得所需的输出。请注意，这取决于 AOI 和 Dur 具有相同数量的元素，我假设给定输入是正确的。

library(tidyverse)
df <- data.frame(
  Line = 1:2,
  Utterance = c("hi there", "how're ya"),
  A_aoi = c("C*B*C", "*"),
  A_aoi_dur = c("100,25,30,50,144", "200"),
  B_aoi = c("*A", "*A*A*C"),
  B_aoi_dur = c("777,876", "50,22,33,100,150,999")
)

df %>%
  pivot_longer(
    cols = matches("(aoi|dur)$"),
    names_to = "name",
    values_to = "value"
  ) %>%
  separate(name, into = c("Speaker", "aoi_dur"), sep = "_", extra = "merge") %>%
  pivot_wider(names_from = aoi_dur, values_from = value) %>%
  rename(AOI = aoi, Dur = aoi_dur) %>%
  mutate(
    AOI = str_split(AOI, pattern = ""),
    Dur = str_split(Dur, pattern = ",")
  ) %>%
  unnest(c(AOI, Dur))
#> # A tibble: 14 x 5
#>     Line Utterance Speaker AOI   Dur  
#>    <int> <chr>     <chr>   <chr> <chr>
#>  1     1 hi there  A       C     100  
#>  2     1 hi there  A       *     25   
#>  3     1 hi there  A       B     30   
#>  4     1 hi there  A       *     50   
#>  5     1 hi there  A       C     144  
#>  6     1 hi there  B       *     777  
#>  7     1 hi there  B       A     876  
#>  8     2 how're ya A       *     200  
#>  9     2 how're ya B       *     50   
#> 10     2 how're ya B       A     22   
#> 11     2 how're ya B       *     33   
#> 12     2 how're ya B       A     100  
#> 13     2 how're ya B       *     150  
#> 14     2 how're ya B       C     999

^{由 reprex package (v1.0.0)}

于 2021-06-24 创建

Answer 2

这里是tidyverse

中的一个选项

我们通过 pasteing '_AOI'

ends_with

rename_with

从 'wide' 重塑为 'long' - pivot_longer
在 'AOI' 中的每个字符之间插入分隔符 , 以构成通用分隔符 - str_replace_all
最后，在 , 分隔符上使用 separate_rows

library(dplyr)
library(tidyr)
library(stringr)
df %>% 
    rename_with(~ str_c(., "_AOI"), ends_with("_aoi")) %>% 
    pivot_longer(cols = contains("_"), 
      names_to = c("Speaker", ".value"), names_pattern = "^(.*)_([^_]+$)") %>% 
    mutate(AOI = str_replace_all(AOI, "(?<=.)(?=.)", ",")) %>% 
    separate_rows(c(AOI, dur), sep = ",", convert = TRUE)

-输出

# A tibble: 14 x 5
    Line Utterance Speaker AOI     dur
   <int> <chr>     <chr>   <chr> <int>
 1     1 hi there  A_aoi   C       100
 2     1 hi there  A_aoi   *        25
 3     1 hi there  A_aoi   B        30
 4     1 hi there  A_aoi   *        50
 5     1 hi there  A_aoi   C       144
 6     1 hi there  B_aoi   *       777
 7     1 hi there  B_aoi   A       876
 8     2 how're ya A_aoi   *       200
 9     2 how're ya B_aoi   *        50
10     2 how're ya B_aoi   A        22
11     2 how're ya B_aoi   *        33
12     2 how're ya B_aoi   A       100
13     2 how're ya B_aoi   *       150
14     2 how're ya B_aoi   C       999

如何在逗号分隔值数量不等的多列上 pivot_longer

How to pivot_longer on multiple columns with unequal numbers of comma-separated values

r

dplyr

tidyr