如何在逗号分隔值数量不等的多列上 pivot_longer

How to pivot_longer on multiple columns with unequal numbers of comma-separated values

我的数据看起来很乱,其中多列有多个逗号分隔值:

df <- data.frame(
  Line = 1:2,
  Utterance = c("hi there", "how're ya"),
  A_aoi = c("C*B*C", "*"),
  A_aoi_dur = c("100,25,30,50,144", "200"),
  B_aoi = c("*A", "*A*A*C"),
  B_aoi_dur = c("777,876", "50,22,33,100,150,999")
)

我想做的是 pivot_longer 这样每个逗号分隔值都有自己的行。我 可以 完成这个,但看起来我正在完成的方式是什么,因为它涉及多个中间步骤和临时 dfs 使代码冗长和繁重:

library(dplyr)
library(tidyr)

# first temporary `df`:
df1 <- df %>%
  select(-ends_with("dur")) %>%
  pivot_longer(cols = ends_with("aoi"),
               names_to = "Speaker") %>%
  separate_rows(value, sep = "(?!^|$)") %>%
  mutate(Speaker = sub("^(.).*", "\1", Speaker)) %>%
  rename(AOI = value)

# second temporary `df`:
df2 <- df %>%
  select(-ends_with("aoi")) %>%
  pivot_longer(cols = ends_with("dur")) %>%
  separate_rows(value, sep = ",") %>%
  rename(Dur = value)

# final `df` (aka, the **expected outcome**):
df3 <- cbind(df1, df2[,4])

df3
   Line Utterance Speaker AOI Dur
1     1  hi there       A   C 100
2     1  hi there       A   *  25
3     1  hi there       A   B  30
4     1  hi there       A   *  50
5     1  hi there       A   C 144
6     1  hi there       B   * 777
7     1  hi there       B   A 876
8     2 how're ya       A   * 200
9     2 how're ya       B   *  50
10    2 how're ya       B   A  22
11    2 how're ya       B   *  33
12    2 how're ya       B   A 100
13    2 how're ya       B   * 150
14    2 how're ya       B   C 999

如何更简洁地实现这种转变?

我不知道这是否真的“更简洁”,但这是一种在单个管道链中完成所有工作的方法。

  1. LineUtterance 之外的所有列中的值都将在值列中结束,因此将它们全部旋转更长的时间
  2. 从列名
  3. 中分离出Speaker
  4. 将值列旋转到最终形状(一列用于 AOI,一列用于 Dur)。这种由不同的键组成的先长后宽的模式很常见
  5. 最后,拆分值,以便我们可以将它们放在自己的行中。我认为 separate_rows 不能很好地处理列的不同分隔符,尤其是在我们需要拆分每个字符的地方,因此我们可以手动进行操作,然后 unnest 以获得所需的输出。请注意,这取决于 AOIDur 具有相同数量的元素,我假设给定输入是正确的。
library(tidyverse)
df <- data.frame(
  Line = 1:2,
  Utterance = c("hi there", "how're ya"),
  A_aoi = c("C*B*C", "*"),
  A_aoi_dur = c("100,25,30,50,144", "200"),
  B_aoi = c("*A", "*A*A*C"),
  B_aoi_dur = c("777,876", "50,22,33,100,150,999")
)

df %>%
  pivot_longer(
    cols = matches("(aoi|dur)$"),
    names_to = "name",
    values_to = "value"
  ) %>%
  separate(name, into = c("Speaker", "aoi_dur"), sep = "_", extra = "merge") %>%
  pivot_wider(names_from = aoi_dur, values_from = value) %>%
  rename(AOI = aoi, Dur = aoi_dur) %>%
  mutate(
    AOI = str_split(AOI, pattern = ""),
    Dur = str_split(Dur, pattern = ",")
  ) %>%
  unnest(c(AOI, Dur))
#> # A tibble: 14 x 5
#>     Line Utterance Speaker AOI   Dur  
#>    <int> <chr>     <chr>   <chr> <chr>
#>  1     1 hi there  A       C     100  
#>  2     1 hi there  A       *     25   
#>  3     1 hi there  A       B     30   
#>  4     1 hi there  A       *     50   
#>  5     1 hi there  A       C     144  
#>  6     1 hi there  B       *     777  
#>  7     1 hi there  B       A     876  
#>  8     2 how're ya A       *     200  
#>  9     2 how're ya B       *     50   
#> 10     2 how're ya B       A     22   
#> 11     2 how're ya B       *     33   
#> 12     2 how're ya B       A     100  
#> 13     2 how're ya B       *     150  
#> 14     2 how're ya B       C     999

reprex package (v1.0.0)

于 2021-06-24 创建

这里是tidyverse

中的一个选项
  1. 我们通过 pasteing '_AOI'
  2. 重命名 ends_with '_aoi' 的列 (rename_with)
  3. 从 'wide' 重塑为 'long' - pivot_longer
  4. 在 'AOI' 中的每个字符之间插入分隔符 , 以构成通用分隔符 - str_replace_all
  5. 最后,在 , 分隔符上使用 separate_rows
library(dplyr)
library(tidyr)
library(stringr)
df %>% 
    rename_with(~ str_c(., "_AOI"), ends_with("_aoi")) %>% 
    pivot_longer(cols = contains("_"), 
      names_to = c("Speaker", ".value"), names_pattern = "^(.*)_([^_]+$)") %>% 
    mutate(AOI = str_replace_all(AOI, "(?<=.)(?=.)", ",")) %>% 
    separate_rows(c(AOI, dur), sep = ",", convert = TRUE)

-输出

# A tibble: 14 x 5
    Line Utterance Speaker AOI     dur
   <int> <chr>     <chr>   <chr> <int>
 1     1 hi there  A_aoi   C       100
 2     1 hi there  A_aoi   *        25
 3     1 hi there  A_aoi   B        30
 4     1 hi there  A_aoi   *        50
 5     1 hi there  A_aoi   C       144
 6     1 hi there  B_aoi   *       777
 7     1 hi there  B_aoi   A       876
 8     2 how're ya A_aoi   *       200
 9     2 how're ya B_aoi   *        50
10     2 how're ya B_aoi   A        22
11     2 how're ya B_aoi   *        33
12     2 how're ya B_aoi   A       100
13     2 how're ya B_aoi   *       150
14     2 how're ya B_aoi   C       999