Pivot_longer 对 names_pattern 中的两位数使用正则表达式

Question

我有一个大型宽格式数据集，其中包含在多个波中测量的许多变量，其中每个变量波组合（例如年龄 1、年龄 2、年龄 3、年龄 4）有一列，还有一些时间- 固定变量（例如 ID、性别）。旋转后，我希望每个变量都由一个列表示，旁边是一个新的 'wave' 列。

它的工作几乎完美，只是我无法在同一列中表示第 1-9 波和第 10-13 波。

df <- data.frame(
  ID = c(10001,10002),
  Sex = c(1,2),
  Age1 = c(73,25),
  Age2 = c(74,26),
  Age3=c(75,27),
  Age4 = c(76,28),
  Age5 = c(77,29),
  Age6=c(78,30),
  Age7 = c(79,31),
  Age8 = c(80,31),
  Age9=c(81,33),
  Age10=c(82,34),
  Age11 = c(83,35),
  Age12 = c(84,36),
  Age13=c(85,37)
)

names_test<-names(df) 
no_numb<-grep("*[A-Za-z]$", names_test) #to identify all the column names ending with a letter, which I do NOT want to pivot into longer form

df_long<-pivot_longer(df,cols = !no_numb, names_to = c('.value',"wave"),
                  names_pattern = "(.*)(\d+)$")

长数据输出：

> df_long
# A tibble: 20 x 5
      ID   Sex wave    Age  Age1
   <dbl> <dbl> <chr> <dbl> <dbl>
 1 10001     1 1        73    83
 2 10001     1 2        74    84
 3 10001     1 3        75    85
 4 10001     1 4        76    NA
 5 10001     1 5        77    NA
 6 10001     1 6        78    NA
 7 10001     1 7        79    NA
 8 10001     1 8        80    NA
 9 10001     1 9        81    NA
10 10001     1 0        NA    82
11 10002     2 1        25    35
12 10002     2 2        26    36
13 10002     2 3        27    37
14 10002     2 4        28    NA
15 10002     2 5        29    NA
16 10002     2 6        30    NA
17 10002     2 7        31    NA
18 10002     2 8        31    NA
19 10002     2 9        33    NA
20 10002     2 0        NA    34

如您所见，有一列 Age，其中包含波 1-9 的值，还有一列 Age1，其中包含 1、2、3 和 0 的值（即 Age10、Age11、Age12 ,13 岁)。我假设这里的问题是 names_to 参数或 names_pattern 参数。任何帮助将不胜感激！

Answer 1

.*是贪心的，所以取最大匹配字符串。你可以使用

pivot_longer(df,cols = !no_numb, names_to = c('.value',"wave"),
                  names_pattern = "(Age)(\d+)$")

或者通过添加 ?

使其不贪心

pivot_longer(df,cols = !no_numb, names_to = c('.value',"wave"),
                  names_pattern = "(.*?)(\d+)$")

这个returns

# A tibble: 26 x 4
      ID   Sex wave    Age
   <dbl> <dbl> <chr> <dbl>
 1 10001     1 1        73
 2 10001     1 2        74
 3 10001     1 3        75
 4 10001     1 4        76
 5 10001     1 5        77
 6 10001     1 6        78
 7 10001     1 7        79
 8 10001     1 8        80
 9 10001     1 9        81
10 10001     1 10       82
# ... with 16 more rows

Answer 2

我们也可以将 names_sep 与正则表达式一起使用

library(tidyr)
 pivot_longer(df, cols = starts_with('Age'), 
   names_to = c(".value", "wave"), names_sep = "(?<=[a-z])(?=[0-9])")
# A tibble: 26 × 4
      ID   Sex wave    Age
   <dbl> <dbl> <chr> <dbl>
 1 10001     1 1        73
 2 10001     1 2        74
 3 10001     1 3        75
 4 10001     1 4        76
 5 10001     1 5        77
 6 10001     1 6        78
 7 10001     1 7        79
 8 10001     1 8        80
 9 10001     1 9        81
10 10001     1 10       82
# … with 16 more rows

Pivot_longer 对 names_pattern 中的两位数使用正则表达式

Pivot_longer with regex for two-digit numbers in names_pattern

regex

r

tidyr