tidyr 和 dplyr 中的 R 正则表达式?
R regex expressions in tidyr and dplyr?
我有一个由数千行这种类型组成的文件:
1 number entry size1 size2 value size5 value2 my_id1k "AJKJjsdfe76r55"; my_label “1900”; my_idk2 "49354ytu866"; you_digit "some"; my_copy “jkl”;
1 number entry size3 size4 value size6 value2 my_id1k "xyz804"; my_id2k “FI71"; my_id3k “Sk9000”; my_id4k “ldv”;
我想找到一种方法来提取 my_id1k
和 my_id2k
条目中包含的内容(不带双引号),以及提取其他一些列(我的代码是下面提供)。
为此,我想使用 tidyr
和 dplyr
包中的 separate()
和 select()
函数,因为它们非常快(而且我关注性能),所以一直在研究:http://rpackages.ianhowson.com/cran/tidyr/man/separate.html
但是,我不确定如何在这种异构情况下(我的最后一列长度不同)指定 into
和 sep
选项以获得我想要的输出.我显然有一些行比其他行包含更多信息,所以我想知道如何编写一些高性能的 tidyr
和 dplyr
代码来尽快提取所需的条目。
这是我目前的工作:
> library(dplyr)
> library(tidyr)
> library(data.table)
> x <- fread("myfile_MWE.txt")
> x
V1 V2 V3 V4 V5 V6 V7 V8 V9
1: 1 number entry size1 size2 value size5 value2 my_id1k "AJKJjsdfe76r55"; my_label “1900”; my_idk2 "49354ytu866"; you_digit "some"; my_copy “jkl”;
2: 1 number entry size3 size4 value size6 value2 my_id1k "xyz804"; my_id2k “FI71"; my_id3k “Sk9000”; my_id4k “ldv”;
> y <- separate(x, V9, into = paste("V", 1:15, sep = "_"))
> y
V1 V2 V3 V4 V5 V6 V7 V8 V_1 V_2 V_3 V_4 V_5 V_6 V_7 V_8 V_9 V_10 V_11 V_12 V_13 V_14 V_15
1: 1 number entry size1 size2 value size5 value2 my id1k AJKJjsdfe76r55 my label 1900 my idk2 49354ytu866 you digit some my copy jkl
2: 1 number entry size3 size4 value size6 value2 my id1k xyz804 my id2k FI71 my id3k Sk9000 my id4k ldv NA NA
显然,由于最后一列的长度不同 (V9
),一些条目显示为 NA
,我无法成功提取 [=] 中包含的内容14=] 和 my_id2k
个条目:
> a <- select(y, V1, V7, V_3, V_9)
> a
V1 V7 V_3 V_9
1: 1 size5 AJKJjsdfe76r55 49354ytu866
2: 1 size6 xyz804 Sk9000
> b <- select(y, V1, V7, V_3, V_6)
> b
V1 V7 V_3 V_6
1: 1 size5 AJKJjsdfe76r55 1900
2: 1 size6 xyz804 FI71
很明显,在一种情况下我需要 V_9
,而在另一种情况下我需要 V_6
。我想要的输出是:
1 size5 AJKJjsdfe76r55 49354ytu866
1 size6 xyz804 FI71
我是否可以以有条件的方式指定 V_9
和 V_6
的用法,以便我的代码足够聪明,可以识别我想拉下包含在my_id1k
和 my_id2k
条目,例如,通过正则表达式?
这是我使用的数据:
data = structure(list(V1 = c(1L, 1L), V2 = c("number", "number"), V3 = c("entry",
"entry"), V4 = c("size1", "size3"), V5 = c("size2", "size4"),
V6 = c("value", "value"), V7 = c("size5", "size6"), V8 = c("value2",
"value2"), V9 = c("my_id1k \"AJKJjsdfe76r55\"; my_label “1900”; my_idk2 \"49354ytu866\"; you_digit \"some\"; my_copy “jkl”;",
"my_id1k \"xyz804\"; my_id2k “FI71\"; my_id3k “Sk9000”; my_id4k “ldv”;"
)), .Names = c("V1", "V2", "V3", "V4", "V5", "V6", "V7",
"V8", "V9"), class = "data.frame", row.names = c(NA, -2L))
这是代码
library(dplyr)
library(stringi)
library(tidyr)
result =
data %>%
group_by(V9) %>%
do(.$V9 %>%
first %>%
stri_replace_all_fixed("; ", "\n") %>%
read.table(text = ., stringsAsFactors = FALSE) ) %>%
spread(V1, V2) %>%
left_join(data)
tidyr::extract
是比 separate
或 spread
更好的选择,因为有很多你不关心的垃圾。
extract(df, V9, c('my_id1k', 'my_id2k'), 'my_id1k .(\S+).;.*my_id(?:2k|k2) .(\S+).;')
# V1 V2 V3 V4 V5 V6 V7 V8 my_id1k my_id2k
# 1 1 number entry size1 size2 value size5 value2 AJKJjsdfe76r55 49354ytu866
# 2 1 number entry size3 size4 value size6 value2 xyz804 FI71
请注意,这假设 my_id2k
和 my_idk2
相同,正如您在问题中假设的那样; my_id1k
没有变化,所以正则表达式也没有变化。它还假设 my_id1k
在 my_id2k
之前。将其扩展到新数据时要注意可能性,并相应地调整正则表达式。
数据:
df <- structure(list(V1 = c(1L, 1L), V2 = structure(c(1L, 1L), .Label = "number", class = "factor"),
V3 = structure(c(1L, 1L), .Label = "entry", class = "factor"),
V4 = structure(1:2, .Label = c("size1", "size3"), class = "factor"),
V5 = structure(1:2, .Label = c("size2", "size4"), class = "factor"),
V6 = structure(c(1L, 1L), .Label = "value", class = "factor"),
V7 = structure(1:2, .Label = c("size5", "size6"), class = "factor"),
V8 = structure(c(1L, 1L), .Label = "value2", class = "factor"),
V9 = c("my_id1k \"AJKJjsdfe76r55\"; my_label “1900”; my_idk2 \"49354ytu866\"; you_digit \"some\"; my_copy “jkl”;",
"my_id1k \"xyz804\"; my_id2k “FI71\"; my_id3k “Sk9000”; my_id4k “ldv”;"
)), .Names = c("V1", "V2", "V3", "V4", "V5", "V6", "V7",
"V8", "V9"), row.names = c(NA, -2L), class = "data.frame")
我有一个由数千行这种类型组成的文件:
1 number entry size1 size2 value size5 value2 my_id1k "AJKJjsdfe76r55"; my_label “1900”; my_idk2 "49354ytu866"; you_digit "some"; my_copy “jkl”;
1 number entry size3 size4 value size6 value2 my_id1k "xyz804"; my_id2k “FI71"; my_id3k “Sk9000”; my_id4k “ldv”;
我想找到一种方法来提取 my_id1k
和 my_id2k
条目中包含的内容(不带双引号),以及提取其他一些列(我的代码是下面提供)。
为此,我想使用 tidyr
和 dplyr
包中的 separate()
和 select()
函数,因为它们非常快(而且我关注性能),所以一直在研究:http://rpackages.ianhowson.com/cran/tidyr/man/separate.html
但是,我不确定如何在这种异构情况下(我的最后一列长度不同)指定 into
和 sep
选项以获得我想要的输出.我显然有一些行比其他行包含更多信息,所以我想知道如何编写一些高性能的 tidyr
和 dplyr
代码来尽快提取所需的条目。
这是我目前的工作:
> library(dplyr)
> library(tidyr)
> library(data.table)
> x <- fread("myfile_MWE.txt")
> x
V1 V2 V3 V4 V5 V6 V7 V8 V9
1: 1 number entry size1 size2 value size5 value2 my_id1k "AJKJjsdfe76r55"; my_label “1900”; my_idk2 "49354ytu866"; you_digit "some"; my_copy “jkl”;
2: 1 number entry size3 size4 value size6 value2 my_id1k "xyz804"; my_id2k “FI71"; my_id3k “Sk9000”; my_id4k “ldv”;
> y <- separate(x, V9, into = paste("V", 1:15, sep = "_"))
> y
V1 V2 V3 V4 V5 V6 V7 V8 V_1 V_2 V_3 V_4 V_5 V_6 V_7 V_8 V_9 V_10 V_11 V_12 V_13 V_14 V_15
1: 1 number entry size1 size2 value size5 value2 my id1k AJKJjsdfe76r55 my label 1900 my idk2 49354ytu866 you digit some my copy jkl
2: 1 number entry size3 size4 value size6 value2 my id1k xyz804 my id2k FI71 my id3k Sk9000 my id4k ldv NA NA
显然,由于最后一列的长度不同 (V9
),一些条目显示为 NA
,我无法成功提取 [=] 中包含的内容14=] 和 my_id2k
个条目:
> a <- select(y, V1, V7, V_3, V_9)
> a
V1 V7 V_3 V_9
1: 1 size5 AJKJjsdfe76r55 49354ytu866
2: 1 size6 xyz804 Sk9000
> b <- select(y, V1, V7, V_3, V_6)
> b
V1 V7 V_3 V_6
1: 1 size5 AJKJjsdfe76r55 1900
2: 1 size6 xyz804 FI71
很明显,在一种情况下我需要 V_9
,而在另一种情况下我需要 V_6
。我想要的输出是:
1 size5 AJKJjsdfe76r55 49354ytu866
1 size6 xyz804 FI71
我是否可以以有条件的方式指定 V_9
和 V_6
的用法,以便我的代码足够聪明,可以识别我想拉下包含在my_id1k
和 my_id2k
条目,例如,通过正则表达式?
这是我使用的数据:
data = structure(list(V1 = c(1L, 1L), V2 = c("number", "number"), V3 = c("entry",
"entry"), V4 = c("size1", "size3"), V5 = c("size2", "size4"),
V6 = c("value", "value"), V7 = c("size5", "size6"), V8 = c("value2",
"value2"), V9 = c("my_id1k \"AJKJjsdfe76r55\"; my_label “1900”; my_idk2 \"49354ytu866\"; you_digit \"some\"; my_copy “jkl”;",
"my_id1k \"xyz804\"; my_id2k “FI71\"; my_id3k “Sk9000”; my_id4k “ldv”;"
)), .Names = c("V1", "V2", "V3", "V4", "V5", "V6", "V7",
"V8", "V9"), class = "data.frame", row.names = c(NA, -2L))
这是代码
library(dplyr)
library(stringi)
library(tidyr)
result =
data %>%
group_by(V9) %>%
do(.$V9 %>%
first %>%
stri_replace_all_fixed("; ", "\n") %>%
read.table(text = ., stringsAsFactors = FALSE) ) %>%
spread(V1, V2) %>%
left_join(data)
tidyr::extract
是比 separate
或 spread
更好的选择,因为有很多你不关心的垃圾。
extract(df, V9, c('my_id1k', 'my_id2k'), 'my_id1k .(\S+).;.*my_id(?:2k|k2) .(\S+).;')
# V1 V2 V3 V4 V5 V6 V7 V8 my_id1k my_id2k
# 1 1 number entry size1 size2 value size5 value2 AJKJjsdfe76r55 49354ytu866
# 2 1 number entry size3 size4 value size6 value2 xyz804 FI71
请注意,这假设 my_id2k
和 my_idk2
相同,正如您在问题中假设的那样; my_id1k
没有变化,所以正则表达式也没有变化。它还假设 my_id1k
在 my_id2k
之前。将其扩展到新数据时要注意可能性,并相应地调整正则表达式。
数据:
df <- structure(list(V1 = c(1L, 1L), V2 = structure(c(1L, 1L), .Label = "number", class = "factor"),
V3 = structure(c(1L, 1L), .Label = "entry", class = "factor"),
V4 = structure(1:2, .Label = c("size1", "size3"), class = "factor"),
V5 = structure(1:2, .Label = c("size2", "size4"), class = "factor"),
V6 = structure(c(1L, 1L), .Label = "value", class = "factor"),
V7 = structure(1:2, .Label = c("size5", "size6"), class = "factor"),
V8 = structure(c(1L, 1L), .Label = "value2", class = "factor"),
V9 = c("my_id1k \"AJKJjsdfe76r55\"; my_label “1900”; my_idk2 \"49354ytu866\"; you_digit \"some\"; my_copy “jkl”;",
"my_id1k \"xyz804\"; my_id2k “FI71\"; my_id3k “Sk9000”; my_id4k “ldv”;"
)), .Names = c("V1", "V2", "V3", "V4", "V5", "V6", "V7",
"V8", "V9"), row.names = c(NA, -2L), class = "data.frame")