获取模式匹配的 id
get id for pattern matches
我想提取引理 GO 的搭配。
df <- data.frame(
id = 1:6,
go = c("go after it", "here we go", "he went bust", "go get it go",
"i 'm gon na go", "she 's going berserk"))
我可以像这样提取搭配:
# lemma forms:
lemma_GO <- c("go", "goes", "going", "gone", "went", "gon na")
# alternation pattern:
pattern_GO <- paste0("\b(", paste0(lemma_GO, collapse = "|"), ")\b")
# extraction:
library(stringr)
df_GO <- data.frame(
left = unlist(str_extract_all(df$go, paste0("('?\b[a-z']+\b|^)(?=\s?", pattern_GO, ")"))),
node = unlist(str_extract_all(df$go, pattern_GO)),
right = unlist(str_extract_all(df$go, paste0("(?<=\s?", pattern_GO, "\s?)('?\b[a-z']+\b|$)")))
)
结果很好,但它没有显示 id
值,也就是说,我不知道匹配项是从哪个 'sentence' 中提取的:
df_GO
left node right
1 go after
2 we go
3 he went bust
4 go get
5 it go
6 'm gon na go
7 na go
8 's going berserk
如何获取 id
值,以便得到这样的结果:
df_GO
left node right id
1 go after 1
2 we go 2
3 he went bust 3
4 go get 4
5 it go 4
6 'm gon na go 5
7 na go 5
8 's going berserk 6
你快到了。您需要做的是 loop/iterate 在您的数据框上并对每一行执行操作。这也允许您提取和存储 ID。
为了实现这一点,我们将您的步骤包装到函数调用中并向其添加 ID。
以下使用 tidyverse
包,特别是 {purrr}
用于迭代。
library(tidyverse)
# wrap your call into a function that we perform on each row
extract_GO <- function(df_row){
df_GO <- data.frame(
id = df_row$id, # we also store the id for the row we process
#---------------------- your work - just adapted the variable to function call, df_row
## this could have stayed the same, but this way it is easier to understand
## what happens here
left = unlist(str_extract_all(df_row$go, paste0("('?\b[a-z']+\b|^)(?=\s?", pattern_GO, ")"))),
node = unlist(str_extract_all(df_row$go, pattern_GO)),
right = unlist(str_extract_all(df_row$go, paste0("(?<=\s?", pattern_GO, "\s?)('?\b[a-z']+\b|$)")))
)
}
# --------------- next we iterate with purrr
## try df %>% group_split(id) to see what group_split() does
df %>%
group_split(id) %>% # splits data frame into list of bins, i.e. by id
purrr::map_dfr(.x, .f = ~ extract_GO(.x)) # now we iterate over bins with our function
这产生:
id left node right
1 1 go after
2 2 we go
3 3 he went bust
4 4 go get
5 4 it go
6 5 'm gon na go
7 5 na go
8 6 's going berserk
我想提取引理 GO 的搭配。
df <- data.frame(
id = 1:6,
go = c("go after it", "here we go", "he went bust", "go get it go",
"i 'm gon na go", "she 's going berserk"))
我可以像这样提取搭配:
# lemma forms:
lemma_GO <- c("go", "goes", "going", "gone", "went", "gon na")
# alternation pattern:
pattern_GO <- paste0("\b(", paste0(lemma_GO, collapse = "|"), ")\b")
# extraction:
library(stringr)
df_GO <- data.frame(
left = unlist(str_extract_all(df$go, paste0("('?\b[a-z']+\b|^)(?=\s?", pattern_GO, ")"))),
node = unlist(str_extract_all(df$go, pattern_GO)),
right = unlist(str_extract_all(df$go, paste0("(?<=\s?", pattern_GO, "\s?)('?\b[a-z']+\b|$)")))
)
结果很好,但它没有显示 id
值,也就是说,我不知道匹配项是从哪个 'sentence' 中提取的:
df_GO
left node right
1 go after
2 we go
3 he went bust
4 go get
5 it go
6 'm gon na go
7 na go
8 's going berserk
如何获取 id
值,以便得到这样的结果:
df_GO
left node right id
1 go after 1
2 we go 2
3 he went bust 3
4 go get 4
5 it go 4
6 'm gon na go 5
7 na go 5
8 's going berserk 6
你快到了。您需要做的是 loop/iterate 在您的数据框上并对每一行执行操作。这也允许您提取和存储 ID。
为了实现这一点,我们将您的步骤包装到函数调用中并向其添加 ID。
以下使用 tidyverse
包,特别是 {purrr}
用于迭代。
library(tidyverse)
# wrap your call into a function that we perform on each row
extract_GO <- function(df_row){
df_GO <- data.frame(
id = df_row$id, # we also store the id for the row we process
#---------------------- your work - just adapted the variable to function call, df_row
## this could have stayed the same, but this way it is easier to understand
## what happens here
left = unlist(str_extract_all(df_row$go, paste0("('?\b[a-z']+\b|^)(?=\s?", pattern_GO, ")"))),
node = unlist(str_extract_all(df_row$go, pattern_GO)),
right = unlist(str_extract_all(df_row$go, paste0("(?<=\s?", pattern_GO, "\s?)('?\b[a-z']+\b|$)")))
)
}
# --------------- next we iterate with purrr
## try df %>% group_split(id) to see what group_split() does
df %>%
group_split(id) %>% # splits data frame into list of bins, i.e. by id
purrr::map_dfr(.x, .f = ~ extract_GO(.x)) # now we iterate over bins with our function
这产生:
id left node right
1 1 go after
2 2 we go
3 3 he went bust
4 4 go get
5 4 it go
6 5 'm gon na go
7 5 na go
8 6 's going berserk