基于列表对象的子集数据框

Subset dataframe based on list objects

我在 Turn 列中有一个包含语音数据的数据框:

test <- data.frame(
  Turn = c("Hi. I'm you an' you are me cos",
          "she'd've been so happy cos with all this stuff goin' on",
          "but we're in this together, because y' know things happens",
          "so you can't, cos well, ah because you know why!",
          "not now because it's too late!"), stringsAsFactors = F)

我想在 cos and/or because 之前至少有四个单词的那些行上对数据帧进行子集化。为此,我在 Turn:

中计算 cosbecause 的索引
test$Index <- sapply(strsplit(test$Turn, " "), function(x) which(x == 'cos'|x == 'because'))
test
                                                        Turn Index
1                             Hi. I'm you an' you are me cos     8
2    she'd've been so happy cos with all this stuff goin' on     5
3 but we're in this together, because y' know things happens     6
4           so you can't, cos well, ah because you know why!  4, 7
5                             not now because it's too late!     3

一行中有多个索引。这就是为什么我尝试像这样进行子集化失败的原因:

test[test$Index >= 5,]
Error in `[.data.frame`(test, test$Index >= 5, ) : 
  (list) object cannot be coerced to type 'double'

如何通过忽略列出的第二个 Index 值来对 test 进行子集化?

预期结果:

test
                                                        Turn Index
1                             Hi. I'm you an' you are me cos     8
2    she'd've been so happy cos with all this stuff goin' on     5
3 but we're in this together, because y' know things happens     6

我将不胜感激任何答案,包括不通过索引使用绕行但使用 regex 模式进行子集设置的答案。

编辑:

sapply 范例中的解决方案非常简单,只需选择所列对象的第一个值:

sapply(test$Index, function(x) x[1])
[1] 4 5 6 4 3

我希望这会给你一个想法:

test <- data.frame(
  Turn = c("Hi. I'm you an' you are me cos",
          "she'd've been so happy cos with all this stuff goin' on",
          "but we're in this together, because y' know things happens",
          "so you can't, cos well, ah because you know why!",
          "not now because it's too late!"), stringsAsFactors = F)
rx <- "^\s*(?:\S+\s+){0,3}(?:cos|because)\b.*(*SKIP)(*F)|(?:\S+[\s,]+){4}\b(cos|because)\b"
Turn <- test[grepl(rx, test$Turn, perl=TRUE),]
split <- strsplit(Turn, "\b(cos|because)\b")
Index <- sapply(split, function(x) lengths(strsplit(trimws(x[[1]]), "\s+"))+1)
test <- data.frame(Turn, Index, stringsAsFactors = F)
test

输出:

                                                       Turn Index
1                             Hi. I'm you an' you are me cos     8
2    she'd've been so happy cos with all this stuff goin' on     5
3 but we're in this together, because y' know things happens     6

参见R demo and the main regex demo

正则表达式详细信息:

  • ^\s*(?:\S+\s+){0,3}(?:cos|because)\b.*(*SKIP)(*F) - 匹配 stirng 的开头,然后是零到三个单词,然后 cosbecause 作为一个完整的单词和字符串的其余部分,然后跳过匹配
  • | - 或
  • (?:\S+[\s,]+){4}\b(cos|because)\b - 匹配 cosbecause 前面有四个词。

基于 tidyverse 的解决方案如下所示。

library(dplyr)
library(purrr)
library(stringr)

test %>%
  mutate(index = map(str_split(Turn, ' '), 
                     ~ str_which(., 'cos|because')[1])) %>%
  filter(index >= 5)

#                                                         Turn index
# 1                             Hi. I'm you an' you are me cos     8
# 2    she'd've been so happy cos with all this stuff goin' on     5
# 3 but we're in this together, because y' know things happens     6