基于列表对象的子集数据框
Subset dataframe based on list objects
我在 Turn
列中有一个包含语音数据的数据框:
test <- data.frame(
Turn = c("Hi. I'm you an' you are me cos",
"she'd've been so happy cos with all this stuff goin' on",
"but we're in this together, because y' know things happens",
"so you can't, cos well, ah because you know why!",
"not now because it's too late!"), stringsAsFactors = F)
我想在 在 cos
and/or because
之前至少有四个单词的那些行上对数据帧进行子集化。为此,我在 Turn
:
中计算 cos
和 because
的索引
test$Index <- sapply(strsplit(test$Turn, " "), function(x) which(x == 'cos'|x == 'because'))
test
Turn Index
1 Hi. I'm you an' you are me cos 8
2 she'd've been so happy cos with all this stuff goin' on 5
3 but we're in this together, because y' know things happens 6
4 so you can't, cos well, ah because you know why! 4, 7
5 not now because it's too late! 3
一行中有多个索引。这就是为什么我尝试像这样进行子集化失败的原因:
test[test$Index >= 5,]
Error in `[.data.frame`(test, test$Index >= 5, ) :
(list) object cannot be coerced to type 'double'
如何通过忽略列出的第二个 Index
值来对 test
进行子集化?
预期结果:
test
Turn Index
1 Hi. I'm you an' you are me cos 8
2 she'd've been so happy cos with all this stuff goin' on 5
3 but we're in this together, because y' know things happens 6
我将不胜感激任何答案,包括不通过索引使用绕行但使用 regex
模式进行子集设置的答案。
编辑:
sapply
范例中的解决方案非常简单,只需选择所列对象的第一个值:
sapply(test$Index, function(x) x[1])
[1] 4 5 6 4 3
我希望这会给你一个想法:
test <- data.frame(
Turn = c("Hi. I'm you an' you are me cos",
"she'd've been so happy cos with all this stuff goin' on",
"but we're in this together, because y' know things happens",
"so you can't, cos well, ah because you know why!",
"not now because it's too late!"), stringsAsFactors = F)
rx <- "^\s*(?:\S+\s+){0,3}(?:cos|because)\b.*(*SKIP)(*F)|(?:\S+[\s,]+){4}\b(cos|because)\b"
Turn <- test[grepl(rx, test$Turn, perl=TRUE),]
split <- strsplit(Turn, "\b(cos|because)\b")
Index <- sapply(split, function(x) lengths(strsplit(trimws(x[[1]]), "\s+"))+1)
test <- data.frame(Turn, Index, stringsAsFactors = F)
test
输出:
Turn Index
1 Hi. I'm you an' you are me cos 8
2 she'd've been so happy cos with all this stuff goin' on 5
3 but we're in this together, because y' know things happens 6
参见R demo and the main regex demo。
正则表达式详细信息:
^\s*(?:\S+\s+){0,3}(?:cos|because)\b.*(*SKIP)(*F)
- 匹配 stirng 的开头,然后是零到三个单词,然后 cos
或 because
作为一个完整的单词和字符串的其余部分,然后跳过匹配
|
- 或
(?:\S+[\s,]+){4}\b(cos|because)\b
- 匹配 cos
或 because
前面有四个词。
基于 tidyverse 的解决方案如下所示。
library(dplyr)
library(purrr)
library(stringr)
test %>%
mutate(index = map(str_split(Turn, ' '),
~ str_which(., 'cos|because')[1])) %>%
filter(index >= 5)
# Turn index
# 1 Hi. I'm you an' you are me cos 8
# 2 she'd've been so happy cos with all this stuff goin' on 5
# 3 but we're in this together, because y' know things happens 6
我在 Turn
列中有一个包含语音数据的数据框:
test <- data.frame(
Turn = c("Hi. I'm you an' you are me cos",
"she'd've been so happy cos with all this stuff goin' on",
"but we're in this together, because y' know things happens",
"so you can't, cos well, ah because you know why!",
"not now because it's too late!"), stringsAsFactors = F)
我想在 在 cos
and/or because
之前至少有四个单词的那些行上对数据帧进行子集化。为此,我在 Turn
:
cos
和 because
的索引
test$Index <- sapply(strsplit(test$Turn, " "), function(x) which(x == 'cos'|x == 'because'))
test
Turn Index
1 Hi. I'm you an' you are me cos 8
2 she'd've been so happy cos with all this stuff goin' on 5
3 but we're in this together, because y' know things happens 6
4 so you can't, cos well, ah because you know why! 4, 7
5 not now because it's too late! 3
一行中有多个索引。这就是为什么我尝试像这样进行子集化失败的原因:
test[test$Index >= 5,]
Error in `[.data.frame`(test, test$Index >= 5, ) :
(list) object cannot be coerced to type 'double'
如何通过忽略列出的第二个 Index
值来对 test
进行子集化?
预期结果:
test
Turn Index
1 Hi. I'm you an' you are me cos 8
2 she'd've been so happy cos with all this stuff goin' on 5
3 but we're in this together, because y' know things happens 6
我将不胜感激任何答案,包括不通过索引使用绕行但使用 regex
模式进行子集设置的答案。
编辑:
sapply
范例中的解决方案非常简单,只需选择所列对象的第一个值:
sapply(test$Index, function(x) x[1])
[1] 4 5 6 4 3
我希望这会给你一个想法:
test <- data.frame(
Turn = c("Hi. I'm you an' you are me cos",
"she'd've been so happy cos with all this stuff goin' on",
"but we're in this together, because y' know things happens",
"so you can't, cos well, ah because you know why!",
"not now because it's too late!"), stringsAsFactors = F)
rx <- "^\s*(?:\S+\s+){0,3}(?:cos|because)\b.*(*SKIP)(*F)|(?:\S+[\s,]+){4}\b(cos|because)\b"
Turn <- test[grepl(rx, test$Turn, perl=TRUE),]
split <- strsplit(Turn, "\b(cos|because)\b")
Index <- sapply(split, function(x) lengths(strsplit(trimws(x[[1]]), "\s+"))+1)
test <- data.frame(Turn, Index, stringsAsFactors = F)
test
输出:
Turn Index
1 Hi. I'm you an' you are me cos 8
2 she'd've been so happy cos with all this stuff goin' on 5
3 but we're in this together, because y' know things happens 6
参见R demo and the main regex demo。
正则表达式详细信息:
^\s*(?:\S+\s+){0,3}(?:cos|because)\b.*(*SKIP)(*F)
- 匹配 stirng 的开头,然后是零到三个单词,然后cos
或because
作为一个完整的单词和字符串的其余部分,然后跳过匹配|
- 或(?:\S+[\s,]+){4}\b(cos|because)\b
- 匹配cos
或because
前面有四个词。
基于 tidyverse 的解决方案如下所示。
library(dplyr)
library(purrr)
library(stringr)
test %>%
mutate(index = map(str_split(Turn, ' '),
~ str_which(., 'cos|because')[1])) %>%
filter(index >= 5)
# Turn index
# 1 Hi. I'm you an' you are me cos 8
# 2 she'd've been so happy cos with all this stuff goin' on 5
# 3 but we're in this together, because y' know things happens 6