Dplyr 根据从另一个数据集引用的值过滤数据集，returns 全行或无行

Question

我在根据从另一个数据集引用的值过滤数据集时遇到问题。

我有两个数据集。第一个数据集 comparison_dt 包含我应该以具有位置 1、位置 2 的行的形式进行的所有比较。第二个数据集 rain_values_dt 包含在不同时间从这些位置收集的值。我的目标是，对于 comparison_dt 中的每一行，过滤掉从位置 1 收集的 rain_values_dt 行，过滤掉从位置 2 收集的 rain_values_dt 行，内部连接这些行，运行配对 t 检验和 return 附加到 comparison_dt 的列的检验统计量。

我遇到的问题是我无法根据 comparison_dt 引用的位置名称过滤 rain_values_dt 的行。要求根据比较 table return 的第一行中存储的名称过滤 rain_values_dt 的所有行。要求根据存储在较高行号 return 中的名称进行过滤没什么。我只想从我在过滤器中引用的站点接收行。


library(data.table)
library(dplyr)

comparison_dt <- data.table(
  location1= c('austin_tx','austin_tx','austin_tx','boston_ma','boston_ma','boston_ma','chicago_il','chicago_il','chicago_il'),
  location2= c('austin_tx','boston_ma','chicago_il','austin_tx','boston_ma','chicago_il','austin_tx','boston_ma','chicago_il'),
  test_statistic= c()
)

rain_values_dt <- data.table(
  location=c('austin_tx','austin_tx','austin_tx','boston_ma','boston_ma','boston_ma','chicago_il','chicago_il','chicago_il'),
  month=c('march','april','may','march','april','may','march','april','may'),
  rainfall=c(1:9)
)

row_n=1

#my intended result, works as expected v
dplyr::filter(rain_values_dt, location == 'austin_tx')

#is pulling the correct name from the comparison table to filter on
comparison_dt[row_n,'location1']

#these are equivalent to each other, so I should be able to substitute, right?
'austin_tx' == comparison_dt[row_n,'location1']

#does not work, returns all values instead of filtering
dplyr::filter(rain_values_dt, location == comparison_dt[row_n,'location1'])

这是对较大数据集的简化，其中并非所有站点比较都有效，试验必须根据许多不同条件进行匹配，并且每个站点的试验数量不均匀。

这在之前是按预期工作的。我重新启动了 R 会话，它不再按预期工作。

基于我可能以不同方式导入数据集的想法，我尝试将数据集中的位置名称更改为字符或函数类型。我尝试将位置列引用为向量或引号。我尝试卸载并重新加载 dplyr 并检查 R 是否使用过滤器的基本统计版本或 dplyr 版本。这似乎是一个简单的问题，但我搜索了这个网站和 filter() 文档，但没有找到为什么该函数可能以这种方式运行的答案。

Answer 1

== 右侧的对象是 data.table。

class(comparison_dt[row_n,'location1'])
[1] "data.table" "data.frame"

我们需要将列提取为 vector。使用 $ 或 [[

dplyr::filter(rain_values_dt, location == 
            comparison_dt[row_n,'location1']$location1)
     location month rainfall
1: austin_tx march        1
2: austin_tx april        2
3: austin_tx   may        3

甚至 unlist 创建一个 vector

dplyr::filter(rain_values_dt, location == 
            unlist(comparison_dt[row_n,'location1']))
    location month rainfall
1: austin_tx march        1
2: austin_tx april        2
3: austin_tx   may        3

关于为什么我们要获取数据集的所有行 - 'location1' 的第一个元素是 'austin_tx' 这也是 'rank_values_dt' 中 'location' 的第一个元素.因此，它是来自 == 的 TRUE，它被回收

comparison_dt[row_n,'location1']
location1
1: austin_tx

假设，如果列值是 'boston_ma' 作为第一个元素，它将 return 0 行，因为元素比较与第一个元素比较 returns FALSE

dplyr::filter(rain_values_dt, location == data.table(location1 = 'boston_ma'))
Empty data.table (0 rows and 3 cols): location,month,rainfall
dplyr::filter(rain_values_dt, location == comparison_dt[row_n,'location1'])
     location month rainfall
1:  austin_tx march        1
2:  austin_tx april        2
3:  austin_tx   may        3
4:  boston_ma march        4
5:  boston_ma april        5
6:  boston_ma   may        6
7: chicago_il march        7
8: chicago_il april        8
9: chicago_il   may        9

即如果我们从 filter 中取出表达式，它会变得更加清晰 - 单个 TRUE/FALSE 输出，它被回收

rain_values_dt$location == data.table(location1 = 'boston_ma')
     location1
[1,]     FALSE
rain_values_dt$location == comparison_dt[row_n,'location1']
     location1
[1,]      TRUE

对于data.frame/data.table/tibble，单位是列。因此，comparison_dt[, 'location1'] 的 length 是 1。如果我们向 'comparison_dt'

添加更多行，则元素比较行为会更加明显

rain_values_dt$location == comparison_dt[3:5,'location1']
     location1
[1,]      TRUE
[2,]     FALSE
[3,]     FALSE

即第一个元素为 TRUE，因为它比较 rain_values_dt' 中 'location' 的第一个元素与比较的第三个元素，但下一个元素为 FALSE，因为它是 'boston_ma' 与第二个元素比较rain_values_dt$location 又是 'austin_tx'

Dplyr 根据从另一个数据集引用的值过滤数据集，returns 全行或无行

Dplyr filter dataset based on values referenced from another dataset, returns all or no rows

filtering

r

dplyr