如何做 dplyr inner_join col1 > col2

Question

当我不使用标准 "col1" = "col2" 连接时，我很难让 dplyr 连接工作。这是我遇到的两个例子。

首先：

library(dplyr)

tableA <- data.frame(col1= c("a","b","c","d"),
                     col2 = c(1,2,3,4))

inner_join(tableA, tableA, by = c("col1"!="col1")) %>% 
  select(col1, col2.x) %>% 
  arrange(col1, col2.x)

Error: by must be a (named) character vector, list, or NULL for natural joins (not recommended in production code), not logical

当我复制此代码但使用 sql 时，我得到以下信息：

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

copy_to(con, tableA)

tbl(con, sql("select a.col1, b.col2
              from 
              tableA as a
              inner join 
              tableA as b
              on a.col1 <> b.col1")) %>% 
  arrange(col1, col2)

来自 sql 查询的结果：

# Source:     SQL [?? x 2]
# Database:   sqlite 3.19.3 [:memory:]
# Ordered by: col1, col2
     col1  col2
     <chr> <dbl>
 1     a     2
 2     a     3
 3     a     4
 4     b     1
 5     b     3
 6     b     4
 7     c     1
 8     c     2
 9     c     4
10     d     1
# ... with more rows

第二部分与上一部分类似：

inner_join(tableA, tableA, by = c("col1" > "col1")) %>% 
   select(col1, col2.x) %>% 
   arrange(col1, col2.x)

Error: by must be a (named) character vector, list, or NULL for natural joins (not recommended in production code), not logical

Sql 相当于：

tbl(con, sql("select a.col1, b.col2
              from tableA as a
              inner join tableA as b
              on a.col1 > b.col1")) %>% 
   arrange(col1, col2)

第二个 sql 查询的结果：

# Source:     SQL [?? x 2]
# Database:   sqlite 3.19.3 [:memory:]
# Ordered by: col1, col2
   col1  col2
  <chr> <dbl>
1     b     1
2     c     1
3     c     2
4     d     1
5     d     2
6     d     3

有谁知道如何使用 dplyr 代码创建这些 sql 示例？

Answer 1

使用 dplyr 和 tidyr 的解决方案。想法是扩展数据框，然后与原始数据框进行连接。之后用tidyr中的fill填入NA到之前的记录。最后过滤掉和NA.

值相同的记录

library(dplyr)
library(tidyr)

tableB <- tableA %>%
  complete(col1, col2) %>%
  left_join(tableA %>% mutate(col3 = col2), by = c("col1", "col2")) %>%
  group_by(col1) %>%
  fill(col3, .direction = "up") %>%
  filter(col2 != col3, !is.na(col3)) %>%
  select(-col3) %>%
  ungroup()
tableB
# # A tibble: 6 x 2
#    col1  col2
#   <chr> <dbl>
# 1     b     1
# 2     c     1
# 3     c     2
# 4     d     1
# 5     d     2
# 6     d     3

数据

tableA <- data.frame(col1= c("a","b","c","d"),
                     col2 = c(1,2,3,4), stringsAsFactors = FALSE)

Answer 2

对于您的第一个案例：

library(dplyr)
library(tidyr)

expand(tableA, col1, col2) %>% 
  left_join(tableA, by = 'col1') %>% 
  filter(col2.x != col2.y) %>% 
  select(col1, col2 = col2.x)

结果：

# A tibble: 12 x 2
     col1  col2
   <fctr> <dbl>
 1      a     2
 2      a     3
 3      a     4
 4      b     1
 5      b     3
 6      b     4
 7      c     1
 8      c     2
 9      c     4
10      d     1
11      d     2
12      d     3

对于你的第二种情况：

expand(tableA, col1, col2) %>% 
  left_join(tableA, by = 'col1') %>% 
  filter(col2.x < col2.y) %>% 
  select(col1, col2 = col2.x)

结果：

# A tibble: 6 x 2
    col1  col2
  <fctr> <dbl>
1      b     1
2      c     1
3      c     2
4      d     1
5      d     2
6      d     3

如何做 dplyr inner_join col1 > col2

How to do a dplyr inner_join col1 > col2

r

dplyr

dbplyr