如何在 SparkR 中找到列的长度

Question

我正在将纯 R 代码转换为 SparkR 以有效利用 Spark。

我有下面的列 CloseDate。

CloseDate
2011-01-08
2011-02-07
2012-04-07
2013-04-18
2011-02-07
2010-11-10
2010-12-09
2013-02-18
2010-12-09
2011-03-11
2011-04-10
2013-06-19
2011-04-10
2011-01-06
2011-02-06
2013-04-16
2011-02-06
2015-09-25
2015-09-25
2010-11-10

我想计算日期增加|减少的次数。我有下面的 R 代码来做到这一点。

dateChange <- function(closeDate, dir){
  close_dt <- as.Date(closeDate)
  num_closedt_out = 0
  num_closedt_in = 0

  for(j in 1:length(close_dt)) 
  {
    curr <- close_dt[j]
    if (j > 1)
      prev <- close_dt[j-1]
    else 
      prev <- curr
    if (curr > prev){
      num_closedt_out = num_closedt_out + 1
    }
    else if (curr < prev){
      num_closedt_in = num_closedt_in + 1
    }
  }
  if (dir=="inc")
    ret <- num_closedt_out
  else if (dir=="dec")
    ret <- num_closedt_in
  ret
}

我在这里尝试使用 SparkR df$col。由于spark懒惰地执行代码，我在执行过程中没有得到length的值并得到NaN错误。

这是我试过的修改后的代码。

DateDirChanges <- function(closeDate, dir){
  close_dt <- to_date(closeDate)
  num_closedt_out = 0
  num_closedt_in = 0

  col_len <- SparkR::count(close_dt)
  for(j in 1:col_len) 
  {
    curr <- close_dt[j]
    if (j > 1)
      prev <- close_dt[j-1]
    else 
      prev <- curr
    if (curr > prev){
      num_closedt_out = num_closedt_out + 1
    }
    else if (curr < prev){
      num_closedt_in = num_closedt_in + 1
    }
  }
  if (dir=="inc")
    ret <- num_closedt_out
  else if (dir=="dec")
    ret <- num_closedt_in
  ret
}

如何在执行这段代码的过程中获取列的长度？或者还有其他更好的方法吗？

Answer 1

你不能，因为 Column 根本没有长度。与您在 R 中可能期望的不同，列不代表数据，而是代表 SQL 表达式和特定数据转换。此外，Spark DataFrame 中值的顺序是任意的，因此您不能简单地环顾四周。

如果可以像您之前的问题那样对数据进行分区，您可以使用 window 函数，就像我在中展示的那样。否则没有单独使用 SparkR 来处理这个问题的有效方法。

假设有一种方法可以确定顺序（必需）并且您可以对数据进行分区（希望获得合理的性能），您只需要这样：

SELECT
   CAST(LAG(CloseDate, 1) OVER w > CloseDate AS INT) gt,
   CAST(LAG(CloseDate, 1) OVER w < CloseDate AS INT) lt,
   CAST(LAG(CloseDate, 1) OVER w = CloseDate AS INT) eq
FROM DF
WINDOW w AS (
  PARTITION BY partition_col ORDER BY order_col
)

如何在 SparkR 中找到列的长度

How can I find length of a column in SparkR

r

apache-spark

sparkr