如何将 groupedData 转换为 R 中的 Dataframe
How can I convert groupedData into Dataframe in R
假设我有以下数据框
AccountId,CloseDate
1,2015-05-07
2,2015-05-09
3,2015-05-01
4,2015-05-07
1,2015-05-09
1,2015-05-12
2,2015-05-12
3,2015-05-01
3,2015-05-01
3,2015-05-02
4,2015-05-17
1,2015-05-12
我想根据 AccountId 对其进行分组,然后我想添加另一列命名 date_diff,它将包含当前行和上一行之间的 CloseDate 差异。请注意,我希望仅针对具有相同 AccountId 的行计算此 date_diff。所以我需要在添加另一列之前对数据进行分组
下面是我正在使用的 R 代码
df <- read.df(sqlContext, "/home/ubuntu/work/csv/sample.csv", source = "com.databricks.spark.csv", inferSchema = "true", header="true")
df$CloseDate <- to_date(df$CloseDate)
groupedData <- SparkR::group_by(df, df$AccountId)
SparkR::mutate(groupedData, DiffCloseDt = as.numeric(SparkR::datediff((CloseDate),(SparkR::lag(CloseDate,1)))))
要添加另一列,我正在使用 mutate。但是作为 group_by returns groupedData 我不能在这里使用 mutate。我收到以下错误
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘mutate’ for signature ‘"GroupedData"’
那么如何将 GroupedData 转换为 Dataframe 以便我可以使用 mutate 添加列?
使用 group_by
无法实现您想要的。正如已经在 SO 上多次解释过的那样:
group_by
on a DataFrame
不会对数据进行物理分组。此外,应用 group_by
后的操作顺序是不确定的。
要获得所需的输出,您必须使用 window 函数并提供明确的排序:
df <- structure(list(AccountId = c(1L, 2L, 3L, 4L, 1L, 1L, 2L, 3L,
3L, 3L, 4L, 1L), CloseDate = structure(c(3L, 4L, 1L, 3L, 4L,
5L, 5L, 1L, 1L, 2L, 6L, 5L), .Label = c("2015-05-01", "2015-05-02",
"2015-05-07", "2015-05-09", "2015-05-12", "2015-05-17"), class = "factor")),
.Names = c("AccountId", "CloseDate"),
class = "data.frame", row.names = c(NA, -12L))
hiveContext <- sparkRHive.init(sc)
sdf <- createDataFrame(hiveContext, df)
registerTempTable(sdf, "df")
query <- "SELECT *, LAG(CloseDate, 1) OVER (
PARTITION BY AccountId ORDER BY CloseDate
) AS DateLag FROM df"
dfWithLag <- sql(hiveContext, query)
withColumn(dfWithLag, "diff", datediff(dfWithLag$CloseDate, dfWithLag$DateLag)) %>%
head()
## AccountId CloseDate DateLag diff
## 1 1 2015-05-07 <NA> NA
## 2 1 2015-05-09 2015-05-07 2
## 3 1 2015-05-12 2015-05-09 3
## 4 1 2015-05-12 2015-05-12 0
## 5 2 2015-05-09 <NA> NA
## 6 2 2015-05-12 2015-05-09 3
假设我有以下数据框
AccountId,CloseDate
1,2015-05-07
2,2015-05-09
3,2015-05-01
4,2015-05-07
1,2015-05-09
1,2015-05-12
2,2015-05-12
3,2015-05-01
3,2015-05-01
3,2015-05-02
4,2015-05-17
1,2015-05-12
我想根据 AccountId 对其进行分组,然后我想添加另一列命名 date_diff,它将包含当前行和上一行之间的 CloseDate 差异。请注意,我希望仅针对具有相同 AccountId 的行计算此 date_diff。所以我需要在添加另一列之前对数据进行分组
下面是我正在使用的 R 代码
df <- read.df(sqlContext, "/home/ubuntu/work/csv/sample.csv", source = "com.databricks.spark.csv", inferSchema = "true", header="true")
df$CloseDate <- to_date(df$CloseDate)
groupedData <- SparkR::group_by(df, df$AccountId)
SparkR::mutate(groupedData, DiffCloseDt = as.numeric(SparkR::datediff((CloseDate),(SparkR::lag(CloseDate,1)))))
要添加另一列,我正在使用 mutate。但是作为 group_by returns groupedData 我不能在这里使用 mutate。我收到以下错误
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘mutate’ for signature ‘"GroupedData"’
那么如何将 GroupedData 转换为 Dataframe 以便我可以使用 mutate 添加列?
使用 group_by
无法实现您想要的。正如已经在 SO 上多次解释过的那样:
group_by
on a DataFrame
不会对数据进行物理分组。此外,应用 group_by
后的操作顺序是不确定的。
要获得所需的输出,您必须使用 window 函数并提供明确的排序:
df <- structure(list(AccountId = c(1L, 2L, 3L, 4L, 1L, 1L, 2L, 3L,
3L, 3L, 4L, 1L), CloseDate = structure(c(3L, 4L, 1L, 3L, 4L,
5L, 5L, 1L, 1L, 2L, 6L, 5L), .Label = c("2015-05-01", "2015-05-02",
"2015-05-07", "2015-05-09", "2015-05-12", "2015-05-17"), class = "factor")),
.Names = c("AccountId", "CloseDate"),
class = "data.frame", row.names = c(NA, -12L))
hiveContext <- sparkRHive.init(sc)
sdf <- createDataFrame(hiveContext, df)
registerTempTable(sdf, "df")
query <- "SELECT *, LAG(CloseDate, 1) OVER (
PARTITION BY AccountId ORDER BY CloseDate
) AS DateLag FROM df"
dfWithLag <- sql(hiveContext, query)
withColumn(dfWithLag, "diff", datediff(dfWithLag$CloseDate, dfWithLag$DateLag)) %>%
head()
## AccountId CloseDate DateLag diff
## 1 1 2015-05-07 <NA> NA
## 2 1 2015-05-09 2015-05-07 2
## 3 1 2015-05-12 2015-05-09 3
## 4 1 2015-05-12 2015-05-12 0
## 5 2 2015-05-09 <NA> NA
## 6 2 2015-05-12 2015-05-09 3