相当于 pyspark 超前和滞后函数的 Dask
Dask equivalent of pyspark lead and lag function
是否有可能在 dask
数据帧中接收到 lag
或 lead
window 函数在 pyspark
接收到的类似结果?我想转换以下数据框
+-------+
| value |
+-------+
| 1 |
| 2 |
| 3 |
+-------+
像这样
+-------+------------+------------+
| value | prev_value | next_value |
+-------+------------+------------+
| 1 | NaN | 2 |
| 2 | 1 | 3 |
| 3 | 2 | NaN |
+-------+------------+------------+
Dask 数据框只是镜像 pandas 接口。在这种情况下,您想要的方法是 shift:
In [3]: import pandas as pd
In [4]: df = pd.DataFrame({'a': range(5)})
In [5]: import dask.dataframe as dd
In [6]: ddf = dd.from_pandas(df, npartitions=2)
In [7]: out = ddf.assign(prev_a=ddf.a.shift(1), next_a=ddf.a.shift(-1))
In [8]: out.compute()
Out[8]:
a prev_a next_a
0 0 NaN 1.0
1 1 0.0 2.0
2 2 1.0 3.0
3 3 2.0 4.0
4 4 3.0 NaN
但是,如果您尝试对齐行以进行某种窗口化或滚动计算,您可能对 map_overlap 更感兴趣,后者性能更高。
是否有可能在 dask
数据帧中接收到 lag
或 lead
window 函数在 pyspark
接收到的类似结果?我想转换以下数据框
+-------+
| value |
+-------+
| 1 |
| 2 |
| 3 |
+-------+
像这样
+-------+------------+------------+
| value | prev_value | next_value |
+-------+------------+------------+
| 1 | NaN | 2 |
| 2 | 1 | 3 |
| 3 | 2 | NaN |
+-------+------------+------------+
Dask 数据框只是镜像 pandas 接口。在这种情况下,您想要的方法是 shift:
In [3]: import pandas as pd
In [4]: df = pd.DataFrame({'a': range(5)})
In [5]: import dask.dataframe as dd
In [6]: ddf = dd.from_pandas(df, npartitions=2)
In [7]: out = ddf.assign(prev_a=ddf.a.shift(1), next_a=ddf.a.shift(-1))
In [8]: out.compute()
Out[8]:
a prev_a next_a
0 0 NaN 1.0
1 1 0.0 2.0
2 2 1.0 3.0
3 3 2.0 4.0
4 4 3.0 NaN
但是,如果您尝试对齐行以进行某种窗口化或滚动计算,您可能对 map_overlap 更感兴趣,后者性能更高。