收缩 Dataframe 的连续两行,并将结果列值替换为连续行的平均值
Shrink consecutive two rows of a Dataframe and replace the resulting column value with average of the consecutive rows
我需要收缩数据框的每连续两行,并且需要用每个类别的两行的平均标记替换标记列。我在 Azure Databricks 上使用 Pyspark 2.4.4。任何想法我怎么能接近相同的。我的示例数据框如下所示。
+----------+-------+--------+
| Category | Quiz | Marks |
+----------+-------+--------+
| A | 1 | 10 |
| A | 2 | 20 |
| A | 3 | 30 |
| A | 4 | 40 |
| B | 1 | 4 |
| B | 2 | 2 |
| B | 3 | 6 |
| B | 4 | 8 |
+----------+-------+--------+
我的 Dataframe 看起来像这样
+----------+-------+--------+
| Category | QuiZ | Marks |
+----------+-------+--------+
| A | 1 | 15 |
| A | 2 | 35 |
| B | 1 | 3 |
| B | 2 | 7 |
+----------+-------+--------+
In genaral I have 10K categories and 300 quiz and marks for each category
导入库
from pyspark.sql import *
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)
创建你的数据框
category = ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B']
quiz = [1, 2, 3, 4, 1, 2, 3, 4]
marks = [10, 20, 30, 40, 4, 2, 6, 8]
df = spark.createDataFrame(zip(category, quiz, marks), schema=['Category', 'Quiz', 'Marks'])
df.show()
+--------+----+-----+
|Category|Quiz|Marks|
+--------+----+-----+
| A| 1| 10|
| A| 2| 20|
| A| 3| 30|
| A| 4| 40|
| B| 1| 4|
| B| 2| 2|
| B| 3| 6|
| B| 4| 8|
+--------+----+-----+
偶数减少一个测验数
df2 = df.withColumn(
"Quiz2",
F.when(
df['Quiz'] % 2 == 0,
df['Quiz'] - 1
).otherwise(df['Quiz'])
)
df2.show()
+--------+----+-----+-----+
|Category|Quiz|Marks|Quiz2|
+--------+----+-----+-----+
| A| 1| 10| 1|
| A| 2| 20| 1|
| A| 3| 30| 3|
| A| 4| 40| 3|
| B| 1| 4| 1|
| B| 2| 2| 1|
| B| 3| 6| 3|
| B| 4| 8| 3|
+--------+----+-----+-----+
计算你的最终结果
df_final = df2.groupBy(['Category', 'Quiz2']).agg({'Marks': 'mean'})
df_final.show()
+--------+-----+----------+
|Category|Quiz2|avg(Marks)|
+--------+-----+----------+
| A| 3| 35.0|
| B| 1| 3.0|
| A| 1| 15.0|
| B| 3| 7.0|
+--------+-----+----------+
我需要收缩数据框的每连续两行,并且需要用每个类别的两行的平均标记替换标记列。我在 Azure Databricks 上使用 Pyspark 2.4.4。任何想法我怎么能接近相同的。我的示例数据框如下所示。
+----------+-------+--------+
| Category | Quiz | Marks |
+----------+-------+--------+
| A | 1 | 10 |
| A | 2 | 20 |
| A | 3 | 30 |
| A | 4 | 40 |
| B | 1 | 4 |
| B | 2 | 2 |
| B | 3 | 6 |
| B | 4 | 8 |
+----------+-------+--------+
我的 Dataframe 看起来像这样
+----------+-------+--------+
| Category | QuiZ | Marks |
+----------+-------+--------+
| A | 1 | 15 |
| A | 2 | 35 |
| B | 1 | 3 |
| B | 2 | 7 |
+----------+-------+--------+
In genaral I have 10K categories and 300 quiz and marks for each category
导入库
from pyspark.sql import *
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)
创建你的数据框
category = ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B']
quiz = [1, 2, 3, 4, 1, 2, 3, 4]
marks = [10, 20, 30, 40, 4, 2, 6, 8]
df = spark.createDataFrame(zip(category, quiz, marks), schema=['Category', 'Quiz', 'Marks'])
df.show()
+--------+----+-----+
|Category|Quiz|Marks|
+--------+----+-----+
| A| 1| 10|
| A| 2| 20|
| A| 3| 30|
| A| 4| 40|
| B| 1| 4|
| B| 2| 2|
| B| 3| 6|
| B| 4| 8|
+--------+----+-----+
偶数减少一个测验数
df2 = df.withColumn(
"Quiz2",
F.when(
df['Quiz'] % 2 == 0,
df['Quiz'] - 1
).otherwise(df['Quiz'])
)
df2.show()
+--------+----+-----+-----+
|Category|Quiz|Marks|Quiz2|
+--------+----+-----+-----+
| A| 1| 10| 1|
| A| 2| 20| 1|
| A| 3| 30| 3|
| A| 4| 40| 3|
| B| 1| 4| 1|
| B| 2| 2| 1|
| B| 3| 6| 3|
| B| 4| 8| 3|
+--------+----+-----+-----+
计算你的最终结果
df_final = df2.groupBy(['Category', 'Quiz2']).agg({'Marks': 'mean'})
df_final.show()
+--------+-----+----------+
|Category|Quiz2|avg(Marks)|
+--------+-----+----------+
| A| 3| 35.0|
| B| 1| 3.0|
| A| 1| 15.0|
| B| 3| 7.0|
+--------+-----+----------+