如何应用 groupby 条件并获取结果中的所有列?
How to apply groupby condition and get all the columns in the result?
我的数据框看起来像
+-------------------------+-----+
| Title| Status|Suite|ID |Time |
+------+-------+-----+----+-----+
|KIM | Passed|ABC |123 |20 |
|KJT | Passed|ABC |123 |10 |
|ZXD | Passed|CDF |123 |15 |
|XCV | Passed|GHY |113 |36 |
|KJM | Passed|RTH |456 |45 |
|KIM | Passed|ABC |115 |47 |
|JY | Passed|JHJK |8963|74 |
|KJH | Passed|SNMP |256 |47 |
|KJH | Passed|ABC |123 |78 |
|LOK | Passed|GHY |456 |96 |
|LIM | Passed|RTH |113 |78 |
|MKN | Passed|ABC |115 |74 |
|KJM | Passed|GHY |8963|74 |
+------+-------+-----+----+-----+
可以使用
创建
df = sqlCtx.createDataFrame(
[
('KIM', 'Passed', 'ABC', '123',20),
('KJT', 'Passed', 'ABC', '123',10),
('ZXD', 'Passed', 'CDF', '123',15),
('XCV', 'Passed', 'GHY', '113',36),
('KJM', 'Passed', 'RTH', '456',45),
('KIM', 'Passed', 'ABC', '115',47),
('JY', 'Passed', 'JHJK', '8963',74),
('KJH', 'Passed', 'SNMP', '256',47),
('KJH', 'Passed', 'ABC', '123',78),
('LOK', 'Passed', 'GHY', '456',96),
('LIM', 'Passed', 'RTH', '113',78),
('MKN', 'Passed', 'ABC', '115',74),
('KJM', 'Passed', 'GHY', '8963',74),
],('Title', 'Status', 'Suite', 'ID','Time')
)
我需要在 ID 上应用 group by
并在时间上应用 aggregation
,结果我需要获得 Title、Status & Suite 以及 ID。
我的输出应该是这样的
+-------------------------+-----+
| Title| Status|Suite| ID|Time |
+------+-------+-----+----+-----+
|KIM | Passed|ABC |123 |30.75|
|XCV | Passed|GHY |113 |57 |
|KJM | Passed|RTH |456 |70.5 |
|KIM | Passed|ABC |115 |60.5 |
|JY | Passed|JHJK |8963|74 |
|KJH | Passed|SNMP |256 |47 |
+------+-------+-----+----+-----+
我试过下面的代码。但它只给我 result
中 ID 的值
df.groupBy("ID").agg(mean("Time").alias("Time"))
通过修改预期输出,您可以通过 first
:
获得任意值
from pyspark.sql.functions import avg, first
df.groupBy("id").agg(
first("Title"), first("Status"), first("Suite"), avg("Time")
).toDF("id", "Title", "Status", "Suite", "Time").show()
# +----+-----+------+-----+-----+
# | id|Title|Status|Suite| Time|
# +----+-----+------+-----+-----+
# | 113| XCV|Passed| GHY| 57.0|
# | 256| KJH|Passed| SNMP| 47.0|
# | 456| KJM|Passed| RTH| 70.5|
# | 115| KIM|Passed| ABC| 60.5|
# |8963| JY|Passed| JHJK| 74.0|
# | 123| KIM|Passed| ABC|30.75|
# +----+-----+------+-----+-----+
原回答
看起来你想要drop_duplicates
:
df.drop_duplicates(subset=["ID"]).show()
# +-----+------+-----+----+
# |Title|Status|Suite| ID|
# +-----+------+-----+----+
# | XCV|Passed| GHY| 113|
# | KJH|Passed| SNMP| 256|
# | KJM|Passed| RTH| 456|
# | KIM|Passed| ABC| 115|
# | JY|Passed| JHJK|8963|
# | KIM|Passed| ABC| 123|
# +-----+------+-----+----+
如果要使用特定行请参考
我的数据框看起来像
+-------------------------+-----+
| Title| Status|Suite|ID |Time |
+------+-------+-----+----+-----+
|KIM | Passed|ABC |123 |20 |
|KJT | Passed|ABC |123 |10 |
|ZXD | Passed|CDF |123 |15 |
|XCV | Passed|GHY |113 |36 |
|KJM | Passed|RTH |456 |45 |
|KIM | Passed|ABC |115 |47 |
|JY | Passed|JHJK |8963|74 |
|KJH | Passed|SNMP |256 |47 |
|KJH | Passed|ABC |123 |78 |
|LOK | Passed|GHY |456 |96 |
|LIM | Passed|RTH |113 |78 |
|MKN | Passed|ABC |115 |74 |
|KJM | Passed|GHY |8963|74 |
+------+-------+-----+----+-----+
可以使用
创建df = sqlCtx.createDataFrame(
[
('KIM', 'Passed', 'ABC', '123',20),
('KJT', 'Passed', 'ABC', '123',10),
('ZXD', 'Passed', 'CDF', '123',15),
('XCV', 'Passed', 'GHY', '113',36),
('KJM', 'Passed', 'RTH', '456',45),
('KIM', 'Passed', 'ABC', '115',47),
('JY', 'Passed', 'JHJK', '8963',74),
('KJH', 'Passed', 'SNMP', '256',47),
('KJH', 'Passed', 'ABC', '123',78),
('LOK', 'Passed', 'GHY', '456',96),
('LIM', 'Passed', 'RTH', '113',78),
('MKN', 'Passed', 'ABC', '115',74),
('KJM', 'Passed', 'GHY', '8963',74),
],('Title', 'Status', 'Suite', 'ID','Time')
)
我需要在 ID 上应用 group by
并在时间上应用 aggregation
,结果我需要获得 Title、Status & Suite 以及 ID。
我的输出应该是这样的
+-------------------------+-----+
| Title| Status|Suite| ID|Time |
+------+-------+-----+----+-----+
|KIM | Passed|ABC |123 |30.75|
|XCV | Passed|GHY |113 |57 |
|KJM | Passed|RTH |456 |70.5 |
|KIM | Passed|ABC |115 |60.5 |
|JY | Passed|JHJK |8963|74 |
|KJH | Passed|SNMP |256 |47 |
+------+-------+-----+----+-----+
我试过下面的代码。但它只给我 result
中 ID 的值df.groupBy("ID").agg(mean("Time").alias("Time"))
通过修改预期输出,您可以通过 first
:
from pyspark.sql.functions import avg, first
df.groupBy("id").agg(
first("Title"), first("Status"), first("Suite"), avg("Time")
).toDF("id", "Title", "Status", "Suite", "Time").show()
# +----+-----+------+-----+-----+
# | id|Title|Status|Suite| Time|
# +----+-----+------+-----+-----+
# | 113| XCV|Passed| GHY| 57.0|
# | 256| KJH|Passed| SNMP| 47.0|
# | 456| KJM|Passed| RTH| 70.5|
# | 115| KIM|Passed| ABC| 60.5|
# |8963| JY|Passed| JHJK| 74.0|
# | 123| KIM|Passed| ABC|30.75|
# +----+-----+------+-----+-----+
原回答
看起来你想要drop_duplicates
:
df.drop_duplicates(subset=["ID"]).show()
# +-----+------+-----+----+
# |Title|Status|Suite| ID|
# +-----+------+-----+----+
# | XCV|Passed| GHY| 113|
# | KJH|Passed| SNMP| 256|
# | KJM|Passed| RTH| 456|
# | KIM|Passed| ABC| 115|
# | JY|Passed| JHJK|8963|
# | KIM|Passed| ABC| 123|
# +-----+------+-----+----+
如果要使用特定行请参考