按 PySpark 中一列中的不同值过滤行
Filter rows by distinct values in one column in PySpark
假设我有以下 table:
+--------------------+--------------------+------+------------+--------------------+
| host| path|status|content_size| time|
+--------------------+--------------------+------+------------+--------------------+
|js002.cc.utsunomi...|/shuttle/resource...| 404| 0|1995-08-01 00:07:...|
| tia1.eskimo.com |/pub/winvn/releas...| 404| 0|1995-08-01 00:28:...|
|grimnet23.idirect...|/www/software/win...| 404| 0|1995-08-01 00:50:...|
|miriworld.its.uni...|/history/history.htm| 404| 0|1995-08-01 01:04:...|
| ras38.srv.net |/elv/DELTA/uncons...| 404| 0|1995-08-01 01:05:...|
| cs1-06.leh.ptd.net | | 404| 0|1995-08-01 01:17:...|
|dialip-24.athenet...|/history/apollo/a...| 404| 0|1995-08-01 01:33:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:35:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:36:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:36:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:36:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:36:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:36:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:36:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:37:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:37:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:37:...|
|hsccs_gatorbox07....|/pub/winvn/releas...| 404| 0|1995-08-01 01:44:...|
|www-b2.proxy.aol....|/pub/winvn/readme...| 404| 0|1995-08-01 01:48:...|
|www-b2.proxy.aol....|/pub/winvn/releas...| 404| 0|1995-08-01 01:48:...|
+--------------------+--------------------+------+------------+--------------------+
我如何过滤此 table 以在 PySpark 中仅包含不同的路径?
但是 table 应该包含所有列。
如果要保存特定列中所有值都不同的行,则必须在 DataFrame 上调用 dropDuplicates
方法。
在我的例子中是这样的:
dataFrame = ...
dataFrame.dropDuplicates(['path'])
其中 路径 是列名
至于调整保留和丢弃哪些记录,如果您可以将您的条件处理成 Window 表达式,则可以使用类似的东西。这是在 scala 中(或多或少),但我想你也可以在 PySpark 中完成。
val window = Window.parititionBy('columns,'to,'make,'unique).orderBy('conditionToPutRowToKeepFirst)
dataframe.withColumn("row_number",row_number().over(window)).where('row_number===1).drop ('row_number)
假设我有以下 table:
+--------------------+--------------------+------+------------+--------------------+
| host| path|status|content_size| time|
+--------------------+--------------------+------+------------+--------------------+
|js002.cc.utsunomi...|/shuttle/resource...| 404| 0|1995-08-01 00:07:...|
| tia1.eskimo.com |/pub/winvn/releas...| 404| 0|1995-08-01 00:28:...|
|grimnet23.idirect...|/www/software/win...| 404| 0|1995-08-01 00:50:...|
|miriworld.its.uni...|/history/history.htm| 404| 0|1995-08-01 01:04:...|
| ras38.srv.net |/elv/DELTA/uncons...| 404| 0|1995-08-01 01:05:...|
| cs1-06.leh.ptd.net | | 404| 0|1995-08-01 01:17:...|
|dialip-24.athenet...|/history/apollo/a...| 404| 0|1995-08-01 01:33:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:35:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:36:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:36:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:36:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:36:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:36:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:36:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:37:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:37:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:37:...|
|hsccs_gatorbox07....|/pub/winvn/releas...| 404| 0|1995-08-01 01:44:...|
|www-b2.proxy.aol....|/pub/winvn/readme...| 404| 0|1995-08-01 01:48:...|
|www-b2.proxy.aol....|/pub/winvn/releas...| 404| 0|1995-08-01 01:48:...|
+--------------------+--------------------+------+------------+--------------------+
我如何过滤此 table 以在 PySpark 中仅包含不同的路径? 但是 table 应该包含所有列。
如果要保存特定列中所有值都不同的行,则必须在 DataFrame 上调用 dropDuplicates
方法。
在我的例子中是这样的:
dataFrame = ...
dataFrame.dropDuplicates(['path'])
其中 路径 是列名
至于调整保留和丢弃哪些记录,如果您可以将您的条件处理成 Window 表达式,则可以使用类似的东西。这是在 scala 中(或多或少),但我想你也可以在 PySpark 中完成。
val window = Window.parititionBy('columns,'to,'make,'unique).orderBy('conditionToPutRowToKeepFirst)
dataframe.withColumn("row_number",row_number().over(window)).where('row_number===1).drop ('row_number)