如何根据同一列的条件更改 PySpark 数据框中的值?
How to change values in a PySpark dataframe based on a condition of that same column?
考虑一个示例数据框:
df =
+-------+-----+
| tech|state|
+-------+-----+
| 70|wa |
| 50|mn |
| 20|fl |
| 50|mo |
| 10|ar |
| 90|wi |
| 30|al |
| 50|ca |
+-------+-----+
我想更改 'tech' 列,将 50 的任何值更改为 1,所有其他值都等于 0。
输出将如下所示:
df =
+-------+-----+
| tech|state|
+-------+-----+
| 0 |wa |
| 1 |mn |
| 0 |fl |
| 1 |mo |
| 0 |ar |
| 0 |wi |
| 0 |al |
| 1 |ca |
+-------+-----+
这是我目前的情况:
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType
changing_column = 'tech'
udf_first = UserDefinedFunction(lambda x: 1, IntegerType())
udf_second = UserDefinedFunction(lambda x: 0, IntegerType())
first_df = zero_df.select(*[udf_first(changing_column) if column == 50 else column for column in zero_df])
second_df = first_df.select(*[udf_second(changing_column) if column != 50 else column for column in first_df])
second_df.show()
希望这对您有所帮助
from pyspark.sql.functions import when
df = spark\
.createDataFrame([\
(70, 'wa'),\
(50, 'mn'),\
(20, 'fl')],\
["tech", "state"])
df\
.select("*", when(df.tech == 50, 1)\
.otherwise(0)\
.alias("tech"))\
.show()
+----+-----+----+
|tech|state|tech|
+----+-----+----+
| 70| wa| 0|
| 50| mn| 1|
| 20| fl| 0|
+----+-----+----+
考虑一个示例数据框:
df =
+-------+-----+
| tech|state|
+-------+-----+
| 70|wa |
| 50|mn |
| 20|fl |
| 50|mo |
| 10|ar |
| 90|wi |
| 30|al |
| 50|ca |
+-------+-----+
我想更改 'tech' 列,将 50 的任何值更改为 1,所有其他值都等于 0。
输出将如下所示:
df =
+-------+-----+
| tech|state|
+-------+-----+
| 0 |wa |
| 1 |mn |
| 0 |fl |
| 1 |mo |
| 0 |ar |
| 0 |wi |
| 0 |al |
| 1 |ca |
+-------+-----+
这是我目前的情况:
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType
changing_column = 'tech'
udf_first = UserDefinedFunction(lambda x: 1, IntegerType())
udf_second = UserDefinedFunction(lambda x: 0, IntegerType())
first_df = zero_df.select(*[udf_first(changing_column) if column == 50 else column for column in zero_df])
second_df = first_df.select(*[udf_second(changing_column) if column != 50 else column for column in first_df])
second_df.show()
希望这对您有所帮助
from pyspark.sql.functions import when
df = spark\
.createDataFrame([\
(70, 'wa'),\
(50, 'mn'),\
(20, 'fl')],\
["tech", "state"])
df\
.select("*", when(df.tech == 50, 1)\
.otherwise(0)\
.alias("tech"))\
.show()
+----+-----+----+
|tech|state|tech|
+----+-----+----+
| 70| wa| 0|
| 50| mn| 1|
| 20| fl| 0|
+----+-----+----+