Pyspark JSON 文件中缺失值的零替换
Zero Replacement for missing values in JSON file for Pyspark
JSON 如下所示。
{
"ThresholdTime": "48min",
"FallTime": "Min",
"description": "PowerAmplifier"
}
{
"ThresholdTime": "min",
"FallTime": "200min",
"description": "DolbyDigitall"
}
我正在使用 regexp_extract
从字母数字字符串中删除字母字符。
df.withColumn("NewThresholdTime",regexp_extract("ThresholdTime","(\d+)",1))
如何在没有时间 ThresholdTime
或 FallTime
的情况下添加 0?
输出应该是:
+--------+-------------+--------------+----------------+
|FallTime|ThresholdTime| NewFallTime|NewThresholdTime|
+--------+-------------+--------------+----------------+
| Min | 48min|0 | 48 |
| 200min| min|200 | 0 |
+--------+-------------+--------------+----------------+
假设我们有一个包含 JSON 中提供的值的数据框,您可以检查如果没有数字,列是否保持不变,然后保持原样,否则删除字母。
df = sqlContext.createDataFrame(
[{"ThresholdTime": "48min",
"FallTime": "15Min",
"description": "PowerAmplifier"
},
{"ThresholdTime": "min",
"FallTime": "200min",
"description": "DolbyDigitall"}])
# What would column look like without alhpabets
col_without_alphabets = F.regexp_replace(df["ThresholdTime"], "[a-zA-Z]", "")
# What would column look like without numerals
col_without_numerals = F.regexp_replace(df["ThresholdTime"], "[0-9]", "")
# If without numerals the column remains the same then keep as-is, else remove alphabets
df.withColumn("NewThresholdTime",
F.when(col_without_numerals == df["ThresholdTime"],
F.lit(0))
.otherwise(col_without_alphabets)).show()
输出:
+--------+-------------+--------------+----------------+
|FallTime|ThresholdTime| description|NewThresholdTime|
+--------+-------------+--------------+----------------+
| 15Min| 48min|PowerAmplifier| 48|
| 200min| min| DolbyDigitall| 0|
+--------+-------------+--------------+----------------+
添加答案以对任意数量的变量进行扩展。
循环遍历您想对其执行相同操作的任何变量。
new_columns = list()
for column in ["ThresholdTime", "FallTime"]:
# What would column look like without alphabets
col_without_alphabets = F.regexp_replace(df[column], "[a-zA-Z]", "")
# What would column look like without numerals
col_without_numerals = F.regexp_replace(df[column], "[0-9]", "")
# If without numerals the column remains the same then keep as-is, else remove alphabets
new_columns.append(F.when(col_without_numerals == df[column],
F.lit(0)).otherwise(col_without_alphabets).alias("New{}".format(column)))
df.select(["*"] + new_columns).show()
输出:
+--------+-------------+--------------+----------------+-----------+
|FallTime|ThresholdTime| description|NewThresholdTime|NewFallTime|
+--------+-------------+--------------+----------------+-----------+
| 15Min| 48min|PowerAmplifier| 48| 15|
| 200min| min| DolbyDigitall| 0| 200|
+--------+-------------+--------------+----------------+-----------+
JSON 如下所示。
{
"ThresholdTime": "48min",
"FallTime": "Min",
"description": "PowerAmplifier"
}
{
"ThresholdTime": "min",
"FallTime": "200min",
"description": "DolbyDigitall"
}
我正在使用 regexp_extract
从字母数字字符串中删除字母字符。
df.withColumn("NewThresholdTime",regexp_extract("ThresholdTime","(\d+)",1))
如何在没有时间 ThresholdTime
或 FallTime
的情况下添加 0?
输出应该是:
+--------+-------------+--------------+----------------+
|FallTime|ThresholdTime| NewFallTime|NewThresholdTime|
+--------+-------------+--------------+----------------+
| Min | 48min|0 | 48 |
| 200min| min|200 | 0 |
+--------+-------------+--------------+----------------+
假设我们有一个包含 JSON 中提供的值的数据框,您可以检查如果没有数字,列是否保持不变,然后保持原样,否则删除字母。
df = sqlContext.createDataFrame(
[{"ThresholdTime": "48min",
"FallTime": "15Min",
"description": "PowerAmplifier"
},
{"ThresholdTime": "min",
"FallTime": "200min",
"description": "DolbyDigitall"}])
# What would column look like without alhpabets
col_without_alphabets = F.regexp_replace(df["ThresholdTime"], "[a-zA-Z]", "")
# What would column look like without numerals
col_without_numerals = F.regexp_replace(df["ThresholdTime"], "[0-9]", "")
# If without numerals the column remains the same then keep as-is, else remove alphabets
df.withColumn("NewThresholdTime",
F.when(col_without_numerals == df["ThresholdTime"],
F.lit(0))
.otherwise(col_without_alphabets)).show()
输出:
+--------+-------------+--------------+----------------+
|FallTime|ThresholdTime| description|NewThresholdTime|
+--------+-------------+--------------+----------------+
| 15Min| 48min|PowerAmplifier| 48|
| 200min| min| DolbyDigitall| 0|
+--------+-------------+--------------+----------------+
添加答案以对任意数量的变量进行扩展。
循环遍历您想对其执行相同操作的任何变量。
new_columns = list()
for column in ["ThresholdTime", "FallTime"]:
# What would column look like without alphabets
col_without_alphabets = F.regexp_replace(df[column], "[a-zA-Z]", "")
# What would column look like without numerals
col_without_numerals = F.regexp_replace(df[column], "[0-9]", "")
# If without numerals the column remains the same then keep as-is, else remove alphabets
new_columns.append(F.when(col_without_numerals == df[column],
F.lit(0)).otherwise(col_without_alphabets).alias("New{}".format(column)))
df.select(["*"] + new_columns).show()
输出:
+--------+-------------+--------------+----------------+-----------+
|FallTime|ThresholdTime| description|NewThresholdTime|NewFallTime|
+--------+-------------+--------------+----------------+-----------+
| 15Min| 48min|PowerAmplifier| 48| 15|
| 200min| min| DolbyDigitall| 0| 200|
+--------+-------------+--------------+----------------+-----------+