将函数从 python 翻译成 pyspark
Translate function from python to pyspark
我想比较两个 pyspark 数据帧并在新的 table 中获取差异。
我测试了使用 python:
的方法
我的数据框
data = {'name': ['NO_VALUE', 'Molly', 'Tina', 'Jake', 'Amy'],
'year': [2012, -999999, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3]}
df3 = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df3
我的参考数据框
data_ref = {'name': ['Jaso', 'Molly', 'Tina', 'Jake', 'Amy'],
'year': [2012, 202, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3]}
df_ref3 = pd.DataFrame(data_ref, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df_ref3
比较行:
def compare_datasets(df, df_ref):
ne_stacked = (df != df_ref).stack()
changed = ne_stacked[ne_stacked]
changed.index.names = ['id', 'col']
difference_locations = np.where(df != df_ref)
changed_from = df.values[difference_locations]
changed_to = df_ref.values[difference_locations]
error_test = pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
return error_test
compare_datasets(df3, df_ref3)
但是,我想在 pyspark 中执行此操作。有人有想法吗?
谢谢!
可能很难准确重现此行为。
我为您提供一个部分解决方案:
df.show()
+----------+--------+-------+-------+
| index| name| year|reports|
+----------+--------+-------+-------+
| Cochice|NO_VALUE| 2012| 4|
| Pima| Molly|-999999| 24|
|Santa Cruz| Tina| 2013| 31|
| Maricopa| Jake| 2014| 2|
| Yuma| Amy| 2014| 3|
+----------+--------+-------+-------+
df_ref.show()
+----------+-----+----+-------+
| index| name|year|reports|
+----------+-----+----+-------+
| Cochice| Jaso|2012| 4|
| Pima|Molly|2012| 24|
|Santa Cruz| Tina|2013| 31|
| Maricopa| Jake|2014| 2|
| Yuma| Amy|2014| 3|
+----------+-----+----+-------+
df.subtract(df_ref).show()
+-------+--------+-------+-------+
| index| name| year|reports|
+-------+--------+-------+-------+
| Pima| Molly|-999999| 24|
|Cochice|NO_VALUE| 2012| 4|
+-------+--------+-------+-------+
或者你可以做慢一点的:
for col in df_ref.columns:
out = df.select(col).subtract(df_ref.select(col))
if out.first():
print(out.collect())
[Row(name=u'NO_VALUE')]
[Row(year=-999999)]
我想比较两个 pyspark 数据帧并在新的 table 中获取差异。
我测试了使用 python:
的方法我的数据框
data = {'name': ['NO_VALUE', 'Molly', 'Tina', 'Jake', 'Amy'],
'year': [2012, -999999, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3]}
df3 = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df3
我的参考数据框
data_ref = {'name': ['Jaso', 'Molly', 'Tina', 'Jake', 'Amy'],
'year': [2012, 202, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3]}
df_ref3 = pd.DataFrame(data_ref, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df_ref3
比较行:
def compare_datasets(df, df_ref):
ne_stacked = (df != df_ref).stack()
changed = ne_stacked[ne_stacked]
changed.index.names = ['id', 'col']
difference_locations = np.where(df != df_ref)
changed_from = df.values[difference_locations]
changed_to = df_ref.values[difference_locations]
error_test = pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
return error_test
compare_datasets(df3, df_ref3)
但是,我想在 pyspark 中执行此操作。有人有想法吗?
谢谢!
可能很难准确重现此行为。 我为您提供一个部分解决方案:
df.show()
+----------+--------+-------+-------+
| index| name| year|reports|
+----------+--------+-------+-------+
| Cochice|NO_VALUE| 2012| 4|
| Pima| Molly|-999999| 24|
|Santa Cruz| Tina| 2013| 31|
| Maricopa| Jake| 2014| 2|
| Yuma| Amy| 2014| 3|
+----------+--------+-------+-------+
df_ref.show()
+----------+-----+----+-------+
| index| name|year|reports|
+----------+-----+----+-------+
| Cochice| Jaso|2012| 4|
| Pima|Molly|2012| 24|
|Santa Cruz| Tina|2013| 31|
| Maricopa| Jake|2014| 2|
| Yuma| Amy|2014| 3|
+----------+-----+----+-------+
df.subtract(df_ref).show()
+-------+--------+-------+-------+
| index| name| year|reports|
+-------+--------+-------+-------+
| Pima| Molly|-999999| 24|
|Cochice|NO_VALUE| 2012| 4|
+-------+--------+-------+-------+
或者你可以做慢一点的:
for col in df_ref.columns:
out = df.select(col).subtract(df_ref.select(col))
if out.first():
print(out.collect())
[Row(name=u'NO_VALUE')]
[Row(year=-999999)]