解析包含 json 数据的 RDD
parsing RDD containing json data
我有一个包含以下数据的 json 文件:
{"year":"2016","category":"physics","laureates":[{"id":"928","firstname":"David J.","surname":"Thouless","motivation":"\"for theoretical discoveries of topological phase transitions and topological phases of matter\"","share":"2"},{"id":"929","firstname":"F. Duncan M.","surname":"Haldane","motivation":"\"for theoretical discoveries of topological phase transitions and topological phases of matter\"","share":"4"},{"id":"930","firstname":"J. Michael","surname":"Kosterlitz","motivation":"\"for theoretical discoveries of topological phase transitions and topological phases of matter\"","share":"4"}]}
{"year":"2016","category":"chemistry","laureates":[{"id":"931","firstname":"Jean-Pierre","surname":"Sauvage","motivation":"\"for the design and synthesis of molecular machines\"","share":"3"},{"id":"932","firstname":"Sir J. Fraser","surname":"Stoddart","motivation":"\"for the design and synthesis of molecular machines\"","share":"3"},{"id":"933","firstname":"Bernard L.","surname":"Feringa","motivation":"\"for the design and synthesis of molecular machines\"","share":"3"}]}
我需要 return 一个 RDD 作为键值对,其中我将类别作为键,将诺贝尔奖获得者的姓氏列表作为值。我怎么可能使用转换来做到这一点?
对于给定的数据集,它应该是:
"physics"-"Thouless","haldane","Kosterlitz"
"chemistry"-"Sauvage","Stoddart","Feringa"
您是否仅限于 RDD?如果你可以使用 DataFrames,那么加载就会非常简单,你得到一个模式,分解嵌套字段,分组,然后使用 RDDs 来完成剩下的工作。这是您可以做到的一种方法
将 JSON 加载到 DataFrame 中,您还可以确认您的架构
>>> nobelDF = spark.read.json('/user/cloudera/nobel.json')
>>> nobelDF.printSchema()
root
|-- category: string (nullable = true)
|-- laureates: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- firstname: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- motivation: string (nullable = true)
| | |-- share: string (nullable = true)
| | |-- surname: string (nullable = true)
|-- year: string (nullable = true)
现在您可以分解嵌套数组,然后转换为可以分组的 RDD
nobelRDD = nobelDF.select('category', explode('laureates.surname')).rdd
仅供参考,分解后的 DataFrame 如下所示
+---------+----------+
| category| col|
+---------+----------+
| physics| Thouless|
| physics| Haldane|
| physics|Kosterlitz|
|chemistry| Sauvage|
|chemistry| Stoddart|
|chemistry| Feringa|
+---------+----------+
现在按类别分组
from pyspark.sql.functions import collect_list
nobelRDD = nobelDF.select('category', explode('laureates.surname')).groupBy('category').agg(collect_list('col').alias('sn')).rdd
nobelRDD.collect()
现在你得到了一个你需要的 RDD,尽管它仍然是一个 Row 对象(我添加了新行以显示完整的行)
>>> for n in nobelRDD.collect():
... print n
...
Row(category=u'chemistry', sn=[u'Sauvage', u'Stoddart', u'Feringa'])
Row(category=u'physics', sn=[u'Thouless', u'Haldane', u'Kosterlitz'])
但这将是一个获取元组的简单映射(我添加了新行以显示完整行)
>>> nobelRDD.map(lambda x: (x[0],x[1])).collect()
[(u'chemistry', [u'Sauvage', u'Stoddart', u'Feringa']),
(u'physics', [u'Thouless', u'Haldane', u'Kosterlitz'])]
我有一个包含以下数据的 json 文件:
{"year":"2016","category":"physics","laureates":[{"id":"928","firstname":"David J.","surname":"Thouless","motivation":"\"for theoretical discoveries of topological phase transitions and topological phases of matter\"","share":"2"},{"id":"929","firstname":"F. Duncan M.","surname":"Haldane","motivation":"\"for theoretical discoveries of topological phase transitions and topological phases of matter\"","share":"4"},{"id":"930","firstname":"J. Michael","surname":"Kosterlitz","motivation":"\"for theoretical discoveries of topological phase transitions and topological phases of matter\"","share":"4"}]}
{"year":"2016","category":"chemistry","laureates":[{"id":"931","firstname":"Jean-Pierre","surname":"Sauvage","motivation":"\"for the design and synthesis of molecular machines\"","share":"3"},{"id":"932","firstname":"Sir J. Fraser","surname":"Stoddart","motivation":"\"for the design and synthesis of molecular machines\"","share":"3"},{"id":"933","firstname":"Bernard L.","surname":"Feringa","motivation":"\"for the design and synthesis of molecular machines\"","share":"3"}]}
我需要 return 一个 RDD 作为键值对,其中我将类别作为键,将诺贝尔奖获得者的姓氏列表作为值。我怎么可能使用转换来做到这一点?
对于给定的数据集,它应该是:
"physics"-"Thouless","haldane","Kosterlitz"
"chemistry"-"Sauvage","Stoddart","Feringa"
您是否仅限于 RDD?如果你可以使用 DataFrames,那么加载就会非常简单,你得到一个模式,分解嵌套字段,分组,然后使用 RDDs 来完成剩下的工作。这是您可以做到的一种方法
将 JSON 加载到 DataFrame 中,您还可以确认您的架构
>>> nobelDF = spark.read.json('/user/cloudera/nobel.json')
>>> nobelDF.printSchema()
root
|-- category: string (nullable = true)
|-- laureates: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- firstname: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- motivation: string (nullable = true)
| | |-- share: string (nullable = true)
| | |-- surname: string (nullable = true)
|-- year: string (nullable = true)
现在您可以分解嵌套数组,然后转换为可以分组的 RDD
nobelRDD = nobelDF.select('category', explode('laureates.surname')).rdd
仅供参考,分解后的 DataFrame 如下所示
+---------+----------+
| category| col|
+---------+----------+
| physics| Thouless|
| physics| Haldane|
| physics|Kosterlitz|
|chemistry| Sauvage|
|chemistry| Stoddart|
|chemistry| Feringa|
+---------+----------+
现在按类别分组
from pyspark.sql.functions import collect_list
nobelRDD = nobelDF.select('category', explode('laureates.surname')).groupBy('category').agg(collect_list('col').alias('sn')).rdd
nobelRDD.collect()
现在你得到了一个你需要的 RDD,尽管它仍然是一个 Row 对象(我添加了新行以显示完整的行)
>>> for n in nobelRDD.collect():
... print n
...
Row(category=u'chemistry', sn=[u'Sauvage', u'Stoddart', u'Feringa'])
Row(category=u'physics', sn=[u'Thouless', u'Haldane', u'Kosterlitz'])
但这将是一个获取元组的简单映射(我添加了新行以显示完整行)
>>> nobelRDD.map(lambda x: (x[0],x[1])).collect()
[(u'chemistry', [u'Sauvage', u'Stoddart', u'Feringa']),
(u'physics', [u'Thouless', u'Haldane', u'Kosterlitz'])]