将 XML PostHistory.xml 从 Stack Exchange 数据转储解析到 Databricks 中的数据帧

Question

我是初级水平，我尝试做一些数据处理。我有来自 Stack Exchange Dump Set 的数据集。我想使用 pyspark 将 xml 文件转换为 csv。我在 Databricks 笔记本中执行了以下步骤，但我有 table 个空值。这是 PostHistory.xml:

<?xml version="1.0" encoding="UTF-8"?>

-<posthistory>

<row ContentLicense="CC BY-SA 3.0" Text="Basically, the title says it all: is there anything that definitely confirms that Svidrigailov actually committed murder in _Crime and Punishment?_ By anything, I mean either a nuanced passage I might have missed in the actual book, some sort of letter or manuscript by Dostoyevsky, or something else. " UserId="3" CreationDate="2017-01-18T17:20:34.290" RevisionGUID="29137b21-e2d7-45a0-acfb-7309871e8cab" PostId="1" PostHistoryTypeId="2" Id="1"/>

<row ContentLicense="CC BY-SA 3.0" Text="Is there anything that definitely confirms that Svidrigailov actually committed murder in "Crime and Punishment?"" UserId="3" CreationDate="2017-01-18T17:20:34.290" RevisionGUID="29137b21-e2d7-45a0-acfb-7309871e8cab" PostId="1" PostHistoryTypeId="1" Id="2"/>

<row ContentLicense="CC BY-SA 3.0" Text="<crime-and-punishment>" UserId="3" CreationDate="2017-01-18T17:20:34.290" RevisionGUID="29137b21-e2d7-45a0-acfb-7309871e8cab" PostId="1" PostHistoryTypeId="3" Id="3"/>

<row ContentLicense="CC BY-SA 3.0" Text="It's [well known](https://en.wikipedia.org/wiki/Shakespeare's_plays#Shakespeare_and_the_textual_problem) that Shakespeare had no part in publishing the text of his own plays - indeed, many of them were only published posthumously. I've read that a significant proportion of his plays came to press by way of actors in his company, hoping to earn a little extra money, stealing copies of his scripts and smuggling them out to publishers. I've also read that the only parts of the scripts published today which were actually written by Shakespeare are the *lines* - not the stage directions, nor the setting descriptions. Unfortunately, I don't have a reliable source for this claim. Is it true? **Do any of the stage directions in modern publications of Shakespeare plays originate from the man himself?**" UserId="17" CreationDate="2017-01-18T17:25:47.547" RevisionGUID="136be093-d66b-4d40-844f-57b73c71631a" PostId="2" PostHistoryTypeId="2" Id="4"/>

我在 Databricks 集群上安装了这个库：com.databricks:spark-xml_2.12:0.10.0

这是我写的：

schema_posthistory = StructType([StructField("ContentLicense", StringType()),\
                            StructField("CreationDate", TimestampType()), \
                            StructField("Id", IntegerType()), \
                            StructField("PostHistoryTypeId", IntegerType()), \
                            StructField("PostId", IntegerType()), \
                            StructField("RevisionGUID", StringType()),\
                            StructField("Text", StringType()), \
                            StructField("UserId", IntegerType()), \
                            StructField("UserDisplayName", StringType()), \
                            StructField("Comment", StringType()) ])

PostHistoryDF = spark.read.format("com.databricks.spark.xml") \
 .option("rowTag", "row") \
 .option("charset", "UTF8") \
 .schema(schema_posthistory) \
 .option("treatEmptyValuesAsNulls", "true") \
 .load("/FileStore/tables/PostHistory.xml")

这让我感到空虚table..

在 jupyter notebook 中，我尝试了以下代码，它适用于其他一些文件，但对于 PostHistory.xml 我有错误。我认为它将一些符号 (*,_) 读取为单独的属性。我试图在创建 DF 期间阅读它，但它似乎没有用。同样在解析过程中，它会创建 7 列而不是 10 列。

def dict_fun(root):
    root_attrib = root.attrib
    for tab in root:
        tab_dict = deepcopy(root_attrib)
        attrib_dict = {}
        attrib_dict.update(tab.attrib)
        for key, value in attrib_dict.items():
            attrib_dict.update({key:value})
        tab_dict.update(attrib_dict)
        yield tab_dict

for frame in ['Badges','Comments','PostHistory','PostLinks','Posts','Tags','Users','Votes']:
    link = os.path.join("./LiteratureXML/" + frame) 
    linkcsv = os.path.join("./LiteratureCSV/" + frame)
    tree = et.parse(link + '.xml')
    root = tree.getroot()
    tab_list = list(dict_fun(root))
    df = spark.createDataFrame(tab_list)
    df = df.replace("\n", " ")
    df = df.replace("\r", " ")
    df = df.replace("[^a-zA-Z0-9]", " ")
    df.write.csv(linkcsv + ".csv", sep="|")
    df = spark.read.csv(linkcsv + ".csv", sep="|")

schema_posthistory = StructType([StructField("ContentLicense", StringType()),\
                            StructField("CreationDate", TimestampType()), \
                            StructField("Id", IntegerType()), \
                            StructField("PostHistoryTypeId", IntegerType()), \
                            StructField("PostId", IntegerType()), \
                            StructField("RevisionGUID", StringType()),\
                            StructField("Text", StringType()), \
                            StructField("UserId", IntegerType()), \
                            StructField("UserDisplayName", StringType()), \
                            StructField("Comment", StringType()) ])

PostHistory_df = spark.read.format("csv") \
  .schema(schema_posthistory) \
  .option("timestampFormat", "yyyy/MM/dd HH:mm:ss") \
  .option("header", "false") \
  .option("sep", "|") \
  .load("LiteratureCSV/PostHistory.csv")

我不知道如何解决它，任何人都可以告诉我我做错了什么吗？

Answer 1

根据您尝试读取 xml 文件的方式，我在您的代码中发现了几个问题。

数据问题：

文本属性中第二条记录的值中有双引号 "Crime and Punishment?" 您必须从此处删除此双引号并将其替换为以下代码的单引号上班。
第三条记录的Text属性值中有小于号(<)为<crime-and-punishment>。您还必须删除 < 和 > signs/symbol 然后下面的代码运行非常适合您。

代码问题：

首先，根据您在问题中发布的示例 xml 数据，您的 xml 数据不包含属性 UserDisplayName 和 Comment。

另一件事是，在您提供的用于读取 xml 文件的显式模式中，尝试在所有列名前附加 _（下划线）。完成此操作后，您将能够在数据框中看到数据。在阅读 xml 文件后，您必须通过应用我将在下面的代码中显示的一些转换，从所有列 header 中删除 _。

第三，您也可以在不指定架构的情况下读取 xml 文件，您应该在 headers 列中使用 _ 获得相同的输出。

#explicit schema with _ in column names
from pyspark.sql.types import *
schema_posthistory = StructType([StructField("_ContentLicense",StringType()),\
            StructField("_CreationDate", TimestampType()), \
            StructField("_Id", LongType()), \
            StructField("_PostHistoryTypeId", LongType()), \
            StructField("_PostId", LongType()), \
            StructField("_RevisionGUID", StringType()),\
            StructField("_Text", StringType()), \
            StructField("_UserId", LongType())
#reading the xml file                                 ])
PostHistoryDF = spark.read.format("com.databricks.spark.xml") \
.option("rowTag", "row") \
.option("charset", "UTF8") \
.option("treatEmptyValuesAsNulls", "true") \
.schema(schema_posthistory) \
.load("/FileStore/tables/PostHistory.xml") 
#renaming all columns to remove _
from pyspark.sql.functions import *
renamed_df = PostHistoryDF.select([col(colnames).alias(colnames.replace('_', '')) for colnames in PostHistoryDF.columns])
renamed_df.show()

您也可以在不明确指定架构的情况下读取 xml 文件。但是为此，您还必须重命名列以从列名称中删除 _。

PostHistoryDF = spark.read.format("com.databricks.spark.xml") \
.option("rowTag", "row") \
.option("charset", "UTF8") \
.option("treatEmptyValuesAsNulls", "true") \
.load("/FileStore/tables/PostHistory.xml")
#same stuff to rename the columns to remove _
from pyspark.sql.functions import *
renamed_df = PostHistoryDF.select([col(colnames).alias(colnames.replace('_', '')) for colnames in PostHistoryDF.columns])
renamed_df.show()

将 XML PostHistory.xml 从 Stack Exchange 数据转储解析到 Databricks 中的数据帧

Parsing XML PostHistory.xml from Stack Exchange Data Dump to Data Frame in Databricks

xml

apache-spark

databricks