"normalize" 将句子数据框转换为更大的单词数据框

Question

使用 Python 和 Spark：

假设我有一个包含句子的行的 DataFrame，我如何normalize（根据 DBMS 术语）将句子 DataFrame 转换为另一个 DataFrame，每行包含一个从句子中拆分出来的单词？

例如，假设 df_sentences 看起来像这样：

[Row(sentence_id=1, sentence=u'the dog ran the fastest.'),
 Row(sentence_id=2, sentence=u'the cat sat down.')]

我正在寻找将 df_sentences 转换为 df_words 的方法，它将获取这两行并构建一个更大的（以行数计）DataFrame，如下所示。请注意 sentence_id 被带入新的 table:

[Row(sentence_id=1, word=u'the'),
 Row(sentence_id=1, word=u'the'),
 Row(sentence_id=1, word=u'fastest'), 
 Row(sentence_id=2, word=u'dog'),
 Row(sentence_id=2, word=u'ran'), 
 Row(sentence_id=2, word=u'cat'), 
 ...clip...]

现在，目前我对行数或唯一词并不真正感兴趣，那是因为我想加入 sentence_id 上的其他 RDD 以获取我存储在其他地方的其他有趣数据。

我怀疑 spark 的大部分能力都在于管道中的这些间歇性转换，因此我想了解做事的最佳方式并开始收集我自己的 snippets/etc。

Answer 1

其实很简单。让我们从创建 DataFrame:

开始

from pyspark.sql import Row

df = sc.parallelize([
    Row(sentence_id=1, sentence=u'the dog ran the fastest.'),
     Row(sentence_id=2, sentence=u'the cat sat down.')
]).toDF()

接下来我们需要一个分词器：

from pyspark.ml.feature import RegexTokenizer

tokenizer = RegexTokenizer(
    inputCol="sentence", outputCol="words", pattern="\W+")
tokenized = tokenizer.transform(df)

最后我们放弃 sentence 和 explode words:

from pyspark.sql.functions import explode, col

transformed = (tokenized
    .drop("sentence")
    .select(col("sentence_id"), explode(col("words")).alias("word")))

最后结果：

transformed.show()

## +-----------+-------+
## |sentence_id|   word|
## +-----------+-------+
## |          1|    the|
## |          1|    dog|
## |          1|    ran|
## |          1|    the|
## |          1|fastest|
## |          2|    the|
## |          2|    cat|
## |          2|    sat|
## |          2|   down|
## +-----------+-------+

备注:

取决于数据 explode 可能相当昂贵，因为它会复制其他列。请务必在应用 explode 之前应用所有可能的过滤器，例如 StopWordsRemover

"normalize" 将句子数据框转换为更大的单词数据框

"normalize" dataframe of sentences into larger dataframe of words

python

dataframe

apache-spark

apache-spark-sql

pyspark