按火花数据框中的特定单词过滤

Filter By Specific words in spark dataframe

我有一个包含以下数据的 spark 数据框

    +---------------------------------------------------------------------------------------------------------------------------------------------------+
    |text                                                                                                                                               |
    +---------------------------------------------------------------------------------------------------------------------------------------------------+
    |Know what you don't do at 1:30 when you can't sleep? Music shopping. Now I want to dance. #shutUpAndDANCE                                          |
    |Serasi ade haha @AdeRais "@SMTOWNGLOBAL: #SHINee ONEW(@skehehdanfdldi) and #AMBER(@llama_ajol) at KBS 'Music Bank'."        |
    |Happy Birhday Ps.Jeffrey Rachmat #JR50 #flipagram  ? Music: This I Believe (The Creed) - Hillsong…                          |

数据框是一列 'text' 并且包含包含 # 的单词。例如 '#shutUpAndDANCE'

我正在尝试读取每个单词并过滤掉,这样我就只剩下一个带有哈希的单词列表

代码:

#Gets only those rows containing
hashtagList = sqlContext.sql("SELECT text FROM tweetstable WHERE text LIKE '%#%'")
print hashtagList.show(100, truncate=False)

#Process Rows to get the words
hashtagList = hashtagList.map(lambda p: p.text).map(lambda x: x.split(" ")).collect() 
print hashtagList

输出为:

[[u'Know', u'what', u'you', u"don't", u'do', u'at', u'1:30', u'when', u'you', u"can't", u'sleep?', u'Music', u'shopping.', u'Now', u'I', u'want', u'to', u'dance.', u'#shutUpAndDANCE'], [...]]

有没有一种方法可以在我的地图阶段过滤掉所有内容并只保留#words。

hashtagList = hashtagList.map(lambda p: p.text).map(lambda x: x.split(" "))<ADD SOMETHING HERE TO FETCH ONLY #>.collect()

试试这个。

from pyspark.sql import Row
from __future__ import print_function

str = "Know what you don't do at 1:30 when you can't sleep? Music shopping. Now I want to dance. #shutUpAndDANCE Serasi ade haha @AdeRais @SMTOWNGLOBAL: #SHINee ONEW(@skehehdanfdldi) and #AMBER(@llama_ajol) at KBS 'Music Bank'.Happy Birhday Ps.Jeffrey Rachmat #JR50 #flipagram? Music: This I Believe (The Creed) - Hillsong"
df = spark.createDataFrame([Row(str)]);
words = df.rdd.flatMap(list).flatMap(lambda line: line.split()).filter(lambda word: word.startswith("#"));
words.foreach(print)

使用:

>>> from pyspark.sql.functions import split, explode, col
>>>
>>> df.select(explode(split("text", "\s+")).alias("word")) \
...     .where(col("word").startswith("#"))