# string methods TypeError: Column is not iterable in pyspark

Question

我正在尝试将在 python 中编写的情感分析重新实现到 pyspark，因为我正在使用大数据，我是 pyspark 语法的新手，并且在尝试应用来自 nltk 包的词形还原函数时出现错误

错误：# 字符串方法类型错误：列不可迭代下面是代码和数据

|overall|       reviewsummary|     cleanreviewText|         reviewText1|  filteredreviewText|
+-------+--------------------+--------------------+--------------------+--------------------+
|    5.0|exactly what i ne...|exactly what i ne...|[exactly, what, i...|[exactly, needed,...|
|    2.0|i agree with the ...|i agree with the ...|[i, agree, with, ...|[agree, review, o...|
|    4.0|love these... i a...|love these... i a...|[love, these, i, ...|[love, going, ord...|
|    2.0|too tiny an openi...|too tiny an openi...|[too, tiny, an, o...|[tiny, opening, t...|
|    3.0|    okay three stars|    okay three stars|[okay, three, stars]|[okay, three, stars]|
|    5.0|exactly what i wa...|exactly what i wa...|[exactly, what, i...|[exactly, wanted,...|
|    4.0|these little plas...|these little plas...|[these, little, p...|[little, plastic,...|
|    3.0|mother - in - law...|mother - in - law...|[mother, in, law,...|[mother, law, wan...|
|    3.0|item is of good q...|item is of good q...|[item, is, of, go...|[item, good, qual...|
|    3.0|i had used my las...|i had used my las...|[i, had, used, my...|[used, last, el, ...|
+-------+--------------------+--------------------+--------------------+--------------------+
only showing top 10 rows


In [18]: dfStopwordRemoved.printSchema()
root
 |-- overall: double (nullable = true)
 |-- reviewText: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- reviewsummary: string (nullable = true)
 |-- cleanreviewText: string (nullable = true)
 |-- reviewText1: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- filteredreviewText: array (nullable = true)
 |    |-- element: string (containsNull = true)

函数词形还原

def get_part_of_speech(word):
probable_part_of_speech = wordnet.synsets(word)

pos_counts = Counter()
pos_counts["n"] = len(  [ item for item in probable_part_of_speech if item.pos()=="n"]  )
pos_counts["v"] = len(  [ item for item in probable_part_of_speech if item.pos()=="v"]  )
pos_counts["a"] = len(  [ item for item in probable_part_of_speech if item.pos()=="a"]  )
pos_counts["r"] = len(  [ item for item in probable_part_of_speech if item.pos()=="r"]  )

most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
return most_likely_part_of_speech

def Lemmatizing_Words(Words):
Lm = WordNetLemmatizer()
Lemmatized_Words = []
for word in Words:
    Lemmatized_Words.append(Lm.lemmatize(word,get_part_of_speech(word)))
return Lemmatized_Words

（函数调用）

x2=list()
for word in dfStopwordRemoved.select('filteredreviewText'):
x_temp = Lemmatizing_Words(word)
x2.append(x_temp)

错误请参考下方link Error

Answer 1

错误消息是准确的：您可以像标准 python 迭代器一样遍历数据框列。为了应用诸如求和、均值等标准函数，我们需要使用 withColumn() 或 select() 函数。在您的情况下，您有自己的自定义功能。因此，您需要将函数注册为 udf 并将其与 withColumn() 或 select()

一起使用

以下是来自 spark 文档的 udf 示例 - https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html?highlight=udf#pyspark.sql.functions.udf

>>> from pyspark.sql.types import IntegerType
>>> slen = udf(lambda s: len(s), IntegerType())
>>> :udf
... def to_upper(s):
...     if s is not None:
...         return s.upper()
...
>>> :udf(returnType=IntegerType())
... def add_one(x):
...     if x is not None:
...         return x + 1
...
>>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name", "age"))
>>> df.select(slen("name").alias("slen(name)"), to_upper("name"), add_one("age")).show()
+----------+--------------+------------+
|slen(name)|to_upper(name)|add_one(age)|
+----------+--------------+------------+
|         8|      JOHN DOE|          22|
+----------+--------------+------------+

# string methods TypeError: Column is not iterable in pyspark

# string methods TypeError: Column is not iterable in pyspark

python

nltk

lemmatization

pyspark

apache-spark-ml