如何在 Apache Spark 中执行 LabelEncoding 或分类值
How to do LabelEncoding or categorical value in Apache Spark
我的数据集包含字符串列。我如何编码基于字符串的列,就像我们在 scikit-learn LabelEncoder
中所做的那样
我们正在开发 sparkit-learn,目的是在 PySpark 上提供 scikit-learn 功能和 API。您可以通过以下方式使用 SparkLabelEncoder:
$ pip install sparkit-learn
>>> from splearn.preprocessing import SparkLabelEncoder
>>> from splearn import BlockRDD
>>>
>>> data = ["paris", "paris", "tokyo", "amsterdam"]
>>> y = BlockRDD(sc.parallelize(data))
>>>
>>> le = SparkLabelEncoder()
>>> le.fit(y)
>>> le.classes_
array(['amsterdam', 'paris', 'tokyo'],
dtype='|S9')
>>>
>>> test = ["tokyo", "tokyo", "paris"]
>>> y_test = BlockRDD(sc.parallelize(test))
>>>
>>> le.transform(y_test).toarray()
array([2, 2, 1])
>>>
>>> test = [2, 2, 1]
>>> y_test = BlockRDD(sc.parallelize(test))
>>>
>>> le.inverse_transform(y_test).toarray()
array(['tokyo', 'tokyo', 'paris'],
dtype='|S9')
StringIndexer 正是您所需要的
https://spark.apache.org/docs/1.5.1/ml-features.html#stringindexer
from pyspark.ml.feature import StringIndexer
df = sqlContext.createDataFrame(
[(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],
["id", "category"])
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
indexed = indexer.fit(df).transform(df)
indexed.show()
我的数据集包含字符串列。我如何编码基于字符串的列,就像我们在 scikit-learn LabelEncoder
中所做的那样我们正在开发 sparkit-learn,目的是在 PySpark 上提供 scikit-learn 功能和 API。您可以通过以下方式使用 SparkLabelEncoder:
$ pip install sparkit-learn
>>> from splearn.preprocessing import SparkLabelEncoder
>>> from splearn import BlockRDD
>>>
>>> data = ["paris", "paris", "tokyo", "amsterdam"]
>>> y = BlockRDD(sc.parallelize(data))
>>>
>>> le = SparkLabelEncoder()
>>> le.fit(y)
>>> le.classes_
array(['amsterdam', 'paris', 'tokyo'],
dtype='|S9')
>>>
>>> test = ["tokyo", "tokyo", "paris"]
>>> y_test = BlockRDD(sc.parallelize(test))
>>>
>>> le.transform(y_test).toarray()
array([2, 2, 1])
>>>
>>> test = [2, 2, 1]
>>> y_test = BlockRDD(sc.parallelize(test))
>>>
>>> le.inverse_transform(y_test).toarray()
array(['tokyo', 'tokyo', 'paris'],
dtype='|S9')
StringIndexer 正是您所需要的 https://spark.apache.org/docs/1.5.1/ml-features.html#stringindexer
from pyspark.ml.feature import StringIndexer
df = sqlContext.createDataFrame(
[(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],
["id", "category"])
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
indexed = indexer.fit(df).transform(df)
indexed.show()