CountVectorizer 提取特征
CountVectorizer Extracting features
我有以下数据框
+------------------------------------------------+
|filtered |
+------------------------------------------------+
|[human, interface, computer] |
|[survey, user, computer, system, response, time]|
|[eps, user, interface, system] |
|[system, human, system, eps] |
|[user, response, time] |
|[trees] |
|[graph, trees] |
|[graph, minors, trees] |
|[graph, minors, survey] |
+------------------------------------------------+
在上面的列中 运行ning CountVectorizer
之后,我得到以下输出
+------------------------------------------------+-------------------
--------------------------+
|filtered |features |
+------------------------------------------------+---------------------------------------------+
|[human, interface, computer] |(12,[4,7,9],[1.0,1.0,1.0]) |
|[survey, user, computer, system, response, time]|(12,[0,2,6,7,8,11],[1.0,1.0,1.0,1.0,1.0,1.0])|
|[eps, user, interface, system] |(12,[0,2,4,10],[1.0,1.0,1.0,1.0]) |
|[system, human, system, eps] |(12,[0,9,10],[2.0,1.0,1.0]) |
|[user, response, time] |(12,[2,8,11],[1.0,1.0,1.0]) |
|[trees] |(12,[1],[1.0]) |
|[graph, trees] |(12,[1,3],[1.0,1.0]) |
|[graph, minors, trees] |(12,[1,3,5],[1.0,1.0,1.0]) |
|[graph, minors, survey] |(12,[3,5,6],[1.0,1.0,1.0]) |
+------------------------------------------------+---------------------------------------------+
现在我想 运行 特征列上的映射函数并将其转换成这样的东西
+------------------------------------------------+--------------------------------------------------------+
|features |transformed |
+------------------------------------------------+--------------------------------------------------------+
|(12,[4,7,9],[1.0,1.0,1.0]) |["1 4 1", "1 7 1", "1 9 1"] |
|(12,[0,2,6,7,8,11],[1.0,1.0,1.0,1.0,1.0,1.0]) |["2 0 1", "2 2 1", "2 6 1", "2 7 1", "2 8 1", "2 11 1"] |
|(12,[0,2,4,10],[1.0,1.0,1.0,1.0]) |["3 0 1", "3 2 1", "3 4 1", "3 10 1"] |
[TRUNCATED]
特征转换的方式是从特征中取出中间数组,然后从中创建子数组。例如,在 features
列的第 1 行和第 1 列中,我们有
(12,[4,7,9],[1.0,1.0,1.0])
现在取它的中间数组 [4,7,9]
并将它的频率与第三列 [1.0,1.0,1.0]
比较,因为它是第 1 行,所以在前面加上“1”以获得以下输出:
["1 4 1", "1 7 1", "1 9 1"]
大体上是这样的:
["RowNumber MiddleFeatEl CorrespondingFreq", ....]
我无法从特征中单独提取 Middle 和 Last Freq list CountVectorizer
通过应用映射函数生成的列:
所以下面是地图代码:
def corpus_create(feats):
return feats[1] # Here i want to get [4,7,9] instead of 1 single feat score.
corpus_udf = udf(lambda feats: corpus_create(feats), StringType())
df3 = df.withColumn("corpus", corpus_udf("features"))
行号在 Spark SQL 中基本上没有意义,但如果您不介意的话:
def f(x):
row, i = x
jvs = (
# SparseVector
zip(row.features.indices, row.features.values) if hasattr(row.features, "indices")
# DenseVector
else enumerate(row.features.toArray()))
s = ["{} {} {}".format(i, j, v)
for j, v in jvs if v]
return row + (s, )
df.rdd.zipWithIndex().map(f).toDF(df.columns + ["transformed"])
我有以下数据框
+------------------------------------------------+
|filtered |
+------------------------------------------------+
|[human, interface, computer] |
|[survey, user, computer, system, response, time]|
|[eps, user, interface, system] |
|[system, human, system, eps] |
|[user, response, time] |
|[trees] |
|[graph, trees] |
|[graph, minors, trees] |
|[graph, minors, survey] |
+------------------------------------------------+
在上面的列中 运行ning CountVectorizer
之后,我得到以下输出
+------------------------------------------------+-------------------
--------------------------+
|filtered |features |
+------------------------------------------------+---------------------------------------------+
|[human, interface, computer] |(12,[4,7,9],[1.0,1.0,1.0]) |
|[survey, user, computer, system, response, time]|(12,[0,2,6,7,8,11],[1.0,1.0,1.0,1.0,1.0,1.0])|
|[eps, user, interface, system] |(12,[0,2,4,10],[1.0,1.0,1.0,1.0]) |
|[system, human, system, eps] |(12,[0,9,10],[2.0,1.0,1.0]) |
|[user, response, time] |(12,[2,8,11],[1.0,1.0,1.0]) |
|[trees] |(12,[1],[1.0]) |
|[graph, trees] |(12,[1,3],[1.0,1.0]) |
|[graph, minors, trees] |(12,[1,3,5],[1.0,1.0,1.0]) |
|[graph, minors, survey] |(12,[3,5,6],[1.0,1.0,1.0]) |
+------------------------------------------------+---------------------------------------------+
现在我想 运行 特征列上的映射函数并将其转换成这样的东西
+------------------------------------------------+--------------------------------------------------------+
|features |transformed |
+------------------------------------------------+--------------------------------------------------------+
|(12,[4,7,9],[1.0,1.0,1.0]) |["1 4 1", "1 7 1", "1 9 1"] |
|(12,[0,2,6,7,8,11],[1.0,1.0,1.0,1.0,1.0,1.0]) |["2 0 1", "2 2 1", "2 6 1", "2 7 1", "2 8 1", "2 11 1"] |
|(12,[0,2,4,10],[1.0,1.0,1.0,1.0]) |["3 0 1", "3 2 1", "3 4 1", "3 10 1"] |
[TRUNCATED]
特征转换的方式是从特征中取出中间数组,然后从中创建子数组。例如,在 features
列的第 1 行和第 1 列中,我们有
(12,[4,7,9],[1.0,1.0,1.0])
现在取它的中间数组 [4,7,9]
并将它的频率与第三列 [1.0,1.0,1.0]
比较,因为它是第 1 行,所以在前面加上“1”以获得以下输出:
["1 4 1", "1 7 1", "1 9 1"]
大体上是这样的:
["RowNumber MiddleFeatEl CorrespondingFreq", ....]
我无法从特征中单独提取 Middle 和 Last Freq list CountVectorizer
通过应用映射函数生成的列:
所以下面是地图代码:
def corpus_create(feats):
return feats[1] # Here i want to get [4,7,9] instead of 1 single feat score.
corpus_udf = udf(lambda feats: corpus_create(feats), StringType())
df3 = df.withColumn("corpus", corpus_udf("features"))
行号在 Spark SQL 中基本上没有意义,但如果您不介意的话:
def f(x):
row, i = x
jvs = (
# SparseVector
zip(row.features.indices, row.features.values) if hasattr(row.features, "indices")
# DenseVector
else enumerate(row.features.toArray()))
s = ["{} {} {}".format(i, j, v)
for j, v in jvs if v]
return row + (s, )
df.rdd.zipWithIndex().map(f).toDF(df.columns + ["transformed"])