使用 pyspark.ml.feature.Tokenizer 时如何打印我的代币?
How can I print my tokens when using pyspark.ml.feature.Tokenizer?
我想查看使用 pyspark.ml.feature.Tokenizer 时创建的令牌。我怎样才能做到这一点?如果我有这段代码:
tokenizer = Tokenizer(inputCol="SystemInfo", outputCol="words")
我尝试使用 print(vars(tokenizer))
打印标记,但当然 returns 仅打印属性。
完整的代码可以在这里找到:https://docs.microsoft.com/de-de/azure/hdinsight/spark/apache-spark-ipython-notebook-machine-learning
你要转型展示,就是这样。这里有一个简单的例子来指导你。希望对你有帮助。
from pyspark.ml.feature import Tokenizer
df = spark.createDataFrame([
(0, 'Hello and good day'),
(1, 'This is a simple demostration'),
(2, 'Natural and unnatural language processing')
], ['id', 'sentence'])
df.show(truncate=False)
# +---+-----------------------------------------+
# |id |sentence |
# +---+-----------------------------------------+
# |0 |Hello and good day |
# |1 |This is a simple demostration |
# |2 |Natural and unnatural language processing|
# +---+-----------------------------------------+
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
tokenized = tokenizer.transform(df)
tokenized.select('words').show(truncate=False)
# +-----------------------------------------------+
# |words |
# +-----------------------------------------------+
# |[hello, and, good, day] |
# |[this, is, a, simple, demostration] |
# |[natural, and, unnatural, language, processing]|
# +-----------------------------------------------+
我想查看使用 pyspark.ml.feature.Tokenizer 时创建的令牌。我怎样才能做到这一点?如果我有这段代码:
tokenizer = Tokenizer(inputCol="SystemInfo", outputCol="words")
我尝试使用 print(vars(tokenizer))
打印标记,但当然 returns 仅打印属性。
完整的代码可以在这里找到:https://docs.microsoft.com/de-de/azure/hdinsight/spark/apache-spark-ipython-notebook-machine-learning
你要转型展示,就是这样。这里有一个简单的例子来指导你。希望对你有帮助。
from pyspark.ml.feature import Tokenizer
df = spark.createDataFrame([
(0, 'Hello and good day'),
(1, 'This is a simple demostration'),
(2, 'Natural and unnatural language processing')
], ['id', 'sentence'])
df.show(truncate=False)
# +---+-----------------------------------------+
# |id |sentence |
# +---+-----------------------------------------+
# |0 |Hello and good day |
# |1 |This is a simple demostration |
# |2 |Natural and unnatural language processing|
# +---+-----------------------------------------+
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
tokenized = tokenizer.transform(df)
tokenized.select('words').show(truncate=False)
# +-----------------------------------------------+
# |words |
# +-----------------------------------------------+
# |[hello, and, good, day] |
# |[this, is, a, simple, demostration] |
# |[natural, and, unnatural, language, processing]|
# +-----------------------------------------------+