如何让 Python 绘制包含文本的列中唯一单词数的直方图？

Question

我有一个名为 'my_data' 的数据集，我将其分配给一个名为 'data.' 的通用变量在我的数据集中，我有一个名为 'impression.' 的列此 'impression' 列包含医学笔记的文本，例如 "Lesion observed in occipital region."

我想绘制该列中出现的唯一单词数量的直方图。这是我正在使用的 python 脚本及其生成的错误：

data = my_text_dataset   # assigns my data set to a generic variable called 'data' 

TEXT_COLUMN = 'impression'  # note: one of the columns in this data set is called 'impression'
text = data[TEXT_COLUMN]

def plot_word_number_histogram(text):
text.str.split().\
    map(lambda x: len(x)).\
    hist()

plot_word_number_histogram(data['impression'])

Python（Jupyter 笔记本）returns 这个错误：

~\Anaconda3\lib\site-packages\pandas\core\base.py in _map_values(self, mapper, na_action)
   1152 
   1153         # mapper is a function
-> 1154         new_values = map_f(values, mapper)
   1155 
   1156         return new_values

pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()

<ipython-input-68-95bf9c5b8264> in <lambda>(x)
      2 def plot_word_number_histogram(text):
      3     text.str.split().\
----> 4         map(lambda x: len(x)).\
      5         hist()

TypeError: object of type 'float' has no len()

注意：此脚本在其他文本列上运行良好。我观察到我正在使用的较新的数据集也有一些数字，比如 "HISTORY: 1. aneurysm 2. metastasis etc." 我怀疑这会强制 Python 中的类型转换，这会破坏我上面的脚本，但我可能是错的？

任何人都可以建议对我的脚本进行调整，以便它将数据从 'float' 转换为 'int'，以便它可以传递到直方图中吗？

非常感谢！！

Answer 1

先将 x 转换为字符串，然后再获取它的长度。然后你会得到数字的长度。

def plot_word_number_histogram(text):
       text.str.split().\
       map(lambda x: len(str(x))).\
       hist()

您可能要考虑是要像处理文字一样处理数字，还是要忽略数字和特殊章程。

如何让 Python 绘制包含文本的列中唯一单词数的直方图？

How do I get Python to plot a histogram of the number of unique words within in a column containing text?

python

nlp

histogram

pandas

Python（Jupyter 笔记本）returns 这个错误：