如何将文本转换为 Python 中的数据框（应用一些规则）

Question

我是 Python 的新手，我需要构建这个复杂的功能，但不知道如何构建

我有一个文本数据框

RepID     RepText
---------------------------
1         Math Math Math  English Physics Sport Sport English English English English 
2         Sport English English English Math Math Physics Physics Physics Computer Computer Computer Computer 
3         Chemistry Chemistry Math Math Math English English English Math Math Math Math Math Sport Sport

我需要创建名为 fnClusters 的函数

它只是在 RepText 中找到 N 个重复的单词，并 return 在数据框中找到它们

如果 N 为 3，则相同的单词彼此相邻出现 3 次或更多次将被计算在内

所以 Math Math Math Math English Physics English English Math 将算作

Math  English  Physics
------------------------
4       0       0

English English English English English Math Math Math English Math Sports Sports 将算作

Math  English  Sports
------------------------
4       6       0

如何在 Python 中构建此函数？

Answer 1

使用 pandas.Series.str.split 和 value_counts 的一种方式：

new_df = df["RepText"].str.split("\s+").apply(pd.Series.value_counts)
n = 3
print(new_df[new_df.ge(n)].fillna(0))

输出：

   English  Math  Sport  Physics  Computer  Chemistry
0      5.0   3.0    0.0      0.0       0.0        0.0
1      3.0   0.0    0.0      3.0       4.0        0.0
2      3.0   8.0    0.0      0.0       0.0        0.0

如何将文本转换为 Python 中的数据框（应用一些规则）

How to convert text into a dataframe in Python (with applying some rules)

python

nlp