NLP：拆分字典，然后将其转换为数据框

Question

我的数据集是这样的

dt = [{author: ...., text: ....},...,{author: ...., text: ....}]

我想将文本分成块，然后生成具有以下形式的数据框：

df = chunk1 of text1    author of chunk1
     ...............    ................

等等

我可以使用这个函数生成块

textwrap.wrap(text, width = 200, break_long_words=False)

然后使用

将其转换为数据帧

# Convert the list of dictionaries to dataframe
df = pd.DataFrame.from_dict(dt)

但我不知道如何将每个块与每个作者匹配。如果您能帮助我，我将不胜感激！

Answer 1

我认为您可以使用嵌套列表理解来遍历原始 dt 中的行，然后对于每一行，您遍历被 textwrap 分割的块列表，创建一个新的与相关作者的每个块的字典。下面的代码是否为您提供了预期的输出？

import pandas as pd
import textwrap

# Sample data
dt = [{'author': 'A', 'text': 'Hello world!'}, {'author': 'B', 'text': '!dlrow olleH'}]

# Sample width
width=6

new_dt = [{'chunk': chunk, 'author': row['author']} for row in dt for chunk in textwrap.wrap(row['text'], width=width, break_long_words=False)]

df = pd.DataFrame.from_dict(new_dt)
print(df)

NLP：拆分字典，然后将其转换为数据框

NLP: split dictionary and then transform it into dataframe

python

nlp

pandas