使用 pandas 将句子拆分为包含不同数量单词的子字符串

Question

我的问题与我过去的问题有关：。

假设我在 pandas 中的 DataFrame 中有以下内容：

id  text
1   I am the first document and I am very happy.
2   Here is the second document and it likes playing tennis.
3   This is the third document and it looks very good today.

我想将每个 id 的文本拆分为随机单词数的标记（在两个值之间变化，例如 1 和 5）所以我最终想要如下内容：

id  text
1   I am the
1   first document
1   and I am very
1   happy
2   Here is
2   the second document and it
2   likes playing
2   tennis
3   This is the third
3   document and
3   looks very
3   very good today

请记住，除了这两列之外，我的数据框可能还有其他列，它们应该按照与上面 id 相同的方式简单地复制到新数据框。

最有效的方法是什么？

Answer 1

使用 itertools.islice:

定义一个以随机方式提取块的函数

from itertools import islice
import random

lo, hi = 3, 5 # change this to whatever
def extract_chunks(it):
    chunks = []
    while True:
        chunk = list(islice(it, random.choice(range(lo, hi+1))))
        if not chunk:
            break
        chunks.append(' '.join(chunk))

    return chunks

通过列表理解调用函数以确保尽可能少的开销，然后 stack 获得输出：

pd.DataFrame([
    extract_chunks(iter(text.split())) for text in df['text']], index=df['id']
).stack()

id   
1   0                    I am the
    1        first document and I
    2              am very happy.
2   0                 Here is the
    1         second document and
    2    it likes playing tennis.
3   0           This is the third
    1       document and it looks
    2            very good today.

您可以扩展 extract_chunks 函数来执行标记化。现在，我在空白处使用了一个简单的拆分，您可以修改它。

请注意，如果您不想触及其他列，您可以在此处执行类似 melting 操作的操作。

u = pd.DataFrame([
    extract_chunks(iter(text.split())) for text in df['text']])

(pd.concat([df.drop('text', 1), u], axis=1)
   .melt(df.columns.difference(['text'])))

使用 pandas 将句子拆分为包含不同数量单词的子字符串

Split sentences into substrings containing varying number of words using pandas

python

string

tokenize

pandas