使用 pandas 将句子拆分为包含不同数量单词的子字符串
Split sentences into substrings containing varying number of words using pandas
我的问题与我过去的问题有关:。
假设我在 pandas
中的 DataFrame
中有以下内容:
id text
1 I am the first document and I am very happy.
2 Here is the second document and it likes playing tennis.
3 This is the third document and it looks very good today.
我想将每个 id 的文本拆分为随机单词数的标记(在两个值之间变化,例如 1 和 5)所以我最终想要如下内容:
id text
1 I am the
1 first document
1 and I am very
1 happy
2 Here is
2 the second document and it
2 likes playing
2 tennis
3 This is the third
3 document and
3 looks very
3 very good today
请记住,除了这两列之外,我的数据框可能还有其他列,它们应该按照与上面 id
相同的方式简单地复制到新数据框。
最有效的方法是什么?
使用 itertools.islice
:
定义一个以随机方式提取块的函数
from itertools import islice
import random
lo, hi = 3, 5 # change this to whatever
def extract_chunks(it):
chunks = []
while True:
chunk = list(islice(it, random.choice(range(lo, hi+1))))
if not chunk:
break
chunks.append(' '.join(chunk))
return chunks
通过列表理解调用函数以确保尽可能少的开销,然后 stack
获得输出:
pd.DataFrame([
extract_chunks(iter(text.split())) for text in df['text']], index=df['id']
).stack()
id
1 0 I am the
1 first document and I
2 am very happy.
2 0 Here is the
1 second document and
2 it likes playing tennis.
3 0 This is the third
1 document and it looks
2 very good today.
您可以扩展 extract_chunks
函数来执行标记化。现在,我在空白处使用了一个简单的拆分,您可以修改它。
请注意,如果您不想触及其他列,您可以在此处执行类似 melt
ing 操作的操作。
u = pd.DataFrame([
extract_chunks(iter(text.split())) for text in df['text']])
(pd.concat([df.drop('text', 1), u], axis=1)
.melt(df.columns.difference(['text'])))
我的问题与我过去的问题有关:
假设我在 pandas
中的 DataFrame
中有以下内容:
id text
1 I am the first document and I am very happy.
2 Here is the second document and it likes playing tennis.
3 This is the third document and it looks very good today.
我想将每个 id 的文本拆分为随机单词数的标记(在两个值之间变化,例如 1 和 5)所以我最终想要如下内容:
id text
1 I am the
1 first document
1 and I am very
1 happy
2 Here is
2 the second document and it
2 likes playing
2 tennis
3 This is the third
3 document and
3 looks very
3 very good today
请记住,除了这两列之外,我的数据框可能还有其他列,它们应该按照与上面 id
相同的方式简单地复制到新数据框。
最有效的方法是什么?
使用 itertools.islice
:
from itertools import islice
import random
lo, hi = 3, 5 # change this to whatever
def extract_chunks(it):
chunks = []
while True:
chunk = list(islice(it, random.choice(range(lo, hi+1))))
if not chunk:
break
chunks.append(' '.join(chunk))
return chunks
通过列表理解调用函数以确保尽可能少的开销,然后 stack
获得输出:
pd.DataFrame([
extract_chunks(iter(text.split())) for text in df['text']], index=df['id']
).stack()
id
1 0 I am the
1 first document and I
2 am very happy.
2 0 Here is the
1 second document and
2 it likes playing tennis.
3 0 This is the third
1 document and it looks
2 very good today.
您可以扩展 extract_chunks
函数来执行标记化。现在,我在空白处使用了一个简单的拆分,您可以修改它。
请注意,如果您不想触及其他列,您可以在此处执行类似 melt
ing 操作的操作。
u = pd.DataFrame([
extract_chunks(iter(text.split())) for text in df['text']])
(pd.concat([df.drop('text', 1), u], axis=1)
.melt(df.columns.difference(['text'])))