标记文本并为数据框中的每一行创建更多行
Tokenise text and create more rows for each row in dataframe
我想用 python
和 pandas
来做这个。
假设我有以下内容:
file_id text
1 I am the first document. I am a nice document.
2 I am the second document. I am an even nicer document.
我最终想要的是:
file_id text
1 I am the first document
1 I am a nice document
2 I am the second document
2 I am an even nicer document
所以我希望每个文件的文本在每个句号处被拆分并为这些文本的每个标记创建新行。
最有效的方法是什么?
df = pd.DataFrame( { 'field_id': [1,2],
'text': ["I am the first document. I am a nice document.",
"I am the second document. I am an even nicer document."]})
df['sents'] = df.text.apply(lambda txt: [x for x in txt.split(".") if len(x) > 1])
df = df.set_index(['field_id']).apply(lambda x:
pd.Series(x['sents']),axis=1).stack().reset_index(level=1, drop=True)
df = df.reset_index()
df.columns = ['field_id','text']
使用:
s = (df.pop('text')
.str.strip('.')
.str.split('\.\s+', expand=True)
.stack()
.rename('text')
.reset_index(level=1, drop=True))
df = df.join(s).reset_index(drop=True)
print (df)
file_id text
0 1 I am the first document
1 1 I am a nice document
2 2 I am the second document
3 2 I am an even nicer document
解释:
先用DataFrame.pop
for extract column, remove last .
by Series.str.rstrip
and split by with Series.str.split
with escape .
because special regex character, reshape by DataFrame.stack
for Series, DataFrame.reset_index
and rename
for Series for DataFrame.join
来原创
我想用 python
和 pandas
来做这个。
假设我有以下内容:
file_id text
1 I am the first document. I am a nice document.
2 I am the second document. I am an even nicer document.
我最终想要的是:
file_id text
1 I am the first document
1 I am a nice document
2 I am the second document
2 I am an even nicer document
所以我希望每个文件的文本在每个句号处被拆分并为这些文本的每个标记创建新行。
最有效的方法是什么?
df = pd.DataFrame( { 'field_id': [1,2],
'text': ["I am the first document. I am a nice document.",
"I am the second document. I am an even nicer document."]})
df['sents'] = df.text.apply(lambda txt: [x for x in txt.split(".") if len(x) > 1])
df = df.set_index(['field_id']).apply(lambda x:
pd.Series(x['sents']),axis=1).stack().reset_index(level=1, drop=True)
df = df.reset_index()
df.columns = ['field_id','text']
使用:
s = (df.pop('text')
.str.strip('.')
.str.split('\.\s+', expand=True)
.stack()
.rename('text')
.reset_index(level=1, drop=True))
df = df.join(s).reset_index(drop=True)
print (df)
file_id text
0 1 I am the first document
1 1 I am a nice document
2 2 I am the second document
3 2 I am an even nicer document
解释:
先用DataFrame.pop
for extract column, remove last .
by Series.str.rstrip
and split by with Series.str.split
with escape .
because special regex character, reshape by DataFrame.stack
for Series, DataFrame.reset_index
and rename
for Series for DataFrame.join
来原创