从 pandas DataFrame 中的文本中提取子字符串作为新列
Extract substring from text in a pandas DataFrame as new column
我有一个列表'words'我要数在下面
word_list = ['one','three']
我在 pandas 数据框中有一列,下面有文本。
TEXT |
-------------------------------------------|
"Perhaps she'll be the one for me." |
"Is it two or one?" |
"Mayhaps it be three afterall..." |
"Three times and it's a charm." |
"One fish, two fish, red fish, blue fish." |
"There's only one cat in the hat." |
"One does not simply code into pandas." |
"Two nights later..." |
"Quoth the Raven... nevermore." |
期望的输出如下所示,它保留原始文本列,但只将 word_list 中的单词提取到新列
TEXT | EXTRACT
-------------------------------------------|---------------
"Perhaps she'll be the one for me." | one
"Is it two or one?" | one
"Mayhaps it be three afterall..." | three
"Three times and it's a charm." | three
"One fish, two fish, red fish, blue fish." | one
"There's only one cat in the hat." | one
"One does not simply code into pandas." | one
"Two nights later..." |
"Quoth the Raven... nevermore." |
在 Python 2.7 中有没有办法做到这一点?
使用str.extract
:
df['EXTRACT'] = df.TEXT.str.extract('({})'.format('|'.join(word_list)),
flags=re.IGNORECASE, expand=False).str.lower().fillna('')
df['EXTRACT']
0 one
1 one
2 three
3 three
4 one
5 one
6 one
7
8
Name: EXTRACT, dtype: object
word_list
中的每个单词由正则表达式分隔符 |
连接,然后传递给 str.extract
进行正则表达式模式匹配。
打开 re.IGNORECASE
开关以进行不区分大小写的比较,结果匹配项将小写以匹配您的预期输出。
我有一个列表'words'我要数在下面
word_list = ['one','three']
我在 pandas 数据框中有一列,下面有文本。
TEXT |
-------------------------------------------|
"Perhaps she'll be the one for me." |
"Is it two or one?" |
"Mayhaps it be three afterall..." |
"Three times and it's a charm." |
"One fish, two fish, red fish, blue fish." |
"There's only one cat in the hat." |
"One does not simply code into pandas." |
"Two nights later..." |
"Quoth the Raven... nevermore." |
期望的输出如下所示,它保留原始文本列,但只将 word_list 中的单词提取到新列
TEXT | EXTRACT
-------------------------------------------|---------------
"Perhaps she'll be the one for me." | one
"Is it two or one?" | one
"Mayhaps it be three afterall..." | three
"Three times and it's a charm." | three
"One fish, two fish, red fish, blue fish." | one
"There's only one cat in the hat." | one
"One does not simply code into pandas." | one
"Two nights later..." |
"Quoth the Raven... nevermore." |
在 Python 2.7 中有没有办法做到这一点?
使用str.extract
:
df['EXTRACT'] = df.TEXT.str.extract('({})'.format('|'.join(word_list)),
flags=re.IGNORECASE, expand=False).str.lower().fillna('')
df['EXTRACT']
0 one
1 one
2 three
3 three
4 one
5 one
6 one
7
8
Name: EXTRACT, dtype: object
word_list
中的每个单词由正则表达式分隔符 |
连接,然后传递给 str.extract
进行正则表达式模式匹配。
打开 re.IGNORECASE
开关以进行不区分大小写的比较,结果匹配项将小写以匹配您的预期输出。