PYTHON:提取非英语单词并在数据帧上迭代
PYTHON: Extract Non-English words and iterate it over a dataframe
我有大约 30,000 行的 table,需要从 dummy_df
数据帧中名为 dummy_df
的列中提取非英语单词。我需要将非英语单词放在名为 non_english
.
的相邻列中
一个虚拟数据是这样的:
dummy_df = pandas.DataFrame({'outcome': ["I want to go to church", "I love Matauranga", "Take me to Oranga Tamariki"]})
我的想法是从句子中提取非英语单词,然后在数据帧上迭代该过程。我能够使用以下代码从句子中准确提取非英语单词:
import nltk
nltk.download('words')
from nltk.corpus import words
words = set(nltk.corpus.words.words())
sent = "I love Matauranga"
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
if not w.lower() in words or not w.isalpha())
以上代码的结果是'Matauranga'
,完全正确。
但是当我尝试使用此代码在数据帧上迭代代码时:
import nltk
nltk.download('words')
from nltk.corpus import words
def no_english(text):
words = set(nltk.corpus.words.words())
" ".join(w for w in nltk.wordpunct_tokenize(text['outcome']) \
if not w.lower() in words or not w.isalpha())
dummy_df['non_english'] = dummy_df.apply(no_english, axis = 1)
print(dummy_df)
我得到了一个不希望的结果,因为 non_english
列有 none
值而不是所需的非英语单词(见下文):
outcome non_english
0 I want to go to church None
1 I love Matauranga None
2 Take me to Oranga Tamariki None
3 None
相反,期望的结果应该是:
outcome non_english
0 I want to go to church
1 I love Matauranga Matauranga
2 Take me to Oranga Tamariki Oranga Tamariki
您的函数中缺少 return
:
import nltk
nltk.download('words')
from nltk.corpus import words
def no_english(text):
words = set(nltk.corpus.words.words())
return " ".join(w for w in nltk.wordpunct_tokenize(text['outcome']) \
if not w.lower() in words or not w.isalpha())
dummy_df['non_english'] = dummy_df.apply(no_english, axis = 1)
print(dummy_df)
输出:
outcome non_english
0 I want to go to church
1 I love Matauranga Matauranga
2 Take me to Oranga Tamariki Oranga Tamariki
我有大约 30,000 行的 table,需要从 dummy_df
数据帧中名为 dummy_df
的列中提取非英语单词。我需要将非英语单词放在名为 non_english
.
一个虚拟数据是这样的:
dummy_df = pandas.DataFrame({'outcome': ["I want to go to church", "I love Matauranga", "Take me to Oranga Tamariki"]})
我的想法是从句子中提取非英语单词,然后在数据帧上迭代该过程。我能够使用以下代码从句子中准确提取非英语单词:
import nltk
nltk.download('words')
from nltk.corpus import words
words = set(nltk.corpus.words.words())
sent = "I love Matauranga"
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
if not w.lower() in words or not w.isalpha())
以上代码的结果是'Matauranga'
,完全正确。
但是当我尝试使用此代码在数据帧上迭代代码时:
import nltk
nltk.download('words')
from nltk.corpus import words
def no_english(text):
words = set(nltk.corpus.words.words())
" ".join(w for w in nltk.wordpunct_tokenize(text['outcome']) \
if not w.lower() in words or not w.isalpha())
dummy_df['non_english'] = dummy_df.apply(no_english, axis = 1)
print(dummy_df)
我得到了一个不希望的结果,因为 non_english
列有 none
值而不是所需的非英语单词(见下文):
outcome non_english
0 I want to go to church None
1 I love Matauranga None
2 Take me to Oranga Tamariki None
3 None
相反,期望的结果应该是:
outcome non_english
0 I want to go to church
1 I love Matauranga Matauranga
2 Take me to Oranga Tamariki Oranga Tamariki
您的函数中缺少 return
:
import nltk
nltk.download('words')
from nltk.corpus import words
def no_english(text):
words = set(nltk.corpus.words.words())
return " ".join(w for w in nltk.wordpunct_tokenize(text['outcome']) \
if not w.lower() in words or not w.isalpha())
dummy_df['non_english'] = dummy_df.apply(no_english, axis = 1)
print(dummy_df)
输出:
outcome non_english
0 I want to go to church
1 I love Matauranga Matauranga
2 Take me to Oranga Tamariki Oranga Tamariki