从 python 个列表中删除单词?
Removing words from python lists?
我是 python 和网络抓取方面的完全菜鸟,并且很早就 运行 处理了一些问题。我已经能够抓取荷兰新闻网站的标题并拆分单词。现在我的 objective 是从结果中删除某些单词。例如,我不希望列表中出现 "het" 和 "om" 这样的词。有谁知道我该怎么做? (我正在使用 python 请求和 BeautifulSoup)
import requests
from bs4 import BeautifulSoup
url="http://www.nu.nl"
r=requests.get(url)
soup=BeautifulSoup(r.content)
g_data=soup.find_all("span" , {"class": "title"})
for item in g_data:
print item.text.split()
在自然语言处理中,排除常用词的术语称为"stop words"。
您是要保留每个单词的顺序和计数,还是只想要页面上出现的单词集?
如果您只想要出现在页面上的词组,使用词组可能是可行的方法。像下面这样的东西可能会起作用:
# It's probably more common to define your STOP_WORDS in a file and then read
# them into your data structure to keep things simple for large numbers of those
# words.
STOP_WORDS = set([
'het',
'om'
])
all_words = set()
for item in g_data:
all_words |= set(item.text.split())
all_words -= STOP_WORDS
print all_words
另一方面,如果您关心顺序,则可以避免在列表中添加停用词。
words_in_order = []
for item in g_data:
words_from_span = item.text.split()
# You might want to break this out into its own function for modularity.
for word in words_from_span:
if word not in STOP_WORDS:
words_in_order.append(word)
print words_in_order
如果你不关心顺序但你想要频率,你可以创建一个单词的字典(或为方便起见的 defaultdict)来计数。
from collections import defaultdict
word_counts = defaultdict(int)
for item in g_data:
# You might want to break this out into its own function for modularity.
for word in item.text.split():
if word not in STOP_WORDS:
word_counts[word] += 1
for word, count in word_counts.iteritems():
print '%s: %d' % (word, count)
我是 python 和网络抓取方面的完全菜鸟,并且很早就 运行 处理了一些问题。我已经能够抓取荷兰新闻网站的标题并拆分单词。现在我的 objective 是从结果中删除某些单词。例如,我不希望列表中出现 "het" 和 "om" 这样的词。有谁知道我该怎么做? (我正在使用 python 请求和 BeautifulSoup)
import requests
from bs4 import BeautifulSoup
url="http://www.nu.nl"
r=requests.get(url)
soup=BeautifulSoup(r.content)
g_data=soup.find_all("span" , {"class": "title"})
for item in g_data:
print item.text.split()
在自然语言处理中,排除常用词的术语称为"stop words"。
您是要保留每个单词的顺序和计数,还是只想要页面上出现的单词集?
如果您只想要出现在页面上的词组,使用词组可能是可行的方法。像下面这样的东西可能会起作用:
# It's probably more common to define your STOP_WORDS in a file and then read
# them into your data structure to keep things simple for large numbers of those
# words.
STOP_WORDS = set([
'het',
'om'
])
all_words = set()
for item in g_data:
all_words |= set(item.text.split())
all_words -= STOP_WORDS
print all_words
另一方面,如果您关心顺序,则可以避免在列表中添加停用词。
words_in_order = []
for item in g_data:
words_from_span = item.text.split()
# You might want to break this out into its own function for modularity.
for word in words_from_span:
if word not in STOP_WORDS:
words_in_order.append(word)
print words_in_order
如果你不关心顺序但你想要频率,你可以创建一个单词的字典(或为方便起见的 defaultdict)来计数。
from collections import defaultdict
word_counts = defaultdict(int)
for item in g_data:
# You might want to break this out into its own function for modularity.
for word in item.text.split():
if word not in STOP_WORDS:
word_counts[word] += 1
for word, count in word_counts.iteritems():
print '%s: %d' % (word, count)