在 Lambda 函数中使用 OR - Web Scraping Python
Use OR in Lambda function - Web Scraping Python
使用这个例子 -
我编写了一个网络抓取脚本来查找本地报纸的最新版本和兑现版本中的关键字。
from bs4 import BeautifulSoup
import requests
urls = ["https://www.marinij.com/", 'https://web.archive.org/web/20210811185035/https://www.marinij.com/',
'https://web.archive.org/web/20210506004633/https://www.marinij.com/','https://web.archive.org/web/20210211022431/https://www.marinij.com/',
'https://web.archive.org/web/20201111174202/https://www.marinij.com/','https://web.archive.org/web/20200811204359/https://www.marinij.com/',
'https://web.archive.org/web/20200511165943/https://www.marinij.com/','https://web.archive.org/web/20200209014056/https://www.marinij.com/',
'https://web.archive.org/web/20191111061843/https://www.marinij.com/']
dates = ['today','aug2021','may2021','feb2021','nov2020','aug2020','may2020','feb2020','nov2019']
for i, (url,date) in enumerate(zip(urls,dates)):
r = requests.get(url)
soup = BeautifulSoup(r.content)
covid_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
'href' in tag.attrs and
('corona' or 'covid') in tag.get_text().lower())
results = soup.find_all(covid_links)
num_art = str((len(results)))
if not results:
results = ["The term COVID did not appear in the headlines this quarter!\n"]
textfile = open("marin_covid_" + date + ".txt", "w")
for idx, element in enumerate(results):
element = str(element)
# print(element)
if idx == 0:
textfile.write(date + "\n" + "Number of articles = " + num_art + "\n" + "\n" + element + "\n")
else:
textfile.write(element + "\n" + "\n")
textfile.close()
files = ['marin_covid_today.txt', 'marin_covid_aug2021.txt', 'marin_covid_may2021.txt', 'marin_covid_feb2021.txt', 'marin_covid_nov2020.txt',
'marin_covid_aug2020.txt', 'marin_covid_may2020.txt', 'marin_covid_feb2020.txt']
with open("COVID_articles_in_MIJ.txt", "w") as outfile:
for filename in files:
print(filename)
with open(filename) as infile:
contents = infile.read()
outfile.write(contents)
仅使用 1 个关键字时效果非常好,但当我尝试使用“或”功能查找 1 个或多个关键字时,它只搜索第一个词。这可以通过切换示例中的 2 个关键字-“covid”和“corona”来复制。
我知道问题出在这个 lambda 函数中,但我不确定如何解决。
covid_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
'href' in tag.attrs and
('corona' or 'covid') in tag.get_text().lower())
如果您安装了先决条件,此代码应该是完全可执行的,感谢所有帮助。
正如评论中指出的那样,问题是 'in' 运算符必须包含在 'or' 运算符的任一侧,以便评估属性;在这种情况下,tag.get_text().lower() 可以针对“corona”和“covid”两种情况进行评估。正确的 lambda 函数是这样的:
covid_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
'href' in tag.attrs and
('covid' in tag.get_text().lower() or 'corona' in tag.get_text().lower()))
使用这个例子 -
我编写了一个网络抓取脚本来查找本地报纸的最新版本和兑现版本中的关键字。
from bs4 import BeautifulSoup
import requests
urls = ["https://www.marinij.com/", 'https://web.archive.org/web/20210811185035/https://www.marinij.com/',
'https://web.archive.org/web/20210506004633/https://www.marinij.com/','https://web.archive.org/web/20210211022431/https://www.marinij.com/',
'https://web.archive.org/web/20201111174202/https://www.marinij.com/','https://web.archive.org/web/20200811204359/https://www.marinij.com/',
'https://web.archive.org/web/20200511165943/https://www.marinij.com/','https://web.archive.org/web/20200209014056/https://www.marinij.com/',
'https://web.archive.org/web/20191111061843/https://www.marinij.com/']
dates = ['today','aug2021','may2021','feb2021','nov2020','aug2020','may2020','feb2020','nov2019']
for i, (url,date) in enumerate(zip(urls,dates)):
r = requests.get(url)
soup = BeautifulSoup(r.content)
covid_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
'href' in tag.attrs and
('corona' or 'covid') in tag.get_text().lower())
results = soup.find_all(covid_links)
num_art = str((len(results)))
if not results:
results = ["The term COVID did not appear in the headlines this quarter!\n"]
textfile = open("marin_covid_" + date + ".txt", "w")
for idx, element in enumerate(results):
element = str(element)
# print(element)
if idx == 0:
textfile.write(date + "\n" + "Number of articles = " + num_art + "\n" + "\n" + element + "\n")
else:
textfile.write(element + "\n" + "\n")
textfile.close()
files = ['marin_covid_today.txt', 'marin_covid_aug2021.txt', 'marin_covid_may2021.txt', 'marin_covid_feb2021.txt', 'marin_covid_nov2020.txt',
'marin_covid_aug2020.txt', 'marin_covid_may2020.txt', 'marin_covid_feb2020.txt']
with open("COVID_articles_in_MIJ.txt", "w") as outfile:
for filename in files:
print(filename)
with open(filename) as infile:
contents = infile.read()
outfile.write(contents)
仅使用 1 个关键字时效果非常好,但当我尝试使用“或”功能查找 1 个或多个关键字时,它只搜索第一个词。这可以通过切换示例中的 2 个关键字-“covid”和“corona”来复制。
我知道问题出在这个 lambda 函数中,但我不确定如何解决。
covid_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
'href' in tag.attrs and
('corona' or 'covid') in tag.get_text().lower())
如果您安装了先决条件,此代码应该是完全可执行的,感谢所有帮助。
正如评论中指出的那样,问题是 'in' 运算符必须包含在 'or' 运算符的任一侧,以便评估属性;在这种情况下,tag.get_text().lower() 可以针对“corona”和“covid”两种情况进行评估。正确的 lambda 函数是这样的:
covid_links = lambda tag: (getattr(tag, 'name', None) == 'a' and
'href' in tag.attrs and
('covid' in tag.get_text().lower() or 'corona' in tag.get_text().lower()))