使用 Scrapy 遍历链接
Traversing Links using Scrapy
我有一个关于 Scrapy 的奇怪问题。我遵循了遍历链接的教程,但由于某种原因没有任何反应。
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from bs4 import BeautifulSoup
import pandas as pd
from time import strftime
class Covid_Crawler(scrapy.Spider):
name = "Covid_Crawler"
allowed_domains = ['worldometers.info/coronavirus/']
start_urls = ['https://www.worldometers.info/coronavirus/countries-where-coronavirus-has-spread/']
def parse(self, response):
count = 0
soup = BeautifulSoup(response.text, "lxml")
try:
covid_table = soup.find('table')
df = pd.read_html(str(covid_table))[0]
print(df)
df.to_csv("CovidFile.csv",index=False)
except:
print("Table not found")
NEXT_PAGE_SELECTOR = 'a::attr(href)'
next_page = response.css(NEXT_PAGE_SELECTOR).getall()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
出于某种原因,当我尝试 运行ning 这个蜘蛛时,它从第一页抓取 table 就好了。但它不想转到其他链接。当我 运行 它时,我得到这样的东西。
2020-12-12 20:45:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://www.worldometers.info/coronavirus/countries-where-coronavirus-has-spread/> (referer: None)
2020-12-12 20:45:15 [numexpr.utils] INFO: NumExpr defaulting to 6 threads.
Country Cases Deaths Region
0 United States 16549366 305082 North America
1 India 9857380 143055 Asia
2 Brazil 6880595 181143 South America
3 Russia 2625848 46453 Europe
4 France 2365319 57761 Europe
.. ... ... ... ...
214 MS Zaandam 9 2 NaN
215 Marshall Islands 4 0 Australia/Oceania
216 Wallis & Futuna 3 0 Australia/Oceania
217 Samoa 2 0 Australia/Oceania
218 Vanuatu 1 0 Australia/Oceania
[219 rows x 4 columns]
2020-12-12 20:45:15 [scrapy.core.scraper] ERROR: Spider error processing <GET
https://www.worldometers.info/coronavirus/countries-where-coronavirus-has-spread/> (referer: None)
Traceback (most recent call last):
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\utils\defer.py", line 120, in iter_errback
yield next(it)
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
return next(self.data)
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
return next(self.data)
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 340, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "C:\Users\Zach Kunz\Documents\Crawler_Test\Covid_Crawler\Covid_Crawler\spiders\Crawler_spider.py", line 84, in parse
yield response.follow(next_page, callback=self.parse)
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\http\response\text.py", line 169, in follow
return super().follow(
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\http\response\__init__.py", line 143, in follow
url = self.urljoin(url)
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\http\response\text.py", line 102, in urljoin
return urljoin(get_base_url(self), url)
File "C:\Users\Zach Kunz\anaconda3\lib\urllib\parse.py", line 512, in urljoin
base, url, _coerce_result = _coerce_args(base, url)
File "C:\Users\Zach Kunz\anaconda3\lib\urllib\parse.py", line 121, in _coerce_args
raise TypeError("Cannot mix str and non-str arguments")
TypeError: Cannot mix str and non-str arguments
2020-12-12 20:45:15 [scrapy.core.engine] INFO: Closing spider (finished)
并使用 scrapy shell 检查它是否获取链接,我明白了
In [6]: response.css('a::attr(href)').getall()
Out[6]:
['/',
'/coronavirus/',
'/population/',
'/coronavirus/',
'/coronavirus/',
'/coronavirus/coronavirus-cases/',
'/coronavirus/worldwide-graphs/',
'/coronavirus/#countries',
'/coronavirus/coronavirus-death-rate/',
'/coronavirus/coronavirus-incubation-period/',
'/coronavirus/coronavirus-age-sex-demographics/',
'/coronavirus/coronavirus-symptoms/',
'/coronavirus/',
'/coronavirus/coronavirus-death-toll/',
'/coronavirus/#countries',
'/coronavirus/',
'/coronavirus/coronavirus-cases/',
'/coronavirus/coronavirus-death-toll/',
'/coronavirus/coronavirus-death-rate/',
'/coronavirus/coronavirus-incubation-period/',
'/coronavirus/coronavirus-age-sex-demographics/',
'/coronavirus/coronavirus-symptoms/',
'/coronavirus/countries-where-coronavirus-has-spread/',
'/coronavirus/#countries',
'/',
'/about/',
'/faq/',
'/languages/',
'/contact/',
'/newsletter-subscribe/',
'https://twitter.com/Worldometers',
'https://www.facebook.com/Worldometers.info',
'/disclaimer/']
任何帮助或见解将不胜感激。如果您愿意帮助解决另一个问题,我正在寻找一种解决方案来将我收集的所有图表存储到多个 csv 或 xlsx 文件中。谢谢!
response.follow()
无法使用 list
。您需要提供一个字符串参数:
next_pages = response.css(NEXT_PAGE_SELECTOR).getall()
for next_page in next_pages:
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
您可以使用 yield from response.follow_all(next_pages)
,它与 gangabass 发布的内容相同。
我有一个关于 Scrapy 的奇怪问题。我遵循了遍历链接的教程,但由于某种原因没有任何反应。
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from bs4 import BeautifulSoup
import pandas as pd
from time import strftime
class Covid_Crawler(scrapy.Spider):
name = "Covid_Crawler"
allowed_domains = ['worldometers.info/coronavirus/']
start_urls = ['https://www.worldometers.info/coronavirus/countries-where-coronavirus-has-spread/']
def parse(self, response):
count = 0
soup = BeautifulSoup(response.text, "lxml")
try:
covid_table = soup.find('table')
df = pd.read_html(str(covid_table))[0]
print(df)
df.to_csv("CovidFile.csv",index=False)
except:
print("Table not found")
NEXT_PAGE_SELECTOR = 'a::attr(href)'
next_page = response.css(NEXT_PAGE_SELECTOR).getall()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
出于某种原因,当我尝试 运行ning 这个蜘蛛时,它从第一页抓取 table 就好了。但它不想转到其他链接。当我 运行 它时,我得到这样的东西。
2020-12-12 20:45:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://www.worldometers.info/coronavirus/countries-where-coronavirus-has-spread/> (referer: None)
2020-12-12 20:45:15 [numexpr.utils] INFO: NumExpr defaulting to 6 threads.
Country Cases Deaths Region
0 United States 16549366 305082 North America
1 India 9857380 143055 Asia
2 Brazil 6880595 181143 South America
3 Russia 2625848 46453 Europe
4 France 2365319 57761 Europe
.. ... ... ... ...
214 MS Zaandam 9 2 NaN
215 Marshall Islands 4 0 Australia/Oceania
216 Wallis & Futuna 3 0 Australia/Oceania
217 Samoa 2 0 Australia/Oceania
218 Vanuatu 1 0 Australia/Oceania
[219 rows x 4 columns]
2020-12-12 20:45:15 [scrapy.core.scraper] ERROR: Spider error processing <GET
https://www.worldometers.info/coronavirus/countries-where-coronavirus-has-spread/> (referer: None)
Traceback (most recent call last):
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\utils\defer.py", line 120, in iter_errback
yield next(it)
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
return next(self.data)
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
return next(self.data)
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 340, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "C:\Users\Zach Kunz\Documents\Crawler_Test\Covid_Crawler\Covid_Crawler\spiders\Crawler_spider.py", line 84, in parse
yield response.follow(next_page, callback=self.parse)
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\http\response\text.py", line 169, in follow
return super().follow(
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\http\response\__init__.py", line 143, in follow
url = self.urljoin(url)
File "C:\Users\Zach Kunz\anaconda3\lib\site-packages\scrapy\http\response\text.py", line 102, in urljoin
return urljoin(get_base_url(self), url)
File "C:\Users\Zach Kunz\anaconda3\lib\urllib\parse.py", line 512, in urljoin
base, url, _coerce_result = _coerce_args(base, url)
File "C:\Users\Zach Kunz\anaconda3\lib\urllib\parse.py", line 121, in _coerce_args
raise TypeError("Cannot mix str and non-str arguments")
TypeError: Cannot mix str and non-str arguments
2020-12-12 20:45:15 [scrapy.core.engine] INFO: Closing spider (finished)
并使用 scrapy shell 检查它是否获取链接,我明白了
In [6]: response.css('a::attr(href)').getall()
Out[6]:
['/',
'/coronavirus/',
'/population/',
'/coronavirus/',
'/coronavirus/',
'/coronavirus/coronavirus-cases/',
'/coronavirus/worldwide-graphs/',
'/coronavirus/#countries',
'/coronavirus/coronavirus-death-rate/',
'/coronavirus/coronavirus-incubation-period/',
'/coronavirus/coronavirus-age-sex-demographics/',
'/coronavirus/coronavirus-symptoms/',
'/coronavirus/',
'/coronavirus/coronavirus-death-toll/',
'/coronavirus/#countries',
'/coronavirus/',
'/coronavirus/coronavirus-cases/',
'/coronavirus/coronavirus-death-toll/',
'/coronavirus/coronavirus-death-rate/',
'/coronavirus/coronavirus-incubation-period/',
'/coronavirus/coronavirus-age-sex-demographics/',
'/coronavirus/coronavirus-symptoms/',
'/coronavirus/countries-where-coronavirus-has-spread/',
'/coronavirus/#countries',
'/',
'/about/',
'/faq/',
'/languages/',
'/contact/',
'/newsletter-subscribe/',
'https://twitter.com/Worldometers',
'https://www.facebook.com/Worldometers.info',
'/disclaimer/']
任何帮助或见解将不胜感激。如果您愿意帮助解决另一个问题,我正在寻找一种解决方案来将我收集的所有图表存储到多个 csv 或 xlsx 文件中。谢谢!
response.follow()
无法使用 list
。您需要提供一个字符串参数:
next_pages = response.css(NEXT_PAGE_SELECTOR).getall()
for next_page in next_pages:
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
您可以使用 yield from response.follow_all(next_pages)
,它与 gangabass 发布的内容相同。