Scrapy 正在从不同的网页返回内容
Scrapy is returning content from a different webpage
我正在尝试从 Tapology.com 抓取战斗数据,但我通过 Scrapy 提取的内容为我提供了一个完全不同的网页的内容。例如,我想从以下 link:
中提取战斗机名称
所以我用以下命令打开 scrapy shell:
scrapy shell 'https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii'
然后我尝试使用以下代码提取战斗机名称:
response.css('.fighterNames ::text').getall()
我得到这个作为结果:
['\n',
'\n',
'\n',
'Billy Ayash',
'\n',
'\n',
'\n',
'Dennis Reed',
'\n',
'\n',
'\n',
'\n',
'“惩罚者”',
'\n',
'\n',
'\n']
正如您在网页上看到的那样,如果您检查 HTML,名称 returned 应该是 'Robbie Lawler' 和 'Rory MacDonald.' 更奇怪的是每次我在 shell 环境下测试这个网页时,Scrapy return 的内容都不一样。它不会总是 return 来自 Billy Ayash 和 Dennis Reed 的战斗网页的内容。
Scrapy有问题吗? Tapology.com有问题吗?任何帮助,将不胜感激!我在 ufcstats.com 上使用 Scrapy,在这个测试之前和之后都没有任何问题。
完整代码如下:
(base) davidwismer@Davids-MacBook-Pro ~ % scrapy shell 'https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii'
2021-03-03 17:18:03 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: scrapybot)
2021-03-03 17:18:03 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (default, Sep 4 2020, 02:22:02) - [Clang 10.0.0 ], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h 22 Sep 2020), cryptography 3.1.1, Platform macOS-10.15.7-x86_64-i386-64bit
2021-03-03 17:18:03 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-03-03 17:18:03 [scrapy.crawler] INFO: Overridden settings:
{'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
'LOGSTATS_INTERVAL': 0}
2021-03-03 17:18:03 [scrapy.extensions.telnet] INFO: Telnet Password: b44d20b5d1bbeb73
2021-03-03 17:18:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage']
2021-03-03 17:18:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-03-03 17:18:04 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-03-03 17:18:04 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-03-03 17:18:04 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-03-03 17:18:04 [scrapy.core.engine] INFO: Spider opened
2021-03-03 17:18:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii> (referer: None)
2021-03-03 17:18:05 [asyncio] DEBUG: Using selector: KqueueSelector
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7fc4d97c5730>
[s] item {}
[s] request <GET https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii>
[s] response <200 https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii>
[s] settings <scrapy.settings.Settings object at 0x7fc4d97c5e50>
[s] spider <DefaultSpider 'default' at 0x7fc4d9e26100>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
2021-03-03 17:18:05 [asyncio] DEBUG: Using selector: KqueueSelector
In [1]: response.css('.fighterNames ::text').getall()
Out[1]:
['\n',
'\n',
'\n',
'Billy Ayash',
'\n',
'\n',
'\n',
'Dennis Reed',
'\n',
'\n',
'\n',
'\n',
'"The Punisher"',
'\n',
'\n',
'\n']
我用 requests
+ BeautifulSoup4
测试了它,得到了相同的结果。
但是,当我将 User-Agent
header 设置为其他内容时(在下面的示例中从我的网络浏览器中获取的值),我得到了有效的结果。这是代码:
from requests import get
from bs4 import BeautifulSoup
def get_names(with_user_agent: bool):
if with_user_agent:
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0'}
else:
headers = {}
r = get('https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii', headers=headers)
r.raise_for_status()
soup = BeautifulSoup(r.text, features='html.parser')
names = soup.select('.fighterNames span')
print('Names:')
for n in names:
print(n.text.strip())
print('---')
if __name__ == '__main__':
print('Without user agent:')
for i in range(3):
get_names(False)
print('\nWith user agent:')
for i in range(3):
get_names(True)
输出:
Without user agent:
Names:
Jared Downing
Danny Tims
"Demon Eyes"
---
Names:
Allen Hope
Mike Kent
"Bunzy"
---
Names:
Paweł Sikora
Patryk Domke
"Ponczek"
"Patrykos"
---
With user agent:
Names:
Robbie Lawler
Rory MacDonald
"Ruthless"
"Red King"
---
Names:
Robbie Lawler
Rory MacDonald
"Ruthless"
"Red King"
---
Names:
Robbie Lawler
Rory MacDonald
"Ruthless"
"Red King"
---
我正在尝试从 Tapology.com 抓取战斗数据,但我通过 Scrapy 提取的内容为我提供了一个完全不同的网页的内容。例如,我想从以下 link:
中提取战斗机名称所以我用以下命令打开 scrapy shell:
scrapy shell 'https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii'
然后我尝试使用以下代码提取战斗机名称:
response.css('.fighterNames ::text').getall()
我得到这个作为结果:
['\n', '\n', '\n', 'Billy Ayash', '\n', '\n', '\n', 'Dennis Reed', '\n', '\n', '\n', '\n', '“惩罚者”', '\n', '\n', '\n']
正如您在网页上看到的那样,如果您检查 HTML,名称 returned 应该是 'Robbie Lawler' 和 'Rory MacDonald.' 更奇怪的是每次我在 shell 环境下测试这个网页时,Scrapy return 的内容都不一样。它不会总是 return 来自 Billy Ayash 和 Dennis Reed 的战斗网页的内容。
Scrapy有问题吗? Tapology.com有问题吗?任何帮助,将不胜感激!我在 ufcstats.com 上使用 Scrapy,在这个测试之前和之后都没有任何问题。
完整代码如下:
(base) davidwismer@Davids-MacBook-Pro ~ % scrapy shell 'https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii'
2021-03-03 17:18:03 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: scrapybot)
2021-03-03 17:18:03 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (default, Sep 4 2020, 02:22:02) - [Clang 10.0.0 ], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h 22 Sep 2020), cryptography 3.1.1, Platform macOS-10.15.7-x86_64-i386-64bit
2021-03-03 17:18:03 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-03-03 17:18:03 [scrapy.crawler] INFO: Overridden settings:
{'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
'LOGSTATS_INTERVAL': 0}
2021-03-03 17:18:03 [scrapy.extensions.telnet] INFO: Telnet Password: b44d20b5d1bbeb73
2021-03-03 17:18:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage']
2021-03-03 17:18:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-03-03 17:18:04 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-03-03 17:18:04 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-03-03 17:18:04 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-03-03 17:18:04 [scrapy.core.engine] INFO: Spider opened
2021-03-03 17:18:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii> (referer: None)
2021-03-03 17:18:05 [asyncio] DEBUG: Using selector: KqueueSelector
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7fc4d97c5730>
[s] item {}
[s] request <GET https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii>
[s] response <200 https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii>
[s] settings <scrapy.settings.Settings object at 0x7fc4d97c5e50>
[s] spider <DefaultSpider 'default' at 0x7fc4d9e26100>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
2021-03-03 17:18:05 [asyncio] DEBUG: Using selector: KqueueSelector
In [1]: response.css('.fighterNames ::text').getall()
Out[1]:
['\n',
'\n',
'\n',
'Billy Ayash',
'\n',
'\n',
'\n',
'Dennis Reed',
'\n',
'\n',
'\n',
'\n',
'"The Punisher"',
'\n',
'\n',
'\n']
我用 requests
+ BeautifulSoup4
测试了它,得到了相同的结果。
但是,当我将 User-Agent
header 设置为其他内容时(在下面的示例中从我的网络浏览器中获取的值),我得到了有效的结果。这是代码:
from requests import get
from bs4 import BeautifulSoup
def get_names(with_user_agent: bool):
if with_user_agent:
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0'}
else:
headers = {}
r = get('https://www.tapology.com/fightcenter/bouts/184425-ufc-189-ruthless-robbie-lawler-vs-rory-red-king-macdonald-ii', headers=headers)
r.raise_for_status()
soup = BeautifulSoup(r.text, features='html.parser')
names = soup.select('.fighterNames span')
print('Names:')
for n in names:
print(n.text.strip())
print('---')
if __name__ == '__main__':
print('Without user agent:')
for i in range(3):
get_names(False)
print('\nWith user agent:')
for i in range(3):
get_names(True)
输出:
Without user agent:
Names:
Jared Downing
Danny Tims
"Demon Eyes"
---
Names:
Allen Hope
Mike Kent
"Bunzy"
---
Names:
Paweł Sikora
Patryk Domke
"Ponczek"
"Patrykos"
---
With user agent:
Names:
Robbie Lawler
Rory MacDonald
"Ruthless"
"Red King"
---
Names:
Robbie Lawler
Rory MacDonald
"Ruthless"
"Red King"
---
Names:
Robbie Lawler
Rory MacDonald
"Ruthless"
"Red King"
---