Scrapy - urlparse.urljoin 的行为方式是否与 str.join 相同?
Scrapy - Does urlparse.urljoin behave in the same way as str.join?
我正在尝试在 Scrapy 蜘蛛中使用 urlparse.urljoin
来编译要抓取的 url 列表。目前,我的蜘蛛没有返回任何东西,但没有抛出任何错误。所以我试图检查我是否正确地编译了 url。
我的尝试是使用 str.join
在闲置状态下对此进行测试,如下所示:
>>> href = ['lphs.asp?id=598&city=london',
'lphs.asp?id=480&city=london',
'lphs.asp?id=1808&city=london',
'lphs.asp?id=1662&city=london',
'lphs.asp?id=502&city=london',]
>>> for x in href:
base = "http:/www.url-base.com/destination/"
final_url = str.join(base, x)
print(final_url)
一行那returns:
lhttp:/www.url-base.com/destination/phttp:/www.url-base.com/destination/hhttp:/www.url-base.com/destination/shttp:/www.url-base.com/destination/.http:/www.url-base.com/destination/ahttp:/www.url-base.com/destination/shttp:/www.url-base.com/destination/phttp:/www.url-base.com/destination/?http:/www.url-base.com/destination/ihttp:/www.url-base.com/destination/dhttp:/www.url-base.com/destination/=http:/www.url-base.com/destination/5http:/www.url-base.com/destination/9http:/www.url-base.com/destination/8http:/www.url-base.com/destination/&http:/www.url-base.com/destination/chttp:/www.url-base.com/destination/ihttp:/www.url-base.com/destination/thttp:/www.url-base.com/destination/yhttp:/www.url-base.com/destination/=http:/www.url-base.com/destination/lhttp:/www.url-base.com/destination/ohttp:/www.url-base.com/destination/nhttp:/www.url-base.com/destination/dhttp:/www.url-base.com/destination/ohttp:/www.url-base.com/destination/n
我认为从我的示例中可以明显看出 str.join
的行为方式不同 - 如果是这样,这就是我的蜘蛛不跟踪这些链接的原因! - 不过,最好确认一下。
如果这不是正确的测试方法,我该如何测试这个过程?
更新
尝试使用下面的 urlparse.urljoin
:
从 urllib.parse 导入 urlparse
>>> from urllib.parse import urlparse
>>> for x in href:
base = "http:/www.url-base.com/destination/"
final_url = urlparse.urljoin(base, x)
print(final_url)
这是投掷AttributeError: 'function' object has no attribute 'urljoin'
更新 - 有问题的 spider 函数
def parse_links(self, response):
room_links = response.xpath('//form/table/tr/td/table//a[div]/@href').extract() # insert xpath which contains the href for the rooms
for link in room_links:
base_url = "http://www.example.com/followthrough"
final_url = urlparse.urljoin(base_url, link)
print(final_url)
# This is not joing the final_url right
yield Request(final_url, callback=parse_links)
更新
我刚刚在闲置中再次测试:
>>> from urllib.parse import urljoin
>>> from urllib import parse
>>> room_links = ['lphs.asp?id=562&city=london',
'lphs.asp?id=1706&city=london',
'lphs.asp?id=1826&city=london',
'lphs.asp?id=541&city=london',
'lphs.asp?id=1672&city=london',
'lphs.asp?id=509&city=london',
'lphs.asp?id=428&city=london',
'lphs.asp?id=614&city=london',
'lphs.asp?id=336&city=london',
'lphs.asp?id=412&city=london',
'lphs.asp?id=611&city=london',]
>>> for link in room_links:
base_url = "http:/www.url-base.com/destination/"
final_url = urlparse.urljoin(base_url, link)
print(final_url)
哪个扔了这个:
Traceback (most recent call last):
File "<pyshell#34>", line 3, in <module>
final_url = urlparse.urljoin(base_url, link)
AttributeError: 'function' object has no attribute 'urljoin'
你看到的输出是因为这个:
for x in href:
base = "http:/www.url-base.com/destination/"
final_url = str.join(base, href) # <-- 'x' instead of 'href' probably intended here
print(final_url)
urllib
库中的 urljoin
行为不同,请参阅文档。这不是简单的字符串连接。
编辑:
根据您的评论,我想您正在使用 Python 3. 使用该导入语句,您导入了一个 urlparse
函数。这就是你得到那个错误的原因。直接导入并使用函数:
from urllib.parse import urljoin
...
final_url = urljoin(base, x)
或导入parse
模块并使用如下函数:
from urllib import parse
...
final_url = parse.urljoin(base, x)
我正在尝试在 Scrapy 蜘蛛中使用 urlparse.urljoin
来编译要抓取的 url 列表。目前,我的蜘蛛没有返回任何东西,但没有抛出任何错误。所以我试图检查我是否正确地编译了 url。
我的尝试是使用 str.join
在闲置状态下对此进行测试,如下所示:
>>> href = ['lphs.asp?id=598&city=london',
'lphs.asp?id=480&city=london',
'lphs.asp?id=1808&city=london',
'lphs.asp?id=1662&city=london',
'lphs.asp?id=502&city=london',]
>>> for x in href:
base = "http:/www.url-base.com/destination/"
final_url = str.join(base, x)
print(final_url)
一行那returns:
lhttp:/www.url-base.com/destination/phttp:/www.url-base.com/destination/hhttp:/www.url-base.com/destination/shttp:/www.url-base.com/destination/.http:/www.url-base.com/destination/ahttp:/www.url-base.com/destination/shttp:/www.url-base.com/destination/phttp:/www.url-base.com/destination/?http:/www.url-base.com/destination/ihttp:/www.url-base.com/destination/dhttp:/www.url-base.com/destination/=http:/www.url-base.com/destination/5http:/www.url-base.com/destination/9http:/www.url-base.com/destination/8http:/www.url-base.com/destination/&http:/www.url-base.com/destination/chttp:/www.url-base.com/destination/ihttp:/www.url-base.com/destination/thttp:/www.url-base.com/destination/yhttp:/www.url-base.com/destination/=http:/www.url-base.com/destination/lhttp:/www.url-base.com/destination/ohttp:/www.url-base.com/destination/nhttp:/www.url-base.com/destination/dhttp:/www.url-base.com/destination/ohttp:/www.url-base.com/destination/n
我认为从我的示例中可以明显看出 str.join
的行为方式不同 - 如果是这样,这就是我的蜘蛛不跟踪这些链接的原因! - 不过,最好确认一下。
如果这不是正确的测试方法,我该如何测试这个过程?
更新
尝试使用下面的 urlparse.urljoin
:
从 urllib.parse 导入 urlparse
>>> from urllib.parse import urlparse
>>> for x in href:
base = "http:/www.url-base.com/destination/"
final_url = urlparse.urljoin(base, x)
print(final_url)
这是投掷AttributeError: 'function' object has no attribute 'urljoin'
更新 - 有问题的 spider 函数
def parse_links(self, response):
room_links = response.xpath('//form/table/tr/td/table//a[div]/@href').extract() # insert xpath which contains the href for the rooms
for link in room_links:
base_url = "http://www.example.com/followthrough"
final_url = urlparse.urljoin(base_url, link)
print(final_url)
# This is not joing the final_url right
yield Request(final_url, callback=parse_links)
更新
我刚刚在闲置中再次测试:
>>> from urllib.parse import urljoin
>>> from urllib import parse
>>> room_links = ['lphs.asp?id=562&city=london',
'lphs.asp?id=1706&city=london',
'lphs.asp?id=1826&city=london',
'lphs.asp?id=541&city=london',
'lphs.asp?id=1672&city=london',
'lphs.asp?id=509&city=london',
'lphs.asp?id=428&city=london',
'lphs.asp?id=614&city=london',
'lphs.asp?id=336&city=london',
'lphs.asp?id=412&city=london',
'lphs.asp?id=611&city=london',]
>>> for link in room_links:
base_url = "http:/www.url-base.com/destination/"
final_url = urlparse.urljoin(base_url, link)
print(final_url)
哪个扔了这个:
Traceback (most recent call last):
File "<pyshell#34>", line 3, in <module>
final_url = urlparse.urljoin(base_url, link)
AttributeError: 'function' object has no attribute 'urljoin'
你看到的输出是因为这个:
for x in href:
base = "http:/www.url-base.com/destination/"
final_url = str.join(base, href) # <-- 'x' instead of 'href' probably intended here
print(final_url)
urllib
库中的 urljoin
行为不同,请参阅文档。这不是简单的字符串连接。
编辑:
根据您的评论,我想您正在使用 Python 3. 使用该导入语句,您导入了一个 urlparse
函数。这就是你得到那个错误的原因。直接导入并使用函数:
from urllib.parse import urljoin
...
final_url = urljoin(base, x)
或导入parse
模块并使用如下函数:
from urllib import parse
...
final_url = parse.urljoin(base, x)