使用 Python 请求提取 href URL
Extracting href URL with Python Requests
我想使用 python 中的请求包从 xpath 中提取 URL。我可以获取文本,但我尝试的任何操作都没有给出 URL。有人可以帮忙吗?
ipdb> webpage.xpath(xpath_url + '/text()')
['Text of the URL']
ipdb> webpage.xpath(xpath_url + '/a()')
*** lxml.etree.XPathEvalError: Invalid expression
ipdb> webpage.xpath(xpath_url + '/href()')
*** lxml.etree.XPathEvalError: Invalid expression
ipdb> webpage.xpath(xpath_url + '/url()')
*** lxml.etree.XPathEvalError: Invalid expression
我使用本教程开始:http://docs.python-guide.org/en/latest/scenarios/scrape/
看起来应该很容易,但在我的搜索过程中没有任何结果。
谢谢。
您最好使用 BeautifulSoup:
from bs4 import BeautifulSoup
html = requests.get('testurl.com')
soup = BeautifulSoup(html, "lxml") # lxml is just the parser for reading the html
soup.find_all('a href') # this is the line that does what you want
您可以打印该行,将其添加到列表等。要遍历它,请使用:
links = soup.find_all('a href')
for link in links:
print(link)
你试过了吗webpage.xpath(xpath_url + '/@href')
?
完整代码如下:
from lxml import html
import requests
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
webpage = html.fromstring(page.content)
webpage.xpath('//a/@href')
结果应该是:
[
'http://econpy.pythonanywhere.com/ex/002.html',
'http://econpy.pythonanywhere.com/ex/003.html',
'http://econpy.pythonanywhere.com/ex/004.html',
'http://econpy.pythonanywhere.com/ex/005.html'
]
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.***.com')
r.html.links
上下文管理器的好处:
with requests_html.HTMLSession() as s:
try:
r = s.get('http://econpy.pythonanywhere.com/ex/001.html')
links = r.html.links
for link in links:
print(link)
except:
pass
你可以用 selenium 轻松做到。
link = webpage.find_elemnt_by_xpath(*xpath url to element with link)
url = link.get_attribute('href')
我想使用 python 中的请求包从 xpath 中提取 URL。我可以获取文本,但我尝试的任何操作都没有给出 URL。有人可以帮忙吗?
ipdb> webpage.xpath(xpath_url + '/text()')
['Text of the URL']
ipdb> webpage.xpath(xpath_url + '/a()')
*** lxml.etree.XPathEvalError: Invalid expression
ipdb> webpage.xpath(xpath_url + '/href()')
*** lxml.etree.XPathEvalError: Invalid expression
ipdb> webpage.xpath(xpath_url + '/url()')
*** lxml.etree.XPathEvalError: Invalid expression
我使用本教程开始:http://docs.python-guide.org/en/latest/scenarios/scrape/
看起来应该很容易,但在我的搜索过程中没有任何结果。
谢谢。
您最好使用 BeautifulSoup:
from bs4 import BeautifulSoup
html = requests.get('testurl.com')
soup = BeautifulSoup(html, "lxml") # lxml is just the parser for reading the html
soup.find_all('a href') # this is the line that does what you want
您可以打印该行,将其添加到列表等。要遍历它,请使用:
links = soup.find_all('a href')
for link in links:
print(link)
你试过了吗webpage.xpath(xpath_url + '/@href')
?
完整代码如下:
from lxml import html
import requests
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
webpage = html.fromstring(page.content)
webpage.xpath('//a/@href')
结果应该是:
[
'http://econpy.pythonanywhere.com/ex/002.html',
'http://econpy.pythonanywhere.com/ex/003.html',
'http://econpy.pythonanywhere.com/ex/004.html',
'http://econpy.pythonanywhere.com/ex/005.html'
]
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.***.com')
r.html.links
上下文管理器的好处:
with requests_html.HTMLSession() as s:
try:
r = s.get('http://econpy.pythonanywhere.com/ex/001.html')
links = r.html.links
for link in links:
print(link)
except:
pass
你可以用 selenium 轻松做到。
link = webpage.find_elemnt_by_xpath(*xpath url to element with link)
url = link.get_attribute('href')