如何从 HTML 页面但从元素本身提取或抓取数据
How to extract or Scrape data from HTML page but from the element itself
目前我使用 lxml 解析 html 文档以从 HTML 元素中获取数据
但是有一个新的挑战,在 HTML 个元素
中存储了一个作为评分的数据
https://i.stack.imgur.com/bwGle.png
<p data-rating="3">
<span class="glyphicon glyphicon-star xh-highlight"></span>
<span class="glyphicon glyphicon-star xh-highlight"></span>
<span class="glyphicon glyphicon-star xh-highlight"></span>
</p>
很容易在标签之间提取文本,但在标签内没有想法。
你有什么建议?
挑战我想提取“3”
URL:https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops
Br,
加布里埃尔.
试试下面的脚本:
from bs4 import BeautifulSoup
import requests
BASE_URL = "https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
html = requests.get(BASE_URL).text
soup = BeautifulSoup(html, "html.parser")
for tag in soup.find_all("div", {"class":"ratings"}):
# get all child from the tags
for h in tag.children:
# convert to string data type
s = h.encode('utf-8').decode("utf-8")
# find the tag with data-rating and get text after the keyword
m = re.search('(?<=data-rating=)(.*)', s)
# check if not None
if m:
#print the text after data-rating and remove last char
print(m.group()[:-1])
如果我正确理解你的问题和评论,下面应该提取该页面中的所有评分:
import lxml.html
import requests
BASE_URL = "https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
html = requests.get(BASE_URL)
root = lxml.html.fromstring(html.text)
targets = root.xpath('//p[./span[@class]]/@data-rating')
例如:
targets[0]
输出
3
目前我使用 lxml 解析 html 文档以从 HTML 元素中获取数据 但是有一个新的挑战,在 HTML 个元素
中存储了一个作为评分的数据https://i.stack.imgur.com/bwGle.png
<p data-rating="3">
<span class="glyphicon glyphicon-star xh-highlight"></span>
<span class="glyphicon glyphicon-star xh-highlight"></span>
<span class="glyphicon glyphicon-star xh-highlight"></span>
</p>
很容易在标签之间提取文本,但在标签内没有想法。 你有什么建议?
挑战我想提取“3” URL:https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops
Br, 加布里埃尔.
试试下面的脚本:
from bs4 import BeautifulSoup
import requests
BASE_URL = "https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
html = requests.get(BASE_URL).text
soup = BeautifulSoup(html, "html.parser")
for tag in soup.find_all("div", {"class":"ratings"}):
# get all child from the tags
for h in tag.children:
# convert to string data type
s = h.encode('utf-8').decode("utf-8")
# find the tag with data-rating and get text after the keyword
m = re.search('(?<=data-rating=)(.*)', s)
# check if not None
if m:
#print the text after data-rating and remove last char
print(m.group()[:-1])
如果我正确理解你的问题和评论,下面应该提取该页面中的所有评分:
import lxml.html
import requests
BASE_URL = "https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
html = requests.get(BASE_URL)
root = lxml.html.fromstring(html.text)
targets = root.xpath('//p[./span[@class]]/@data-rating')
例如:
targets[0]
输出
3