如何使用 beautifulsoup 获取源代码中的文本
How to get text inside source code with beautifulsoup
我正在尝试在此页面上进行网络抓取:
如果您查看源代码并查找术语“Tamanho”(带引号),您应该会在下面找到如下内容:
<script>var SKUsCorTamanho = {"34": {"ProdutoId":"330685",
"Codigo":"195241995304",
"Tamanho":"34","PrecoDe":"R$ 0,00",
"PrecoPor":"R$ 699,99",
"PrecoPorSemPromocao":"R$ 699,99",
"ValorParcela":"R$ 58,33",
"ParcelamentoMaximmo:"12","PreVenda":"0","DtLancto":"15\/06\/2021
}}</script>
我怎样才能只获得 beautifulsoup 的尺码?
request = request.get("https://www.nike.com.br/air-max-pre-day-153-169-211-330676")
soup = bs4(request.text, "html.parser")
tamanho = soup.find_all(?)
print(tamanho)
//Result I want on script output
Tamanho 34 or 34
例如,我需要这个 return 我在问题开头 json 的大小,我该怎么做?我该怎么做?
像往常一样找到<script>
标签,然后用re
解析它。这不是最好的方法,因为 re
不懂 JS,但应该可以做到。
import requests, bs4, re
a = requests.get("https://www.nike.com.br/air-max-pre-day-153-169-211-330676")
b = bs4.BeautifulSoup(a.text, "html.parser")
d = next(c.text for c in b.find_all('script') if 'Tamanho' in c.text)
size = list(map(lambda i: re.sub('[^0-9,]', '', i), re.findall(r'"Tamanho":"[^"]*"', d)))
print(size)
输出:
['34', '34,5', '35', '35,5', '36', '37', '37,5', '38', '39', '39,5', '40', '40,5', '41', '42', '42,5', '43', '43,5', '44', '45', '46', '47', '48']
此代码已在 Python 3.9
上测试
from bs4 import BeautifulSoup
import requests
import re
import json
request = requests.get("https://www.nike.com.br/air-max-pre-day-153-169-211-330676")
soup = BeautifulSoup(request.text, "html.parser")
script = soup.find_all('script')[9].string
script = script[len('var SKUsCorTamanho = '):]
variables = json.loads(script)
Tamanho = variables[list(variables.keys())[0]]['Tamanho']
print ("Tamanho : ", Tamanho)
您可以通过减少导入来简化。只需在响应文本
上使用 re.findall
import requests, re
r = requests.get('https://www.nike.com.br/air-max-pre-day-153-169-211-330676').text
sizes = re.findall(r'"Tamanho":"(.*?)"', r)
我正在尝试在此页面上进行网络抓取:
如果您查看源代码并查找术语“Tamanho”(带引号),您应该会在下面找到如下内容:
<script>var SKUsCorTamanho = {"34": {"ProdutoId":"330685",
"Codigo":"195241995304",
"Tamanho":"34","PrecoDe":"R$ 0,00",
"PrecoPor":"R$ 699,99",
"PrecoPorSemPromocao":"R$ 699,99",
"ValorParcela":"R$ 58,33",
"ParcelamentoMaximmo:"12","PreVenda":"0","DtLancto":"15\/06\/2021
}}</script>
我怎样才能只获得 beautifulsoup 的尺码?
request = request.get("https://www.nike.com.br/air-max-pre-day-153-169-211-330676")
soup = bs4(request.text, "html.parser")
tamanho = soup.find_all(?)
print(tamanho)
//Result I want on script output
Tamanho 34 or 34
例如,我需要这个 return 我在问题开头 json 的大小,我该怎么做?我该怎么做?
像往常一样找到<script>
标签,然后用re
解析它。这不是最好的方法,因为 re
不懂 JS,但应该可以做到。
import requests, bs4, re
a = requests.get("https://www.nike.com.br/air-max-pre-day-153-169-211-330676")
b = bs4.BeautifulSoup(a.text, "html.parser")
d = next(c.text for c in b.find_all('script') if 'Tamanho' in c.text)
size = list(map(lambda i: re.sub('[^0-9,]', '', i), re.findall(r'"Tamanho":"[^"]*"', d)))
print(size)
输出:
['34', '34,5', '35', '35,5', '36', '37', '37,5', '38', '39', '39,5', '40', '40,5', '41', '42', '42,5', '43', '43,5', '44', '45', '46', '47', '48']
此代码已在 Python 3.9
上测试from bs4 import BeautifulSoup
import requests
import re
import json
request = requests.get("https://www.nike.com.br/air-max-pre-day-153-169-211-330676")
soup = BeautifulSoup(request.text, "html.parser")
script = soup.find_all('script')[9].string
script = script[len('var SKUsCorTamanho = '):]
variables = json.loads(script)
Tamanho = variables[list(variables.keys())[0]]['Tamanho']
print ("Tamanho : ", Tamanho)
您可以通过减少导入来简化。只需在响应文本
上使用 re.findallimport requests, re
r = requests.get('https://www.nike.com.br/air-max-pre-day-153-169-211-330676').text
sizes = re.findall(r'"Tamanho":"(.*?)"', r)