python 请求拆分某些数据不匹配
python requests splitting certain data mismatch
尝试从网站获取数据,但为某些 url 获取了两个数据
本田思域
make = honda
model = civic
路虎
make = land
model = rover
它应该在哪里
make = landrover
model = rangerover
试过这个:
scala.txt:
https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208
https://www.redbook.com.au/cars/details/2019-holden-astra-rs-black-edition-bk-auto-my19/SPOT-ITM-524534
http://www.redbook.com.au/cars/research/used/details/2014-land-rover-range-rover-evoque-ed4-pure-tech-manual-my15/SPOT-ITM-410126
http://www.redbook.com.au/cars/research/used/details/2014-land-rover-range-rover-evoque-sd4-pure-tech-auto-4x4-my15/SPOT-ITM-410136
import lxml.html as lh
import pandas as pd
import html
from lxml import html
from bs4 import BeautifulSoup
import requests
import requests
from bs4 import BeautifulSoup as bs
cars = []
with open('scala.txt') as f:
urls = f.read().splitlines()
for url in urls:
car_data={}
headers = {'User-Agent':'Mozilla/5.0'}
page = (requests.get(url, headers=headers))
tree = html.fromstring(page.content)
car_data['url']=url
if tree.xpath('//h1[@class="details-title"]/text()')[0]:
full_car_name = tree.xpath('//h1[@class="details-title"]/text()')[0]
car_data['naming'] = full_car_name
print(full_car_name)
car_data['id'] = url.split("SPOT-ITM-")[1].replace("/", "")
car_data['year'] = full_car_name.split(" ")[0]
car_data['make'] = full_car_name.split(" ")[1]
car_data['model']= full_car_name.split(" ")[2]
cars.append(car_data)
前两个没问题,当第三个url出现时有多个值
输出:
{'id': '524208',
'make': 'Honda',
'model': 'Civic',
'naming': '2019 Honda Civic 50 Years Edition Auto MY19',
'url': 'https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208',
'year': '2019'}
{'id': '410136',
'make': 'Land',
'model': 'Rover',
'naming': '2014 Land Rover Range Rover Evoque SD4 Pure Tech Auto 4x4 MY15',
'url': 'http://www.redbook.com.au/cars/research/used/details/2014-land-rover-range-rover-evoque-sd4-pure-tech-auto-4x4-my15/SPOT-ITM-410136',
'year': '2014'}
对于路虎,make should be land rover
和model should be range rover
尝试使用 try/except
。某些元素没有 img。因此,当它试图从索引 [0]
中获取 image_url 时,那里什么也没有。您基本上是在告诉从空列表中获取第一个元素:
try/except
的骨架
try:
<code to do something>
<code>
<more code>
...
except:
<code to do something if the try fails/throws errors>
...
...
图片也是如此:
...
car_data={}
headers = {'User-Agent':'Mozilla/5.0'}
page = (requests.get(url, headers=headers))
tree = html.fromstring(page.content)
try:
img_urls = tree.xpath('//div[@class="r-module"]/div[@class="csn-results"]/div[@class="content"]/a[@class="item"]//div[@class="photos"]//img/@src')
img_url = tree.xpath('//ul/li/a/img/@src')[0]
img_url = str(img_url)
img_url = img_url
except:
img_url = 'N/A'
...
这里也有一些帮助修复您的 json key:values。
你得到这些结果的原因是因为你在白色 space 上分裂。在 text/content 中,它是 land rover range rover
,而不是 landrover rangerover
。因此,当您拆分时,它会返回 ['land', 'rover', 'range', 'rover']
。你正在抓取索引 0 和 1 中的元素,即 'land'
和 'rover'
.
现在如果文本是 'landrover rangerover'
,那么您就可以正确地得到您想要的内容。它会拆分 ['landrover', 'rangerover']
,因此在索引位置 0 和 1 中抓取元素将按照您想要的方式工作。
import requests
from bs4 import BeautifulSoup as bs
import re
import json
cars = []
with open('scala.txt') as f:
urls = f.read().splitlines()
for url in urls:
car_data={}
headers = {'User-Agent':'Mozilla/5.0'}
page = (requests.get(url, headers=headers))
soup = bs(page.content, 'html.parser')
script = soup.find('script', text=re.compile("CsnInsights.metaData"))
jsonData = json.loads(script.text.split('CsnInsights.metaData = ')[-1].rsplit(';',1)[0])
make = jsonData['make']
model = jsonData['model']
car_id = jsonData['networkid'].rsplit('-',1)[-1]
naming = soup.find('div', class_='heading').text.split(' ',1)[-1]
year = soup.find('div', class_='heading').text.split(' ',1)[0]
car_data = {'id':car_id,
'make':make,
'model':model,
'naming':naming,
'url':url,
'year':year}
cars.append(car_data)
输出:
print(json.dumps(cars, indent=4))
[
{
"id": "524208",
"make": "Honda",
"model": "Civic",
"naming": "Honda Civic VTi-S Auto MY19",
"url": "https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208",
"year": "2019"
},
{
"id": "524534",
"make": "Holden",
"model": "Astra",
"naming": "Holden Astra RS BK Auto MY19",
"url": "https://www.redbook.com.au/cars/details/2019-holden-astra-rs-black-edition-bk-auto-my19/SPOT-ITM-524534",
"year": "2019"
},
{
"id": "410126",
"make": "Land Rover",
"model": "Range Rover Evoque",
"naming": "Land Rover Range Rover Evoque SD4 Pure Manual 4x4 MY14",
"url": "http://www.redbook.com.au/cars/research/used/details/2014-land-rover-range-rover-evoque-ed4-pure-tech-manual-my15/SPOT-ITM-410126",
"year": "2014"
},
{
"id": "410136",
"make": "Land Rover",
"model": "Range Rover Evoque",
"naming": "Land Rover Range Rover Evoque SD4 Pure Tech Manual 4x4 MY15",
"url": "http://www.redbook.com.au/cars/research/used/details/2014-land-rover-range-rover-evoque-sd4-pure-tech-auto-4x4-my15/SPOT-ITM-410136",
"year": "2014"
}
]
尝试从网站获取数据,但为某些 url 获取了两个数据
本田思域
make = honda
model = civic
路虎
make = land
model = rover
它应该在哪里
make = landrover
model = rangerover
试过这个:
scala.txt:
https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208
https://www.redbook.com.au/cars/details/2019-holden-astra-rs-black-edition-bk-auto-my19/SPOT-ITM-524534
http://www.redbook.com.au/cars/research/used/details/2014-land-rover-range-rover-evoque-ed4-pure-tech-manual-my15/SPOT-ITM-410126
http://www.redbook.com.au/cars/research/used/details/2014-land-rover-range-rover-evoque-sd4-pure-tech-auto-4x4-my15/SPOT-ITM-410136
import lxml.html as lh
import pandas as pd
import html
from lxml import html
from bs4 import BeautifulSoup
import requests
import requests
from bs4 import BeautifulSoup as bs
cars = []
with open('scala.txt') as f:
urls = f.read().splitlines()
for url in urls:
car_data={}
headers = {'User-Agent':'Mozilla/5.0'}
page = (requests.get(url, headers=headers))
tree = html.fromstring(page.content)
car_data['url']=url
if tree.xpath('//h1[@class="details-title"]/text()')[0]:
full_car_name = tree.xpath('//h1[@class="details-title"]/text()')[0]
car_data['naming'] = full_car_name
print(full_car_name)
car_data['id'] = url.split("SPOT-ITM-")[1].replace("/", "")
car_data['year'] = full_car_name.split(" ")[0]
car_data['make'] = full_car_name.split(" ")[1]
car_data['model']= full_car_name.split(" ")[2]
cars.append(car_data)
前两个没问题,当第三个url出现时有多个值
输出:
{'id': '524208',
'make': 'Honda',
'model': 'Civic',
'naming': '2019 Honda Civic 50 Years Edition Auto MY19',
'url': 'https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208',
'year': '2019'}
{'id': '410136',
'make': 'Land',
'model': 'Rover',
'naming': '2014 Land Rover Range Rover Evoque SD4 Pure Tech Auto 4x4 MY15',
'url': 'http://www.redbook.com.au/cars/research/used/details/2014-land-rover-range-rover-evoque-sd4-pure-tech-auto-4x4-my15/SPOT-ITM-410136',
'year': '2014'}
对于路虎,make should be land rover
和model should be range rover
尝试使用 try/except
。某些元素没有 img。因此,当它试图从索引 [0]
中获取 image_url 时,那里什么也没有。您基本上是在告诉从空列表中获取第一个元素:
try/except
try:
<code to do something>
<code>
<more code>
...
except:
<code to do something if the try fails/throws errors>
...
...
图片也是如此:
...
car_data={}
headers = {'User-Agent':'Mozilla/5.0'}
page = (requests.get(url, headers=headers))
tree = html.fromstring(page.content)
try:
img_urls = tree.xpath('//div[@class="r-module"]/div[@class="csn-results"]/div[@class="content"]/a[@class="item"]//div[@class="photos"]//img/@src')
img_url = tree.xpath('//ul/li/a/img/@src')[0]
img_url = str(img_url)
img_url = img_url
except:
img_url = 'N/A'
...
这里也有一些帮助修复您的 json key:values。
你得到这些结果的原因是因为你在白色 space 上分裂。在 text/content 中,它是 land rover range rover
,而不是 landrover rangerover
。因此,当您拆分时,它会返回 ['land', 'rover', 'range', 'rover']
。你正在抓取索引 0 和 1 中的元素,即 'land'
和 'rover'
.
现在如果文本是 'landrover rangerover'
,那么您就可以正确地得到您想要的内容。它会拆分 ['landrover', 'rangerover']
,因此在索引位置 0 和 1 中抓取元素将按照您想要的方式工作。
import requests
from bs4 import BeautifulSoup as bs
import re
import json
cars = []
with open('scala.txt') as f:
urls = f.read().splitlines()
for url in urls:
car_data={}
headers = {'User-Agent':'Mozilla/5.0'}
page = (requests.get(url, headers=headers))
soup = bs(page.content, 'html.parser')
script = soup.find('script', text=re.compile("CsnInsights.metaData"))
jsonData = json.loads(script.text.split('CsnInsights.metaData = ')[-1].rsplit(';',1)[0])
make = jsonData['make']
model = jsonData['model']
car_id = jsonData['networkid'].rsplit('-',1)[-1]
naming = soup.find('div', class_='heading').text.split(' ',1)[-1]
year = soup.find('div', class_='heading').text.split(' ',1)[0]
car_data = {'id':car_id,
'make':make,
'model':model,
'naming':naming,
'url':url,
'year':year}
cars.append(car_data)
输出:
print(json.dumps(cars, indent=4))
[
{
"id": "524208",
"make": "Honda",
"model": "Civic",
"naming": "Honda Civic VTi-S Auto MY19",
"url": "https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208",
"year": "2019"
},
{
"id": "524534",
"make": "Holden",
"model": "Astra",
"naming": "Holden Astra RS BK Auto MY19",
"url": "https://www.redbook.com.au/cars/details/2019-holden-astra-rs-black-edition-bk-auto-my19/SPOT-ITM-524534",
"year": "2019"
},
{
"id": "410126",
"make": "Land Rover",
"model": "Range Rover Evoque",
"naming": "Land Rover Range Rover Evoque SD4 Pure Manual 4x4 MY14",
"url": "http://www.redbook.com.au/cars/research/used/details/2014-land-rover-range-rover-evoque-ed4-pure-tech-manual-my15/SPOT-ITM-410126",
"year": "2014"
},
{
"id": "410136",
"make": "Land Rover",
"model": "Range Rover Evoque",
"naming": "Land Rover Range Rover Evoque SD4 Pure Tech Manual 4x4 MY15",
"url": "http://www.redbook.com.au/cars/research/used/details/2014-land-rover-range-rover-evoque-sd4-pure-tech-auto-4x4-my15/SPOT-ITM-410136",
"year": "2014"
}
]