来自不同链接的网页碎片和 运行 获取数据的例程
web scrap from difrents links and run a routine to get the data
很抱歉,如果我有愚蠢的问题,这是我第一次抓取代码,我一直在尝试获取一页信息性事物的数据并抓取它并保存数据...
但是在让它正常工作方面遇到了麻烦。
我写的代码旨在获得一个类别的所有 links 变体(每个类别 40 个项目),直到它工作得很好。
cody 的其余部分用于获取信息,因为第一个 link 上的 40 个第一个数据工作得很好,但是当我尝试迭代时它真的搞砸了,第二部分不工作那就是获取数据。
#https://www.youtube.com/watch?v=wLRNdCTXmnE
import requests
from bs4 import BeautifulSoup as bs
import itertools
import numpy as np
pages=[]
prices=[]
ids=[]
list_codigo=[]
prices=[]
url_collected=[]
#Loop to go over all pages
pages= np.arange(40,120,40)
print(pages)
#loop in pages for get a array of link
for page in pages:
a='https://www.paris.cl/tecnologia/consolas-videojuegos/?start='
b='&sz=40'
c=str(page)
page = a + c + b
print(page)
url_collected.append(page)
print(url_collected)
#https://www.paris.cl/tecnologia/consolas-videojuegos/?start=40&sza=40
response=requests.get(page).text
soup=bs(response,"html.parser")
#websscraping the data of the links * not working so good
for object in soup.find_all("div",class_='price-content'):
final =object.find_all(class_="price__text")
price =final[0].get('aria-label')
print(price)
prices.append(price)
for object in soup.find_all("div",class_='onecolumn'):
final2 =object.find_all(class_="product-tile")
id1 =final2[0].get('data-itemid')
list_codigo.append(id1)
print(id1)
# get data in array like csv format
for n, v in zip(prices, list_codigo):
print("{} , {}".format(n, v))
# price = final[0].get('content')
#prices.append(price)
有人知道我做错了什么吗?
不要单独抓取id
、price
、name
等,因为有些产品可能有2或3个价格,其他产品可能没有一些价值,它会跳过这个,稍后 zip()
将创建错误的对。
最好先找到所有产品 - 所有 product-tile
- 然后 运行 for
- 循环分别处理每个产品并搜索 id
, price
, name
单个 product-tile
。如果产品有很多价格那么你可以只得到一个,如果它有缺失值那么你可以分配 None
或默认值。
最少的工作代码。
我只保留重要元素。
因为单词product
和products
很相似,很容易出错所以我用前缀all_
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.paris.cl/tecnologia/consolas-videojuegos/'
params = {
'start': 0,
'sz': 40,
}
results = []
for offset in range(0, 121, 40): # set end at 121` so it will use `120`, if you set end at `120` then it will finish on `80`
params['start'] = offset
response = requests.get(url, params=params)
print('url:', response.url)
#print('status:', response.status_code)
soup = bs(response.text, "html.parser")
all_products = soup.find_all('div', {'class': 'product-tile'})
for product in all_products:
itemid = product.get('data-itemid')
print('itemid:', itemid)
data = product.get('data-product')
print('data:', data)
name = product.find('span', {'itemprop': 'name'}).text
print('name:', name)
all_prices = product.find_all('div', {'class': 'price__text'})
print('len(all_prices):', len(all_prices))
price = all_prices[0].get('aria-label')
print('price:', price)
results.append( (itemid, name, price, data) )
print('---')
# ---
# ... here you can save all `results` in file ...
结果:
url: https://www.paris.cl/tecnologia/consolas-videojuegos/?start=0&sz=40
itemid: CBELC349
data: {"id":"CBELC349","name":"Consola Nintendo Switch Neon + Switch Mario Kart 8 Deluxe","variant":"CBELC349","category":"Tecno/Consolas y VideoJuegos/Consolas Nintendo","brand":"Nintendo","price":"419990","dimension2":"743","dimension3":"VIDEOJUEGOS","dimension32":"005","dimension11":"","dimension33":"010","dimension12":"","dimension18":"","dimension19":"469990","dimension20":"399990","dimension30":"Nintendo","dimension41":"4.8571","dimension42":14}
name: Consola Nintendo Switch Neon + Switch Mario Kart 8 Deluxe
len(all_prices): 2
price: 399.990 pesos
---
itemid: 259382999
data: {"id":"259382999","name":"Consola Nintendo Switch Neon ","variant":"259382999","category":"Tecno/Consolas y VideoJuegos/Consolas Nintendo","brand":"Nintendo","price":"369990","dimension2":"743","dimension3":"VIDEOJUEGOS","dimension32":"005","dimension11":"CONSOLAS","dimension33":"010","dimension12":"CONSOLA PORTABLES","dimension18":"","dimension19":"399990","dimension20":"359990","dimension21":"True","dimension30":"Nintendo","dimension41":"4.644","dimension42":191}
name: Consola Nintendo Switch Neon
len(all_prices): 2
price: 359.990 pesos
---
itemid: 292147999
data: {"id":"292147999","name":"Nintendo Switch OLED + White Joy-Con","variant":"292147999","category":"Tecno/Consolas y VideoJuegos/Consolas Nintendo","brand":"Nintendo","price":"459990","dimension2":"743","dimension3":"VIDEOJUEGOS","dimension32":"005","dimension11":"CONSOLAS","dimension33":"010","dimension12":"CONSOLA PORTABLES","dimension18":"","dimension19":"469990","dimension20":0,"dimension21":"True","dimension30":"Nintendo","dimension41":"4.9574","dimension42":47}
name: Nintendo Switch OLED + White Joy-Con
len(all_prices): 1
price: 459.990 pesos
---
itemid: 590573999
data: {"id":"590573999","name":"Consola Sony PS4 Slim 1TB Black","variant":"590573999","category":"Tecno/Consolas y VideoJuegos/Consolas PlayStation","brand":"Sony","price":"539990","dimension2":"743","dimension3":"VIDEOJUEGOS","dimension32":"005","dimension11":"CONSOLAS","dimension33":"005","dimension12":"CONSOLA HOME","dimension18":"","dimension19":"549990","dimension20":"529990","dimension21":"True","dimension30":"Sony"}
name: Consola Sony PS4 Slim 1TB Black
len(all_prices): 2
price: 529.990 pesos
---
坦率地说,您可以从 data-product
中获得大多数值 - 它有 id
、name
、price
(只需要除以 10000)、brand
, category
很抱歉,如果我有愚蠢的问题,这是我第一次抓取代码,我一直在尝试获取一页信息性事物的数据并抓取它并保存数据...
但是在让它正常工作方面遇到了麻烦。
我写的代码旨在获得一个类别的所有 links 变体(每个类别 40 个项目),直到它工作得很好。
cody 的其余部分用于获取信息,因为第一个 link 上的 40 个第一个数据工作得很好,但是当我尝试迭代时它真的搞砸了,第二部分不工作那就是获取数据。
#https://www.youtube.com/watch?v=wLRNdCTXmnE
import requests
from bs4 import BeautifulSoup as bs
import itertools
import numpy as np
pages=[]
prices=[]
ids=[]
list_codigo=[]
prices=[]
url_collected=[]
#Loop to go over all pages
pages= np.arange(40,120,40)
print(pages)
#loop in pages for get a array of link
for page in pages:
a='https://www.paris.cl/tecnologia/consolas-videojuegos/?start='
b='&sz=40'
c=str(page)
page = a + c + b
print(page)
url_collected.append(page)
print(url_collected)
#https://www.paris.cl/tecnologia/consolas-videojuegos/?start=40&sza=40
response=requests.get(page).text
soup=bs(response,"html.parser")
#websscraping the data of the links * not working so good
for object in soup.find_all("div",class_='price-content'):
final =object.find_all(class_="price__text")
price =final[0].get('aria-label')
print(price)
prices.append(price)
for object in soup.find_all("div",class_='onecolumn'):
final2 =object.find_all(class_="product-tile")
id1 =final2[0].get('data-itemid')
list_codigo.append(id1)
print(id1)
# get data in array like csv format
for n, v in zip(prices, list_codigo):
print("{} , {}".format(n, v))
# price = final[0].get('content')
#prices.append(price)
有人知道我做错了什么吗?
不要单独抓取id
、price
、name
等,因为有些产品可能有2或3个价格,其他产品可能没有一些价值,它会跳过这个,稍后 zip()
将创建错误的对。
最好先找到所有产品 - 所有 product-tile
- 然后 运行 for
- 循环分别处理每个产品并搜索 id
, price
, name
单个 product-tile
。如果产品有很多价格那么你可以只得到一个,如果它有缺失值那么你可以分配 None
或默认值。
最少的工作代码。
我只保留重要元素。
因为单词product
和products
很相似,很容易出错所以我用前缀all_
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.paris.cl/tecnologia/consolas-videojuegos/'
params = {
'start': 0,
'sz': 40,
}
results = []
for offset in range(0, 121, 40): # set end at 121` so it will use `120`, if you set end at `120` then it will finish on `80`
params['start'] = offset
response = requests.get(url, params=params)
print('url:', response.url)
#print('status:', response.status_code)
soup = bs(response.text, "html.parser")
all_products = soup.find_all('div', {'class': 'product-tile'})
for product in all_products:
itemid = product.get('data-itemid')
print('itemid:', itemid)
data = product.get('data-product')
print('data:', data)
name = product.find('span', {'itemprop': 'name'}).text
print('name:', name)
all_prices = product.find_all('div', {'class': 'price__text'})
print('len(all_prices):', len(all_prices))
price = all_prices[0].get('aria-label')
print('price:', price)
results.append( (itemid, name, price, data) )
print('---')
# ---
# ... here you can save all `results` in file ...
结果:
url: https://www.paris.cl/tecnologia/consolas-videojuegos/?start=0&sz=40
itemid: CBELC349
data: {"id":"CBELC349","name":"Consola Nintendo Switch Neon + Switch Mario Kart 8 Deluxe","variant":"CBELC349","category":"Tecno/Consolas y VideoJuegos/Consolas Nintendo","brand":"Nintendo","price":"419990","dimension2":"743","dimension3":"VIDEOJUEGOS","dimension32":"005","dimension11":"","dimension33":"010","dimension12":"","dimension18":"","dimension19":"469990","dimension20":"399990","dimension30":"Nintendo","dimension41":"4.8571","dimension42":14}
name: Consola Nintendo Switch Neon + Switch Mario Kart 8 Deluxe
len(all_prices): 2
price: 399.990 pesos
---
itemid: 259382999
data: {"id":"259382999","name":"Consola Nintendo Switch Neon ","variant":"259382999","category":"Tecno/Consolas y VideoJuegos/Consolas Nintendo","brand":"Nintendo","price":"369990","dimension2":"743","dimension3":"VIDEOJUEGOS","dimension32":"005","dimension11":"CONSOLAS","dimension33":"010","dimension12":"CONSOLA PORTABLES","dimension18":"","dimension19":"399990","dimension20":"359990","dimension21":"True","dimension30":"Nintendo","dimension41":"4.644","dimension42":191}
name: Consola Nintendo Switch Neon
len(all_prices): 2
price: 359.990 pesos
---
itemid: 292147999
data: {"id":"292147999","name":"Nintendo Switch OLED + White Joy-Con","variant":"292147999","category":"Tecno/Consolas y VideoJuegos/Consolas Nintendo","brand":"Nintendo","price":"459990","dimension2":"743","dimension3":"VIDEOJUEGOS","dimension32":"005","dimension11":"CONSOLAS","dimension33":"010","dimension12":"CONSOLA PORTABLES","dimension18":"","dimension19":"469990","dimension20":0,"dimension21":"True","dimension30":"Nintendo","dimension41":"4.9574","dimension42":47}
name: Nintendo Switch OLED + White Joy-Con
len(all_prices): 1
price: 459.990 pesos
---
itemid: 590573999
data: {"id":"590573999","name":"Consola Sony PS4 Slim 1TB Black","variant":"590573999","category":"Tecno/Consolas y VideoJuegos/Consolas PlayStation","brand":"Sony","price":"539990","dimension2":"743","dimension3":"VIDEOJUEGOS","dimension32":"005","dimension11":"CONSOLAS","dimension33":"005","dimension12":"CONSOLA HOME","dimension18":"","dimension19":"549990","dimension20":"529990","dimension21":"True","dimension30":"Sony"}
name: Consola Sony PS4 Slim 1TB Black
len(all_prices): 2
price: 529.990 pesos
---
坦率地说,您可以从 data-product
中获得大多数值 - 它有 id
、name
、price
(只需要除以 10000)、brand
, category