来自不同链接的网页碎片和 运行 获取数据的例程

web scrap from difrents links and run a routine to get the data

很抱歉,如果我有愚蠢的问题,这是我第一次抓取代码,我一直在尝试获取一页信息性事物的数据并抓取它并保存数据...

但是在让它正常工作方面遇到了麻烦。

我写的代码旨在获得一个类别的所有 links 变体(每个类别 40 个项目),直到它工作得很好。

cody 的其余部分用于获取信息,因为第一个 link 上的 40 个第一个数据工作得很好,但是当我尝试迭代时它真的搞砸了,第二部分不工作那就是获取数据。

#https://www.youtube.com/watch?v=wLRNdCTXmnE
import requests
from bs4 import BeautifulSoup as bs
import itertools
import numpy as np
pages=[]
prices=[]
ids=[]
list_codigo=[]
prices=[]
url_collected=[]
#Loop to go over all pages
pages= np.arange(40,120,40)
print(pages)
#loop in pages for get a array of link 
for page in pages:
    a='https://www.paris.cl/tecnologia/consolas-videojuegos/?start='
    b='&sz=40'
    c=str(page)
    page = a + c + b
    print(page)
    url_collected.append(page) 
    print(url_collected)

    
    

    #https://www.paris.cl/tecnologia/consolas-videojuegos/?start=40&sza=40
    response=requests.get(page).text
    soup=bs(response,"html.parser")
    
    #websscraping the data of the links * not working so good
for object in soup.find_all("div",class_='price-content'):
            final =object.find_all(class_="price__text")
            price =final[0].get('aria-label')
            print(price)
            prices.append(price)


for object in soup.find_all("div",class_='onecolumn'):
                final2 =object.find_all(class_="product-tile")
                id1 =final2[0].get('data-itemid') 
                list_codigo.append(id1)
                print(id1)
# get data in array like csv format
for n, v in zip(prices, list_codigo):
        print("{} , {}".format(n, v))

                   # price = final[0].get('content')
                    #prices.append(price)

有人知道我做错了什么吗?

不要单独抓取idpricename等,因为有些产品可能有2或3个价格,其他产品可能没有一些价值,它会跳过这个,稍后 zip() 将创建错误的对。

最好先找到所有产品 - 所有 product-tile - 然后 运行 for - 循环分别处理每个产品并搜索 id, price, name 单个 product-tile。如果产品有很多价格那么你可以只得到一个,如果它有缺失值那么你可以分配 None 或默认值。


最少的工作代码。

我只保留重要元素。

因为单词productproducts很相似,很容易出错所以我用前缀all_

import requests
from bs4 import BeautifulSoup as bs

url = 'https://www.paris.cl/tecnologia/consolas-videojuegos/'

params = {
    'start': 0,
    'sz': 40,
}

results = []

for offset in range(0, 121, 40):  # set end at 121` so it will use `120`, if you set end at `120` then it will finish on `80`

    params['start'] = offset

    response = requests.get(url, params=params)
    print('url:', response.url)
    #print('status:', response.status_code)
                    
    soup = bs(response.text, "html.parser")

    all_products = soup.find_all('div', {'class': 'product-tile'})

    for product in all_products:
        itemid = product.get('data-itemid') 
        print('itemid:', itemid)

        data = product.get('data-product') 
        print('data:', data)
        
        name = product.find('span', {'itemprop': 'name'}).text
        print('name:', name)
        
        all_prices = product.find_all('div', {'class': 'price__text'})
        print('len(all_prices):', len(all_prices))
        
        price = all_prices[0].get('aria-label')
        print('price:', price)
        
        results.append( (itemid, name, price, data) )
        print('---')

# ---

# ... here you can save all `results` in file ...

结果:

url: https://www.paris.cl/tecnologia/consolas-videojuegos/?start=0&sz=40

itemid: CBELC349
data: {"id":"CBELC349","name":"Consola Nintendo Switch Neon + Switch Mario Kart 8 Deluxe","variant":"CBELC349","category":"Tecno/Consolas y VideoJuegos/Consolas Nintendo","brand":"Nintendo","price":"419990","dimension2":"743","dimension3":"VIDEOJUEGOS","dimension32":"005","dimension11":"","dimension33":"010","dimension12":"","dimension18":"","dimension19":"469990","dimension20":"399990","dimension30":"Nintendo","dimension41":"4.8571","dimension42":14}
name: Consola Nintendo Switch Neon + Switch Mario Kart 8 Deluxe
len(all_prices): 2
price: 399.990 pesos
---
itemid: 259382999
data: {"id":"259382999","name":"Consola Nintendo Switch Neon                         ","variant":"259382999","category":"Tecno/Consolas y VideoJuegos/Consolas Nintendo","brand":"Nintendo","price":"369990","dimension2":"743","dimension3":"VIDEOJUEGOS","dimension32":"005","dimension11":"CONSOLAS","dimension33":"010","dimension12":"CONSOLA PORTABLES","dimension18":"","dimension19":"399990","dimension20":"359990","dimension21":"True","dimension30":"Nintendo","dimension41":"4.644","dimension42":191}
name: Consola Nintendo Switch Neon 
len(all_prices): 2
price: 359.990 pesos
---
itemid: 292147999
data: {"id":"292147999","name":"Nintendo Switch OLED + White Joy-Con","variant":"292147999","category":"Tecno/Consolas y VideoJuegos/Consolas Nintendo","brand":"Nintendo","price":"459990","dimension2":"743","dimension3":"VIDEOJUEGOS","dimension32":"005","dimension11":"CONSOLAS","dimension33":"010","dimension12":"CONSOLA PORTABLES","dimension18":"","dimension19":"469990","dimension20":0,"dimension21":"True","dimension30":"Nintendo","dimension41":"4.9574","dimension42":47}
name: Nintendo Switch OLED + White Joy-Con
len(all_prices): 1
price: 459.990 pesos
---
itemid: 590573999
data: {"id":"590573999","name":"Consola Sony PS4 Slim 1TB Black","variant":"590573999","category":"Tecno/Consolas y VideoJuegos/Consolas PlayStation","brand":"Sony","price":"539990","dimension2":"743","dimension3":"VIDEOJUEGOS","dimension32":"005","dimension11":"CONSOLAS","dimension33":"005","dimension12":"CONSOLA HOME","dimension18":"","dimension19":"549990","dimension20":"529990","dimension21":"True","dimension30":"Sony"}
name: Consola Sony PS4 Slim 1TB Black
len(all_prices): 2
price: 529.990 pesos
---

坦率地说,您可以从 data-product 中获得大多数值 - 它有 idnameprice(只需要除以 10000)、brand, category