如何在没有 selenium 的情况下抓取这个 <selfridges.com> 网站的股票信息?
How can I scrapy the stock information of this <selfridges.com> website without selenium?
网站URL:https://www.selfridges.com/GB/en/cat/giorgio-armani-lip-mastero-mattr-6-6ml_317-77011643-LB014200/
我可以通过chrome -> F12 -> network -> XHR
得到包含价格信息和股票信息的文件。
价格APIurl:
https://www.selfridges.com/api/cms/ecom/v1/GB/en/price/byId/317-77011643-LB014200
库存APIurl:
https://www.selfridges.com/api/cms/ecom/v1/GB/en/stock/byId/317-77011643-LB014200
而我可以通过直接访问API link 来获取响应内容,如下所示:
s= requests.session()
response = s.get(price_api_url, headers=headers)
print(response.text)
但是,对于股票URL,此方法无效,将返回403 Forbidden
状态码。
我试过使用 cookie,但结果相同。
即使通过 chrome 浏览器访问也是一样的效果。
可能有用的信息:
我得到了包含 API 方法的源代码,但是我找不到 {variantValue}
和 {variantName}
.
"@data_api":"
{"apiKeyValue":"xjut2p34999bad9dx7y868ng",
"apiKey":"Api-Key",
"withCredentials":true,
"priceApi":"/api/cms/ecom/v1/GB/en/price/byId/{partNumber}",
"stockApi":"/api/cms/ecom/v1/GB/en/stock/byId/{partNumber}?option\u003d{variantName}\u0026optionValue\u003d{variantValue}",
"cacheControl":"no-cache",
"addToWishListApiUrl":"/api/cms/ecom/v1/GB/en/wishlist",
"addToBagApiUrl":"/api/cms/ecom/v1/GB/en/cart"
}"
在 Chrome
/Firefox
中,你应该检查它还发送了什么 - 也许它需要特殊的 headers - 比如 XHR 请求的特殊 header('X-Requested-With': 'XMLHttpRequest'
).或者,也许您必须首先 GET
主页才能获得新鲜饼干。
Firefox
有类似 Chrome
的工具,它有 "Copy reuqest as CURL command"
并且在控制台中使用这个命令我可以获得股票数据。
curl 'https://www.selfridges.com/api/cms/ecom/v1/GB/en/stock/byId/317-77011643-LB014200' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Accept-Language: pl,en-US;q=0.7,en;q=0.3' --compressed -H 'Content-Type: application/json; charset=utf-8' -H 'Api-Key: xjut2p34999bad9dx7y868ng' -H 'cache-control: no-cache' -H 'X-Requested-With: XMLHttpRequest' -H 'Connection: keep-alive' -H 'Referer: https://www.selfridges.com/GB/en/cat/giorgio-armani-lip-mastero-mattr-6-6ml_317-77011643-LB014200/' -H 'Cookie: AWSELB=85FF15BB10593ECE847219C9B214EEC5BBD393B7301D90E17B625C66620D7473C3FCE779E5EA1D351A2192C6C975C128815AC60F1118B8968E03001896493C045071A25E98; SF_COUNTRY_LANG=GB_en; COOKIE_NOTICE_SEEN=seen; utag_main=v_id:016df6fcb41700231568089828b001044006200900c48$_sn:1$_ss:0$_pn:2%3Bexp-session$_st:1571808694713$ses_id:1571806819351%3Bexp-session; utag_chan={"channel":"","channel_set":"","channel_converted":false,"awc":""}; Apache=10.77.3.197.1571806819436981; JSESSIONID=0000FBk5q2nb8WGtpDUjLBiNvha:17re3pp2r; WC_PERSISTENT=EBTewrGMk86bvcN%2fwqrCZtv%2bnXk%3d%0a%3b2019%2d10%2d23+05%3a00%3a22%2e442%5f1571806819438%2d1407831%5f10052%5f1480243004%2c%2d1%2cGBP%5f10052; WC_SESSION_ESTABLISHED=true; WC_ACTIVEPOINTER=%2d1%2c10052; WC_AUTHENTICATION_1480243004=1480243004%2cQbYKoQJpwYcMM6iznWYL1ludFS8%3d; WC_USERACTIVITY_1480243004=1480243004%2c10052%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cpfXMuSmw4%2b86xW7eYpU03lFrlirAydf27cytgnreiETU0zdlaTYkdIvAFHFrHmqcOVjtNhcyBowU%0ah%2bD2jUFBMXetfiZdIXQuaegcWHNNUqlIHSvMQrpghGvwCVdLsi%2bVK5UuT9NrO2L6RLVuf2ROuIXl%0avrgeD6slXh2C9RTk%2fKYkbRFJrqWGbiO5BZCmcHU14xftVA%3d%3d; cmTPSet=Y; CoreID6=87385145971315718068242&ci=90262645; 90262645_clogin=v=7&l=62021491571806824206&e=1571808675410; SIGNUP_POPUP_SEEN=seen' -H 'DNT: 1'
使用https://curl.trillworks.com/我可以将CURL
转换为Pythonrequests
并且它也可以获得库存。
import requests
cookies = {
'AWSELB': '85FF15BB10593ECE847219C9B214EEC5BBD393B7301D90E17B625C66620D7473C3FCE779E5EA1D351A2192C6C975C128815AC60F1118B8968E03001896493C045071A25E98',
'SF_COUNTRY_LANG': 'GB_en',
'COOKIE_NOTICE_SEEN': 'seen',
'utag_main': 'v_id:016df6fcb41700231568089828b001044006200900c48$_sn:1$_ss:0$_pn:2%3Bexp-session$_st:1571808694713$ses_id:1571806819351%3Bexp-session',
'utag_chan': '{"channel":"","channel_set":"","channel_converted":false,"awc":""}',
'Apache': '10.77.3.197.1571806819436981',
'JSESSIONID': '0000FBk5q2nb8WGtpDUjLBiNvha:17re3pp2r',
'WC_PERSISTENT': 'EBTewrGMk86bvcN%2fwqrCZtv%2bnXk%3d%0a%3b2019%2d10%2d23+05%3a00%3a22%2e442%5f1571806819438%2d1407831%5f10052%5f1480243004%2c%2d1%2cGBP%5f10052',
'WC_SESSION_ESTABLISHED': 'true',
'WC_ACTIVEPOINTER': '%2d1%2c10052',
'WC_AUTHENTICATION_1480243004': '1480243004%2cQbYKoQJpwYcMM6iznWYL1ludFS8%3d',
'WC_USERACTIVITY_1480243004': '1480243004%2c10052%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cpfXMuSmw4%2b86xW7eYpU03lFrlirAydf27cytgnreiETU0zdlaTYkdIvAFHFrHmqcOVjtNhcyBowU%0ah%2bD2jUFBMXetfiZdIXQuaegcWHNNUqlIHSvMQrpghGvwCVdLsi%2bVK5UuT9NrO2L6RLVuf2ROuIXl%0avrgeD6slXh2C9RTk%2fKYkbRFJrqWGbiO5BZCmcHU14xftVA%3d%3d',
'cmTPSet': 'Y',
'CoreID6': '87385145971315718068242&ci=90262645',
'90262645_clogin': 'v=7&l=62021491571806824206&e=1571808675410',
'SIGNUP_POPUP_SEEN': 'seen',
}
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'pl,en-US;q=0.7,en;q=0.3',
'Content-Type': 'application/json; charset=utf-8',
'Api-Key': 'xjut2p34999bad9dx7y868ng',
'cache-control': 'no-cache',
'X-Requested-With': 'XMLHttpRequest',
'Connection': 'keep-alive',
'Referer': 'https://www.selfridges.com/GB/en/cat/giorgio-armani-lip-mastero-mattr-6-6ml_317-77011643-LB014200/',
'DNT': '1',
}
response = requests.get('https://www.selfridges.com/api/cms/ecom/v1/GB/en/stock/byId/317-77011643-LB014200', headers=headers, cookies=cookies)
print(response.text)
但我不知道服务器会遵守此代码及其 cookie 多长时间。稍后我 运行 它可能需要新鲜的饼干。
编辑: 几个小时后,同样的代码仍然给我数据。有时即使只有
我也能得到结果
import requests
headers = { 'Api-Key': 'xjut2p34999bad9dx7y868ng' }
response = requests.get('https://www.selfridges.com/api/cms/ecom/v1/GB/en/stock/byId/317-77011643-LB014200', headers=headers)
print(response.text)
但有时它会给我 <h1>Developer Inactive</h1>
所以我确定这是否不是服务器上的临时问题。
网站URL:https://www.selfridges.com/GB/en/cat/giorgio-armani-lip-mastero-mattr-6-6ml_317-77011643-LB014200/
我可以通过chrome -> F12 -> network -> XHR
得到包含价格信息和股票信息的文件。
价格APIurl: https://www.selfridges.com/api/cms/ecom/v1/GB/en/price/byId/317-77011643-LB014200
库存APIurl: https://www.selfridges.com/api/cms/ecom/v1/GB/en/stock/byId/317-77011643-LB014200
而我可以通过直接访问API link 来获取响应内容,如下所示:
s= requests.session()
response = s.get(price_api_url, headers=headers)
print(response.text)
但是,对于股票URL,此方法无效,将返回403 Forbidden
状态码。
我试过使用 cookie,但结果相同。
即使通过 chrome 浏览器访问也是一样的效果。
可能有用的信息:
我得到了包含 API 方法的源代码,但是我找不到 {variantValue}
和 {variantName}
.
"@data_api":"
{"apiKeyValue":"xjut2p34999bad9dx7y868ng",
"apiKey":"Api-Key",
"withCredentials":true,
"priceApi":"/api/cms/ecom/v1/GB/en/price/byId/{partNumber}",
"stockApi":"/api/cms/ecom/v1/GB/en/stock/byId/{partNumber}?option\u003d{variantName}\u0026optionValue\u003d{variantValue}",
"cacheControl":"no-cache",
"addToWishListApiUrl":"/api/cms/ecom/v1/GB/en/wishlist",
"addToBagApiUrl":"/api/cms/ecom/v1/GB/en/cart"
}"
在 Chrome
/Firefox
中,你应该检查它还发送了什么 - 也许它需要特殊的 headers - 比如 XHR 请求的特殊 header('X-Requested-With': 'XMLHttpRequest'
).或者,也许您必须首先 GET
主页才能获得新鲜饼干。
Firefox
有类似 Chrome
的工具,它有 "Copy reuqest as CURL command"
并且在控制台中使用这个命令我可以获得股票数据。
curl 'https://www.selfridges.com/api/cms/ecom/v1/GB/en/stock/byId/317-77011643-LB014200' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Accept-Language: pl,en-US;q=0.7,en;q=0.3' --compressed -H 'Content-Type: application/json; charset=utf-8' -H 'Api-Key: xjut2p34999bad9dx7y868ng' -H 'cache-control: no-cache' -H 'X-Requested-With: XMLHttpRequest' -H 'Connection: keep-alive' -H 'Referer: https://www.selfridges.com/GB/en/cat/giorgio-armani-lip-mastero-mattr-6-6ml_317-77011643-LB014200/' -H 'Cookie: AWSELB=85FF15BB10593ECE847219C9B214EEC5BBD393B7301D90E17B625C66620D7473C3FCE779E5EA1D351A2192C6C975C128815AC60F1118B8968E03001896493C045071A25E98; SF_COUNTRY_LANG=GB_en; COOKIE_NOTICE_SEEN=seen; utag_main=v_id:016df6fcb41700231568089828b001044006200900c48$_sn:1$_ss:0$_pn:2%3Bexp-session$_st:1571808694713$ses_id:1571806819351%3Bexp-session; utag_chan={"channel":"","channel_set":"","channel_converted":false,"awc":""}; Apache=10.77.3.197.1571806819436981; JSESSIONID=0000FBk5q2nb8WGtpDUjLBiNvha:17re3pp2r; WC_PERSISTENT=EBTewrGMk86bvcN%2fwqrCZtv%2bnXk%3d%0a%3b2019%2d10%2d23+05%3a00%3a22%2e442%5f1571806819438%2d1407831%5f10052%5f1480243004%2c%2d1%2cGBP%5f10052; WC_SESSION_ESTABLISHED=true; WC_ACTIVEPOINTER=%2d1%2c10052; WC_AUTHENTICATION_1480243004=1480243004%2cQbYKoQJpwYcMM6iznWYL1ludFS8%3d; WC_USERACTIVITY_1480243004=1480243004%2c10052%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cpfXMuSmw4%2b86xW7eYpU03lFrlirAydf27cytgnreiETU0zdlaTYkdIvAFHFrHmqcOVjtNhcyBowU%0ah%2bD2jUFBMXetfiZdIXQuaegcWHNNUqlIHSvMQrpghGvwCVdLsi%2bVK5UuT9NrO2L6RLVuf2ROuIXl%0avrgeD6slXh2C9RTk%2fKYkbRFJrqWGbiO5BZCmcHU14xftVA%3d%3d; cmTPSet=Y; CoreID6=87385145971315718068242&ci=90262645; 90262645_clogin=v=7&l=62021491571806824206&e=1571808675410; SIGNUP_POPUP_SEEN=seen' -H 'DNT: 1'
使用https://curl.trillworks.com/我可以将CURL
转换为Pythonrequests
并且它也可以获得库存。
import requests
cookies = {
'AWSELB': '85FF15BB10593ECE847219C9B214EEC5BBD393B7301D90E17B625C66620D7473C3FCE779E5EA1D351A2192C6C975C128815AC60F1118B8968E03001896493C045071A25E98',
'SF_COUNTRY_LANG': 'GB_en',
'COOKIE_NOTICE_SEEN': 'seen',
'utag_main': 'v_id:016df6fcb41700231568089828b001044006200900c48$_sn:1$_ss:0$_pn:2%3Bexp-session$_st:1571808694713$ses_id:1571806819351%3Bexp-session',
'utag_chan': '{"channel":"","channel_set":"","channel_converted":false,"awc":""}',
'Apache': '10.77.3.197.1571806819436981',
'JSESSIONID': '0000FBk5q2nb8WGtpDUjLBiNvha:17re3pp2r',
'WC_PERSISTENT': 'EBTewrGMk86bvcN%2fwqrCZtv%2bnXk%3d%0a%3b2019%2d10%2d23+05%3a00%3a22%2e442%5f1571806819438%2d1407831%5f10052%5f1480243004%2c%2d1%2cGBP%5f10052',
'WC_SESSION_ESTABLISHED': 'true',
'WC_ACTIVEPOINTER': '%2d1%2c10052',
'WC_AUTHENTICATION_1480243004': '1480243004%2cQbYKoQJpwYcMM6iznWYL1ludFS8%3d',
'WC_USERACTIVITY_1480243004': '1480243004%2c10052%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cpfXMuSmw4%2b86xW7eYpU03lFrlirAydf27cytgnreiETU0zdlaTYkdIvAFHFrHmqcOVjtNhcyBowU%0ah%2bD2jUFBMXetfiZdIXQuaegcWHNNUqlIHSvMQrpghGvwCVdLsi%2bVK5UuT9NrO2L6RLVuf2ROuIXl%0avrgeD6slXh2C9RTk%2fKYkbRFJrqWGbiO5BZCmcHU14xftVA%3d%3d',
'cmTPSet': 'Y',
'CoreID6': '87385145971315718068242&ci=90262645',
'90262645_clogin': 'v=7&l=62021491571806824206&e=1571808675410',
'SIGNUP_POPUP_SEEN': 'seen',
}
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'pl,en-US;q=0.7,en;q=0.3',
'Content-Type': 'application/json; charset=utf-8',
'Api-Key': 'xjut2p34999bad9dx7y868ng',
'cache-control': 'no-cache',
'X-Requested-With': 'XMLHttpRequest',
'Connection': 'keep-alive',
'Referer': 'https://www.selfridges.com/GB/en/cat/giorgio-armani-lip-mastero-mattr-6-6ml_317-77011643-LB014200/',
'DNT': '1',
}
response = requests.get('https://www.selfridges.com/api/cms/ecom/v1/GB/en/stock/byId/317-77011643-LB014200', headers=headers, cookies=cookies)
print(response.text)
但我不知道服务器会遵守此代码及其 cookie 多长时间。稍后我 运行 它可能需要新鲜的饼干。
编辑: 几个小时后,同样的代码仍然给我数据。有时即使只有
我也能得到结果import requests
headers = { 'Api-Key': 'xjut2p34999bad9dx7y868ng' }
response = requests.get('https://www.selfridges.com/api/cms/ecom/v1/GB/en/stock/byId/317-77011643-LB014200', headers=headers)
print(response.text)
但有时它会给我 <h1>Developer Inactive</h1>
所以我确定这是否不是服务器上的临时问题。