使用 Beautiful Soup 从 python 中的嵌套 table 中提取名称文本
Extract Name Text from nested table in python using Beautiful Soup
我对使用 Python 进行网络抓取还比较陌生,而且我很难从 CoinMarketCap.com HTML table 行中提取名称值].我不熟悉它们的结构。我在堆栈溢出和其他站点上尝试了几种方法,但都无济于事。这是他们 html 的片段:https://i.stack.imgur.com/eBamV.png
这是我目前拥有的代码:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://coinmarketcap.com/rankings/exchanges/").text
soup = BeautifulSoup(page, features="html.parser")
tags = soup.findAll("div", class_="sc-16r8icm-0 sc-1teo54s-1 dNOTPP")
tables = soup.findChildren('tr')
my_table = tables[0]
rows = my_table.findChildren(['td'])
print(rows)
for row in rows:
cells = row.findChildren('td')
for cell in cells:
value = cell.string
print("the value in this cell is %s" % value)
提前感谢您的帮助!
这sc-16r8icm-0 sc-1teo54s-1 dNOTPP
是三个类,用空格隔开。如果您需要通过多个 类 来标识一个元素,请使用这样的选择器
tags = soup.select("div.sc-16r8icm-0.sc-1teo54s-1.dNOTPP")
您看到的数据以 Json 形式嵌入到页面中。要解析它,您可以使用下一个示例:
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://coinmarketcap.com/rankings/exchanges/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = soup.select_one("#__NEXT_DATA__").text
data = json.loads(data)
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
df = pd.json_normalize(data["props"]["initialProps"]["pageProps"]["exchange"])
print(df.head().to_markdown())
打印:
id
name
slug
score
countries
fiats
totalVol24h
spotVol24h
derivativesVol24h
derivativesOpenInterests
derivativesMarketPairs
totalVolChgPct24h
totalVolChgPct7d
visits
liquidity
numMarkets
numCoins
dateLaunched
lastUpdated
marketSharePct
type
makerFee
takerFee
rank
0
270
Binance
binance
9.9
[]
['AED', 'ARS', 'AUD', 'AZN', 'BRL', 'CAD', 'CHF', 'CLP', 'COP', 'CZK', 'EGP', 'EUR', 'GBP', 'GHS', 'HKD', 'HRK', 'HUF', 'IDR', 'ILS', 'INR', 'ISK', 'JPY', 'KES', 'KRW', 'KZT', 'MXN', 'NGN', 'NOK', 'NZD', 'PEN', 'PHP', 'PLN', 'RON', 'RUB', 'SAR', 'SEK', 'SGD', 'THB', 'TRY', 'TWD', 'UAH', 'UGX', 'USD', 'UYU', 'VND', 'ZAR']
5.56801e+10
1.42812e+10
4.21641e+10
1.57537e+10
203
-20.7533
-65.7038
2.20602e+07
816
1667
394
2017-07-14T00:00:00.000Z
2022-05-17T20:08:11.000Z
0.0023
0.02
0.04
1
1
524
FTX
ftx
8.3819
[]
['USD', 'EUR', 'GBP', 'AUD', 'HKD', 'SGD', 'ZAR', 'CAD', 'CHF', 'BRL']
7.57339e+09
2.12004e+09
5.61716e+09
3.46104e+09
43
-21.1298
-58.9183
4.71841e+06
722
466
326
2019-02-25T00:00:00.000Z
2022-05-17T20:08:11.000Z
0.0003
0.02
0.07
2
2
89
Coinbase Exchange
coinbase-exchange
8.303
[]
['USD', 'EUR', 'GBP']
1.80697e+09
1.80757e+09
nan
nan
nan
-13.3741
-68.7096
2.19108e+06
717
503
173
2014-05-24T00:00:00.000Z
2022-05-17T20:08:11.000Z
0.0003
0
0
3
3
24
Kraken
kraken
7.9853
[]
['USD', 'EUR', 'GBP', 'CAD', 'JPY', 'CHF', 'AUD']
8.10391e+08
7.66352e+08
2.74902e+11
4.01852e+07
28
-14.7475
-63.5845
1.72099e+06
739
542
167
2011-07-28T00:00:00.000Z
2022-05-17T20:08:11.000Z
0.0001
0.02
0.05
4
4
311
KuCoin
kucoin
7.486
[]
['USD', 'AED', 'ARS', 'AUD', 'AGN', 'BGN', 'BRL', 'CAD', 'CHF', 'CLP', 'COP', 'CRC', 'CZK', 'DKK', 'DOP', 'EUR', 'GBP', 'GEL', 'HKD', 'HUF', 'ILS', 'INR', 'JPY', 'KRW', 'KZT', 'MAD', 'MDL', 'MXN', 'MYR', 'NAD', 'NGN', 'NOK', 'NZD', 'PEN', 'PHP', 'PLN', 'QAR', 'RON', 'RUB', 'SEK', 'SGD', 'TRY', 'TWD', 'UAH', 'USD', 'UYU', 'UZS', 'ZAR']
5.17875e+09
1.58063e+09
3.61257e+09
9.08548e+08
112
-12.0398
-62.4081
2.55465e+06
547
1291
696
2017-08-13T00:00:00.000Z
2022-05-17T20:08:11.000Z
0.0002
0
0
5
我对使用 Python 进行网络抓取还比较陌生,而且我很难从 CoinMarketCap.com HTML table 行中提取名称值].我不熟悉它们的结构。我在堆栈溢出和其他站点上尝试了几种方法,但都无济于事。这是他们 html 的片段:https://i.stack.imgur.com/eBamV.png 这是我目前拥有的代码:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://coinmarketcap.com/rankings/exchanges/").text
soup = BeautifulSoup(page, features="html.parser")
tags = soup.findAll("div", class_="sc-16r8icm-0 sc-1teo54s-1 dNOTPP")
tables = soup.findChildren('tr')
my_table = tables[0]
rows = my_table.findChildren(['td'])
print(rows)
for row in rows:
cells = row.findChildren('td')
for cell in cells:
value = cell.string
print("the value in this cell is %s" % value)
提前感谢您的帮助!
这sc-16r8icm-0 sc-1teo54s-1 dNOTPP
是三个类,用空格隔开。如果您需要通过多个 类 来标识一个元素,请使用这样的选择器
tags = soup.select("div.sc-16r8icm-0.sc-1teo54s-1.dNOTPP")
您看到的数据以 Json 形式嵌入到页面中。要解析它,您可以使用下一个示例:
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://coinmarketcap.com/rankings/exchanges/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = soup.select_one("#__NEXT_DATA__").text
data = json.loads(data)
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
df = pd.json_normalize(data["props"]["initialProps"]["pageProps"]["exchange"])
print(df.head().to_markdown())
打印:
id | name | slug | score | countries | fiats | totalVol24h | spotVol24h | derivativesVol24h | derivativesOpenInterests | derivativesMarketPairs | totalVolChgPct24h | totalVolChgPct7d | visits | liquidity | numMarkets | numCoins | dateLaunched | lastUpdated | marketSharePct | type | makerFee | takerFee | rank | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 270 | Binance | binance | 9.9 | [] | ['AED', 'ARS', 'AUD', 'AZN', 'BRL', 'CAD', 'CHF', 'CLP', 'COP', 'CZK', 'EGP', 'EUR', 'GBP', 'GHS', 'HKD', 'HRK', 'HUF', 'IDR', 'ILS', 'INR', 'ISK', 'JPY', 'KES', 'KRW', 'KZT', 'MXN', 'NGN', 'NOK', 'NZD', 'PEN', 'PHP', 'PLN', 'RON', 'RUB', 'SAR', 'SEK', 'SGD', 'THB', 'TRY', 'TWD', 'UAH', 'UGX', 'USD', 'UYU', 'VND', 'ZAR'] | 5.56801e+10 | 1.42812e+10 | 4.21641e+10 | 1.57537e+10 | 203 | -20.7533 | -65.7038 | 2.20602e+07 | 816 | 1667 | 394 | 2017-07-14T00:00:00.000Z | 2022-05-17T20:08:11.000Z | 0.0023 | 0.02 | 0.04 | 1 | |
1 | 524 | FTX | ftx | 8.3819 | [] | ['USD', 'EUR', 'GBP', 'AUD', 'HKD', 'SGD', 'ZAR', 'CAD', 'CHF', 'BRL'] | 7.57339e+09 | 2.12004e+09 | 5.61716e+09 | 3.46104e+09 | 43 | -21.1298 | -58.9183 | 4.71841e+06 | 722 | 466 | 326 | 2019-02-25T00:00:00.000Z | 2022-05-17T20:08:11.000Z | 0.0003 | 0.02 | 0.07 | 2 | |
2 | 89 | Coinbase Exchange | coinbase-exchange | 8.303 | [] | ['USD', 'EUR', 'GBP'] | 1.80697e+09 | 1.80757e+09 | nan | nan | nan | -13.3741 | -68.7096 | 2.19108e+06 | 717 | 503 | 173 | 2014-05-24T00:00:00.000Z | 2022-05-17T20:08:11.000Z | 0.0003 | 0 | 0 | 3 | |
3 | 24 | Kraken | kraken | 7.9853 | [] | ['USD', 'EUR', 'GBP', 'CAD', 'JPY', 'CHF', 'AUD'] | 8.10391e+08 | 7.66352e+08 | 2.74902e+11 | 4.01852e+07 | 28 | -14.7475 | -63.5845 | 1.72099e+06 | 739 | 542 | 167 | 2011-07-28T00:00:00.000Z | 2022-05-17T20:08:11.000Z | 0.0001 | 0.02 | 0.05 | 4 | |
4 | 311 | KuCoin | kucoin | 7.486 | [] | ['USD', 'AED', 'ARS', 'AUD', 'AGN', 'BGN', 'BRL', 'CAD', 'CHF', 'CLP', 'COP', 'CRC', 'CZK', 'DKK', 'DOP', 'EUR', 'GBP', 'GEL', 'HKD', 'HUF', 'ILS', 'INR', 'JPY', 'KRW', 'KZT', 'MAD', 'MDL', 'MXN', 'MYR', 'NAD', 'NGN', 'NOK', 'NZD', 'PEN', 'PHP', 'PLN', 'QAR', 'RON', 'RUB', 'SEK', 'SGD', 'TRY', 'TWD', 'UAH', 'USD', 'UYU', 'UZS', 'ZAR'] | 5.17875e+09 | 1.58063e+09 | 3.61257e+09 | 9.08548e+08 | 112 | -12.0398 | -62.4081 | 2.55465e+06 | 547 | 1291 | 696 | 2017-08-13T00:00:00.000Z | 2022-05-17T20:08:11.000Z | 0.0002 | 0 | 0 | 5 |