从 BS4 中的 table 行中提取多个数据

Question

在下面的代码中，我试图使用 BeautifulSoup 从 table 中提取 http://free-proxy-list.net 的 IP 地址和端口。

但每次我得到整行都是无用的，因为我无法将 IP 地址与其端口分开。

如何分离IP和端口？

这是我的代码：

        def get_proxy(self):
            response = requests.get(self.url)
            soup = bs(response.content,'html.parser')
            data_list = [tr for tr in soup.select('tr') if tr.td]

            for i in data_list:
                print(i.text)

Answer 1

试试这个。我必须添加 isnumeric() 条件以确保代码不包含来自同一网站上存在的另一个 table 的数据。

from bs4 import BeautifulSoup as bs
import requests
from collections import defaultdict

def get_proxy(URL):
    response = requests.get(url)
    soup = bs(response.content,'html.parser')
    mapping = defaultdict()
    for tr in soup.select('tr'):
        if len(list(tr)) == 8:
            ip_val = str(list(tr)[0].text)
            port_val = str(list(tr)[1].text)
            if port_val.isnumeric():
                mapping[ip_val] = port_val


    for items in mapping.keys():
        print("IP:",items)
        print("PORT:",mapping[items])

if __name__ == '__main__': 
    url = "http://free-proxy-list.net"
    get_proxy(url)

Answer 2

在你的代码中，而不是 -

i.text 您可以使用 i.getText(' ,')（或您选择的除 , 之外的其他分隔符）。这将为您提供逗号分隔的 IP 和端口。

此外，为了方便起见，您也可以将代理列表加载到数据框中。

对您的代码进行以下 changes/additions -

soup = bs(response.content,'html.parser')

data_list = [tr for tr in soup.select('tr') if tr.td]


data_list2 = [tr.getText(' ,') for tr in soup.select('tr') if tr.td]

#for i in data_list:
      #print(i.text)

df = pd.DataFrame(data_list2,columns=['proxy_list'])

df_proxyList= df['proxy_list'].str.split(',', expand=True)[0:300]

df_proxyList 看起来像（垃圾列很少）-

从 BS4 中的 table 行中提取多个数据

extracting multiple data from table row in BS4

html

python

beautifulsoup

html-parsing

python-3.x