从网页中提取内容并保存为 Python 中的数据框
Extract content from web pages and save as dataframe in Python
我尝试从下图中的 蓝色圆圈 中的 this link 中提取内容:
代码:
import requests
from bs4 import BeautifulSoup
url = 'https://www.cspea.com.cn/list/c01/gr2020bj1005297-3'
res = requests.get(url, verify = False)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)
output = ''
blacklist = [
'[document]',
'a',
'b',
'body',
'div',
'em',
'h1',
'h2',
'h3',
'head',
'html',
'i',
'meta',
'p',
'script',
# 'span',
# 'td',
# 'th',
# 'title'
# there may be more elements you don't want, such as "style", etc.
]
for t in text:
if t.parent.name not in blacklist:
output += '{} '.format(t)
print(output)
如何提取数据并将内容保存为数据框?
您可以使用此示例作为抓取页面的基础(因为我不懂中文,所以我将所有单元格都放入数据框 - 您可以从数据框中删除以后不需要的行):
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.cspea.com.cn/list/c01/gr2021bj1000186"
soup = BeautifulSoup(requests.get(url, verify=False).content, "html.parser")
index, data = [], []
for th in soup.select(".project-detail-left th"):
h = th.get_text(strip=True)
t = th.find_next("td").get_text(strip=True)
index.append(h)
data.append(t)
df = pd.DataFrame(data, index=index, columns=["value"])
print(df)
打印:
value
项目名称 海南省三亚市吉阳区溪泽南路18号兰海水都花园29幢
项目编号 GR2021BJ1000186
受让方名称 **
交易方式 网络竞价
...etc.
我尝试从下图中的 蓝色圆圈 中的 this link 中提取内容:
import requests
from bs4 import BeautifulSoup
url = 'https://www.cspea.com.cn/list/c01/gr2020bj1005297-3'
res = requests.get(url, verify = False)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)
output = ''
blacklist = [
'[document]',
'a',
'b',
'body',
'div',
'em',
'h1',
'h2',
'h3',
'head',
'html',
'i',
'meta',
'p',
'script',
# 'span',
# 'td',
# 'th',
# 'title'
# there may be more elements you don't want, such as "style", etc.
]
for t in text:
if t.parent.name not in blacklist:
output += '{} '.format(t)
print(output)
如何提取数据并将内容保存为数据框?
您可以使用此示例作为抓取页面的基础(因为我不懂中文,所以我将所有单元格都放入数据框 - 您可以从数据框中删除以后不需要的行):
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.cspea.com.cn/list/c01/gr2021bj1000186"
soup = BeautifulSoup(requests.get(url, verify=False).content, "html.parser")
index, data = [], []
for th in soup.select(".project-detail-left th"):
h = th.get_text(strip=True)
t = th.find_next("td").get_text(strip=True)
index.append(h)
data.append(t)
df = pd.DataFrame(data, index=index, columns=["value"])
print(df)
打印:
value
项目名称 海南省三亚市吉阳区溪泽南路18号兰海水都花园29幢
项目编号 GR2021BJ1000186
受让方名称 **
交易方式 网络竞价
...etc.