如何将 Div using Beautiful soup python 中的所有 Details 导出到 excel/csv?
How to export all the Details in the Div using Beautiful soup python to excel/csv?
我是 Soup/python 的新手,我正在尝试查找数据。
我的网站结构看起来像这样。
如果我打开 Div 类 border class
它看起来像这样。(下图)
我做过这样的事情:
for P in soup.find_all('p', attrs={'class': 'bid_no pull-left'}) :
print(P.find('a').contents[0])
一个Div结构看起来像
每页大约有10个div
我想在其中提取 项目、数量要求、投标编号、结束日期 。
请帮助我
<div class="border block " style="display: block;">
<div class="block_header">
<p class="bid_no pull-left"> BID NO: <a style="color:#fff !important" href="/showbidDocument/1844736">GEM/2020/B/763154</a></p>
<p class="pull-right view_corrigendum" data-bid="1844736" style="display:none; margin-left: 10px;"><a href="#">View Corrigendum</a></p>
<div class="clearfix"></div>
</div>
<div class="col-block">
<p><strong style="text-transform: none !important;">Item(s): </strong><span>Compatible Cartridge</span></p>
<p><strong>Quantity Required: </strong><span>8</span></p>
<div class="clearfix"></div>
</div>
<div class="col-block">
<p><strong>Department Name And Address:</strong></p>
<p class="add-height">
Ministry Of Railways<br> Na<br> South Central Railway N/a
</p>
<div class="clearfix"></div>
</div>
<div class="col-block">
<p><strong>Start Date: </strong><span>25-08-2020 02:54 PM</span></p>
<p><strong>End Date: </strong><span>04-09-2020 03:00 PM</span></p>
<div class="clearfix"></div>
</div>
<div class="clearfix"></div>
</div>
错误图片
使用 requests 和 beautiful soup 尝试以下方法。我已经使用从网站获取的 URL 创建了脚本,然后创建一个动态 URL 来遍历每个页面以获取数据。
脚本究竟在做什么:
第一个脚本将创建一个 URL,其中 page_no 查询字符串参数将在每次遍历完成时递增 1。
请求将使用get方法从创建的URL中获取数据,然后传递给beautiful soup进行解析HTML使用 lxml.
的结构
然后从解析的数据脚本中搜索 div 数据实际存在的地方。
最后对所有div文本数据逐页循环。
```python
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
from bs4 import BeautifulSoup as bs
def scrap_bid_data():
page_no = 1 #initial page number
while True:
print('Hold on creating URL to fetch data...')
URL = 'https://bidplus.gem.gov.in/bidlists?bidlists&page_no=' + str(page_no) #create dynamic URL
print('URL cerated: ' + URL)
scraped_data = requests.get(URL,verify=False) # request to get the data
soup_data = bs(scraped_data.text, 'lxml') #parse the scraped data using lxml
extracted_data = soup_data.find('div',{'id':'pagi_content'}) #find divs which contains required data
if len(extracted_data) == 0: # **if block** which will check the length of extracted_data if it is 0 then quit and stop the further execution of script.
break
else:
for idx in range(len(extracted_data)): # loops through all the divs and extract and print data
if(idx % 2 == 1): #get data from odd indexes only because we have required data on odd indexes
bid_data = extracted_data.contents[idx].text.strip().split('\n')
print('-' * 100)
print(bid_data[0]) #BID number
print(bid_data[5]) #Items
print(bid_data[6]) #Quantitiy Required
print(bid_data[10] + bid_data[12].strip()) #Department name and address
print(bid_data[16]) #Start date
print(bid_data[17]) #End date
print('-' * 100)
page_no +=1 #increments the page number by 1
scrap_bid_data()
```
Actual Code
Output image
我是 Soup/python 的新手,我正在尝试查找数据。
我的网站结构看起来像这样。
如果我打开 Div 类 border class
它看起来像这样。(下图)
我做过这样的事情:
for P in soup.find_all('p', attrs={'class': 'bid_no pull-left'}) :
print(P.find('a').contents[0])
一个Div结构看起来像
每页大约有10个div
我想在其中提取 项目、数量要求、投标编号、结束日期 。
请帮助我
<div class="border block " style="display: block;">
<div class="block_header">
<p class="bid_no pull-left"> BID NO: <a style="color:#fff !important" href="/showbidDocument/1844736">GEM/2020/B/763154</a></p>
<p class="pull-right view_corrigendum" data-bid="1844736" style="display:none; margin-left: 10px;"><a href="#">View Corrigendum</a></p>
<div class="clearfix"></div>
</div>
<div class="col-block">
<p><strong style="text-transform: none !important;">Item(s): </strong><span>Compatible Cartridge</span></p>
<p><strong>Quantity Required: </strong><span>8</span></p>
<div class="clearfix"></div>
</div>
<div class="col-block">
<p><strong>Department Name And Address:</strong></p>
<p class="add-height">
Ministry Of Railways<br> Na<br> South Central Railway N/a
</p>
<div class="clearfix"></div>
</div>
<div class="col-block">
<p><strong>Start Date: </strong><span>25-08-2020 02:54 PM</span></p>
<p><strong>End Date: </strong><span>04-09-2020 03:00 PM</span></p>
<div class="clearfix"></div>
</div>
<div class="clearfix"></div>
</div>
错误图片
使用 requests 和 beautiful soup 尝试以下方法。我已经使用从网站获取的 URL 创建了脚本,然后创建一个动态 URL 来遍历每个页面以获取数据。
脚本究竟在做什么:
第一个脚本将创建一个 URL,其中 page_no 查询字符串参数将在每次遍历完成时递增 1。
请求将使用get方法从创建的URL中获取数据,然后传递给beautiful soup进行解析HTML使用 lxml.
的结构然后从解析的数据脚本中搜索 div 数据实际存在的地方。
最后对所有div文本数据逐页循环。
```python import requests from urllib3.exceptions import InsecureRequestWarning requests.packages.urllib3.disable_warnings(InsecureRequestWarning) from bs4 import BeautifulSoup as bs def scrap_bid_data(): page_no = 1 #initial page number while True: print('Hold on creating URL to fetch data...') URL = 'https://bidplus.gem.gov.in/bidlists?bidlists&page_no=' + str(page_no) #create dynamic URL print('URL cerated: ' + URL) scraped_data = requests.get(URL,verify=False) # request to get the data soup_data = bs(scraped_data.text, 'lxml') #parse the scraped data using lxml extracted_data = soup_data.find('div',{'id':'pagi_content'}) #find divs which contains required data if len(extracted_data) == 0: # **if block** which will check the length of extracted_data if it is 0 then quit and stop the further execution of script. break else: for idx in range(len(extracted_data)): # loops through all the divs and extract and print data if(idx % 2 == 1): #get data from odd indexes only because we have required data on odd indexes bid_data = extracted_data.contents[idx].text.strip().split('\n') print('-' * 100) print(bid_data[0]) #BID number print(bid_data[5]) #Items print(bid_data[6]) #Quantitiy Required print(bid_data[10] + bid_data[12].strip()) #Department name and address print(bid_data[16]) #Start date print(bid_data[17]) #End date print('-' * 100) page_no +=1 #increments the page number by 1 scrap_bid_data() ```
Actual Code
Output image