Python BS4 unwrap() 抓取了 xml 数据

Python BS4 unwrap() scraped xml data

我是一名记者,从事一个使用网络抓取从县监狱网站提取数据的项目。我仍在自学 python 并试图获得一份指控清单以及为该指控分配的保释金。该网站使用 xml,我已经能够提取收费和保释数据并将其写入 csv 文件,但我在使用 unwrap() 函数删除标签时遇到了问题。我已经在几个地方试过了,但似乎无法弄清楚它的用法。我真的很想在代码中执行此操作,而不仅仅是 运行 在电子表格中查找和替换。

from bs4 import BeautifulSoup
import requests
import csv
from datetime import datetime

url="https://legacyweb.randolphcountync.gov/sheriff/jailroster.xml"
xml = requests.get(url)
response = requests.get(url)
if response.status_code == 200:
   print("Connecting to jail website:")
   print("Connected - Response code:", response)
   print("Scraping Started at ", datetime.now())

   soup = BeautifulSoup(xml.content, 'lxml')

   charges = soup.find_all('ol')
   bail_amt = soup.find_all('ob')

with open('charges-bail.csv', 'a', newline='') as csvfile:
    chargesbail = csv.writer(csvfile, delimiter=',')
    chargesbail.writerow([charges.unwrap(), bail_amt.unwrap()])

CSV 文件

"[<ol>BREAKING AND OR ENTERING (F)</ol>, <ol>POSS STOLEN GOODS/PROP (F)</ol>, <...

不需要使用unwrap()函数,只需要访问元素内的文本即可。我建议您搜索 <of>,它位于 <ol><ob> 条目之上。这样做可以避免您的 ol 和 ob 条目列表不同步,因为并非所有条目都有 ob.

尝试以下操作:

from bs4 import BeautifulSoup
import requests
import csv
from datetime import datetime

url = "https://legacyweb.randolphcountync.gov/sheriff/jailroster.xml"
print("Connecting to jail website:")
req_xml = requests.get(url)
print("Connected - Response code:", req_xml)

if req_xml.status_code == 200:
    with open('charges-bail.csv', 'a', newline='') as csvfile:
        chargesbail = csv.writer(csvfile)
        
        print("Scraping Started at ", datetime.now())
        soup = BeautifulSoup(req_xml.content, 'lxml')

        for of in soup.find_all('of'):
            if of.ob:
                ob = of.ob.text
            else:
                ob = ''
                
            chargesbail.writerow([of.ol.text, ob])       

这会给你一个输出 CSV 文件开始:

BREAKING AND OR ENTERING (F),
LARCENY AFTER BREAK/ENTER,
POSS STOLEN GOODS/PROP (F),5000
HABEAS CORPUS,100000
ELECTRONIC HOUSE ARREST VIOLAT,25000

代码 of.ob.text 是 shorthand 用于:从 of 中找到第一个 ob 条目,然后 return 中包含的文本或:

of.find('ob').get_text()

要只写入同时存在的行,您可以将其更改为:

for of in soup.find_all('of'):
    if of.ob and of.ob.get_text(strip=True):
        chargesbail.writerow([of.ol.text, of.ob.get_text(strip=True)])