Python BS4 unwrap() 抓取了 xml 数据

Question

我是一名记者，从事一个使用网络抓取从县监狱网站提取数据的项目。我仍在自学 python 并试图获得一份指控清单以及为该指控分配的保释金。该网站使用 xml，我已经能够提取收费和保释数据并将其写入 csv 文件，但我在使用 unwrap() 函数删除标签时遇到了问题。我已经在几个地方试过了，但似乎无法弄清楚它的用法。我真的很想在代码中执行此操作，而不仅仅是运行在电子表格中查找和替换。

from bs4 import BeautifulSoup
import requests
import csv
from datetime import datetime

url="https://legacyweb.randolphcountync.gov/sheriff/jailroster.xml"
xml = requests.get(url)
response = requests.get(url)
if response.status_code == 200:
   print("Connecting to jail website:")
   print("Connected - Response code:", response)
   print("Scraping Started at ", datetime.now())

   soup = BeautifulSoup(xml.content, 'lxml')

   charges = soup.find_all('ol')
   bail_amt = soup.find_all('ob')

with open('charges-bail.csv', 'a', newline='') as csvfile:
    chargesbail = csv.writer(csvfile, delimiter=',')
    chargesbail.writerow([charges.unwrap(), bail_amt.unwrap()])

CSV 文件

"[<ol>BREAKING AND OR ENTERING (F)</ol>, <ol>POSS STOLEN GOODS/PROP (F)</ol>, <...

Answer 1

不需要使用unwrap()函数，只需要访问元素内的文本即可。我建议您搜索 <of>，它位于 <ol> 和 <ob> 条目之上。这样做可以避免您的 ol 和 ob 条目列表不同步，因为并非所有条目都有 ob.

尝试以下操作：

from bs4 import BeautifulSoup
import requests
import csv
from datetime import datetime

url = "https://legacyweb.randolphcountync.gov/sheriff/jailroster.xml"
print("Connecting to jail website:")
req_xml = requests.get(url)
print("Connected - Response code:", req_xml)

if req_xml.status_code == 200:
    with open('charges-bail.csv', 'a', newline='') as csvfile:
        chargesbail = csv.writer(csvfile)
        
        print("Scraping Started at ", datetime.now())
        soup = BeautifulSoup(req_xml.content, 'lxml')

        for of in soup.find_all('of'):
            if of.ob:
                ob = of.ob.text
            else:
                ob = ''
                
            chargesbail.writerow([of.ol.text, ob])

这会给你一个输出 CSV 文件开始：

BREAKING AND OR ENTERING (F),
LARCENY AFTER BREAK/ENTER,
POSS STOLEN GOODS/PROP (F),5000
HABEAS CORPUS,100000
ELECTRONIC HOUSE ARREST VIOLAT,25000

代码 of.ob.text 是 shorthand 用于：从 of 中找到第一个 ob 条目，然后 return 中包含的文本或：

of.find('ob').get_text()

要只写入同时存在的行，您可以将其更改为：

for of in soup.find_all('of'):
    if of.ob and of.ob.get_text(strip=True):
        chargesbail.writerow([of.ol.text, of.ob.get_text(strip=True)])

Python BS4 unwrap() 抓取了 xml 数据

Python BS4 unwrap() scraped xml data

python

beautifulsoup