Python BS4 unwrap() 抓取了 xml 数据
Python BS4 unwrap() scraped xml data
我是一名记者,从事一个使用网络抓取从县监狱网站提取数据的项目。我仍在自学 python 并试图获得一份指控清单以及为该指控分配的保释金。该网站使用 xml,我已经能够提取收费和保释数据并将其写入 csv 文件,但我在使用 unwrap() 函数删除标签时遇到了问题。我已经在几个地方试过了,但似乎无法弄清楚它的用法。我真的很想在代码中执行此操作,而不仅仅是 运行 在电子表格中查找和替换。
from bs4 import BeautifulSoup
import requests
import csv
from datetime import datetime
url="https://legacyweb.randolphcountync.gov/sheriff/jailroster.xml"
xml = requests.get(url)
response = requests.get(url)
if response.status_code == 200:
print("Connecting to jail website:")
print("Connected - Response code:", response)
print("Scraping Started at ", datetime.now())
soup = BeautifulSoup(xml.content, 'lxml')
charges = soup.find_all('ol')
bail_amt = soup.find_all('ob')
with open('charges-bail.csv', 'a', newline='') as csvfile:
chargesbail = csv.writer(csvfile, delimiter=',')
chargesbail.writerow([charges.unwrap(), bail_amt.unwrap()])
CSV 文件
"[<ol>BREAKING AND OR ENTERING (F)</ol>, <ol>POSS STOLEN GOODS/PROP (F)</ol>, <...
不需要使用unwrap()
函数,只需要访问元素内的文本即可。我建议您搜索 <of>
,它位于 <ol>
和 <ob>
条目之上。这样做可以避免您的 ol 和 ob 条目列表不同步,因为并非所有条目都有 ob
.
尝试以下操作:
from bs4 import BeautifulSoup
import requests
import csv
from datetime import datetime
url = "https://legacyweb.randolphcountync.gov/sheriff/jailroster.xml"
print("Connecting to jail website:")
req_xml = requests.get(url)
print("Connected - Response code:", req_xml)
if req_xml.status_code == 200:
with open('charges-bail.csv', 'a', newline='') as csvfile:
chargesbail = csv.writer(csvfile)
print("Scraping Started at ", datetime.now())
soup = BeautifulSoup(req_xml.content, 'lxml')
for of in soup.find_all('of'):
if of.ob:
ob = of.ob.text
else:
ob = ''
chargesbail.writerow([of.ol.text, ob])
这会给你一个输出 CSV 文件开始:
BREAKING AND OR ENTERING (F),
LARCENY AFTER BREAK/ENTER,
POSS STOLEN GOODS/PROP (F),5000
HABEAS CORPUS,100000
ELECTRONIC HOUSE ARREST VIOLAT,25000
代码 of.ob.text
是 shorthand 用于:从 of
中找到第一个 ob
条目,然后 return 中包含的文本或:
of.find('ob').get_text()
要只写入同时存在的行,您可以将其更改为:
for of in soup.find_all('of'):
if of.ob and of.ob.get_text(strip=True):
chargesbail.writerow([of.ol.text, of.ob.get_text(strip=True)])
我是一名记者,从事一个使用网络抓取从县监狱网站提取数据的项目。我仍在自学 python 并试图获得一份指控清单以及为该指控分配的保释金。该网站使用 xml,我已经能够提取收费和保释数据并将其写入 csv 文件,但我在使用 unwrap() 函数删除标签时遇到了问题。我已经在几个地方试过了,但似乎无法弄清楚它的用法。我真的很想在代码中执行此操作,而不仅仅是 运行 在电子表格中查找和替换。
from bs4 import BeautifulSoup
import requests
import csv
from datetime import datetime
url="https://legacyweb.randolphcountync.gov/sheriff/jailroster.xml"
xml = requests.get(url)
response = requests.get(url)
if response.status_code == 200:
print("Connecting to jail website:")
print("Connected - Response code:", response)
print("Scraping Started at ", datetime.now())
soup = BeautifulSoup(xml.content, 'lxml')
charges = soup.find_all('ol')
bail_amt = soup.find_all('ob')
with open('charges-bail.csv', 'a', newline='') as csvfile:
chargesbail = csv.writer(csvfile, delimiter=',')
chargesbail.writerow([charges.unwrap(), bail_amt.unwrap()])
CSV 文件
"[<ol>BREAKING AND OR ENTERING (F)</ol>, <ol>POSS STOLEN GOODS/PROP (F)</ol>, <...
不需要使用unwrap()
函数,只需要访问元素内的文本即可。我建议您搜索 <of>
,它位于 <ol>
和 <ob>
条目之上。这样做可以避免您的 ol 和 ob 条目列表不同步,因为并非所有条目都有 ob
.
尝试以下操作:
from bs4 import BeautifulSoup
import requests
import csv
from datetime import datetime
url = "https://legacyweb.randolphcountync.gov/sheriff/jailroster.xml"
print("Connecting to jail website:")
req_xml = requests.get(url)
print("Connected - Response code:", req_xml)
if req_xml.status_code == 200:
with open('charges-bail.csv', 'a', newline='') as csvfile:
chargesbail = csv.writer(csvfile)
print("Scraping Started at ", datetime.now())
soup = BeautifulSoup(req_xml.content, 'lxml')
for of in soup.find_all('of'):
if of.ob:
ob = of.ob.text
else:
ob = ''
chargesbail.writerow([of.ol.text, ob])
这会给你一个输出 CSV 文件开始:
BREAKING AND OR ENTERING (F),
LARCENY AFTER BREAK/ENTER,
POSS STOLEN GOODS/PROP (F),5000
HABEAS CORPUS,100000
ELECTRONIC HOUSE ARREST VIOLAT,25000
代码 of.ob.text
是 shorthand 用于:从 of
中找到第一个 ob
条目,然后 return 中包含的文本或:
of.find('ob').get_text()
要只写入同时存在的行,您可以将其更改为:
for of in soup.find_all('of'):
if of.ob and of.ob.get_text(strip=True):
chargesbail.writerow([of.ol.text, of.ob.get_text(strip=True)])