从图书馆目录中抓取信息
Scraping information from library catalog
我正在开展一个项目,从特定图书馆抓取图书的目录信息。到目前为止,我的脚本可以从 table 中抓取所有单元格。但是,我对如何仅 return 新不列颠图书馆的特定单元格感到困惑。
import requests
from bs4 import BeautifulSoup
mypage = 'http://lci-mt.iii.com/iii/encore/record/C__Rb1872125__S%28*%29%20f%3Aa%20c%3A47__P0%2C3__Orightresult__U__X6?lang=eng&suite=cobalt'
response = requests.get(mypage)
soup = BeautifulSoup(response.text, 'html.parser')
data = []
table = soup.find('table', attrs={'class':'itemTable'})
rows = table.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele]) # Get rid of empty values
for index, libraryinfo in enumerate(data):
print(index, libraryinfo)
这是脚本中新不列颠图书馆的示例输出:
["New Britain, Main Library - Children's Department", 'J FIC PALACIO', 'Check Shelf']
而不是 return 所有的单元格,我如何只 return 关于新不列颠图书馆的单元格?我只想要图书馆名称和结帐状态。
所需的输出将是:
["New Britain, Main Library - Children's Department", 'Check Shelf']
可以有多个单元格,因为一本书在同一个图书馆可以有多本。
过滤掉与新不列颠无关的行只需要检查cols
(即cols[0]
)的第一个元素是否有图书馆的名称。
只获取图书馆名称和借出状态很简单。您只需要访问 cols
的第一个和第三个元素(即 [cols[0], cols[2]]
),因为它们分别具有图书馆名称和结帐状态。
您可以尝试将 data.append([ele for ele in cols if ele])
替换为以下内容。
# We gotta do this to skip empty rows.
if len(cols) == 0:
continue
if 'New Britain' in cols[0]:
data.append([cols[0], cols[2]])
您的代码将如下所示:
import requests
from bs4 import BeautifulSoup
mypage = 'http://lci-mt.iii.com/iii/encore/record/C__Rb1872125__S%28*%29%20f%3Aa%20c%3A47__P0%2C3__Orightresult__U__X6?lang=eng&suite=cobalt'
response = requests.get(mypage)
soup = BeautifulSoup(response.text, 'html.parser')
data = []
table = soup.find('table', attrs={'class':'itemTable'})
rows = table.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
if len(cols) == 0:
continue
if 'New Britain' in cols[0]:
data.append([cols[0], cols[2]])
for index, libraryinfo in enumerate(data):
print(index, libraryinfo)
输出:
0 ["New Britain, Jefferson Branch - Children's Department", 'Check Shelf']
1 ["New Britain, Main Library - Children's Department", 'Check Shelf']
2 ["New Britain, Main Library - Children's Department", 'Check Shelf']
为了根据特定字段(示例中的第一个字段)简单地过滤掉数据,您可以建立一个理解:
[element for element in data if 'New Britain' in element[0]]
您提供的示例消除了使数据元素具有不同大小的空值。这使得更难知道哪个字段对应于每个数据组件。使用字典我们可以让数据更容易理解和处理。
某些字段内部似乎有空块(只有 space 类字符 ['\n'
、'\r'
、'\t'
、' '
]).所以 strip 不会删除那些。将它与简单的正则表达式结合可以帮助改善这一点。我写了一个简单的函数来做到这一点:
def squish(s):
return re.sub(r'\s+', ' ', s)
综上所述,相信对您有所帮助:
import re
import requests
from bs4 import BeautifulSoup
def squish(s):
return re.sub(r'\s+', ' ', s)
def filter_by_location(data, location_name):
return [x for x in data if location_name.lower() in x['Location'].lower()]
mypage = 'http://lci-mt.iii.com/iii/encore/record/C__Rb1872125__S%28*%29%20f%3Aa%20c%3A47__P0%2C3__Orightresult__U__X6?lang=eng&suite=cobalt'
response = requests.get(mypage)
soup = BeautifulSoup(response.text, 'html.parser')
data = []
table = soup.find('table', attrs={'class':'itemTable'})
headers = [squish(element.text.strip()) for element in table.find('tr').find_all('th')]
for row in table.find_all('tr')[1:]:
cols = [squish(element.text.strip()) for element in row.find_all('td')]
data.append({k:v for k, v in zip(headers, cols)})
filtered_data = filter_by_location(data, 'New Britain')
for x in filtered_data:
print('Location: {}'.format(x['Location']))
print('Status: {}'.format(x['Status']))
print()
运行 我得到了以下结果:
Location: New Britain, Jefferson Branch - Children's Department
Status: Check Shelf
Location: New Britain, Main Library - Children's Department
Status: Check Shelf
Location: New Britain, Main Library - Children's Department
Status: Check Shelf
试试这个以获得所需的内容:
import requests
from bs4 import BeautifulSoup
URL = "http://lci-mt.iii.com/iii/encore/record/C__Rb1872125__S%28*%29%20f%3Aa%20c%3A47__P0%2C3__Orightresult__U__X6?lang=eng&suite=cobalt"
res = requests.get(URL)
soup = BeautifulSoup(res.text,"lxml")
for items in soup.find("table",class_="itemTable").find_all("tr"):
if "New Britain" in items.text:
data = items.find_all("td")
name = data[0].a.get_text(strip=True)
status = data[2].get_text(strip=True)
print(name,status)
输出:
New Britain, Jefferson Branch - Children's Department Check Shelf
New Britain, Main Library - Children's Department Check Shelf
New Britain, Main Library - Children's Department Check Shelf
我正在开展一个项目,从特定图书馆抓取图书的目录信息。到目前为止,我的脚本可以从 table 中抓取所有单元格。但是,我对如何仅 return 新不列颠图书馆的特定单元格感到困惑。
import requests
from bs4 import BeautifulSoup
mypage = 'http://lci-mt.iii.com/iii/encore/record/C__Rb1872125__S%28*%29%20f%3Aa%20c%3A47__P0%2C3__Orightresult__U__X6?lang=eng&suite=cobalt'
response = requests.get(mypage)
soup = BeautifulSoup(response.text, 'html.parser')
data = []
table = soup.find('table', attrs={'class':'itemTable'})
rows = table.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele]) # Get rid of empty values
for index, libraryinfo in enumerate(data):
print(index, libraryinfo)
这是脚本中新不列颠图书馆的示例输出:
["New Britain, Main Library - Children's Department", 'J FIC PALACIO', 'Check Shelf']
而不是 return 所有的单元格,我如何只 return 关于新不列颠图书馆的单元格?我只想要图书馆名称和结帐状态。
所需的输出将是:
["New Britain, Main Library - Children's Department", 'Check Shelf']
可以有多个单元格,因为一本书在同一个图书馆可以有多本。
过滤掉与新不列颠无关的行只需要检查cols
(即cols[0]
)的第一个元素是否有图书馆的名称。
只获取图书馆名称和借出状态很简单。您只需要访问 cols
的第一个和第三个元素(即 [cols[0], cols[2]]
),因为它们分别具有图书馆名称和结帐状态。
您可以尝试将 data.append([ele for ele in cols if ele])
替换为以下内容。
# We gotta do this to skip empty rows.
if len(cols) == 0:
continue
if 'New Britain' in cols[0]:
data.append([cols[0], cols[2]])
您的代码将如下所示:
import requests
from bs4 import BeautifulSoup
mypage = 'http://lci-mt.iii.com/iii/encore/record/C__Rb1872125__S%28*%29%20f%3Aa%20c%3A47__P0%2C3__Orightresult__U__X6?lang=eng&suite=cobalt'
response = requests.get(mypage)
soup = BeautifulSoup(response.text, 'html.parser')
data = []
table = soup.find('table', attrs={'class':'itemTable'})
rows = table.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
if len(cols) == 0:
continue
if 'New Britain' in cols[0]:
data.append([cols[0], cols[2]])
for index, libraryinfo in enumerate(data):
print(index, libraryinfo)
输出:
0 ["New Britain, Jefferson Branch - Children's Department", 'Check Shelf']
1 ["New Britain, Main Library - Children's Department", 'Check Shelf']
2 ["New Britain, Main Library - Children's Department", 'Check Shelf']
为了根据特定字段(示例中的第一个字段)简单地过滤掉数据,您可以建立一个理解:
[element for element in data if 'New Britain' in element[0]]
您提供的示例消除了使数据元素具有不同大小的空值。这使得更难知道哪个字段对应于每个数据组件。使用字典我们可以让数据更容易理解和处理。
某些字段内部似乎有空块(只有 space 类字符 ['\n'
、'\r'
、'\t'
、' '
]).所以 strip 不会删除那些。将它与简单的正则表达式结合可以帮助改善这一点。我写了一个简单的函数来做到这一点:
def squish(s):
return re.sub(r'\s+', ' ', s)
综上所述,相信对您有所帮助:
import re
import requests
from bs4 import BeautifulSoup
def squish(s):
return re.sub(r'\s+', ' ', s)
def filter_by_location(data, location_name):
return [x for x in data if location_name.lower() in x['Location'].lower()]
mypage = 'http://lci-mt.iii.com/iii/encore/record/C__Rb1872125__S%28*%29%20f%3Aa%20c%3A47__P0%2C3__Orightresult__U__X6?lang=eng&suite=cobalt'
response = requests.get(mypage)
soup = BeautifulSoup(response.text, 'html.parser')
data = []
table = soup.find('table', attrs={'class':'itemTable'})
headers = [squish(element.text.strip()) for element in table.find('tr').find_all('th')]
for row in table.find_all('tr')[1:]:
cols = [squish(element.text.strip()) for element in row.find_all('td')]
data.append({k:v for k, v in zip(headers, cols)})
filtered_data = filter_by_location(data, 'New Britain')
for x in filtered_data:
print('Location: {}'.format(x['Location']))
print('Status: {}'.format(x['Status']))
print()
运行 我得到了以下结果:
Location: New Britain, Jefferson Branch - Children's Department
Status: Check Shelf
Location: New Britain, Main Library - Children's Department
Status: Check Shelf
Location: New Britain, Main Library - Children's Department
Status: Check Shelf
试试这个以获得所需的内容:
import requests
from bs4 import BeautifulSoup
URL = "http://lci-mt.iii.com/iii/encore/record/C__Rb1872125__S%28*%29%20f%3Aa%20c%3A47__P0%2C3__Orightresult__U__X6?lang=eng&suite=cobalt"
res = requests.get(URL)
soup = BeautifulSoup(res.text,"lxml")
for items in soup.find("table",class_="itemTable").find_all("tr"):
if "New Britain" in items.text:
data = items.find_all("td")
name = data[0].a.get_text(strip=True)
status = data[2].get_text(strip=True)
print(name,status)
输出:
New Britain, Jefferson Branch - Children's Department Check Shelf
New Britain, Main Library - Children's Department Check Shelf
New Britain, Main Library - Children's Department Check Shelf