使用 Beautifulsoup 查询多次显示相同的数据
Same Data Showing Several Times With Beautifulsoup Query
我有以下Python代码:-
import pandas as pd
import requests
import numpy as np
from bs4 import BeautifulSoup
import xlrd
import re
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
res3 = requests.get("https://web.archive.org/web/20220521203053/https://www.military-
airshows.co.uk/press22/bbmfschedule2022.htm")
soup3 = BeautifulSoup(res3.content,'lxml')
BBMF_2022 = []
#BBMF_elem = soup3.find_all('a', string=re.compile(r'between|Flypast'))
for item in soup3.find_all('a', string=re.compile(r'between|Flypast')):
li1 = item.find_parent().text
#li2 = li1.find_previous().font
#print(link)
print(li1)
#print(li2)
#BBMF_2022.append(li1)
#df = pd.DataFrame(BBMF_2022, columns=['BBMF_2022'])
#df
我遇到的问题是当我 运行 代码时,打印了数据,从 5 月 28 日到 5 月 29 日,15 次条目的 15 个条目。
我不确定为什么会这样?有人可以为我建议原因吗?并告诉我我需要在代码中更改什么,所以
该数据只打印一次而不是 15 次?我试图从网站上抓取数据,其中条目在 'a' 标签中包含单词 between 或 Flypast。打印的数据是正确的,即对于 5 月 21 日的条目,它们只打印一次,并且数据的外观是正确的。
我检查了页面数据并注意到 <i>
标签在 5 月 28 日至 29 日的数据中不存在,而在 5 月 21 日的数据中等
当我改用这些代码行时:
for item in soup3.find_all('a', string=re.compile(r'between|Flypast')):
li1 = item.find_parent().text
#li2 = li1.find_previous().font
#print(link)
#print(li1)
#print(li2)
BBMF_2022.append(li1)
df = pd.DataFrame(BBMF_2022, columns=['BBMF_2022'])
df
5 月 28 日的第一个条目在输出 DataFrame 中打印了 15 次!而不是我之前提到的从 5 月 28 日到 5 月 29 日的 15 个单独的条目。我很困惑,我哪里出错了?我正在使用 web.archive.org link,因为前几天从网站上删除了一周前的数据。
对于第一次使用的Python代码,我想要的输出是:-
21 May - S - Rickmansworth Festival, Hertfordshire Flypast - 3.35pm
21 May - S - The Great Woodford Vintage Festival, Thrapston, Northamptonshire Flypast - between 3.50 & 4.35pm
21 May - Moira Canal Festival, Leicestershire Flypast - between 1.25 & 2.10pm
21 May - L - Wild West & American History Photography Day, Abbots Bromley, Staffs Flypast - between 1.10 & 1.55pm
21 May - Haworth 1940's event, Haworth, West Yorkshire Flypast - between 12.40 & 1.30pm
21 May - L - Etwall Well Dressing Festival, Derbyshire Flypast - between 1.15 & 2.05pm
21 May - Supercars & Classics Weekend, Stonor Park, Oxfordshire Flypast - between 3.25 & 4.15pm
21 May - S - VW Breakout, Santa Pod Raceway, Northamptonshire Flypast - 3.59pm
21 May - The Wartime Village, Skegness, Lincolnshire Flypast - between 2.45 & 3.30pm
22 May - L - Norfolk & Suffolk Aviation Museum, Flixton, Suffolk Flypast - between 10.00 & 10.45am
22 May - S or H - VE Day Event (Royal Air Force Association), Bridlington, E.Yorkshire Flypast - between 11.45 & 12.30pm
22 May - Haworth 1940's event, Haworth, West Yorkshire Flypast - between 12.10 & 1.00pm
22 May - L - Etwall Well Dressing Festival, Derbyshire Flypast - between 11.55 & 12.40pm
22 May - Moira Canal Festival, Leicestershire Flypast - between 11.50 & 12.30pm
22 May - L - The Great Woodford Vintage Festival, Thrapston, Northamptonshire Flypast - between 3.50 & 4.35pm
22 May - L - Rickmansworth Festival, Hertfordshire Flypast - 11.20am
22 May - Supercars & Classics Weekend, Stonor Park, Oxfordshire Flypast - between 10.40 & 11.30am
22 May - L - VW Breakout, Santa Pod Raceway, Northamptonshire Flypast - 11.38am
22 May - The Wartime Village, Skegness, Lincolnshire Flypast - between 11.20 & 12.05pm
28 May - Vintage Rally, Smallwood, Cheshire - between 1.45 & 2.30pm
28 May - Carrington Steam & Heritage Show, Lincolnshire - between 1.15 & 2.00pm
28 May - H - The Shropshire County Show - between 2.05 & 2.45pm
28 May - The Ironbridge WW2 Weekend, Shropshire - between 2.05 & 2.50pm
28 May - H - Middy in the 1940s, Wetheringsett, Suffolk - between 2.15 & 3.00pm
28 May - S - FIA/FIM, Santa Pod Raceway, Northamptonshire - between 3.25 & 4.10pm
28 May - Prescott Historique, Bishops Cleeve, Gloucestershire - between 11.45 & 12.30pm
28 May - S - WARAG Weekend, Somerset - between 2.45 & 3.30pm
28 May - Lechlade Festival, Gloucestershire - between 3.05 & 3.55pm
28 May - H - Heathfield Agricultural Show, East Sussex - between 1.45 & 2.30pm
29 May - Carrington Steam & Heritage Show, Lincolnshire - between 4.15 & 5.00pm
29 May - Vintage Rally, Smallwood, Cheshire - between 3.45 & 4.30pm
29 May - SH - FIA/FIM, Santa Pod Raceway, Northamptonshire - between 12.10 & 12.55pm
29 May - Lechlade Festival, Gloucestershire - between 3.05 & 3.55pm
29 May - SH - Classic Wings & Wheels, Bidford Gliding Club, Warwickshire - between 12.30 & 1.00pm
02 June - L - Lanc, Tank and Military Machines, East Kirkby, Lincs. Flypast
02 July - S - Hollowell Steam and Vintage Rally Flypast - 12.48pm
03 July - SH - Hollowell Steam and Vintage Rally Flypast - 2.01pm
当使用 DataFrame 代码行时,我想要相同的输出。
我尝试了 6 月的最新网页。我希望输出与我在 6 月份发布的格式相同。 June 的数据的问题是这次 between 和 Flypast 文本不在 'a' href 标签中,所以我不确定如何将 re=compile 代码行与哪个相关标签结合起来,似乎是在字体标签中 ?
我在 6 月份使用了这行代码 :-
for item in soup3.find_all('b', string=re.compile(r'June')):
但由于我没有包括 between 和 Flypast,在代码行中,输出了很多不需要的数据。和以前一样,重复 6 月的数据,重复次数与条目数一样多。
鉴于 html 和一些条目的不规则性,您将需要首先搜索几个模式(您完全遗漏了一些符合当前模式的日期)。这可以通过 css OR 语法并传递月份缩写列表以在指定标签内搜索来完成。
然后您将根据标记类型处理返回的列表。在b
标签的情况下,您可以从每个匹配的节点加上一些兄弟节点构建相关的事件条目。
我使用项目符号作为一种锚点来识别我的目标元素,然后使用更多 css 选择器来限制页面上感兴趣的元素。
考虑到 html 的性质,下面的解决方案比我想要的更脆弱。
import requests
from bs4 import BeautifulSoup as bs
import calendar
import pandas as pd
months = '" ' + '"," '.join(list(calendar.month_abbr)[1:]) + '"'
r = requests.get(
'https://web.archive.org/web/20220521203053/https://www.military-airshows.co.uk/press22/bbmfschedule2022.htm')
soup = bs(r.content, 'lxml')
results = []
for i in soup.select(f'[color=black]:-soup-contains("•") ~ i:has(b:-soup-contains({months})), \
[color=black]:-soup-contains("•") + b:-soup-contains({months}):not(i [color=black]:-soup-contains("•") + \
b:-soup-contains({months}))'):
if i.name != 'b':
if '\n' in i.text: # handle odd case of late May
results.extend(i.text.replace('• ', '').strip().split('\n'))
else:
results.append(i.text.replace('• ', '').strip())
else:
s = i.text + i.next_sibling + i.next_sibling.find_next(string=True)
ss = i.next_sibling.find_next(string=True).find_next(string=True)
if ss.strip() == '-':
results.append(s + ss + ss.find_next('a').text.strip())
else:
results.append(s.strip())
df = pd.DataFrame(results, columns=['event'])
df.to_markdown(index=False)
df.to_csv('events.csv', encoding='utf-8-sig', index=False)
五月底:
在一个父 i
标签中有很多 br
分开的列表
我在 \n
上拆分此内容,然后在适当的时候使用此列表扩展我的总体结果。
结果样本:
好吧,这并不容易,但我们做到了:
(我是在 6 月做的,因为无法正常访问 5 月,但代码应该也适用于 5 月)
1.导入模块,得到url和html代码:
from bs4 import BeautifulSoup
from lxml import etree
import requests
URL = "https://www.military-airshows.co.uk/press22/bbmfschedule2022.htm"
webpage = requests.get(URL)
soup = BeautifulSoup(webpage.content, "html.parser")
dom = etree.HTML(str(soup))
2. 从第一个i
:
之后的descendant
获取所有text()
all = dom.xpath('/html/body/div[6]/div/div[1]/div/div[2]/i[4]/descendant::text()')
2.1为了方便我在这里做了第一次清理:
all = [i for i in all if i != '\n' and i != ' ']
3. 我写了一个小函数让我们在每次出现'•'时分隔lines/rows:
def split_list(input_list, delimiter):
result_list = []
while len(input_list) > 0:
elem = input_list.pop(0)
if elem == delimiter:
if 'sub_list' in locals():
result_list.append(sub_list)
sub_list = [elem]
elif len(input_list) == 0:
sub_list.append(elem)
result_list.append(sub_list)
else:
sub_list.append(elem)
return result_list
a = split_list(all, '•')
这个函数适用于你想要的任何分隔符,你可以在其他地方使用它;)
4. 现在我们可以在干净列表上使用 for 循环来创建数据框:
rows = []
for i in a:
date = (i[1])
event = (','.join(i[2:])).replace(',', '')
rows.append([date,event])
df = pd.DataFrame(rows, columns=["Date", "Event"])
df
5. 输出:
index
Date
Event
0
02 June
- BBMF aircraft will take part in the Queen's Platinum Jubilee Flypast over Buckingham Palace at 1.00pm
1
02 June
- Kingston-Upon-Hull E.Yorkshire - between 6.50 & 7.35pm
2
02 June
- Hessle N.Yorkshire - between 6.45 & 7.30pm
...
index
Date
Event
105
05 June
- Ingatestone Essex - between 12.45 & 1.30pm
106
05 June
- Maidstone Kent - between 3.45 & 4.30pm
107
05 June
- H - The Overlord Show Denmead Hampshire - between 3.10 & 4.00pm
我有以下Python代码:-
import pandas as pd
import requests
import numpy as np
from bs4 import BeautifulSoup
import xlrd
import re
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
res3 = requests.get("https://web.archive.org/web/20220521203053/https://www.military-
airshows.co.uk/press22/bbmfschedule2022.htm")
soup3 = BeautifulSoup(res3.content,'lxml')
BBMF_2022 = []
#BBMF_elem = soup3.find_all('a', string=re.compile(r'between|Flypast'))
for item in soup3.find_all('a', string=re.compile(r'between|Flypast')):
li1 = item.find_parent().text
#li2 = li1.find_previous().font
#print(link)
print(li1)
#print(li2)
#BBMF_2022.append(li1)
#df = pd.DataFrame(BBMF_2022, columns=['BBMF_2022'])
#df
我遇到的问题是当我 运行 代码时,打印了数据,从 5 月 28 日到 5 月 29 日,15 次条目的 15 个条目。 我不确定为什么会这样?有人可以为我建议原因吗?并告诉我我需要在代码中更改什么,所以 该数据只打印一次而不是 15 次?我试图从网站上抓取数据,其中条目在 'a' 标签中包含单词 between 或 Flypast。打印的数据是正确的,即对于 5 月 21 日的条目,它们只打印一次,并且数据的外观是正确的。
我检查了页面数据并注意到 <i>
标签在 5 月 28 日至 29 日的数据中不存在,而在 5 月 21 日的数据中等
当我改用这些代码行时:
for item in soup3.find_all('a', string=re.compile(r'between|Flypast')):
li1 = item.find_parent().text
#li2 = li1.find_previous().font
#print(link)
#print(li1)
#print(li2)
BBMF_2022.append(li1)
df = pd.DataFrame(BBMF_2022, columns=['BBMF_2022'])
df
5 月 28 日的第一个条目在输出 DataFrame 中打印了 15 次!而不是我之前提到的从 5 月 28 日到 5 月 29 日的 15 个单独的条目。我很困惑,我哪里出错了?我正在使用 web.archive.org link,因为前几天从网站上删除了一周前的数据。
对于第一次使用的Python代码,我想要的输出是:-
21 May - S - Rickmansworth Festival, Hertfordshire Flypast - 3.35pm
21 May - S - The Great Woodford Vintage Festival, Thrapston, Northamptonshire Flypast - between 3.50 & 4.35pm
21 May - Moira Canal Festival, Leicestershire Flypast - between 1.25 & 2.10pm
21 May - L - Wild West & American History Photography Day, Abbots Bromley, Staffs Flypast - between 1.10 & 1.55pm
21 May - Haworth 1940's event, Haworth, West Yorkshire Flypast - between 12.40 & 1.30pm
21 May - L - Etwall Well Dressing Festival, Derbyshire Flypast - between 1.15 & 2.05pm
21 May - Supercars & Classics Weekend, Stonor Park, Oxfordshire Flypast - between 3.25 & 4.15pm
21 May - S - VW Breakout, Santa Pod Raceway, Northamptonshire Flypast - 3.59pm
21 May - The Wartime Village, Skegness, Lincolnshire Flypast - between 2.45 & 3.30pm
22 May - L - Norfolk & Suffolk Aviation Museum, Flixton, Suffolk Flypast - between 10.00 & 10.45am
22 May - S or H - VE Day Event (Royal Air Force Association), Bridlington, E.Yorkshire Flypast - between 11.45 & 12.30pm
22 May - Haworth 1940's event, Haworth, West Yorkshire Flypast - between 12.10 & 1.00pm
22 May - L - Etwall Well Dressing Festival, Derbyshire Flypast - between 11.55 & 12.40pm
22 May - Moira Canal Festival, Leicestershire Flypast - between 11.50 & 12.30pm
22 May - L - The Great Woodford Vintage Festival, Thrapston, Northamptonshire Flypast - between 3.50 & 4.35pm
22 May - L - Rickmansworth Festival, Hertfordshire Flypast - 11.20am
22 May - Supercars & Classics Weekend, Stonor Park, Oxfordshire Flypast - between 10.40 & 11.30am
22 May - L - VW Breakout, Santa Pod Raceway, Northamptonshire Flypast - 11.38am
22 May - The Wartime Village, Skegness, Lincolnshire Flypast - between 11.20 & 12.05pm
28 May - Vintage Rally, Smallwood, Cheshire - between 1.45 & 2.30pm
28 May - Carrington Steam & Heritage Show, Lincolnshire - between 1.15 & 2.00pm
28 May - H - The Shropshire County Show - between 2.05 & 2.45pm
28 May - The Ironbridge WW2 Weekend, Shropshire - between 2.05 & 2.50pm
28 May - H - Middy in the 1940s, Wetheringsett, Suffolk - between 2.15 & 3.00pm
28 May - S - FIA/FIM, Santa Pod Raceway, Northamptonshire - between 3.25 & 4.10pm
28 May - Prescott Historique, Bishops Cleeve, Gloucestershire - between 11.45 & 12.30pm
28 May - S - WARAG Weekend, Somerset - between 2.45 & 3.30pm
28 May - Lechlade Festival, Gloucestershire - between 3.05 & 3.55pm
28 May - H - Heathfield Agricultural Show, East Sussex - between 1.45 & 2.30pm
29 May - Carrington Steam & Heritage Show, Lincolnshire - between 4.15 & 5.00pm
29 May - Vintage Rally, Smallwood, Cheshire - between 3.45 & 4.30pm
29 May - SH - FIA/FIM, Santa Pod Raceway, Northamptonshire - between 12.10 & 12.55pm
29 May - Lechlade Festival, Gloucestershire - between 3.05 & 3.55pm
29 May - SH - Classic Wings & Wheels, Bidford Gliding Club, Warwickshire - between 12.30 & 1.00pm
02 June - L - Lanc, Tank and Military Machines, East Kirkby, Lincs. Flypast
02 July - S - Hollowell Steam and Vintage Rally Flypast - 12.48pm
03 July - SH - Hollowell Steam and Vintage Rally Flypast - 2.01pm
当使用 DataFrame 代码行时,我想要相同的输出。
我尝试了 6 月的最新网页。我希望输出与我在 6 月份发布的格式相同。 June 的数据的问题是这次 between 和 Flypast 文本不在 'a' href 标签中,所以我不确定如何将 re=compile 代码行与哪个相关标签结合起来,似乎是在字体标签中 ?
我在 6 月份使用了这行代码 :-
for item in soup3.find_all('b', string=re.compile(r'June')):
但由于我没有包括 between 和 Flypast,在代码行中,输出了很多不需要的数据。和以前一样,重复 6 月的数据,重复次数与条目数一样多。
鉴于 html 和一些条目的不规则性,您将需要首先搜索几个模式(您完全遗漏了一些符合当前模式的日期)。这可以通过 css OR 语法并传递月份缩写列表以在指定标签内搜索来完成。
然后您将根据标记类型处理返回的列表。在b
标签的情况下,您可以从每个匹配的节点加上一些兄弟节点构建相关的事件条目。
我使用项目符号作为一种锚点来识别我的目标元素,然后使用更多 css 选择器来限制页面上感兴趣的元素。
考虑到 html 的性质,下面的解决方案比我想要的更脆弱。
import requests
from bs4 import BeautifulSoup as bs
import calendar
import pandas as pd
months = '" ' + '"," '.join(list(calendar.month_abbr)[1:]) + '"'
r = requests.get(
'https://web.archive.org/web/20220521203053/https://www.military-airshows.co.uk/press22/bbmfschedule2022.htm')
soup = bs(r.content, 'lxml')
results = []
for i in soup.select(f'[color=black]:-soup-contains("•") ~ i:has(b:-soup-contains({months})), \
[color=black]:-soup-contains("•") + b:-soup-contains({months}):not(i [color=black]:-soup-contains("•") + \
b:-soup-contains({months}))'):
if i.name != 'b':
if '\n' in i.text: # handle odd case of late May
results.extend(i.text.replace('• ', '').strip().split('\n'))
else:
results.append(i.text.replace('• ', '').strip())
else:
s = i.text + i.next_sibling + i.next_sibling.find_next(string=True)
ss = i.next_sibling.find_next(string=True).find_next(string=True)
if ss.strip() == '-':
results.append(s + ss + ss.find_next('a').text.strip())
else:
results.append(s.strip())
df = pd.DataFrame(results, columns=['event'])
df.to_markdown(index=False)
df.to_csv('events.csv', encoding='utf-8-sig', index=False)
五月底:
在一个父 i
标签中有很多 br
分开的列表
我在 \n
上拆分此内容,然后在适当的时候使用此列表扩展我的总体结果。
结果样本:
好吧,这并不容易,但我们做到了: (我是在 6 月做的,因为无法正常访问 5 月,但代码应该也适用于 5 月)
1.导入模块,得到url和html代码:
from bs4 import BeautifulSoup
from lxml import etree
import requests
URL = "https://www.military-airshows.co.uk/press22/bbmfschedule2022.htm"
webpage = requests.get(URL)
soup = BeautifulSoup(webpage.content, "html.parser")
dom = etree.HTML(str(soup))
2. 从第一个i
:
descendant
获取所有text()
all = dom.xpath('/html/body/div[6]/div/div[1]/div/div[2]/i[4]/descendant::text()')
2.1为了方便我在这里做了第一次清理:
all = [i for i in all if i != '\n' and i != ' ']
3. 我写了一个小函数让我们在每次出现'•'时分隔lines/rows:
def split_list(input_list, delimiter):
result_list = []
while len(input_list) > 0:
elem = input_list.pop(0)
if elem == delimiter:
if 'sub_list' in locals():
result_list.append(sub_list)
sub_list = [elem]
elif len(input_list) == 0:
sub_list.append(elem)
result_list.append(sub_list)
else:
sub_list.append(elem)
return result_list
a = split_list(all, '•')
这个函数适用于你想要的任何分隔符,你可以在其他地方使用它;)
4. 现在我们可以在干净列表上使用 for 循环来创建数据框:
rows = []
for i in a:
date = (i[1])
event = (','.join(i[2:])).replace(',', '')
rows.append([date,event])
df = pd.DataFrame(rows, columns=["Date", "Event"])
df
5. 输出:
index | Date | Event |
---|---|---|
0 | 02 June | - BBMF aircraft will take part in the Queen's Platinum Jubilee Flypast over Buckingham Palace at 1.00pm |
1 | 02 June | - Kingston-Upon-Hull E.Yorkshire - between 6.50 & 7.35pm |
2 | 02 June | - Hessle N.Yorkshire - between 6.45 & 7.30pm |
...
index | Date | Event |
---|---|---|
105 | 05 June | - Ingatestone Essex - between 12.45 & 1.30pm |
106 | 05 June | - Maidstone Kent - between 3.45 & 4.30pm |
107 | 05 June | - H - The Overlord Show Denmead Hampshire - between 3.10 & 4.00pm |