使用 Beautifulsoup 查询多次显示相同的数据

Question

我有以下Python代码：-

import pandas as pd
import requests
import numpy as np
from bs4 import BeautifulSoup
import xlrd
import re

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

res3 = requests.get("https://web.archive.org/web/20220521203053/https://www.military- 
airshows.co.uk/press22/bbmfschedule2022.htm")     
soup3 = BeautifulSoup(res3.content,'lxml')

BBMF_2022 = []

#BBMF_elem = soup3.find_all('a', string=re.compile(r'between|Flypast'))

for item in soup3.find_all('a', string=re.compile(r'between|Flypast')):
li1 = item.find_parent().text
#li2 = li1.find_previous().font
#print(link)
print(li1)
#print(li2)
 
#BBMF_2022.append(li1)

#df = pd.DataFrame(BBMF_2022, columns=['BBMF_2022'])

#df

我遇到的问题是当我运行代码时，打印了数据，从 5 月 28 日到 5 月 29 日，15 次条目的 15 个条目。我不确定为什么会这样？有人可以为我建议原因吗？并告诉我我需要在代码中更改什么，所以该数据只打印一次而不是 15 次？我试图从网站上抓取数据，其中条目在 'a' 标签中包含单词 between 或 Flypast。打印的数据是正确的，即对于 5 月 21 日的条目，它们只打印一次，并且数据的外观是正确的。

我检查了页面数据并注意到 <i> 标签在 5 月 28 日至 29 日的数据中不存在，而在 5 月 21 日的数据中等

当我改用这些代码行时：

for item in soup3.find_all('a', string=re.compile(r'between|Flypast')):
li1 = item.find_parent().text
#li2 = li1.find_previous().font
#print(link)
#print(li1)
#print(li2)
  
BBMF_2022.append(li1)

df = pd.DataFrame(BBMF_2022, columns=['BBMF_2022'])

df

5 月 28 日的第一个条目在输出 DataFrame 中打印了 15 次！而不是我之前提到的从 5 月 28 日到 5 月 29 日的 15 个单独的条目。我很困惑，我哪里出错了？我正在使用 web.archive.org link，因为前几天从网站上删除了一周前的数据。

对于第一次使用的Python代码，我想要的输出是：-

21 May - S - Rickmansworth Festival, Hertfordshire Flypast - 3.35pm
21 May - S - The Great Woodford Vintage Festival, Thrapston, Northamptonshire Flypast - between 3.50 & 4.35pm
21 May - Moira Canal Festival, Leicestershire Flypast - between 1.25 & 2.10pm
21 May - L - Wild West & American History Photography Day, Abbots Bromley, Staffs Flypast - between 1.10 & 1.55pm
21 May - Haworth 1940's event, Haworth, West Yorkshire Flypast - between 12.40 & 1.30pm
21 May - L - Etwall Well Dressing Festival, Derbyshire Flypast - between 1.15 & 2.05pm
21 May - Supercars & Classics Weekend, Stonor Park, Oxfordshire Flypast - between 3.25 & 4.15pm
21 May - S - VW Breakout, Santa Pod Raceway, Northamptonshire Flypast - 3.59pm
21 May - The Wartime Village, Skegness, Lincolnshire Flypast - between 2.45 & 3.30pm
22 May - L - Norfolk & Suffolk Aviation Museum, Flixton, Suffolk Flypast - between 10.00 & 10.45am
22 May  - S or H - VE Day Event (Royal Air Force Association), Bridlington, E.Yorkshire Flypast - between 11.45 & 12.30pm
22 May - Haworth 1940's event, Haworth, West Yorkshire Flypast - between 12.10 & 1.00pm
22 May - L - Etwall Well Dressing Festival, Derbyshire Flypast - between 11.55 & 12.40pm
22 May - Moira Canal Festival, Leicestershire Flypast - between 11.50 & 12.30pm
22 May - L - The Great Woodford Vintage Festival, Thrapston, Northamptonshire Flypast - between 3.50 & 4.35pm
22 May - L - Rickmansworth Festival, Hertfordshire Flypast - 11.20am
22 May - Supercars & Classics Weekend, Stonor Park, Oxfordshire Flypast - between 10.40 & 11.30am
22 May - L - VW Breakout, Santa Pod Raceway, Northamptonshire Flypast - 11.38am
22 May - The Wartime Village, Skegness, Lincolnshire Flypast - between 11.20 & 12.05pm

28 May - Vintage Rally, Smallwood, Cheshire - between 1.45 & 2.30pm
28 May - Carrington Steam & Heritage Show, Lincolnshire - between 1.15 & 2.00pm
28 May - H - The Shropshire County Show - between 2.05 & 2.45pm
28 May - The Ironbridge WW2 Weekend, Shropshire - between 2.05 & 2.50pm
28 May - H - Middy in the 1940s, Wetheringsett, Suffolk - between 2.15 & 3.00pm
28 May - S - FIA/FIM, Santa Pod Raceway, Northamptonshire - between 3.25 & 4.10pm
28 May - Prescott Historique, Bishops Cleeve, Gloucestershire - between 11.45 & 12.30pm
28 May - S - WARAG Weekend, Somerset - between 2.45 & 3.30pm
28 May - Lechlade Festival, Gloucestershire - between 3.05 & 3.55pm
28 May - H - Heathfield Agricultural Show, East Sussex - between 1.45 & 2.30pm
29 May - Carrington Steam & Heritage Show, Lincolnshire - between 4.15 & 5.00pm
29 May - Vintage Rally, Smallwood, Cheshire - between 3.45 & 4.30pm
29 May - SH - FIA/FIM, Santa Pod Raceway, Northamptonshire - between 12.10 & 12.55pm
29 May - Lechlade Festival, Gloucestershire - between 3.05 & 3.55pm
29 May - SH - Classic Wings & Wheels, Bidford Gliding Club, Warwickshire - between 12.30 & 1.00pm

02 June - L - Lanc, Tank and Military Machines, East Kirkby, Lincs.  Flypast
02 July - S - Hollowell Steam and Vintage Rally Flypast - 12.48pm
03 July - SH - Hollowell Steam and Vintage Rally Flypast - 2.01pm

当使用 DataFrame 代码行时，我想要相同的输出。

我尝试了 6 月的最新网页。我希望输出与我在 6 月份发布的格式相同。 June 的数据的问题是这次 between 和 Flypast 文本不在 'a' href 标签中，所以我不确定如何将 re=compile 代码行与哪个相关标签结合起来，似乎是在字体标签中 ?

我在 6 月份使用了这行代码 :-

for item in soup3.find_all('b', string=re.compile(r'June')):

但由于我没有包括 between 和 Flypast，在代码行中，输出了很多不需要的数据。和以前一样，重复 6 月的数据，重复次数与条目数一样多。

Answer 1

鉴于 html 和一些条目的不规则性，您将需要首先搜索几个模式（您完全遗漏了一些符合当前模式的日期）。这可以通过 css OR 语法并传递月份缩写列表以在指定标签内搜索来完成。

然后您将根据标记类型处理返回的列表。在b标签的情况下，您可以从每个匹配的节点加上一些兄弟节点构建相关的事件条目。

我使用项目符号作为一种锚点来识别我的目标元素，然后使用更多 css 选择器来限制页面上感兴趣的元素。

考虑到 html 的性质，下面的解决方案比我想要的更脆弱。

import requests
from bs4 import BeautifulSoup as bs
import calendar
import pandas as pd

months = '" ' + '"," '.join(list(calendar.month_abbr)[1:]) + '"'
r = requests.get(
    'https://web.archive.org/web/20220521203053/https://www.military-airshows.co.uk/press22/bbmfschedule2022.htm')
soup = bs(r.content, 'lxml')
results = []

for i in soup.select(f'[color=black]:-soup-contains("•") ~ i:has(b:-soup-contains({months})), \
                       [color=black]:-soup-contains("•") + b:-soup-contains({months}):not(i [color=black]:-soup-contains("•") + \
                       b:-soup-contains({months}))'):
    if i.name != 'b':
        if '\n' in i.text:  # handle odd case of late May
            results.extend(i.text.replace('• ', '').strip().split('\n'))
        else:
            results.append(i.text.replace('• ', '').strip())
    else:
        s = i.text + i.next_sibling + i.next_sibling.find_next(string=True)
        ss = i.next_sibling.find_next(string=True).find_next(string=True)
        if ss.strip() == '-':
            results.append(s + ss + ss.find_next('a').text.strip())
        else:
            results.append(s.strip())

df = pd.DataFrame(results, columns=['event'])
df.to_markdown(index=False)
df.to_csv('events.csv', encoding='utf-8-sig', index=False)

五月底：

在一个父 i 标签中有很多 br 分开的列表

我在 \n 上拆分此内容，然后在适当的时候使用此列表扩展我的总体结果。

结果样本：

Answer 2

好吧，这并不容易，但我们做到了：（我是在 6 月做的，因为无法正常访问 5 月，但代码应该也适用于 5 月）

1.导入模块，得到url和html代码：

from bs4 import BeautifulSoup
from lxml import etree
import requests
  
URL = "https://www.military-airshows.co.uk/press22/bbmfschedule2022.htm"
webpage = requests.get(URL)
soup = BeautifulSoup(webpage.content, "html.parser")
dom = etree.HTML(str(soup))

2. 从第一个i:

之后的descendant获取所有text()

all = dom.xpath('/html/body/div[6]/div/div[1]/div/div[2]/i[4]/descendant::text()')

2.1为了方便我在这里做了第一次清理：

all = [i for i in all if i != '\n' and i != ' ']

3. 我写了一个小函数让我们在每次出现'•'时分隔lines/rows：

def split_list(input_list, delimiter):
    result_list = []
    while len(input_list) > 0:
        elem = input_list.pop(0)
        if elem == delimiter:
            if 'sub_list' in locals():
                result_list.append(sub_list)
            sub_list = [elem]

        elif len(input_list) == 0:
            sub_list.append(elem)
            result_list.append(sub_list)

        else:
            sub_list.append(elem)

    return result_list

a = split_list(all, '•')

这个函数适用于你想要的任何分隔符，你可以在其他地方使用它;)

4. 现在我们可以在干净列表上使用 for 循环来创建数据框：

rows = []

for i in a:
  date = (i[1])
  event = (','.join(i[2:])).replace(',', '')
  rows.append([date,event])

df = pd.DataFrame(rows, columns=["Date", "Event"])
df

5. 输出：

index	Date	Event
0	02 June	- BBMF aircraft will take part in the Queen's Platinum Jubilee Flypast over Buckingham Palace at 1.00pm
1	02 June	- Kingston-Upon-Hull E.Yorkshire - between 6.50 & 7.35pm
2	02 June	- Hessle N.Yorkshire - between 6.45 & 7.30pm

...

index	Date	Event
105	05 June	- Ingatestone Essex - between 12.45 & 1.30pm
106	05 June	- Maidstone Kent - between 3.45 & 4.30pm
107	05 June	- H - The Overlord Show Denmead Hampshire - between 3.10 & 4.00pm

使用 Beautifulsoup 查询多次显示相同的数据

Same Data Showing Several Times With Beautifulsoup Query

python

beautifulsoup

dataframe

pandas