Google 抓取 returns 没有描述或电子邮件
Google scrape returns no description or email
我试图从每个 Google 搜索中获取描述和电子邮件,但 returns 只有标题和链接。我使用 Selenium 打开页面并使用 bs4 抓取实际内容。
我做错了什么?请帮忙。
谢谢!
soup = BeautifulSoup(driver.page_source,'lxml')
result_div = soup.find_all('div', attrs={'class': 'g'})
links = []
titles = []
descriptions = []
emails = []
phones = []
for r in result_div:
# Checks if each element is present, else, raise exception
try:
# link
link = r.find('a', href=True)
# title
title = None
title = r.find('h3')
if isinstance(title,Tag):
title = title.get_text()
# desc
description = None
description = r.find('div', attrs={'class': 'IsZvec'})
#description = r.find('span')
if isinstance(description, Tag):
description = description.get_text()
print(description)
# email
email = r.find(text=re.compile(r'[A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.[a-zA-Z]*'))
这里的主要问题是 class 名称是动态的,因此您必须通过 tag
或 id
更改您的策略和 select 您的元素。
...
data = []
for e in soup.select('div:has(> div > a h3)'):
data.append({
'title':e.h3.text,
'url':e.a.get('href'),
'desc':e.next_sibling.text,
'email':m.group(0) if (m:= re.search(r'[\w.+-]+@[\w-]+\.[\w.-]+', e.parent.text)) else None
})
data
输出
[{'title': 'Email design at Stack Overflow',
'url': 'https://Whosebug.design/email/guidelines/getting-started/',
'desc': 'An email design system that helps us work together to create consistently-designed, properly-rendered email for all Stack Overflow users.',
'email': None},
{'title': 'Is email from do-not-reply@Whosebug.email legit? - Meta ...',
'url': 'https://meta.whosebug.com/questions/338332/is-email-from-do-not-replyWhosebug-email-legit',
'desc': '23.11.2016 · 1\xa0AntwortYes it is legit. We use it to protect whosebug.com user cookies from third parties. The links in the email are all rewritten to a\xa0...',
'email': 'do-not-reply@Whosebug.email'},
{'title': "Newest 'email' Questions - Stack Overflow",
'url': 'https://whosebug.com/questions/tagged/email',
'desc': 'Use this tag for questions involving code to send or receive email messages. Posting to ask why the emails you send are marked as spam is off-topic for Stack\xa0...',
'email': None},
{'title': 'Contact information - contact us today - Stack Overflow',
'url': 'https://Whosebug.co/company/contact',
'desc': "A private, secure home for your team's questions and answers. Perfect for teams of 10-500 members. No more digging through stale wikis and lost emails—give your\xa0...",
'email': None},
{'title': 'How can I get the email of a Whosebug user? - Meta Stack ...',
'url': 'https://meta.stackexchange.com/questions/64970/how-can-i-get-the-email-of-a-Whosebug-user',
'desc': '18.09.2010 · 1\xa0AntwortYou can\'t. Read your own profile. The e-mail box says "never displayed". The closest we have to private messaging is commenting as a reply\xa0...',
'email': None},...]
我试图从每个 Google 搜索中获取描述和电子邮件,但 returns 只有标题和链接。我使用 Selenium 打开页面并使用 bs4 抓取实际内容。
我做错了什么?请帮忙。 谢谢!
soup = BeautifulSoup(driver.page_source,'lxml')
result_div = soup.find_all('div', attrs={'class': 'g'})
links = []
titles = []
descriptions = []
emails = []
phones = []
for r in result_div:
# Checks if each element is present, else, raise exception
try:
# link
link = r.find('a', href=True)
# title
title = None
title = r.find('h3')
if isinstance(title,Tag):
title = title.get_text()
# desc
description = None
description = r.find('div', attrs={'class': 'IsZvec'})
#description = r.find('span')
if isinstance(description, Tag):
description = description.get_text()
print(description)
# email
email = r.find(text=re.compile(r'[A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.[a-zA-Z]*'))
这里的主要问题是 class 名称是动态的,因此您必须通过 tag
或 id
更改您的策略和 select 您的元素。
...
data = []
for e in soup.select('div:has(> div > a h3)'):
data.append({
'title':e.h3.text,
'url':e.a.get('href'),
'desc':e.next_sibling.text,
'email':m.group(0) if (m:= re.search(r'[\w.+-]+@[\w-]+\.[\w.-]+', e.parent.text)) else None
})
data
输出
[{'title': 'Email design at Stack Overflow',
'url': 'https://Whosebug.design/email/guidelines/getting-started/',
'desc': 'An email design system that helps us work together to create consistently-designed, properly-rendered email for all Stack Overflow users.',
'email': None},
{'title': 'Is email from do-not-reply@Whosebug.email legit? - Meta ...',
'url': 'https://meta.whosebug.com/questions/338332/is-email-from-do-not-replyWhosebug-email-legit',
'desc': '23.11.2016 · 1\xa0AntwortYes it is legit. We use it to protect whosebug.com user cookies from third parties. The links in the email are all rewritten to a\xa0...',
'email': 'do-not-reply@Whosebug.email'},
{'title': "Newest 'email' Questions - Stack Overflow",
'url': 'https://whosebug.com/questions/tagged/email',
'desc': 'Use this tag for questions involving code to send or receive email messages. Posting to ask why the emails you send are marked as spam is off-topic for Stack\xa0...',
'email': None},
{'title': 'Contact information - contact us today - Stack Overflow',
'url': 'https://Whosebug.co/company/contact',
'desc': "A private, secure home for your team's questions and answers. Perfect for teams of 10-500 members. No more digging through stale wikis and lost emails—give your\xa0...",
'email': None},
{'title': 'How can I get the email of a Whosebug user? - Meta Stack ...',
'url': 'https://meta.stackexchange.com/questions/64970/how-can-i-get-the-email-of-a-Whosebug-user',
'desc': '18.09.2010 · 1\xa0AntwortYou can\'t. Read your own profile. The e-mail box says "never displayed". The closest we have to private messaging is commenting as a reply\xa0...',
'email': None},...]