我如何确保我在特定网站的“关于我们”页面上
How can i make sure that i am on About us page of a particular website
这是一段代码,我正尝试使用它来从给定主页 URL 的网站检索所有链接。
import requests
from BeautifulSoup import BeautifulSoup
url = "https://www.udacity.com"
response = requests.get(url)
page = str(BeautifulSoup(response.content))
def getURL(page):
start_link = page.find("a href")
if start_link == -1:
return None, 0
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1: end_quote]
return url, end_quote
while True:
url, n = getURL(page)
page = page[n:]
if url:
print url
else:
break
结果是
/uconnect
#
/
/
/
/nanodegree
/courses/all
#
/legal/tos
/nanodegree
/courses/all
/nanodegree
uconnect
/
/course/machine-learning-engineer-nanodegree--nd009
/course/data-analyst-nanodegree--nd002
/course/ios-developer-nanodegree--nd003
/course/full-stack-web-developer-nanodegree--nd004
/course/senior-web-developer-nanodegree--nd802
/course/front-end-web-developer-nanodegree--nd001
/course/tech-entrepreneur-nanodegree--nd007
http://blog.udacity.com
http://support.udacity.com
/courses/all
/veterans
https://play.google.com/store/apps/details?id=com.udacity.android
https://itunes.apple.com/us/app/id819700933?mt=8
/us
/press
/jobs
/georgia-tech
/business
/employers
/success
#
/contact
/catalog-api
/legal
http://status.udacity.com
/sitemap/guides
/sitemap
https://twitter.com/udacity
https://www.facebook.com/Udacity
https://plus.google.com/+Udacity/posts
https://www.linkedin.com/company/udacity
Process finished with exit code 0
我只想获取网站的 "about us" 页面的 URL,该页面在很多情况下都不同,例如
优达学城 https://www.udacity.com/us
对于 artscape-inc,它是 https://www.artscape-inc.com/about-decorative-window-film/
我的意思是,我可以尝试在 URL 中搜索像 "about" 这样的关键字,但如前所述,我可能错过了这种方法的大胆。谁能推荐什么好的方法?
涵盖 "About us" 页面 link 的所有可能变体并不容易,但这是在您展示的两种情况下都适用的初步想法 - 检查 "about" 在 href
属性和 a
元素的文本中:
def about_links(elm):
return elm.name == "a" and ("about" in elm["href"].lower() or \
"about" in elm.get_text().lower())
用法:
soup.find_all(about_links) # or soup.find(about_links)
您还可以通过只检查 "footer" 页面的一部分来减少误报的数量。例如。找到 footer
元素,或具有 id="footer"
或具有 footer
class.
的元素
另一种对 "outsource" "about us" 页面定义进行分类的想法是 google(当然来自您的脚本)"about" + "webpage url" 并获取第一个搜索结果。
附带说明一下,我注意到您仍在尽快使用 BeautifulSoup
version 3 - it is not being developed and maintained and you should switch to BeautifulSoup
4,请通过以下方式安装:
pip install --upgrade beautifulsoup4
并将您的导入更改为:
from bs4 import BeautifulSoup
这是一段代码,我正尝试使用它来从给定主页 URL 的网站检索所有链接。
import requests
from BeautifulSoup import BeautifulSoup
url = "https://www.udacity.com"
response = requests.get(url)
page = str(BeautifulSoup(response.content))
def getURL(page):
start_link = page.find("a href")
if start_link == -1:
return None, 0
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1: end_quote]
return url, end_quote
while True:
url, n = getURL(page)
page = page[n:]
if url:
print url
else:
break
结果是
/uconnect
#
/
/
/
/nanodegree
/courses/all
#
/legal/tos
/nanodegree
/courses/all
/nanodegree
uconnect
/
/course/machine-learning-engineer-nanodegree--nd009
/course/data-analyst-nanodegree--nd002
/course/ios-developer-nanodegree--nd003
/course/full-stack-web-developer-nanodegree--nd004
/course/senior-web-developer-nanodegree--nd802
/course/front-end-web-developer-nanodegree--nd001
/course/tech-entrepreneur-nanodegree--nd007
http://blog.udacity.com
http://support.udacity.com
/courses/all
/veterans
https://play.google.com/store/apps/details?id=com.udacity.android
https://itunes.apple.com/us/app/id819700933?mt=8
/us
/press
/jobs
/georgia-tech
/business
/employers
/success
#
/contact
/catalog-api
/legal
http://status.udacity.com
/sitemap/guides
/sitemap
https://twitter.com/udacity
https://www.facebook.com/Udacity
https://plus.google.com/+Udacity/posts
https://www.linkedin.com/company/udacity
Process finished with exit code 0
我只想获取网站的 "about us" 页面的 URL,该页面在很多情况下都不同,例如
优达学城 https://www.udacity.com/us
对于 artscape-inc,它是 https://www.artscape-inc.com/about-decorative-window-film/
我的意思是,我可以尝试在 URL 中搜索像 "about" 这样的关键字,但如前所述,我可能错过了这种方法的大胆。谁能推荐什么好的方法?
涵盖 "About us" 页面 link 的所有可能变体并不容易,但这是在您展示的两种情况下都适用的初步想法 - 检查 "about" 在 href
属性和 a
元素的文本中:
def about_links(elm):
return elm.name == "a" and ("about" in elm["href"].lower() or \
"about" in elm.get_text().lower())
用法:
soup.find_all(about_links) # or soup.find(about_links)
您还可以通过只检查 "footer" 页面的一部分来减少误报的数量。例如。找到 footer
元素,或具有 id="footer"
或具有 footer
class.
另一种对 "outsource" "about us" 页面定义进行分类的想法是 google(当然来自您的脚本)"about" + "webpage url" 并获取第一个搜索结果。
附带说明一下,我注意到您仍在尽快使用 BeautifulSoup
version 3 - it is not being developed and maintained and you should switch to BeautifulSoup
4,请通过以下方式安装:
pip install --upgrade beautifulsoup4
并将您的导入更改为:
from bs4 import BeautifulSoup