使用 python 和 beautlfulsoup 从网站的 href 中提取文本
extracting text from a href in a website using python and beautlfulsoup
我正在尝试从网站抓取数据,我需要文本标题。
[<a href="http://www.thegolfcourses.net/golfcourses/TX/38468.htm" rel="bookmark">Feather Bay Golf Course and Resort</a>]
[<a href="http://www.thegolfcourses.net/golfcourses/AZ/174830.htm" rel="bookmark">Paradise Valley Country Club</a>]
[<a href="http://www.thegolfcourses.net/golfcourses/IL/129935.htm" rel="bookmark">The Golf Club at Waters Edge</a>]
[<a href="http://www.thegolfcourses.net/golfcourses/NY/10630.htm" rel="bookmark">1000 Acres Ranch Resort</a>]
[<a href="http://www.thegolfcourses.net/golfcourses/VA/995731.htm" rel="bookmark">1757 Golf Club, 1757 Golf Club Front 9 Golf Course</a>]
[<a href="http://www.thegolfcourses.net/golfcourses/WI/320815.htm" rel="bookmark">27 Pines Golf Course</a>]
[<a href="http://www.thegolfcourses.net/golfcourses/WY/823145.htm" rel="bookmark">3 Creek Ranch Golf Club</a>]
[<a href="http://www.thegolfcourses.net/golfcourses/CA/18431.htm" rel="bookmark">3 Par At Four Points</a>]
[<a href="http://www.thegolfcourses.net/golfcourses/AZ/470720.htm" rel="bookmark">3 Parks Fairways</a>]
[<a href="http://www.thegolfcourses.net/golfcourses/IA/074920.htm" rel="bookmark">3-30 Golf & Country Club</a>]
我使用这段代码来处理它,但是我很难编写代码来提取它关于如何处理它的任何好的想法?
import csv
import requests
from bs4 import BeautifulSoup
courses_list = []
for i in range(1):
url="http://www.thegolfcourses.net/page/{}?ls&location=California&orderby=title&radius=6750#038;location=California&orderby=title&radius=6750".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content)
g_data2=soup.find_all("article")
for item in g_data2:
try:
name= item.contents[5].find_all("a")
print name
except:
name=''
如果我理解正确,这可能有效。
Is there InnerText equivalent in BeautifulSoup / python?
基本上试试“.text”的方法
name = item.contents[5].find_all("a").text
编辑:抱歉,我无法正确格式化试试这个,它很糟糕,但它有效
x = "<a> text </a>"
y = x.split(">")[1]
z = y.split("<")[0]
print z
text
使用 string
属性
name= item.contents[5].find_all("a")[0].string
记住 findall
returns 一个列表(ResultSet 对象),所以如果您知道只有一个,您可以只查找该列表中的第 0 个索引。
或者您可以使用 find
,如果您知道只有一个结果您感兴趣的话。
name= item.contents[5].find("a").string
我正在尝试从网站抓取数据,我需要文本标题。
[<a href="http://www.thegolfcourses.net/golfcourses/TX/38468.htm" rel="bookmark">Feather Bay Golf Course and Resort</a>]
[<a href="http://www.thegolfcourses.net/golfcourses/AZ/174830.htm" rel="bookmark">Paradise Valley Country Club</a>]
[<a href="http://www.thegolfcourses.net/golfcourses/IL/129935.htm" rel="bookmark">The Golf Club at Waters Edge</a>]
[<a href="http://www.thegolfcourses.net/golfcourses/NY/10630.htm" rel="bookmark">1000 Acres Ranch Resort</a>]
[<a href="http://www.thegolfcourses.net/golfcourses/VA/995731.htm" rel="bookmark">1757 Golf Club, 1757 Golf Club Front 9 Golf Course</a>]
[<a href="http://www.thegolfcourses.net/golfcourses/WI/320815.htm" rel="bookmark">27 Pines Golf Course</a>]
[<a href="http://www.thegolfcourses.net/golfcourses/WY/823145.htm" rel="bookmark">3 Creek Ranch Golf Club</a>]
[<a href="http://www.thegolfcourses.net/golfcourses/CA/18431.htm" rel="bookmark">3 Par At Four Points</a>]
[<a href="http://www.thegolfcourses.net/golfcourses/AZ/470720.htm" rel="bookmark">3 Parks Fairways</a>]
[<a href="http://www.thegolfcourses.net/golfcourses/IA/074920.htm" rel="bookmark">3-30 Golf & Country Club</a>]
我使用这段代码来处理它,但是我很难编写代码来提取它关于如何处理它的任何好的想法?
import csv
import requests
from bs4 import BeautifulSoup
courses_list = []
for i in range(1):
url="http://www.thegolfcourses.net/page/{}?ls&location=California&orderby=title&radius=6750#038;location=California&orderby=title&radius=6750".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content)
g_data2=soup.find_all("article")
for item in g_data2:
try:
name= item.contents[5].find_all("a")
print name
except:
name=''
如果我理解正确,这可能有效。 Is there InnerText equivalent in BeautifulSoup / python?
基本上试试“.text”的方法
name = item.contents[5].find_all("a").text
编辑:抱歉,我无法正确格式化试试这个,它很糟糕,但它有效
x = "<a> text </a>"
y = x.split(">")[1]
z = y.split("<")[0]
print z
text
使用 string
属性
name= item.contents[5].find_all("a")[0].string
记住 findall
returns 一个列表(ResultSet 对象),所以如果您知道只有一个,您可以只查找该列表中的第 0 个索引。
或者您可以使用 find
,如果您知道只有一个结果您感兴趣的话。
name= item.contents[5].find("a").string