如何让我的 python 抓取函数在 post 的特定范围内执行?
How can I make my python scraping function execute between a certain range of post?
我是新手。我创建了一个在一定数量的帖子之间进行抓取的功能。它有效,但它看起来太大而且看起来很新手。我想压缩代码并使它的行为方式在初始数量过大的情况下将其抓取的帖子数量减少 1。因此,如果它试图抓取 15 而只有 14,它将下降到 14 而不是停止。这是我的代码
def scrape_world():
url = 'http://www.example.org'
html = requests.get(url, headers=headers)
soup = BeautifulSoup(html.text, 'html5lib')
titles = []
if len(titles) > 15:
titles = soup.find_all('section', 'box')[:15]
random.shuffle(titles)
print(len(titles))
elif len(titles) > 14:
titles = soup.find_all('section', 'box')[:14]
# random.shuffle(titles)
print(len(titles))
elif len(titles) > 13:
titles = soup.find_all('section', 'box')[:13]
random.shuffle(titles)
print(len(titles))
elif len(titles) > 12:
titles = soup.find_all('section', 'box')[:12]
random.shuffle(titles)
print(len(titles))
elif len(titles) > 11:
titles = soup.find_all('section', 'box')[:11]
random.shuffle(titles)
print(len(titles))
elif len(titles) > 10:
titles = soup.find_all('section', 'box')[:10]
random.shuffle(titles)
print(len(titles))
elif len(titles) > 9:
titles = soup.find_all('section', 'box')[:9]
random.shuffle(titles)
print(len(titles))
else:
titles = soup.find_all('section', 'box')[:8]
random.shuffle(titles)
print(len(titles))
entries = [{'href': url + box.a.get('href'),
'src': box.img.get('src'),
'text': box.strong.a.text,
} for box in titles]
# random.shuffle(entries)
return entries
我试过
if len(titles) > 15 || < 9:
但这并不奏效
更新:打印(标题)输出
[<section class="box">
<a class="video-box" href="/videos/video.php?v=wshh2Nw4BKk0vav380lx">
<img alt="" height="125" src="http://i.ytimg.com/vi/clPaWvb6lWk/maxresdefault.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshh2Nw4BKk0vav380lx">Spodee - All I Want</a></strong>
<div>
<span class="views">18,781</span>
<span class="comments"><a data-disqus-identifier="95018" href="http://www.worldstarhiphop.com/videos/video.php?v=wshh2Nw4BKk0vav380lx#disqus_thread"></a></span>
</div>
</section>, <section class="box">
<a class="video-box" href="/videos/video.php?v=wshh058e7C1B1Ey8qwNT">
<img alt="" height="125" src="http://hw-static.worldstarhiphop.com/u/pic/2016/06/t9OWyXfcdYQm.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshh058e7C1B1Ey8qwNT">Sheesh: Dude Grill Is On Another Level!</a></strong>
<div>
<span class="views">182,832</span>
<span class="comments"><a data-disqus-identifier="95013" href="http://www.worldstarhiphop.com/videos/video.php?v=wshh058e7C1B1Ey8qwNT#disqus_thread"></a></span>
</div>
</section>, <section class="box">
<a class="video-box" href="/videos/video.php?v=wshhrXYCnHFIj4h2GQjE">
<img alt="" height="125" src="http://hw-static.worldstarhiphop.com/u/pic/2016/06/M1itOMKyh7zj.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshhrXYCnHFIj4h2GQjE">Back At It: Brock Lesnar To Return At UFC 200, WWE Approved!</a></strong>
<div>
<span class="views">124,237</span>
<span class="comments"><a data-disqus-identifier="95016" href="http://www.worldstarhiphop.com/videos/video.php?v=wshhrXYCnHFIj4h2GQjE#disqus_thread"></a></span>
</div>
</section>, <section class="box">
<a class="video-box" href="/videos/video.php?v=wshhj7V8H8GXx08iH2V9">
<img alt="" height="125" src="http://i.ytimg.com/vi/YRlsJtuZ09s/maxresdefault.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshhj7V8H8GXx08iH2V9">Jose Guapo - Off Top</a></strong>
<div>
<span class="views">16,462</span>
<span class="comments"><a data-disqus-identifier="95017" href="http://www.worldstarhiphop.com/videos/video.php?v=wshhj7V8H8GXx08iH2V9#disqus_thread"></a></span>
</div>
</section>, <section class="box">
<a class="video-box" href="/videos/video.php?v=wshhfOnhy45f780tHqQG">
<img alt="" height="125" src="http://hw-static.worldstarhiphop.com/u/pic/2016/06/wn03kuXW3v2a.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshhfOnhy45f780tHqQG">Tulsa Candidate Angry About Not Being Involved In The Mayoral Debate, Runs Up There Anyway!</a></strong>
<div>
<span class="views">115,333</span>
<span class="comments"><a data-disqus-identifier="95014" href="http://www.worldstarhiphop.com/videos/video.php?v=wshhfOnhy45f780tHqQG#disqus_thread"></a></span>
</div>
</section>, <section class="box">
<a class="video-box" href="/videos/video.php?v=wshhrYcD83QWN1n0665g">
<img alt="" height="125" src="http://hw-static.worldstarhiphop.com/u/pic/2016/06/14H17jc8ZTIw.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshhrYcD83QWN1n0665g">This Motel Has An Interesting Key Policy!</a></strong>
<div>
<span class="views">16,015</span>
<span class="comments"><a data-disqus-identifier="95019" href="http://www.worldstarhiphop.com/videos/video.php?v=wshhrYcD83QWN1n0665g#disqus_thread"></a></span>
</div>
</section>, <section class="box">
<a class="video-box" href="/videos/video.php?v=wshhs2kTRq49K0gXYbuu">
<img alt="" height="125" src="http://hw-static.worldstarhiphop.com/u/pic/2016/06/e2VMzdzmKwFe.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshhs2kTRq49K0gXYbuu">Yonio & AG - Holy (Freestyle) [Houston Unsigned Artist] </a></strong>
<div>
<span class="views">4,076</span>
<span class="comments"><a data-disqus-identifier="95012" href="http://www.worldstarhiphop.com/videos/video.php?v=wshhs2kTRq49K0gXYbuu#disqus_thread"></a></span>
</div>
</section>, <section class="box">
<a class="video-box" href="/videos/video.php?v=wshhDQZ3eC6yJE6Y5hjL">
<img alt="" height="125" src="http://hw-static.worldstarhiphop.com/u/pic/2016/06/dVjLEzVRc1YQ.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshhDQZ3eC6yJE6Y5hjL">Messed Up: 6-Year Old Polish Boy Beats His Mother And Pulls Her Hair!</a></strong>
<div>
<span class="views">201,996</span>
<span class="comments"><a data-disqus-identifier="95015" href="http://www.worldstarhiphop.com/videos/video.php?v=wshhDQZ3eC6yJE6Y5hjL#disqus_thread"></a></span>
</div>
</section>]
好的,BeautifulSoup returns 一种与我预期不同的结构类型。但是,我确实在回答的前提下要求澄清,所以我会 post 并在有问题时撤回。
def scrape_world():
url = 'http://www.example.org'
html = requests.get(url, headers=headers)
soup = BeautifulSoup(html.text, 'html5lib')
titles = soup.find_all('section', 'box')
cleaned_titles = [title for title in titles if title is not None]
entries = [{'href': url + box.a.get('href'),
'src': box.img.get('src'),
'text': box.strong.a.text,
} for box in cleaned_titles]
return entries
在您的示例中实际包含您正在尝试执行的操作的示例总是更好,这样人们就可以更轻松地重现您的问题。
如评论所说,您的代码直接进入 titles[:8]
因为在循环之前,titles =[]
表示 len(titles)
为 0。soup.find_all
函数很聪明足以知道您的数据集有多大,因此无需指定长度。根据您的 print(titles)
输出,我假设您将代码指向 url = 'http://www.worldstarhiphop.com'
所以下面使用它。当抓取这个特定的 url 时,在 titles[11]
中有一个 "SUBMIT YOUR VIDEO" 结果,当您构建 entries
字典时会抛出错误。 roganjosh 的回答是正确的基本方法,但在这种情况下,它不会捕获不是 None 的标题 [11],但不幸的是,它只是一种不同的格式。如果您将 cleaned_titles 更新为以下内容,它应该适合您。
cleaned_titles = [title for title in titles if title.a.get('href') != 'vsubmit.php']
给你:
def scrape_world():
url = 'http://www.worldstarhiphop.com'
html = requests.get(url, headers=headers)
soup = BeautifulSoup(html.text, 'html5lib')
titles = soup.find_all('section', 'box')
cleaned_titles = [title for title in titles if title.a.get('href') != 'vsubmit.php']
entries = [{'href': url + box.a.get('href'),
'src': box.img.get('src'),
'text': box.strong.a.text,
} for box in cleaned_titles]
return entries
我是新手。我创建了一个在一定数量的帖子之间进行抓取的功能。它有效,但它看起来太大而且看起来很新手。我想压缩代码并使它的行为方式在初始数量过大的情况下将其抓取的帖子数量减少 1。因此,如果它试图抓取 15 而只有 14,它将下降到 14 而不是停止。这是我的代码
def scrape_world():
url = 'http://www.example.org'
html = requests.get(url, headers=headers)
soup = BeautifulSoup(html.text, 'html5lib')
titles = []
if len(titles) > 15:
titles = soup.find_all('section', 'box')[:15]
random.shuffle(titles)
print(len(titles))
elif len(titles) > 14:
titles = soup.find_all('section', 'box')[:14]
# random.shuffle(titles)
print(len(titles))
elif len(titles) > 13:
titles = soup.find_all('section', 'box')[:13]
random.shuffle(titles)
print(len(titles))
elif len(titles) > 12:
titles = soup.find_all('section', 'box')[:12]
random.shuffle(titles)
print(len(titles))
elif len(titles) > 11:
titles = soup.find_all('section', 'box')[:11]
random.shuffle(titles)
print(len(titles))
elif len(titles) > 10:
titles = soup.find_all('section', 'box')[:10]
random.shuffle(titles)
print(len(titles))
elif len(titles) > 9:
titles = soup.find_all('section', 'box')[:9]
random.shuffle(titles)
print(len(titles))
else:
titles = soup.find_all('section', 'box')[:8]
random.shuffle(titles)
print(len(titles))
entries = [{'href': url + box.a.get('href'),
'src': box.img.get('src'),
'text': box.strong.a.text,
} for box in titles]
# random.shuffle(entries)
return entries
我试过
if len(titles) > 15 || < 9:
但这并不奏效
更新:打印(标题)输出
[<section class="box">
<a class="video-box" href="/videos/video.php?v=wshh2Nw4BKk0vav380lx">
<img alt="" height="125" src="http://i.ytimg.com/vi/clPaWvb6lWk/maxresdefault.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshh2Nw4BKk0vav380lx">Spodee - All I Want</a></strong>
<div>
<span class="views">18,781</span>
<span class="comments"><a data-disqus-identifier="95018" href="http://www.worldstarhiphop.com/videos/video.php?v=wshh2Nw4BKk0vav380lx#disqus_thread"></a></span>
</div>
</section>, <section class="box">
<a class="video-box" href="/videos/video.php?v=wshh058e7C1B1Ey8qwNT">
<img alt="" height="125" src="http://hw-static.worldstarhiphop.com/u/pic/2016/06/t9OWyXfcdYQm.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshh058e7C1B1Ey8qwNT">Sheesh: Dude Grill Is On Another Level!</a></strong>
<div>
<span class="views">182,832</span>
<span class="comments"><a data-disqus-identifier="95013" href="http://www.worldstarhiphop.com/videos/video.php?v=wshh058e7C1B1Ey8qwNT#disqus_thread"></a></span>
</div>
</section>, <section class="box">
<a class="video-box" href="/videos/video.php?v=wshhrXYCnHFIj4h2GQjE">
<img alt="" height="125" src="http://hw-static.worldstarhiphop.com/u/pic/2016/06/M1itOMKyh7zj.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshhrXYCnHFIj4h2GQjE">Back At It: Brock Lesnar To Return At UFC 200, WWE Approved!</a></strong>
<div>
<span class="views">124,237</span>
<span class="comments"><a data-disqus-identifier="95016" href="http://www.worldstarhiphop.com/videos/video.php?v=wshhrXYCnHFIj4h2GQjE#disqus_thread"></a></span>
</div>
</section>, <section class="box">
<a class="video-box" href="/videos/video.php?v=wshhj7V8H8GXx08iH2V9">
<img alt="" height="125" src="http://i.ytimg.com/vi/YRlsJtuZ09s/maxresdefault.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshhj7V8H8GXx08iH2V9">Jose Guapo - Off Top</a></strong>
<div>
<span class="views">16,462</span>
<span class="comments"><a data-disqus-identifier="95017" href="http://www.worldstarhiphop.com/videos/video.php?v=wshhj7V8H8GXx08iH2V9#disqus_thread"></a></span>
</div>
</section>, <section class="box">
<a class="video-box" href="/videos/video.php?v=wshhfOnhy45f780tHqQG">
<img alt="" height="125" src="http://hw-static.worldstarhiphop.com/u/pic/2016/06/wn03kuXW3v2a.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshhfOnhy45f780tHqQG">Tulsa Candidate Angry About Not Being Involved In The Mayoral Debate, Runs Up There Anyway!</a></strong>
<div>
<span class="views">115,333</span>
<span class="comments"><a data-disqus-identifier="95014" href="http://www.worldstarhiphop.com/videos/video.php?v=wshhfOnhy45f780tHqQG#disqus_thread"></a></span>
</div>
</section>, <section class="box">
<a class="video-box" href="/videos/video.php?v=wshhrYcD83QWN1n0665g">
<img alt="" height="125" src="http://hw-static.worldstarhiphop.com/u/pic/2016/06/14H17jc8ZTIw.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshhrYcD83QWN1n0665g">This Motel Has An Interesting Key Policy!</a></strong>
<div>
<span class="views">16,015</span>
<span class="comments"><a data-disqus-identifier="95019" href="http://www.worldstarhiphop.com/videos/video.php?v=wshhrYcD83QWN1n0665g#disqus_thread"></a></span>
</div>
</section>, <section class="box">
<a class="video-box" href="/videos/video.php?v=wshhs2kTRq49K0gXYbuu">
<img alt="" height="125" src="http://hw-static.worldstarhiphop.com/u/pic/2016/06/e2VMzdzmKwFe.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshhs2kTRq49K0gXYbuu">Yonio & AG - Holy (Freestyle) [Houston Unsigned Artist] </a></strong>
<div>
<span class="views">4,076</span>
<span class="comments"><a data-disqus-identifier="95012" href="http://www.worldstarhiphop.com/videos/video.php?v=wshhs2kTRq49K0gXYbuu#disqus_thread"></a></span>
</div>
</section>, <section class="box">
<a class="video-box" href="/videos/video.php?v=wshhDQZ3eC6yJE6Y5hjL">
<img alt="" height="125" src="http://hw-static.worldstarhiphop.com/u/pic/2016/06/dVjLEzVRc1YQ.jpg" width="222"/>
</a>
<strong class="title"><a href="/videos/video.php?v=wshhDQZ3eC6yJE6Y5hjL">Messed Up: 6-Year Old Polish Boy Beats His Mother And Pulls Her Hair!</a></strong>
<div>
<span class="views">201,996</span>
<span class="comments"><a data-disqus-identifier="95015" href="http://www.worldstarhiphop.com/videos/video.php?v=wshhDQZ3eC6yJE6Y5hjL#disqus_thread"></a></span>
</div>
</section>]
好的,BeautifulSoup returns 一种与我预期不同的结构类型。但是,我确实在回答的前提下要求澄清,所以我会 post 并在有问题时撤回。
def scrape_world():
url = 'http://www.example.org'
html = requests.get(url, headers=headers)
soup = BeautifulSoup(html.text, 'html5lib')
titles = soup.find_all('section', 'box')
cleaned_titles = [title for title in titles if title is not None]
entries = [{'href': url + box.a.get('href'),
'src': box.img.get('src'),
'text': box.strong.a.text,
} for box in cleaned_titles]
return entries
在您的示例中实际包含您正在尝试执行的操作的示例总是更好,这样人们就可以更轻松地重现您的问题。
如评论所说,您的代码直接进入 titles[:8]
因为在循环之前,titles =[]
表示 len(titles)
为 0。soup.find_all
函数很聪明足以知道您的数据集有多大,因此无需指定长度。根据您的 print(titles)
输出,我假设您将代码指向 url = 'http://www.worldstarhiphop.com'
所以下面使用它。当抓取这个特定的 url 时,在 titles[11]
中有一个 "SUBMIT YOUR VIDEO" 结果,当您构建 entries
字典时会抛出错误。 roganjosh 的回答是正确的基本方法,但在这种情况下,它不会捕获不是 None 的标题 [11],但不幸的是,它只是一种不同的格式。如果您将 cleaned_titles 更新为以下内容,它应该适合您。
cleaned_titles = [title for title in titles if title.a.get('href') != 'vsubmit.php']
给你:
def scrape_world():
url = 'http://www.worldstarhiphop.com'
html = requests.get(url, headers=headers)
soup = BeautifulSoup(html.text, 'html5lib')
titles = soup.find_all('section', 'box')
cleaned_titles = [title for title in titles if title.a.get('href') != 'vsubmit.php']
entries = [{'href': url + box.a.get('href'),
'src': box.img.get('src'),
'text': box.strong.a.text,
} for box in cleaned_titles]
return entries