如何只提取文章正文的某些部分?
How can I extract only certain parts of the body of an article?
在我的 text_scraper(page_soup)
中,我意识到到最后我得到了与我的文章完全无关的无关信息。去除无关信息的一般方法是什么?
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import re
# Initializing our dictionary
dictionary = {}
# Initializing our url key
url_key = 'url'
dictionary.setdefault(url_key, [])
# Initializing our text key
text_key = 'text'
dictionary.setdefault(text_key, [])
def text_scraper(page_soup):
text_body = ''
# Returns the text of p tags, we stopped it at -5 bc that's when the text is irrelevant to the article
for p in page_soup.find_all('p'):
text_body += p.text
return(text_body)
def article_scraper(url):
# Opening up the connection, grabbing the page
uClient = uReq(url)
page_html = uClient.read()
uClient.close()
# HTML parsing
page_soup = soup(page_html, "html.parser")
dictionary['url'].append(url)
dictionary['text'] = text_scraper(page_soup)
return dictionary
articles_zero = 'https://www.sfchronicle.com/news/bayarea/heatherknight/article/Special-education-teacher-a-prime-example-of-13560483.php'
article = article_scraper(articles_zero)
article
如果你只想要与文章相关的文本,你可以在你的 text_scraper
方法中调整你的指针,只抓取 <section>
中的 <p>
标签:
def text_scraper(page_soup):
text_body = ''
# Find only the text related to the article:
article_section = page_soup.find('section',{'class':'body'})
# Returns the text of p tags, we stopped it at -5 bc that's when the text is irrelevant to the article
for p in article_section.find_all('p'):
if p.previousSibling and p.previousSibling.name is not 'em':
text_body += p.text
return(text_body)
然后文章将返回,页脚内没有文本(Heather Knight 是专栏作家 [...] 和他们的斗争。)
编辑: 在 parent 上添加了测试以避免最后一部分 旧金山纪事报[...]Twitter:@hknightsf
在我的 text_scraper(page_soup)
中,我意识到到最后我得到了与我的文章完全无关的无关信息。去除无关信息的一般方法是什么?
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import re
# Initializing our dictionary
dictionary = {}
# Initializing our url key
url_key = 'url'
dictionary.setdefault(url_key, [])
# Initializing our text key
text_key = 'text'
dictionary.setdefault(text_key, [])
def text_scraper(page_soup):
text_body = ''
# Returns the text of p tags, we stopped it at -5 bc that's when the text is irrelevant to the article
for p in page_soup.find_all('p'):
text_body += p.text
return(text_body)
def article_scraper(url):
# Opening up the connection, grabbing the page
uClient = uReq(url)
page_html = uClient.read()
uClient.close()
# HTML parsing
page_soup = soup(page_html, "html.parser")
dictionary['url'].append(url)
dictionary['text'] = text_scraper(page_soup)
return dictionary
articles_zero = 'https://www.sfchronicle.com/news/bayarea/heatherknight/article/Special-education-teacher-a-prime-example-of-13560483.php'
article = article_scraper(articles_zero)
article
如果你只想要与文章相关的文本,你可以在你的 text_scraper
方法中调整你的指针,只抓取 <section>
中的 <p>
标签:
def text_scraper(page_soup):
text_body = ''
# Find only the text related to the article:
article_section = page_soup.find('section',{'class':'body'})
# Returns the text of p tags, we stopped it at -5 bc that's when the text is irrelevant to the article
for p in article_section.find_all('p'):
if p.previousSibling and p.previousSibling.name is not 'em':
text_body += p.text
return(text_body)
然后文章将返回,页脚内没有文本(Heather Knight 是专栏作家 [...] 和他们的斗争。)
编辑: 在 parent 上添加了测试以避免最后一部分 旧金山纪事报[...]Twitter:@hknightsf