如何抓取新闻内容并删除不相关的部分

Question

我的目标是使用 BeautifulSoup 和 for-loop 抓取 100 条新闻文本，并将文本存储到列表 myarticle 中。我希望 myarticle 应该只包含新闻文章的内容，我发现它们都有 h 属性 。然而，我得到的结果包含许多不相关的部分，例如："Thanks for contacting us. We've received your submission." 和“这个故事已被分享 205,105 次。 205,105" 等等。

另一个问题是，当我 print(myarticle[0]) 时，它给了我很多新闻文章，但我预计它应该只给我 1 篇文章。

我想知道如何删除不相关的部分，只保留我们从新闻网络上阅读的主要内容。我该如何调整代码，以便当我 print(myarticle[0]) 时，它会给我第一篇新闻文章。

此页面上有 100 篇新闻文章之一： https://nypost.com/2020/04/21/missouri-sues-china-over-coronavirus-deceit/

我想抓取的其他新闻文章在这个网站上： https://nypost.com/search/China+COVID-19/page/1/?orderby=relevance

下面是与我的问题相关的代码行。

            for pagelink in pagelinks:
                #get page text
                page = requests.get(pagelink)
                #parse with BeautifulSoup
                soup = bs(page.text, 'lxml')
                articletext = soup.find_all('p')
                for paragraph in articletext[:-1]:
                    #get the text only
                    text = paragraph.get_text()
                    paragraphtext.append(text)

                #combine all paragraphs into an article
                thearticle.append(paragraphtext)
    # join paragraphs to re-create the article            
    myarticle = [''.join(article) for article in thearticle]
    #show the first string of the list
    print(myarticle[0])

Answer 1

soup.find_all('p')

在这里您可以找到网页中所有的P标签元素。 P 是用于大多数文本的非常常见的标签，这就是您找到非文章文本的原因。

我会先找到包含 div 的文章，然后获取文本，例如：

container = soup.find("div", class_=['entry-content', 'entry-content-read-more'])
articletext = container.find_all('p')

如何抓取新闻内容并删除不相关的部分

How to scrape news content and remove the irrelevant parts

python

for-loop

beautifulsoup

web-crawler

web-scraping