非结构化数据的网络爬虫

Question

是否有适用于解析许多非结构化网站（新闻、文章）并在没有预先定义规则的情况下从中提取主要内容块的网络爬虫？

我的意思是当我解析新闻提要时，我想从每篇文章中提取主要内容块来做一些 NLP 的事情。我有很多网站，研究它们的 DOM 模型并为每个网站编写规则需要很长时间。

我试图使用 Scrapy 并获取所有没有标签和脚本的文本，放在一个正文中，但它包含很多不相关的东西，比如菜单项，广告块等

site_body = selector.xpath('//body').extract_first()

但是对这种内容做NLP就不会很精确了。

那么还有其他工具或方法可以完成此类任务吗？

Answer 1

我试图用 pattern matching 解决这个问题。从而将网页本身的来源标注出来，作为匹配的样本，而不需要编写特殊的规则。

例如，如果您查看此页面的源代码，您会看到：

<td class="postcell">
<div>
    <div class="post-text" itemprop="text">

<p>Are there any web-crawlers adapted for parsing many unstructured websites (news, articles) and extracting a main block of content from them without previously defined rules?</p>

然后您删除文本并添加 {.} 以将地点标记为相关并获得：

<td class="postcell">
<div>
<div class="post-text" itemprop="text">
{.}

（通常你还需要关闭标签，但对于单个元素则没有必要）

然后将其作为模式传递给 Xidel（SO 似乎阻止了默认用户代理，因此需要对其进行更改），

xidel ' --user-agent "Mozilla/5.0 (compatible; Xidel)"  -e '<td class="postcell"><div><div class="post-text" itemprop="text">{.}'

它会输出你的文本

Are there any web-crawlers adapted for parsing many unstructured websites (news, articles) and extracting a main block of content from them without previously defined rules?

I mean when I'm parsing a news feed, I want to extract the main content block from each article to do some NLP stuff. I have a lot of websites and it will take forever to look into their DOM model and write rules for each of them.

I was trying to use Scrapy and get all text without tags and scripts, placed in a body, but it include a lot of un-relevant stuff, like menu items, ad blocks, etc.

site_body = selector.xpath('//body').extract_first()


But doing NLP over such kind of content will not be very precise.

So is there any other tools or approaches for doing such tasks?

Answer 2

您可以在 parse() 和 get_text() 中使用 Beautiful Soup:

from bs4 import BeautifulSoup, Comment

soup = BeautifulSoup(response.body, 'html.parser')

yield {'body': soup.get_text() }

您也可以手动删除不需要的内容（如果您发现自己喜欢某些标记，例如 <H1> 或 <b> 可能是有用的信号）

# Remove invisible tags
#for i in soup.findAll(lambda tag: tag.name in ['script', 'link', 'meta']):
#     i.extract()

你可以做类似的事情来将一些标签列入白名单。

非结构化数据的网络爬虫

Web-crawler for unstructured data

nlp

web-crawler

scrapy

web-scraping