Android:提取文章主要内容

Android: Extract article main content

目前我正在创建一个 Android 应用程序,它允许从网站中提取主要内容和图片。现在我正在使用 Jsoup API 从 HTML 中提取所有 p 标签。但是,这不是一个好的解决方案。任何建议或更好的解决方案使我能够从 Android?

中的网站提取主要内容和图片

为什么您认为使用 Jsoup 不是一个好的解决方案?

我为不同的网页编写了很多网络抓取工具,根据我的经验,Jsoup 是完成该任务的最佳方式。您应该研究 Jsoup Syntax 它非常强大,使用正确的选择器您可以非常轻松地从 HTML 文档中提取大部分信息。通常,当文档没有 idclass 属性或其他独特特征时,提取信息会变得更加困难。

您可能感兴趣的其他 HTML 个解析器是 JTidy and TagSoup

我没有找到任何适合我的东西,所以我发布了 Android 的 Goose,此处:https://github.com/milosmns/goose

下面是一些描述...

Document cleaning

When you pass a URL to Goose, the first thing it starts to do is clean up the document to make it easier to parse. It will go through the whole document and remove comments, common social network sharing elements, convert em and other tags to plain text nodes, try to convert divs used as text nodes to paragraphs, as well as do a general document cleanup (spaces, new lines, quotes, encoding, etc).

Content / Images Extraction

When dealing with random article links you're bound to come across the craziest of HTML files. Some sites even like to include 2 or more HTML files per site. Goose uses a scoring system based on clustering of English stop words and other factors that you can find in the code. Goose also does descending scoring so as the nodes move down - the lower their scores become. The goal is to find the strongest grouping of text nodes inside a parent container and assume that's the relevant group of content as long as it's high enough (up) on the page.

Image extraction is the one that takes the longest. Trying to find the most important image on a page proved to be challenging and required to download all the images to manually inspect them using external tools (not all images are considered, Goose checks mime types, dimensions, byte sizes, compression quality, etc). Java's Image functions were just too unreliable and inaccurate. On Android, Goose uses the BitmapFactory class, it is well documented, tested, and is fast and accurate. Images are analyzed from the top node that Goose finds the content in, then comes a recursive run outwards trying to find good images - Goose also checks if those images are ads, banners or author logos, and ignores them if so.

Output Formatting

Once Goose has the top node where we think the content is, Goose will try to format the content of that node for the output. For example, for NLP-type applications, Goose's output formatter will just suck all the text and ignore everything else, and other (custom) extractors can be built to offer a more Flipboardy-type experience.

你可以试试 textracto api它会自动识别HTML文件的主要内容。还有机会解析 OpenGraph 元数据,因此您也能够提取图片 (og:image)。