Android:提取文章主要内容
Android: Extract article main content
目前我正在创建一个 Android 应用程序,它允许从网站中提取主要内容和图片。现在我正在使用 Jsoup
API 从 HTML 中提取所有 p
标签。但是,这不是一个好的解决方案。任何建议或更好的解决方案使我能够从 Android?
中的网站提取主要内容和图片
为什么您认为使用 Jsoup 不是一个好的解决方案?
我为不同的网页编写了很多网络抓取工具,根据我的经验,Jsoup 是完成该任务的最佳方式。您应该研究 Jsoup Syntax 它非常强大,使用正确的选择器您可以非常轻松地从 HTML 文档中提取大部分信息。通常,当文档没有 id
、class
属性或其他独特特征时,提取信息会变得更加困难。
我没有找到任何适合我的东西,所以我发布了 Android 的 Goose,此处:https://github.com/milosmns/goose
下面是一些描述...
Document cleaning
When you pass a URL to Goose, the first thing it starts to do is clean
up the document to make it easier to parse. It will go through the
whole document and remove comments, common social network sharing
elements, convert em and other tags to plain text nodes, try to
convert divs used as text nodes to paragraphs, as well as do a general
document cleanup (spaces, new lines, quotes, encoding, etc).
Content / Images Extraction
When dealing with random article links you're bound to come across the
craziest of HTML files. Some sites even like to include 2 or more HTML
files per site. Goose uses a scoring system based on clustering of
English stop words and other factors that you can find in the code.
Goose also does descending scoring so as the nodes move down - the
lower their scores become. The goal is to find the strongest grouping
of text nodes inside a parent container and assume that's the relevant
group of content as long as it's high enough (up) on the page.
Image extraction is the one that takes the longest. Trying to find the
most important image on a page proved to be challenging and required
to download all the images to manually inspect them using external
tools (not all images are considered, Goose checks mime types,
dimensions, byte sizes, compression quality, etc). Java's Image
functions were just too unreliable and inaccurate. On Android, Goose
uses the BitmapFactory class, it is well documented, tested, and is
fast and accurate. Images are analyzed from the top node that Goose
finds the content in, then comes a recursive run outwards trying to
find good images - Goose also checks if those images are ads, banners
or author logos, and ignores them if so.
Output Formatting
Once Goose has the top node where we think the content is, Goose will
try to format the content of that node for the output. For example,
for NLP-type applications, Goose's output formatter will just suck all
the text and ignore everything else, and other (custom) extractors can
be built to offer a more Flipboardy-type experience.
你可以试试 textracto api它会自动识别HTML文件的主要内容。还有机会解析 OpenGraph 元数据,因此您也能够提取图片 (og:image
)。
目前我正在创建一个 Android 应用程序,它允许从网站中提取主要内容和图片。现在我正在使用 Jsoup
API 从 HTML 中提取所有 p
标签。但是,这不是一个好的解决方案。任何建议或更好的解决方案使我能够从 Android?
为什么您认为使用 Jsoup 不是一个好的解决方案?
我为不同的网页编写了很多网络抓取工具,根据我的经验,Jsoup 是完成该任务的最佳方式。您应该研究 Jsoup Syntax 它非常强大,使用正确的选择器您可以非常轻松地从 HTML 文档中提取大部分信息。通常,当文档没有 id
、class
属性或其他独特特征时,提取信息会变得更加困难。
我没有找到任何适合我的东西,所以我发布了 Android 的 Goose,此处:https://github.com/milosmns/goose
下面是一些描述...
Document cleaning
When you pass a URL to Goose, the first thing it starts to do is clean up the document to make it easier to parse. It will go through the whole document and remove comments, common social network sharing elements, convert em and other tags to plain text nodes, try to convert divs used as text nodes to paragraphs, as well as do a general document cleanup (spaces, new lines, quotes, encoding, etc).
Content / Images Extraction
When dealing with random article links you're bound to come across the craziest of HTML files. Some sites even like to include 2 or more HTML files per site. Goose uses a scoring system based on clustering of English stop words and other factors that you can find in the code. Goose also does descending scoring so as the nodes move down - the lower their scores become. The goal is to find the strongest grouping of text nodes inside a parent container and assume that's the relevant group of content as long as it's high enough (up) on the page.
Image extraction is the one that takes the longest. Trying to find the most important image on a page proved to be challenging and required to download all the images to manually inspect them using external tools (not all images are considered, Goose checks mime types, dimensions, byte sizes, compression quality, etc). Java's Image functions were just too unreliable and inaccurate. On Android, Goose uses the BitmapFactory class, it is well documented, tested, and is fast and accurate. Images are analyzed from the top node that Goose finds the content in, then comes a recursive run outwards trying to find good images - Goose also checks if those images are ads, banners or author logos, and ignores them if so.
Output Formatting
Once Goose has the top node where we think the content is, Goose will try to format the content of that node for the output. For example, for NLP-type applications, Goose's output formatter will just suck all the text and ignore everything else, and other (custom) extractors can be built to offer a more Flipboardy-type experience.
你可以试试 textracto api它会自动识别HTML文件的主要内容。还有机会解析 OpenGraph 元数据,因此您也能够提取图片 (og:image
)。