将网页精简为 iOS (Objective C) 上的文字

Strip down webpage to just text on iOS (Objective C)

我的主要目标是实现类似 Readability 或 Safari 的 Reader 服务的效果,其中网页的主要内容被转换为文本。我实际上不想显示任何图像,只是获取网页的所有重要文本。我目前正在使用一些相当长的 self-built 代码来解析 s 的网页,以找出标题可能是什么样子,我也在解析

s 我希望包含大部分内容页面内容。

-(void)interpretAndDisplay {
NSURL *URL = [NSURL URLWithString:self.url];
NSData *data = [NSData dataWithContentsOfURL:URL];
NSString *html = [NSString stringWithUTF8String:[data bytes]];

//Getting the H1s
NSMutableArray *h1Full = [[NSMutableArray alloc] init];
h1Full = [self stringsBetweenString:@"<h1" andString:@">" andText:html];

if ([h1Full count] > 0) {
    NSMutableArray *h1Content = [[NSMutableArray alloc] init];
    h1Content = [self stringsBetweenString:[NSString stringWithFormat:@"<h1%@>",[h1Full firstObject]] andString:@"</h1>" andText:html];
    NSMutableArray *h1Sanitize = [[NSMutableArray alloc] init];
    h1Sanitize = [self stringsBetweenString:@"<" andString:@">" andText:html];

    if ([h1Content count] > 0) {
        NSString *finalTitle = [h1Content firstObject];

        for (int i = 0; i < [h1Sanitize count]; i++) {
            NSString *toRemove = [NSString stringWithFormat:@"<%@>",[h1Sanitize objectAtIndex:i]];
            finalTitle = [finalTitle stringByReplacingOccurrencesOfString:toRemove withString:@""];
            finalTitle = [finalTitle stringByReplacingOccurrencesOfString:@"\n" withString:@""];

        }

        finalTitle = [self sanitizeString:finalTitle];

        [self.titleLabel setText:finalTitle];
    }

}

//Now for the body!
NSMutableArray *pTag = [[NSMutableArray alloc] init];
pTag = [self stringsBetweenString:@"<p" andString:@">" andText:html];
if ([pTag count] > 0) {
    NSMutableArray *pContent = [[NSMutableArray alloc] init];
    pContent = [self stringsBetweenString:[NSString stringWithFormat:@"<p%@>",[pTag firstObject]] andString:@"</p>" andText:html];

    NSMutableArray *pSanitize = [[NSMutableArray alloc] init];
    pSanitize = [self stringsBetweenString:@"<" andString:@">" andText:html];

    if ([pContent count] > 0) {

        for (int i = 0; i < [pContent count]; i++) {
            NSString *pToEdit = [pContent objectAtIndex:i];

            for (int i = 0; i < [pSanitize count]; i++) {
                NSString *toRemove = [NSString stringWithFormat:@"<%@>",[pSanitize objectAtIndex:i]];
                pToEdit = [pToEdit stringByReplacingOccurrencesOfString:toRemove withString:@""];
            }

            [pContent replaceObjectAtIndex:i withObject:pToEdit];
        }

        for (int i = 0; i < [pContent count]; i++) {
            NSString *pToEdit = [pContent objectAtIndex:i];
            pToEdit = [pToEdit stringByReplacingOccurrencesOfString:@"\n" withString:@""];
            [pContent replaceObjectAtIndex:i withObject:pToEdit];
        }

        NSString *finalBody = @"";

        for (int i = 0; i < [pContent count]; i++) {

            if ([finalBody isEqualToString:@""]) {
                finalBody = [NSString stringWithFormat:@"%@",[pContent objectAtIndex:i]];
            }

            else {
                finalBody = [NSString stringWithFormat:@"%@\n\n%@",finalBody,[pContent objectAtIndex:i]];
            }
        }

        finalBody = [self sanitizeString:finalBody];

        [self.textLabel setText:finalBody];
    }

}
}

上面的代码很好地提取了所有元素并使用我创建的方法对其进行了清理,但问题是仅分析 P 标签有时完全无法简化内容,分析所有可能的内容标签可能打乱内容的顺序和布局。

是否有更好的方法或框架可以将所有文本转换为漂亮的字符串?

编辑

四处寻找,我发现了一个可以极其轻松地提取文本的 Boilerpipe 项目 (https://github.com/k-bx/boilerpipe/wiki/QuickStart)。它看起来很简单:String text= ArticleExtractor.INSTANCE.getText(url);

我可以在 Objective C 上做这个吗?

编辑 2

似乎有一个样板管道 API,但它的请求有限。我主要是在寻找 user-side 解决方案。

我认为Reggie不是最宽容的方法。

我会尝试找到一个现有的开源(即 https://github.com/Kerrick/readability-js) and use WebKit 在加载后将 JS 注入网页。

之后你可以注入另一个JS,提取处理后的内容(使用appropriate class from the source

然后,使用JavaScriptCore你可以将div的内容传递给Objective-C(JS提供了很多方法)