如何从 html 节点中删除 2-3 个元素并抓取其余元素？

Question

准确地说，我有一个 class，比如说 A，我在 rvest 中通过 html_nodes select。现在 A 可以有很多子 class 和很多 html 标签，例如 links 和 img 标签。我想从 A 中删除一些特定的 classes 和标签，同时抓取其余数据。我不知道其余数据的 classes。我知道我想将什么列入黑名单。

HTML（假设）。此标记 <div class="messageContent"> 在文档中最多重复 25 次，内容不同，但结构相同。

<div class="messageContent">
<article>
<blockquote class="messageText SelectQuoteContainer ugc baseHtml">
<div class="bbCodeBlock bbCodeQuote" data-author="Generic">

<aside>
<div class="attribution type">Generic said:
<a href="goto/post?id=32554#post-32754" class="AttributionLink">&uarr;</a>
</div>
<blockquote class="quoteContainer"><div class="quote">I see what you did there.</div><div class="quoteExpand">Click to expand...</div></blockquote>
</aside>

</div><img src="styles/default/xenforo/clear.png" class="mceSmilieSprite      mceSmilie9" alt=":o" title="Eek!    :o"/> Really?
<aside>
<div class="attribution type">Generic said:
<a href="goto/post?id=32554#post-32754" class="AttributionLink">&uarr;</a>
</div>
<blockquote class="quoteContainer"><div class="quote">I see what you did there.</div><div class="quoteExpand">Click to expand...</div></blockquote>
</aside>

<div class="messageTextEndMarker">&nbsp;</div>
</blockquote>
</article>
</div>

所以，我正在抓取的页面包含多个这样的 classes。我做

posts <- page %>%  html_nodes(".messageContent")

这给了我一个包含 25 个 html 节点的列表，每个节点都包含上述 html 内容的变体。

我想删除 <aside> 和 </aside> 标签中的所有内容（可能出现在 post 中的多个位置），并捕获 [=48= 的其余部分] 通过 html_text() %>% as.character()

我可以用 rvest 做这个吗？

正在测试@Mirosław Zalewski 的解决方案。

test<- page %>% html_node(".messageContent") %>%
          html_nodes(xpath='//*[not(ancestor::aside or name()="aside")]/text()')

这返回了页面中不在 aside 内的所有元素。一点微调，让我：

page %>% html_nodes(xpath='(//div[@class="messageContent"])[1]//*[not(ancestor::aside or name()="aside")]/text()') %>% html_text() %>% as.character()

迭代了 25 个 classes，这正是我所需要的。

Answer 1

使用 XPath，您可以 select 所有不是 <aside> 或 <aside> 的后代的节点：

page %>% html_node(".messageContent") %>%
    html_nodes(xpath='//*[not(ancestor::aside or name()="aside")]')

不幸的是，这也将匹配包含 <aside> 的元素。如果您将其传递给 html_text()，它仍然会 return <aside> 文本内容。

这可以通过在查询中添加另一个条件来解决。这种条件的一个很好的候选者是 "everything that is text node":

page %>% html_node(".messageContent") %>%
    html_nodes(xpath='//*[not(ancestor::aside or name()="aside")]/text()')

实际上，/text() 只会 return 文本节点，这几乎可以让您完全跳过 html_text() 调用。但是由于许多文本节点是可疑的（只包含空白字符）并且这个函数有 trim 内置，你可能会考虑调用它。

请注意，此解决方案还将跳过任何非文本内容，例如图像引用（可能包括表情）。您最初的提议也会这样做，但我不清楚您是否有意这样做。

如何从 html 节点中删除 2-3 个元素并抓取其余元素？

How can I drop 2-3 elements from a html node and scrape the rest?

r

rvest