在 vba 中按顺序获取 html 标签

Question

我正在浏览网站以将数据打印到 excel sheet

<div class="article-content">
  <h4> ３１. 出勤 – しゅっきん : đi làm</h4>
  <h5> Ví dụ :</h5>
  <p> 毎朝８時に出勤している:
     <br> Hàng sáng tôi đi làm vào lúc 8h
     <br> 多くの会社では出勤時間は９時だ
     <br> Nhiều công ty đều quy định giờ làm việc là 9h</p>

  <h4> ３２. 出世 – しゅっせ : thăng tiến</h4>
  <h5> Ví dụ :</h5>
  <p>出世もしたいが、仕事ばかりの人生の嫌だ.
     <br> Tớ muốn thăng tiến nhưng mà lại gét cuộc sống toàn công việc
     <br> 同期の中で、山田さんが一番出世が早い。
     <br> Trong số những người cùng khóa, anh yamada là người thăng tiến nhanh nhất</p>
</div>

我想提取成这样的文字

３１. 出勤 – しゅっきん : đi làm
毎朝８時に出勤している:
<br> Hàng sáng tôi đi làm vào lúc 8h
<br> 多くの会社では出勤時間は９時だ
<br> Nhiều công ty đều quy định giờ làm việc là 9h

３２. 出世 – しゅっせ : thăng tiến
出世もしたいが、仕事ばかりの人生の嫌だ.
<br> Tớ muốn thăng tiến nhưng mà lại gét cuộc sống toàn công việc
<br> 同期の中で、山田さんが一番出世が早い。
<br> Trong số những người cùng khóa, anh yamada là người thăng tiến nhanh nhất

此时我的代码：我已经提取了所有p标签内容，但我需要h4标签来整理

Dim IE As Object
Set IE = CreateObject("internetexplorer.application")

IE.Visible = True
IE.navigate "https://tuhoconline.net/tu-vung-n2-sach-mimi-kara-oboeru-4.html"

Do While IE.readyState <> 4
DoEvents
Loop

Set doc = IE.document

For i = 0 To 50
    inputText = doc.getElementsByTagName("p")(i).innerHTML
    outputStr() = Split(inputText, "<br>")

我试过getElementsByClassName或getElementsByTagName，但它被分开了，不能按我想要的顺序组合，有人在VBA中解决这个问题吗？真的很满意

Answer 1

CSS 选择器：

您可以使用 CSS 选择器来定位您想要的信息。

模式一：

.article-content h4

这是具有 class article-content 的元素中的 h4 个标签。页面上有 10 个。 "." 是 class 选择器。

模式二：

.article-content h4 + h5 + p

您想要第一段，在 h5 标记之后，在 h4 标记之后，在具有 class article-content 的元素中。有10个，所以都很好。

"+" 是相邻兄弟组合器。它分隔两个选择器并仅匹配紧跟在第一个元素之后的第二个元素，并且两者都是同一父元素的子元素。

备注：

由于在每种情况下都匹配了多个项目，即对于两种模式，您使用 document 的 querySelectorAll 方法，到 return 匹配的 nodeList元素。然后遍历此列表的 .Length，索引到 nodeList 以检索项目。

我取消了打开浏览器并发出 XmlHttpRequest GET 请求。这是一种更快的检索页面内容的方法。

CSS 查询操作（应用 CSS 选择器）

模式 1：匹配结果样本

模式 2：匹配结果样本

VBA:

如下：

Option Explicit
Public Sub GetInfo()
    Dim sResponse As String, html As New HTMLDocument, i As Long, hNodeList As Object, pNodeList As Object

    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://tuhoconline.net/tu-vung-n2-sach-mimi-kara-oboeru-4.html", False
        .send
        sResponse = StrConv(.responseBody, vbUnicode)
    End With

    sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))

    With html
       .body.innerHTML = sResponse
      Set hNodeList = .querySelectorAll(".article-content h4")
      Set pNodeList = .querySelectorAll(".article-content h4 + h5 + p")

      For i = 0 To hNodeList.Length - 1
          Debug.Print hNodeList.item(i).innerText
          Debug.Print pNodeList.item(i).innerText
      Next i
    End With
End Sub

参考资料：通过 VBA > 工具 > 参考资料

HTML 对象库

在 vba 中按顺序获取 html 标签

get html tag in order in vba

html

excel

vba

element

web-scraping

CSS 选择器：

备注：

CSS 查询操作（应用 CSS 选择器）

VBA: