用 VBA 解析 HTML 内容

Question

目前，我正在努力解析 data.cnbc.com/quotes/sdrl 中的引用 table，并将 innerhtml 放入我指定的代码旁边的列中。

所以，我会从 A2 中获取品种，然后将收益率数据放入 C2，然后移动到下一个品种。

HTML 看起来像：

<table id="fundamentalsTableOne">
  <tbody>
    <tr scope="row">
        <th scope="row">EPS</th>
        <td>8.06</td>
    </tr>
    <tr scope="row">
        <th scope="row">Market Cap</th>
        <td>5.3B</td>
    </tr>
    <tr scope="row">
        <th scope="row">Shares Out</th>
        <td>492.8M</td>
    </tr>
    <tr scope="row">
        <th scope="row">Price/Earnings</th>
        <td>1.3x</td>
    </tr>
</tbody>
</table>
<table id="fundamentalsTableTwo">
  <tbody>
    <tr scope="row">
        <th scope="row">Revenue (TTM)</th>
        <td>5.0B</td>   
    </tr>
    <tr scope="row">
        <th scope="row">Beta</th>
        <td>1.84</td>
    </tr>
    <tr scope="row">
        <th scope="row">Dividend</th>
        <td>--</td>
    </tr>
    <tr scope="row">
        <th scope="row">Yield</th>
        <td><span class="pos">0.00%</span></td>
    </tr>
  </tbody>
</table>

目前，我有：

Sub getInfoWeb()

Dim cell As Integer
Dim xhr As MSXML2.XMLHTTP60
Dim doc As MSHTML.HTMLDocument
Dim table As MSHTML.HTMLTable
Dim tableCells As MSHTML.IHTMLElementCollection

Set xhr = New MSXML2.XMLHTTP60

For cell = 2 To 5

ticker = Cells(cell, 1).Value

    With xhr

        .Open "GET", "http://data.cnbc.com/quotes/" & ticker, False
        .send

        If .readyState = 4 And .Status = 200 Then
            Set doc = New MSHTML.HTMLDocument
            doc.body.innerHTML = .responseText
        Else
            MsgBox "Error" & vbNewLine & "Ready state: " & .readyState & _
            vbNewLine & "HTTP request status: " & .Status
        End If

    End With

    Set table = doc.getElementById("fundamentalsTableOne")
    Set tableCells = table.getElementsByTagName("td")

    For Each tableCell In tableCells

            Cells(cell, 2).Value = tableCell.NextSibling.innerHTML

    Next tableCell

Next cell

End Sub

但是，我收到一个 "access is denied" 错误，并且在我的 set tablecells 行出现运行时 91。是不是每行只有一个元素，tablecells被设置为一个集合？另外，"access is denied" 错误是由于 javascript 生成的 HTML 造成的吗？我认为这不是问题。

如果有人知道如何使这项工作正常进行，我们将不胜感激。谢谢。

Answer 1

我刚刚浏览了该站点，我认为您可以在没有浏览器的情况下完成此操作 object。

问题是这些网站通常使用 Ajax 之类的东西来动态更新较小的 div 而无需刷新整个页面。新数据通常仍然到达 html（尽管可能被压缩），因此它仍然可以在 HTML 文档中被解析，但是它来自对不同 URL 的调用。

特别是对于这个站点，您最初从 quotes.cnbc.com 获取 GET，然后您的浏览器在后台悄悄地从 data.cnbc.com 获取另一个，最后从 [=27] 获取您想要的 table =].如果所有这些都是必要的，您仍然可以使用 http 请求 object 来完成所有这些，如果不需要 cookie 并且 post 数据不是由JS在前两个。

我建议你下载一个像Fiddler 4这样的网络流量监视器。它是免费的，而且在这样的项目中必不可少。

这是第一次有点混乱，所以这里有一个快速入门。在您打开它并第一次拨打 CNBC 之后，在左侧面板中找到它并突出显示。然后在右上角的面板中单击 "inspectors" 选项卡，然后单击 "raw"。这将显示您的浏览器发送给 CNBC 的 header 和 post 数据，这是您要在 HTTP 请求中复制的内容。在右下方的面板中，您可以单击原始以查看响应 header 和 body，以及状态代码、HTML 语法、呈现 html（没有 css) 等...您可以使用这些来确定哪个请求 returns 您真正想要的数据，并确切地查看它是如何到达的。

我想你会惊讶于你们之间的距离到底有多近。

Answer 2

下面是一个示例，展示了如何获取所需的数据：

GetData "sdrl"

Sub GetData(sSymbol)
    Dim sRespText, arrName, oDict, sResult, sItem
    XmlHttpRequest "GET", "http://data.cnbc.com/quotes/" & sSymbol, "", "", "", sRespText
    ParseToNestedArr "<span data-field=""name"">([\s\S]*?)</span>", sRespText, arrName
    XmlHttpRequest "GET", "http://apps.cnbc.com/company/quote/newindex.asp?symbol=" & sSymbol, "", "", "", sRespText
    ParseToDict "<tr[\s\S]*?><th[\s\S]*?>([\s\S]*?)</th><td>(?:<span[\s\S]*?>)*([\s\S]*?)(?:</span>)*</td></tr>", sRespText, oDict
    sResult = arrName(0)(0) & vbCrLf & vbCrLf
    For Each sItem in oDict.Keys
        sResult = sResult & sItem & " = " & oDict(sItem) & vbCrLf
    Next
    MsgBox sResult
End Sub

Sub ParseToDict(sPattern, sResponse, oList)
    Dim oMatch, arrSMatches
    Set oList = CreateObject("Scripting.Dictionary")
    With CreateObject("VBScript.RegExp")
        .Global = True
        .MultiLine = True
        .IgnoreCase = True
        .Pattern = sPattern
        For Each oMatch In .Execute(sResponse)
            oList(oMatch.SubMatches(0)) = oMatch.SubMatches(1)
        Next
    End With
End Sub

Sub ParseToNestedArr(sPattern, sResponse, arrMatches)
    Dim oMatch, arrSMatches, sSubMatch
    arrMatches = Array()
    With CreateObject("VBScript.RegExp")
        .Global = True
        .MultiLine = True
        .IgnoreCase = True
        .Pattern = sPattern
        For Each oMatch In .Execute(sResponse)
            arrSMatches = Array()
            For Each sSubMatch in oMatch.SubMatches
                PushItem arrSMatches, sSubMatch
            Next
            PushItem arrMatches, arrSMatches
        Next
    End With
End Sub

Sub PushItem(arrList, varItem)
    ReDim Preserve arrList(UBound(arrList) + 1)
    arrList(UBound(arrList)) = varItem
End Sub

Sub XmlHttpRequest(sMethod, sUrl, arrSetHeaders, sFormData, sRespHeaders, sRespText)
    Dim arrHeader
    With CreateObject("Msxml2.ServerXMLHTTP.3.0")
        .SetOption 2, 13056 ' SXH_SERVER_CERT_IGNORE_ALL_SERVER_ERRORS
        .Open sMethod, sUrl, False
        If IsArray(arrSetHeaders) Then
            For Each arrHeader In arrSetHeaders
                .SetRequestHeader arrHeader(0), arrHeader(1)
            Next
        End If
        .Send sFormData
        sRespHeaders = .GetAllResponseHeaders
        sRespText = .ResponseText
    End With
End Sub

它使用后期绑定，因为最初的目标语言是 VBScript，但如果需要，将它们更改为早期绑定并不难。第二个 link http://apps.cnbc.com/company/quote/newindex.asp?symbol=SDRL 你可以在网页内容中找到 iframe 源。

用 VBA 解析 HTML 内容

Parse HTML Content with VBA

html

excel

parsing

vba

web-scraping