使用 VBA morningstar financial 进行网络抓取

Webscraping with VBA morningstar financial

我正试图从 Morningstar 那里获取内部所有权 url: http://investors.morningstar.com/ownership/shareholders-overview.html?t=TWTR&region=usa&culture=en-US

这是我正在使用的代码:

Sub test()

    Dim appIE As Object

    Set appIE = CreateObject("InternetExplorer.Application")
    With appIE
        .Navigate "http://investors.morningstar.com/ownership/shareholders-overview.html?t=TWTR&region=usa&culture=en-US"
        .Visible = True
    End With
    While appIE.Busy
        DoEvents
    Wend
    Set allRowOfData = appIE.Document.getElementById("currentInsiderVal")
    Debug.Print allRowOfData
    Dim myValue As String: myValue = allRowOfData.Cells(0).innerHTML
    appIE.Quit
    Set appIE = Nothing

    Range("A30").Value = myValue

End Sub

我在第

行收到 运行-时间错误 13
Set allRowOfData = appIE.Document.getElementById("currentInsiderVal")

但我看不出有任何不匹配之处。这是怎么回事?

查看网站,您尝试检索的元素中有错字;尝试使用 currrentInsiderVal 而不是 currentInsiderVal,您应该可以正确检索数据。

可能值得考虑一些错误捕获来为您检索的任何其他字段捕获类似的东西?

看到你的评论我仔细看了看。您的问题似乎是试图捕获单个单元格的 id 而不是向下导航对象树。我修改了代码以检索您之后的 table 行,然后将 myValue 设置为该行中的正确单元格。当我尝试时似乎正在工作。试一试?

Sub test()

Dim appIE As Object
Set appIE = CreateObject("internetexplorer.application")


With appIE
    .Navigate "http://investors.morningstar.com/ownership/shareholders-overview.html?t=TWTR&region=usa&culture=en-US"
    .Visible = True
End With

While appIE.Busy
    DoEvents
Wend

Set allRowOfData = appIE.Document.getelementbyID("tableTest").getElementsByTagName("tbody")(0).getElementsByTagName("tr")(5)
myValue = allRowOfData.Cells(2).innerHTML

appIE.Quit
Set appIE = Nothing
Range("A30").Value = myValue
End Sub

你可以只用 XHR 和 RegEx 来代替繁琐的 IE:

Sub Test()
    Dim sContent
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "http://investors.morningstar.com/ownership/shareholders-overview.html?t=TWTR&region=usa&culture=en-US", False
        .Send
        sContent = .ResponseText
    End With
    With CreateObject("VBScript.RegExp")
        .Pattern = ",""currInsiderVal"":(.*?),"
        Range("A30").Value = .Execute(sContent).Item(0).SubMatches(0)
    End With
End Sub

以下是代码工作原理的说明:

首先MSXML2.XMLHTTP ActiveX 实例被创建。 GET 请求以同步模式打开目标 URL(执行中断直到收到响应)。

然后VBScript.RegExp被创建。默认情况下 .IgnoreCase.Global.MultiLine 属性为 False。模式是 ,"currInsiderVal":(.*?),,其中 (.*?) 是捕获组,. 表示任何字符,.* - 零个或多个字符,.*? - 尽可能少字符(惰性匹配)。模式中的其他字符按原样找到。 .Execute 方法 returns 一组匹配项,其中只有一个匹配对象,因为 .GlobalFalse。这个匹配对象有一组子匹配,其中只有一个子匹配,因为模式包含唯一的捕获组。
有一些关于正则表达式的有用的 MSDN 文章:
Microsoft Beefs Up VBScript with Regular Expressions
Introduction to Regular Expressions

下面是我如何创建代码的描述:

首先我在网页DOM上使用浏览器找到了一个包含目标值的元素:

对应节点为:

<td align="right" id="currrentInsiderVal">143.51</td>

然后做了XHR,在responseHTML中找到了这个节点,但是里面没有这个值(刷新页面后在浏览器开发者工具的network标签中可以找到response):

<td align="right" id="currrentInsiderVal">
</td>

这种行为是 DHTML 的典型表现。动态 HTML 内容是在网页加载后由脚本生成的,可以是在通过 XHR 从 Web 检索数据之后,也可以是处理已经加载的网页数据。然后我只是在响应中搜索值 143.51,代码段 ,"currInsiderVal":143.51, 位于 JS 函数中:

            fundsArr = {"fundTotalHistVal":132.61,"mutualFunds":[[1,89,"#a71620"],[2,145,"#a71620"],[3,152,"#a71620"],[4,198,"#a71620"],[5,155,"#a71620"],[6,146,"#a71620"],[7,146,"#a71620"],[8,132,"#a71620"]],"insiderHisMaxVal":3.535,"institutions":[[1,273,"#283862"],[2,318,"#283862"],[3,351,"#283862"],[4,369,"#283862"],[5,311,"#283862"],[6,298,"#283862"],[7,274,"#283862"],[8,263,"#283862"]],"currFundData":[2,2202,"#a6001d"],"currInstData":[1,4370,"#283864"],"instHistMaxVal":369,"insiders":[[5,0.042,"#ff6c21"],[6,0.057,"#ff6c21"],[7,0.057,"#ff6c21"],[8,3.535,"#ff6c21"],[5,0],[6,0],[7,0],[8,0]],"currMax":4370,"histLineQuars":[[1,"Q2"],[2,"Q3"],[3,"Q4"],[4,"Q1<br>2015"],[5,"Q2"],[6,"Q3"],[7,"Q4"],[8,"Q1<br>2016"]],"fundHisMaxVal":198,"currInsiderData":[3,143,"#ff6900"],"currFundVal":2202.85,"quarters":[[1,"Q2"],[2,""],[3,""],[4,"Q1<br>2015"],[5,""],[6,""],[7,""],[8,"Q1<br>2016"]],"insiderTotalHistVal":3.54,"currInstVal":4370.46,"currInsiderVal":143.51,"use10YearData":"false","instTotalHistVal":263.74,"maxValue":369};

因此基于它创建的正则表达式模式应该找到片段 ,"currInsiderVal":<some text>,,其中 <some text> 是我们的目标值。