使用 VBA morningstar financial 进行网络抓取
Webscraping with VBA morningstar financial
我正试图从 Morningstar 那里获取内部所有权 url:
http://investors.morningstar.com/ownership/shareholders-overview.html?t=TWTR®ion=usa&culture=en-US
这是我正在使用的代码:
Sub test()
Dim appIE As Object
Set appIE = CreateObject("InternetExplorer.Application")
With appIE
.Navigate "http://investors.morningstar.com/ownership/shareholders-overview.html?t=TWTR®ion=usa&culture=en-US"
.Visible = True
End With
While appIE.Busy
DoEvents
Wend
Set allRowOfData = appIE.Document.getElementById("currentInsiderVal")
Debug.Print allRowOfData
Dim myValue As String: myValue = allRowOfData.Cells(0).innerHTML
appIE.Quit
Set appIE = Nothing
Range("A30").Value = myValue
End Sub
我在第
行收到 运行-时间错误 13
Set allRowOfData = appIE.Document.getElementById("currentInsiderVal")
但我看不出有任何不匹配之处。这是怎么回事?
查看网站,您尝试检索的元素中有错字;尝试使用 currrentInsiderVal
而不是 currentInsiderVal
,您应该可以正确检索数据。
可能值得考虑一些错误捕获来为您检索的任何其他字段捕获类似的东西?
看到你的评论我仔细看了看。您的问题似乎是试图捕获单个单元格的 id 而不是向下导航对象树。我修改了代码以检索您之后的 table 行,然后将 myValue 设置为该行中的正确单元格。当我尝试时似乎正在工作。试一试?
Sub test()
Dim appIE As Object
Set appIE = CreateObject("internetexplorer.application")
With appIE
.Navigate "http://investors.morningstar.com/ownership/shareholders-overview.html?t=TWTR®ion=usa&culture=en-US"
.Visible = True
End With
While appIE.Busy
DoEvents
Wend
Set allRowOfData = appIE.Document.getelementbyID("tableTest").getElementsByTagName("tbody")(0).getElementsByTagName("tr")(5)
myValue = allRowOfData.Cells(2).innerHTML
appIE.Quit
Set appIE = Nothing
Range("A30").Value = myValue
End Sub
你可以只用 XHR 和 RegEx 来代替繁琐的 IE:
Sub Test()
Dim sContent
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "http://investors.morningstar.com/ownership/shareholders-overview.html?t=TWTR®ion=usa&culture=en-US", False
.Send
sContent = .ResponseText
End With
With CreateObject("VBScript.RegExp")
.Pattern = ",""currInsiderVal"":(.*?),"
Range("A30").Value = .Execute(sContent).Item(0).SubMatches(0)
End With
End Sub
以下是代码工作原理的说明:
首先MSXML2.XMLHTTP
ActiveX 实例被创建。 GET 请求以同步模式打开目标 URL(执行中断直到收到响应)。
然后VBScript.RegExp
被创建。默认情况下 .IgnoreCase
、.Global
和 .MultiLine
属性为 False
。模式是 ,"currInsiderVal":(.*?),
,其中 (.*?)
是捕获组,.
表示任何字符,.*
- 零个或多个字符,.*?
- 尽可能少字符(惰性匹配)。模式中的其他字符按原样找到。 .Execute
方法 returns 一组匹配项,其中只有一个匹配对象,因为 .Global
是 False
。这个匹配对象有一组子匹配,其中只有一个子匹配,因为模式包含唯一的捕获组。
有一些关于正则表达式的有用的 MSDN 文章:
Microsoft Beefs Up VBScript with Regular Expressions
Introduction to Regular Expressions
下面是我如何创建代码的描述:
首先我在网页DOM上使用浏览器找到了一个包含目标值的元素:
对应节点为:
<td align="right" id="currrentInsiderVal">143.51</td>
然后做了XHR,在responseHTML中找到了这个节点,但是里面没有这个值(刷新页面后在浏览器开发者工具的network标签中可以找到response):
<td align="right" id="currrentInsiderVal">
</td>
这种行为是 DHTML 的典型表现。动态 HTML 内容是在网页加载后由脚本生成的,可以是在通过 XHR 从 Web 检索数据之后,也可以是处理已经加载的网页数据。然后我只是在响应中搜索值 143.51
,代码段 ,"currInsiderVal":143.51,
位于 JS 函数中:
fundsArr = {"fundTotalHistVal":132.61,"mutualFunds":[[1,89,"#a71620"],[2,145,"#a71620"],[3,152,"#a71620"],[4,198,"#a71620"],[5,155,"#a71620"],[6,146,"#a71620"],[7,146,"#a71620"],[8,132,"#a71620"]],"insiderHisMaxVal":3.535,"institutions":[[1,273,"#283862"],[2,318,"#283862"],[3,351,"#283862"],[4,369,"#283862"],[5,311,"#283862"],[6,298,"#283862"],[7,274,"#283862"],[8,263,"#283862"]],"currFundData":[2,2202,"#a6001d"],"currInstData":[1,4370,"#283864"],"instHistMaxVal":369,"insiders":[[5,0.042,"#ff6c21"],[6,0.057,"#ff6c21"],[7,0.057,"#ff6c21"],[8,3.535,"#ff6c21"],[5,0],[6,0],[7,0],[8,0]],"currMax":4370,"histLineQuars":[[1,"Q2"],[2,"Q3"],[3,"Q4"],[4,"Q1<br>2015"],[5,"Q2"],[6,"Q3"],[7,"Q4"],[8,"Q1<br>2016"]],"fundHisMaxVal":198,"currInsiderData":[3,143,"#ff6900"],"currFundVal":2202.85,"quarters":[[1,"Q2"],[2,""],[3,""],[4,"Q1<br>2015"],[5,""],[6,""],[7,""],[8,"Q1<br>2016"]],"insiderTotalHistVal":3.54,"currInstVal":4370.46,"currInsiderVal":143.51,"use10YearData":"false","instTotalHistVal":263.74,"maxValue":369};
因此基于它创建的正则表达式模式应该找到片段 ,"currInsiderVal":<some text>,
,其中 <some text>
是我们的目标值。
我正试图从 Morningstar 那里获取内部所有权 url: http://investors.morningstar.com/ownership/shareholders-overview.html?t=TWTR®ion=usa&culture=en-US
这是我正在使用的代码:
Sub test()
Dim appIE As Object
Set appIE = CreateObject("InternetExplorer.Application")
With appIE
.Navigate "http://investors.morningstar.com/ownership/shareholders-overview.html?t=TWTR®ion=usa&culture=en-US"
.Visible = True
End With
While appIE.Busy
DoEvents
Wend
Set allRowOfData = appIE.Document.getElementById("currentInsiderVal")
Debug.Print allRowOfData
Dim myValue As String: myValue = allRowOfData.Cells(0).innerHTML
appIE.Quit
Set appIE = Nothing
Range("A30").Value = myValue
End Sub
我在第
行收到 运行-时间错误 13Set allRowOfData = appIE.Document.getElementById("currentInsiderVal")
但我看不出有任何不匹配之处。这是怎么回事?
查看网站,您尝试检索的元素中有错字;尝试使用 currrentInsiderVal
而不是 currentInsiderVal
,您应该可以正确检索数据。
可能值得考虑一些错误捕获来为您检索的任何其他字段捕获类似的东西?
看到你的评论我仔细看了看。您的问题似乎是试图捕获单个单元格的 id 而不是向下导航对象树。我修改了代码以检索您之后的 table 行,然后将 myValue 设置为该行中的正确单元格。当我尝试时似乎正在工作。试一试?
Sub test()
Dim appIE As Object
Set appIE = CreateObject("internetexplorer.application")
With appIE
.Navigate "http://investors.morningstar.com/ownership/shareholders-overview.html?t=TWTR®ion=usa&culture=en-US"
.Visible = True
End With
While appIE.Busy
DoEvents
Wend
Set allRowOfData = appIE.Document.getelementbyID("tableTest").getElementsByTagName("tbody")(0).getElementsByTagName("tr")(5)
myValue = allRowOfData.Cells(2).innerHTML
appIE.Quit
Set appIE = Nothing
Range("A30").Value = myValue
End Sub
你可以只用 XHR 和 RegEx 来代替繁琐的 IE:
Sub Test()
Dim sContent
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "http://investors.morningstar.com/ownership/shareholders-overview.html?t=TWTR®ion=usa&culture=en-US", False
.Send
sContent = .ResponseText
End With
With CreateObject("VBScript.RegExp")
.Pattern = ",""currInsiderVal"":(.*?),"
Range("A30").Value = .Execute(sContent).Item(0).SubMatches(0)
End With
End Sub
以下是代码工作原理的说明:
首先MSXML2.XMLHTTP
ActiveX 实例被创建。 GET 请求以同步模式打开目标 URL(执行中断直到收到响应)。
然后VBScript.RegExp
被创建。默认情况下 .IgnoreCase
、.Global
和 .MultiLine
属性为 False
。模式是 ,"currInsiderVal":(.*?),
,其中 (.*?)
是捕获组,.
表示任何字符,.*
- 零个或多个字符,.*?
- 尽可能少字符(惰性匹配)。模式中的其他字符按原样找到。 .Execute
方法 returns 一组匹配项,其中只有一个匹配对象,因为 .Global
是 False
。这个匹配对象有一组子匹配,其中只有一个子匹配,因为模式包含唯一的捕获组。
有一些关于正则表达式的有用的 MSDN 文章:
Microsoft Beefs Up VBScript with Regular Expressions
Introduction to Regular Expressions
下面是我如何创建代码的描述:
首先我在网页DOM上使用浏览器找到了一个包含目标值的元素:
对应节点为:
<td align="right" id="currrentInsiderVal">143.51</td>
然后做了XHR,在responseHTML中找到了这个节点,但是里面没有这个值(刷新页面后在浏览器开发者工具的network标签中可以找到response):
<td align="right" id="currrentInsiderVal">
</td>
这种行为是 DHTML 的典型表现。动态 HTML 内容是在网页加载后由脚本生成的,可以是在通过 XHR 从 Web 检索数据之后,也可以是处理已经加载的网页数据。然后我只是在响应中搜索值 143.51
,代码段 ,"currInsiderVal":143.51,
位于 JS 函数中:
fundsArr = {"fundTotalHistVal":132.61,"mutualFunds":[[1,89,"#a71620"],[2,145,"#a71620"],[3,152,"#a71620"],[4,198,"#a71620"],[5,155,"#a71620"],[6,146,"#a71620"],[7,146,"#a71620"],[8,132,"#a71620"]],"insiderHisMaxVal":3.535,"institutions":[[1,273,"#283862"],[2,318,"#283862"],[3,351,"#283862"],[4,369,"#283862"],[5,311,"#283862"],[6,298,"#283862"],[7,274,"#283862"],[8,263,"#283862"]],"currFundData":[2,2202,"#a6001d"],"currInstData":[1,4370,"#283864"],"instHistMaxVal":369,"insiders":[[5,0.042,"#ff6c21"],[6,0.057,"#ff6c21"],[7,0.057,"#ff6c21"],[8,3.535,"#ff6c21"],[5,0],[6,0],[7,0],[8,0]],"currMax":4370,"histLineQuars":[[1,"Q2"],[2,"Q3"],[3,"Q4"],[4,"Q1<br>2015"],[5,"Q2"],[6,"Q3"],[7,"Q4"],[8,"Q1<br>2016"]],"fundHisMaxVal":198,"currInsiderData":[3,143,"#ff6900"],"currFundVal":2202.85,"quarters":[[1,"Q2"],[2,""],[3,""],[4,"Q1<br>2015"],[5,""],[6,""],[7,""],[8,"Q1<br>2016"]],"insiderTotalHistVal":3.54,"currInstVal":4370.46,"currInsiderVal":143.51,"use10YearData":"false","instTotalHistVal":263.74,"maxValue":369};
因此基于它创建的正则表达式模式应该找到片段 ,"currInsiderVal":<some text>,
,其中 <some text>
是我们的目标值。