如何从多个网页中提取列出的数据 - 找不到 table 标签
How to pull listed data from several web pages - Cannot locate table tag
首先,我已经在网上阅读了很多与该主题相关的不同答案,但我不得不承认,我真的很难使它们适应我的需要,所以非常感谢您的帮助!
我需要提取以下网页(第 1-7 页)中列出的数据,即基金名称、价格、货币等 https://toolkit.financialexpress.net/santanderam 并将这些数据提取到 excel。
我有以下代码可以打开 IE 页面(正在运行):
' return the document containg the DOM of the page strWebAddress
' returns Nothing if the timeout lngTimeoutInSeconds was reached
Public Function GetIEDocument(ByVal strWebAddress As String, Optional ByVal lngTimeoutInSeconds As Long = 15) As MSHTML.HTMLDocument
Dim IE As SHDocVw.InternetExplorer
Dim IEDocument As MSHTML.HTMLDocument
Dim dateNow As Date
' create an IE application, representing a tab
Set IE = New SHDocVw.InternetExplorer
' optionally make the application visible, though it will work perfectly fine in the background otherwise
IE.Visible = True
' open a webpage in the tab represented by IE and wait until the main request successfully finished
' times out after lngTimeoutInSeconds with a warning
IE.Navigate strWebAddress
dateNow = Now
Do While IE.Busy
If Now > DateAdd("s", lngTimeoutInSeconds, dateNow) Then Exit Function
Loop
' retrieve the webpage's content (that is, the HTML DOM) and wait until everything is loaded (images, etc.)
' times out after lngTimeoutInSeconds with a warning
Set IEDocument = IE.Document
dateNow = Now
Do While IEDocument.ReadyState <> "complete"
If Now > DateAdd("s", lngTimeoutInSeconds, dateNow) Then Exit Function
Loop
Set GetIEDocument = IEDocument
End Function
但是我找不到包含我感兴趣的所有其他标签的 table 标签,以允许其余代码提取数据,下面的代码是我目前所拥有的:
Public Sub GetTeamData()
Dim strWebAddress As String
Dim strH2AnchorContent As String
Dim IEDocument As MSHTML.HTMLDocument
Dim objH2 As MSHTML.HTMLHeaderElement
Dim objTable As MSHTML.HTMLTable
Dim objRow As MSHTML.HTMLTableRow
Dim objCell As MSHTML.HTMLTableCell
Dim lngRow As Long
Dim lngColumn As Long
' initialize some variables that should probably better be passed as paramaters or defined as constants
strWebAddress = "https://toolkit.financialexpress.net/santanderam"
strH2AnchorContent = " "
' open page
Set IEDocument = GetIEDocument(strWebAddress)
If IEDocument Is Nothing Then
MsgBox "Timeout reached opening this address:" & vbNewLine & strWebAddress, vbCritical
Exit Sub
End If
' retrieve anchor element
For Each objH2 In IEDocument.getElementsByTagName("h2")
If objH2.innerText = strH2AnchorContent Then Exit For
Next objH2
If objH2 Is Nothing Then
MsgBox "Could not find """ & strH2AnchorContent & """ in DOM!", vbCritical
Exit Sub
End If
' traverse HTML tree to desired table element
' * move up one element in the hierarchy
' * skip two elements to proceed to the third (interjected each time with whitespace that is interpreted as an element of its own)
' * move down two elements n the hierarchy
Set objTable = objH2.parentElement _
.NextSibling.NextSibling _
.NextSibling.NextSibling _
.NextSibling.NextSibling _
.Children(0) _
.Children(0)
' iterate over the table and output its contents
lngRow = 1
For Each objRow In objTable.Rows
lngColumn = 1
For Each objCell In objRow.Cells
Cells(lngRow, lngColumn) = objCell.innerText
lngColumn = lngColumn + 1
Next objCell
lngRow = lngRow + 1
Next
End Sub
我假设我是否可以找到正确的 table 标记以在下面的行中输入:
strH2AnchorContent = " "
那么上面的方法行得通吗?如果是这样,任何人都可以帮助找到正确的标签或建议我在上面的地方出了什么问题吗?
再次感谢任何帮助!
谢谢
编辑 1
更新代码:
' open a webpage in the tab represented by IE and wait until the main request successfully finished
' times out after lngTimeoutInSeconds with a warning
IE.Navigate strWebAddress
dateNow = Now
Do While IE.Busy
If Now > DateAdd("s", lngTimeoutInSeconds, dateNow) Then Exit Function
Loop
' retrieve the webpage's content (that is, the HTML DOM) and wait until everything is loaded (images, etc.)
' times out after lngTimeoutInSeconds with a warning
Set IEDocument = IE.Document
dateNow = Now
Do While IEDocument.ReadyState <> "complete"
If Now > DateAdd("s", lngTimeoutInSeconds, dateNow) Then Exit Function
Loop
Set GetIEDocument = IEDocument
End Function
Public Sub GetTeamData()
Dim strWebAddress As String
Dim strH2AnchorContent As String
Dim IEDocument As MSHTML.HTMLDocument
Dim objH2 As MSHTML.HTMLHeaderElement
Dim obTable As MSHTML.HTMLTable
Dim objRow As MSHTML.HTMLTableRow
Dim objCell As MSHTML.HTMLTableCell
Dim lngRow As Long
Dim lngColumn As Long
' initialize some variables that should probably better be passed as paramaters or defined as constants
strWebAddress = "https://toolkit.financialexpress.net/santanderam"
' open page
Set IEDocument = GetIEDocument(strWebAddress)
If IEDocument Is Nothing Then
MsgBox "Timeout reached opening this address:" & vbNewLine & strWebAddress, vbCritical
Exit Sub
End If
' retrieve anchor element
Set oTable = IEDocument.getElementById("Price_1_1")
Debug.Print oTable.innerText
' iterate over the table and output its contents
lngRow = 1
For Each objRow In oTable.Rows
lngColumn = 1
For Each objCell In objRow.Cells
Cells(lngRow, lngColumn) = objCell.innerText
lngColumn = lngColumn + 1
Next objCell
lngRow = lngRow + 1
Next
End Sub
您的代码工作正常,问题是您试图在加载之前从 table 捕获数据。我添加了一个简单的 Wait
循环 5 秒,您当前的代码捕获了数据。下面是我在 Set oTable = IEDocument.getElementById("Price_1_1")
语句之前添加的循环:
dateNow = Now
bExitLoop = False
lngTimeoutInSeconds = 5
Do While Not bExitLoop
If Now > DateAdd("s", lngTimeoutInSeconds, dateNow) Then Exit Do
Loop
上面的代码是静态的 5 秒等待。你可以让它更有活力..我会把它留在那里作为脑筋急转弯:)
首先,我已经在网上阅读了很多与该主题相关的不同答案,但我不得不承认,我真的很难使它们适应我的需要,所以非常感谢您的帮助!
我需要提取以下网页(第 1-7 页)中列出的数据,即基金名称、价格、货币等 https://toolkit.financialexpress.net/santanderam 并将这些数据提取到 excel。
我有以下代码可以打开 IE 页面(正在运行):
' return the document containg the DOM of the page strWebAddress
' returns Nothing if the timeout lngTimeoutInSeconds was reached
Public Function GetIEDocument(ByVal strWebAddress As String, Optional ByVal lngTimeoutInSeconds As Long = 15) As MSHTML.HTMLDocument
Dim IE As SHDocVw.InternetExplorer
Dim IEDocument As MSHTML.HTMLDocument
Dim dateNow As Date
' create an IE application, representing a tab
Set IE = New SHDocVw.InternetExplorer
' optionally make the application visible, though it will work perfectly fine in the background otherwise
IE.Visible = True
' open a webpage in the tab represented by IE and wait until the main request successfully finished
' times out after lngTimeoutInSeconds with a warning
IE.Navigate strWebAddress
dateNow = Now
Do While IE.Busy
If Now > DateAdd("s", lngTimeoutInSeconds, dateNow) Then Exit Function
Loop
' retrieve the webpage's content (that is, the HTML DOM) and wait until everything is loaded (images, etc.)
' times out after lngTimeoutInSeconds with a warning
Set IEDocument = IE.Document
dateNow = Now
Do While IEDocument.ReadyState <> "complete"
If Now > DateAdd("s", lngTimeoutInSeconds, dateNow) Then Exit Function
Loop
Set GetIEDocument = IEDocument
End Function
但是我找不到包含我感兴趣的所有其他标签的 table 标签,以允许其余代码提取数据,下面的代码是我目前所拥有的:
Public Sub GetTeamData()
Dim strWebAddress As String
Dim strH2AnchorContent As String
Dim IEDocument As MSHTML.HTMLDocument
Dim objH2 As MSHTML.HTMLHeaderElement
Dim objTable As MSHTML.HTMLTable
Dim objRow As MSHTML.HTMLTableRow
Dim objCell As MSHTML.HTMLTableCell
Dim lngRow As Long
Dim lngColumn As Long
' initialize some variables that should probably better be passed as paramaters or defined as constants
strWebAddress = "https://toolkit.financialexpress.net/santanderam"
strH2AnchorContent = " "
' open page
Set IEDocument = GetIEDocument(strWebAddress)
If IEDocument Is Nothing Then
MsgBox "Timeout reached opening this address:" & vbNewLine & strWebAddress, vbCritical
Exit Sub
End If
' retrieve anchor element
For Each objH2 In IEDocument.getElementsByTagName("h2")
If objH2.innerText = strH2AnchorContent Then Exit For
Next objH2
If objH2 Is Nothing Then
MsgBox "Could not find """ & strH2AnchorContent & """ in DOM!", vbCritical
Exit Sub
End If
' traverse HTML tree to desired table element
' * move up one element in the hierarchy
' * skip two elements to proceed to the third (interjected each time with whitespace that is interpreted as an element of its own)
' * move down two elements n the hierarchy
Set objTable = objH2.parentElement _
.NextSibling.NextSibling _
.NextSibling.NextSibling _
.NextSibling.NextSibling _
.Children(0) _
.Children(0)
' iterate over the table and output its contents
lngRow = 1
For Each objRow In objTable.Rows
lngColumn = 1
For Each objCell In objRow.Cells
Cells(lngRow, lngColumn) = objCell.innerText
lngColumn = lngColumn + 1
Next objCell
lngRow = lngRow + 1
Next
End Sub
我假设我是否可以找到正确的 table 标记以在下面的行中输入:
strH2AnchorContent = " "
那么上面的方法行得通吗?如果是这样,任何人都可以帮助找到正确的标签或建议我在上面的地方出了什么问题吗?
再次感谢任何帮助!
谢谢
编辑 1
更新代码:
' open a webpage in the tab represented by IE and wait until the main request successfully finished
' times out after lngTimeoutInSeconds with a warning
IE.Navigate strWebAddress
dateNow = Now
Do While IE.Busy
If Now > DateAdd("s", lngTimeoutInSeconds, dateNow) Then Exit Function
Loop
' retrieve the webpage's content (that is, the HTML DOM) and wait until everything is loaded (images, etc.)
' times out after lngTimeoutInSeconds with a warning
Set IEDocument = IE.Document
dateNow = Now
Do While IEDocument.ReadyState <> "complete"
If Now > DateAdd("s", lngTimeoutInSeconds, dateNow) Then Exit Function
Loop
Set GetIEDocument = IEDocument
End Function
Public Sub GetTeamData()
Dim strWebAddress As String
Dim strH2AnchorContent As String
Dim IEDocument As MSHTML.HTMLDocument
Dim objH2 As MSHTML.HTMLHeaderElement
Dim obTable As MSHTML.HTMLTable
Dim objRow As MSHTML.HTMLTableRow
Dim objCell As MSHTML.HTMLTableCell
Dim lngRow As Long
Dim lngColumn As Long
' initialize some variables that should probably better be passed as paramaters or defined as constants
strWebAddress = "https://toolkit.financialexpress.net/santanderam"
' open page
Set IEDocument = GetIEDocument(strWebAddress)
If IEDocument Is Nothing Then
MsgBox "Timeout reached opening this address:" & vbNewLine & strWebAddress, vbCritical
Exit Sub
End If
' retrieve anchor element
Set oTable = IEDocument.getElementById("Price_1_1")
Debug.Print oTable.innerText
' iterate over the table and output its contents
lngRow = 1
For Each objRow In oTable.Rows
lngColumn = 1
For Each objCell In objRow.Cells
Cells(lngRow, lngColumn) = objCell.innerText
lngColumn = lngColumn + 1
Next objCell
lngRow = lngRow + 1
Next
End Sub
您的代码工作正常,问题是您试图在加载之前从 table 捕获数据。我添加了一个简单的 Wait
循环 5 秒,您当前的代码捕获了数据。下面是我在 Set oTable = IEDocument.getElementById("Price_1_1")
语句之前添加的循环:
dateNow = Now
bExitLoop = False
lngTimeoutInSeconds = 5
Do While Not bExitLoop
If Now > DateAdd("s", lngTimeoutInSeconds, dateNow) Then Exit Do
Loop
上面的代码是静态的 5 秒等待。你可以让它更有活力..我会把它留在那里作为脑筋急转弯:)