为季节和剧集抓取维基百科 (VBA EXCEL)

Scraping Wikipedia for Season and Episodes (VBA EXCEL)

我正在尝试使用 Excel 2013 和 Visual Basic 创建一个程序。作为该程序的一部分,我将在列表框中列出电视节目。我希望能够 double-click 在其中一个上打开另一个表单,其中包含一个列表框,其中包含该节目的所有季节和剧集。

我发现最好的方法是抓取 Wikipedia.org。我认为它将是仅有的以大致相同的格式提供此信息的网站之一。我也打算用书来做这个。

我最初是在这个网站上读到关于抓取的:http://www.wiseowl.co.uk/blog/s393/scrape-website-html.htm

但是,我从来没有对 getelementby* 做过任何事情,所以我不确定它们是如何工作的。任何帮助将不胜感激。在网上搜索后,以下是我能拼凑出的最佳代码:

 Private Sub cmdTest_Click()
'to refer to the running copy of Internet Explorer
Dim ie As InternetExplorer
'to refer to the HTML document returned
Dim html As HTMLDocument
'for iteration
Dim i As Integer
Dim j As Integer

'open Internet Explorer in memory, and go to website
Set ie = New InternetExplorer
ie.Visible = False
ie.navigate "http://en.wikipedia.org/wiki/List_of_Archer_episodes"
'ie.navigate "http://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes"

'Wait until IE is done loading page
Do While ie.READYSTATE <> READYSTATE_COMPLETE
Application.StatusBar = "Trying to go to Episodes ..."
DoEvents
Loop

'show text of HTML document returned
Set html = ie.document

'close down IE and reset status bar
Set ie = Nothing
Application.StatusBar = ""

'clear old data out and put titles in
Cells.Clear

'put heading across the top of row 3
Range("A3").Value = "Season"
Range("B3").Value = "Episode"

i = 4

For Each ele In html.getElementsByClassName("summary")
    Sheets("Wiki2").Range("B" & i).Value = ele.innerText
    i = i + 1
Next

i = 4

For Each ele In html.getElementsByClassName("mw-headline")
     Sheets("Wiki2").Range("A" & i).Value = Left(ele.innerText, 8)
     i = i + 1
 Next
End Sub

第一部分似乎是获取给定页面源代码的非常通用的方法。我目前已将其设置为从电视节目 "Archer" 中提取剧集。以下代码:

i = 4

    For Each ele In html.getElementsByClassName("summary")
        Sheets("Wiki2").Range("B" & i).Value = ele.innerText
        i = i + 1
    Next

    i = 4

    For Each ele In html.getElementsByClassName("mw-headline")
         Sheets("Wiki2").Range("A" & i).Value = Left(ele.innerText, 8)
         i = i + 1
     Next

是我用来提取我要查找的文本的工具。我需要帮助的是将这些组合在一起。我需要它遍历并找到 "mw-headline" 的每个实例,每次如果找到,就查找 class 名称摘要。如果它找到摘要,那么它应该在 A 列的单元格中显示标题(又名季节)内部文本,并在相邻的 B 列单元格中显示摘要(又名剧集名称)。

这些目前是独立工作的。如果您 运行 此代码,您将获得从单元格 b4 开始的电视节目的所有剧集,并且您将获得 a4 中列出的 "mw-headline" 所有内容的列表。问题是 "mw-headline" 不仅适用于季节,还适用于其他一些东西,因此需要检查它下面是否有 "summary" class .这也将消除它说电视节目有新一季的情况,但在它之下,它只说新一季即将到来。没有 "summary" 标签,它不应该列出它。我希望季节显示在 B 列列表中每一集旁边的 A 列中,因此如果每个季节有 10 集,那么 A 列将有 10 个 "Season 1" 实例,然后 10 个实例"Season 2" 等等。

感谢您的帮助,对于以后不知道会遇到这个问题的任何人,您需要将以下代码放在编码的顶部 window:

Enum READYSTATE

READYSTATE_UNINITIALIZED = 0

READYSTATE_LOADING = 1

READYSTATE_LOADED = 2

READYSTATE_INTERACTIVE = 3

READYSTATE_COMPLETE = 4

End Enum

PS - 在代码 "For Each ele in *" 中,ele 是未定义的变量还是代表元素的 vba 可识别单词?我从复制和粘贴工作中得到了这个,但我不明白。谢谢。

我建议也试试 IMDb。以下代码展示了如何通过 HTTP 请求从 IMDb 和维基百科抓取剧季和剧集。

Option Explicit

Sub ExtractDataWikipedia()
    Dim y, sUrl, sRespText, arrMatchSeasons, arrSeason, arrMatchEpisodes, arrEpisode

    sUrl = "https://en.wikipedia.org/wiki/List_of_Archer_episodes"
    ' sUrl = "https://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes"
    ' sUrl = "https://en.wikipedia.org/wiki/List_of_DuckTales_episodes"

    XmlHttpRequest "GET", sUrl, "", "", "", sRespText
    ParseToArray "<span class=""mw-headline"" id=""Season[\s\S]*?>.*?(Season.*?)<[\s\S]*?(<table[\s\S]*?</table>)", sRespText, arrMatchSeasons
    y = 1
    For Each arrSeason In arrMatchSeasons
        ParseToArray "(<td class=""summary""[\s\S]*?</td>)", arrSeason(1), arrMatchEpisodes
        For Each arrEpisode In arrMatchEpisodes
            Cells(y, 1).Value = arrSeason(0)
            Cells(y, 2).Value = GetInnerText(arrEpisode(0))
            y = y + 1
        Next
    Next
End Sub

Sub ExtractDataIMDb()
    Dim y, sUrl, sRespText, arrData, arrMatchSeasons, arrSeason, sUrlEp, arrMatchEpisodes, arrEpisode

    sUrl = "http://www.imdb.com/title/tt1486217/episodes" ' Archer
    ' sUrl = "http://www.imdb.com/title/tt0096697/episodes" ' The Simpsons
    ' sUrl = "http://www.imdb.com/title/tt0092345/episodes" ' DuckTales

    XmlHttpRequest "GET", sUrl, "", "", "", sRespText
    ParseToArray "(<select id=""bySeason""[\s\S]*?</select>)", sRespText, arrData
    ParseToArray "<option[\s\S]*?value=""([\d]*)"">", arrData(0)(0), arrMatchSeasons
    y = 1
    For Each arrSeason In arrMatchSeasons
        DoEvents
        sUrlEp = sUrl & "?season=" & arrSeason(0)
        XmlHttpRequest "GET", sUrlEp, "", "", "", sRespText
        ParseToArray "itemprop=""episodes""[\s\S]*?itemprop=""name""[\s\S]*?>([\s\S]*?)</a>", sRespText, arrMatchEpisodes
        For Each arrEpisode In arrMatchEpisodes
            Cells(y, 3).Value = "Season " & arrSeason(0)
            Cells(y, 4).Value = arrEpisode(0)
            y = y + 1
        Next
    Next
    MsgBox "Completed"
End Sub

Sub XmlHttpRequest(sMethod, sUrl, arrSetHeaders, sFormData, sRespHeaders, sRespText)
    Dim arrHeader
    With CreateObject("Msxml2.ServerXMLHTTP.3.0")
        .SetOption 2, 13056 ' SXH_SERVER_CERT_IGNORE_ALL_SERVER_ERRORS
        .Open sMethod, sUrl, False
        If IsArray(arrSetHeaders) Then
            For Each arrHeader In arrSetHeaders
                .SetRequestHeader arrHeader(0), arrHeader(1)
            Next
        End If
        .Send sFormData
        sRespHeaders = .GetAllResponseHeaders
        sRespText = .ResponseText
    End With
End Sub

Sub ParseToArray(sPattern, sResponse, arrMatches)
    Dim oMatch, arrSMatches, sSubMatch
    arrMatches = Array()
    With CreateObject("VBScript.RegExp")
        .Global = True
        .MultiLine = True
        .Pattern = sPattern
        For Each oMatch In .Execute(sResponse)
            arrSMatches = Array()
            For Each sSubMatch In oMatch.SubMatches
                PushItem arrSMatches, sSubMatch
            Next
            PushItem arrMatches, arrSMatches
        Next
    End With
End Sub

Sub PushItem(arrList, varItem)
    ReDim Preserve arrList(UBound(arrList) + 1)
    arrList(UBound(arrList)) = varItem
End Sub

Function GetInnerText(sText)
    With CreateObject("htmlfile")
        .Write ("<body>" & sText & "</body>")
        GetInnerText = .DocumentElement.Document.GetElementsByTagName("body")(0).InnerText
    End With
End Function

关于 HTML 使用 RegExp 解析:这里是 disclaimer and alternative

更新

例如,下面的代码从 IMDb 中检索包含季、集、标题和播出日期的 table:

Option Explicit

Sub ExtractDataIMDB()

    Dim i As Long
    Dim sURL As String
    Dim sRespText As String
    Dim aData
    Dim aMatchSeasons
    Dim aSeason
    Dim sUrlEp As String
    Dim aMatchEpisodes
    Dim aEpisode
    Dim aResult() As String
    Dim aCells

    ReDim aResult(1 To 4, 1 To 1)
    aResult(1, 1) = "Season"
    aResult(2, 1) = "Episode"
    aResult(3, 1) = "Title"
    aResult(4, 1) = "Air date"

    sURL = "http://www.imdb.com/title/tt1486217/episodes" ' Archer
    ' sUrl = "http://www.imdb.com/title/tt0096697/episodes" ' The Simpsons
    ' sUrl = "http://www.imdb.com/title/tt0092345/episodes" ' DuckTales

    XmlHttpRequest "GET", sURL, "", "", "", sRespText
    ParseToArray "(<select id=""bySeason""[\s\S]*?</select>)", sRespText, aData
    ParseToArray "<option[\s\S]*?value=""([\d]*)"">", aData(0)(0), aMatchSeasons
    i = 2
    For Each aSeason In aMatchSeasons
        DoEvents
        sUrlEp = sURL & "?season=" & aSeason(0)
        XmlHttpRequest "GET", sUrlEp, "", "", "", sRespText
        ParseToArray "itemprop=""episodes""[\s\S]*?itemprop=""episodeNumber"" content=""(.*?)""[\s\S]*?<div class=""airdate"">[\r\n\s]*([\s\S]*?)[\r\n\s]*</div>[\s\S]*?itemprop=""name""[\s\S]*?>([\s\S]*?)</a>", sRespText, aMatchEpisodes
        For Each aEpisode In aMatchEpisodes
            ReDim Preserve aResult(1 To 4, 1 To i)
            aResult(1, i) = aSeason(0)
            aResult(2, i) = aEpisode(0)
            aResult(3, i) = aEpisode(2)
            aResult(4, i) = aEpisode(1)
            i = i + 1
        Next
    Next
    aCells = WorksheetFunction.Transpose(aResult)
    Cells.Delete
    Output Cells(1, 1), aCells

    MsgBox "Completed"
End Sub

Sub XmlHttpRequest(sMethod, sURL, aSetHeaders, sFormData, sRespHeaders, sRespText)
    Dim aHeader
    With CreateObject("Msxml2.ServerXMLHTTP.3.0")
        .SetOption 2, 13056 ' SXH_SERVER_CERT_IGNORE_ALL_SERVER_ERRORS
        .Open sMethod, sURL, False
        If IsArray(aSetHeaders) Then
            For Each aHeader In aSetHeaders
                .SetRequestHeader aHeader(0), aHeader(1)
            Next
        End If
        .Send sFormData
        sRespHeaders = .GetAllResponseHeaders
        sRespText = .ResponseText
    End With
End Sub

Sub ParseToArray(sPattern, sResponse, aMatches)
    Dim oMatch, aSubMatches, sSubMatch
    aMatches = Array()
    With CreateObject("VBScript.RegExp")
        .Global = True
        .MultiLine = True
        .Pattern = sPattern
        For Each oMatch In .Execute(sResponse)
            aSubMatches = Array()
            For Each sSubMatch In oMatch.SubMatches
                PushItem aSubMatches, sSubMatch
            Next
            PushItem aMatches, aSubMatches
        Next
    End With
End Sub

Sub PushItem(aArray, vElement)
    ReDim Preserve aArray(UBound(aArray) + 1)
    aArray(UBound(aArray)) = vElement
End Sub

Sub Output(oDstRng As Range, aCells As Variant)
    With oDstRng
        .Parent.Select
        With .Resize( _
            UBound(aCells, 1) - LBound(aCells, 1) + 1, _
            UBound(aCells, 2) - LBound(aCells, 2) + 1 _
        )
            .NumberFormat = "@"
            .Value = aCells
            .Columns.AutoFit
        End With
    End With
End Sub

这是一个可能的解决方案。我查看了那个特定页面的 html,它对将季节与剧集相关联提出了相当大的挑战。我退后一步,认为由于季节大概是按数字顺序排列的,所以我们不需要为季节编号抓取任何东西。在您提供的页面上,每个特定季节的剧集位于相同的 table,所以我只是从一个 table 中抓取每一集并假设它是第 1 季,所有剧集都来自下一季table 是第 2 季,...

Private Sub cmdTest_Click()
'to refer to the running copy of Internet Explorer
Dim ie As InternetExplorer
'to refer to the HTML document returned
Dim html As HTMLDocument
'for iteration
Dim i As Integer
Dim j As Integer

'open Internet Explorer in memory, and go to website
Set ie = New InternetExplorer
ie.Visible = False
ie.navigate "http://en.wikipedia.org/wiki/List_of_Archer_episodes"
'ie.navigate "http://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes"

'Wait until IE is done loading page
Do While ie.READYSTATE <> READYSTATE_COMPLETE
Application.StatusBar = "Trying to go to Episodes ..."
DoEvents
Loop

'show text of HTML document returned
Set html = ie.document

'close down IE and reset status bar
Set ie = Nothing
Application.StatusBar = ""

'clear old data out and put titles in
Cells.Clear

'put heading across the top of row 3
Range("A3").Value = "Season"
Range("B3").Value = "Episode"

i = 4
Dim season As Integer: season = 1

For Each tableTag In html.getElementsByTagName("table") 'look through each table for "summary" (you could change this to be something a bit more discriminating!)

    If (InStr(1, tableTag.innerHTML, "summary")) Then
        Sheets(1).Cells(i, 1) = "Season " & season

        For Each objEpisode In tableTag.getElementsByClassName("summary")
            Sheets(1).Range("B" & i).Value = objEpisode.innerText
            i = i + 1
        Next
    season = season + 1
    End If

Next

End Sub