Excel Web Powerquery:Excel 合并单元格中的数据字符串 --> 如何分隔数据?

Excel Web Powerquery: Excel merges data strings in cells --> How do I delimit the data?

我正在使用 Excel 2016 并想通过 Web Powerquery 函数从 Oddschecker.com 下载 Odds 到 Excel 电子表格中。

更具体地说,我正在尝试从该网站下载数据:

https://www.oddschecker.com/politics/european-politics/french-election/next-president/bet-history/marine-le-pen/today#all-history

我遇到的问题是,本网站上的一些赔率在没有 space 的情况下被合并到单个单元格中:

Powerquery 中是否有任何方法来分隔数据 strings/odds 以便它们不被合并?

非常感谢您提供的任何帮助。

虽然我无法测试,但由于该站点在俄罗斯互联网段中被列入黑名单,我想那里有 <cr><lf>,并且它们没有转换为新的线。 您需要的是 运行 Text.Replace 对所有包含数据的单元格替换这些字符。 但随后您可能需要将这些值作为单独的行,而这是一项复杂得多的任务。 :)

灵感来自 Gil Raviv 的 http://datachant.com/2017/03/30/web-scraping-power-bi-excel-power-query/

编辑 2017 年 4 月 11 日:此解决方案高度依赖于网站的结构,或者换句话说:昨天它运行良好,但不幸的是今天却没有。

以下具有关联功能的查询适用于我:

let
    Source = Web.Page(Web.Contents("https://www.oddschecker.com/politics/european-politics/french-election/next-president/bet-history/marine-le-pen/today#all-history")),
    Data0 = Source{1}[Data],
    Children = Data0{0}[Children],
    Children1 = Children{1}[Children],
    Children2 = Children1{4}[Children],
    Children3 = Children2{0}[Children],
    Children4 = Children3{0}[Children],
    Children5 = Children4{0}[Children],
    Children6 = Children5{3}[Children],
    Children7 = Children6{0}[Children],
    Children8 = Children7{1}[Children],
    Children9 = Children8{3}[Children],
    Children10 = Children9{0}[Children],
    Children11 = Children10{2}[Children],
    Children12 = Children11{2}[Children],
    Children13 = Children12{0}[Children],
    Children14 = Children13{1}[Children],
    #"Removed Other Columns" = Table.SelectColumns(Children14,{"Children"}),
    #"Invoked Custom Function" = Table.AddColumn(#"Removed Other Columns", "EpandTables", each EpandTables([Children])),
    #"Expanded EpandTables" = Table.ExpandTableColumn(#"Invoked Custom Function", "EpandTables", {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29"}, {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29"}),
    #"Removed Columns" = Table.RemoveColumns(#"Expanded EpandTables",{"Children"}),
    #"Removed Blank Rows" = Table.SelectRows(#"Removed Columns", each not List.IsEmpty(List.RemoveMatchingItems(Record.FieldValues(_), {"", null}))),
    #"Parsed Date" = Table.TransformColumns(#"Removed Blank Rows",{{"Column1", each Date.From(DateTimeZone.From(_)), type date}})
in
    #"Parsed Date"

函数 ExpandTables (编辑:#"Added Custom" 行通过添加 Table.SelectRows 调整)

(ChildTable as table) =>
let
    #"Removed Other Columns1" = Table.SelectColumns(ChildTable,{"Children"}),
    #"Added Custom" = Table.AddColumn(#"Removed Other Columns1", "Custom", each try if [Children] is null then null else if [Children][Text]{0} <> null then [Children][Text]{0} else Lines.ToText(List.Transform(Table.SelectRows([Children], each [Children] <> null)[Children], each _[Text]{0})) otherwise null),
    #"Removed Columns" = Table.RemoveColumns(#"Added Custom",{"Children"}),
    #"Transposed Table" = Table.Transpose(#"Removed Columns")
in
    #"Transposed Table"

下面代码中的另一种方法使用递归函数 fnSearchTR(嵌入在查询中)向下钻取 HTML 文档,直到找到名称 "TR"(或在 100 之后)迭代只是为了防止无休止的迭代)。我注意到这是所需数据所在的地方,至少在今天是这样。 备注:我也将代码中的第二步调整为select "Document"。

这是一个更动态的解决方案,因为 "TR" 在文档结构中的位置无关紧要;否则,如果调整文档结构,那么仍然有可能首先找到其他"TR",但到目前为止它是有效的。 否则也会找到 "TR" 的其他内容,但在将第一列的数据类型调整为日期后,这些将作为错误或空值过滤掉。

此查询还使用了我之前回答中的函数 "ExpandTables"(我更正了拼写错误并添加了 "x",否则函数没有任何变化)。

let
    Source = Web.Page(Web.Contents("https://www.oddschecker.com/politics/european-politics/french-election/next-president/bet-history/marine-le-pen/today#all-history")),
    Data0 = Table.SelectRows(Source, each [Caption] = "Document"){0}[Data],
    ChildrenWithTable = Table.SelectRows(Data0, each [Children] is table),

    fnSearchTR = (newChildren as table, counter as number) as table =>
    let
        Combined = Table.Buffer(Table.Combine(newChildren[Children])),
        ChildrensChildrenWithTable = Table.AddColumn(newChildren, "ChildrensChildren", each Table.SelectRows([Children], each [Children] is table)),
        ChildrensChildrenCombined = Table.Combine(ChildrensChildrenWithTable[ChildrensChildren]),
        CombinedAll = if ChildrensChildrenCombined[Name]{0} = "TR" 
                        then ChildrensChildrenCombined 
                        else if Table.RowCount(ChildrensChildrenCombined) = 0 or counter = 100
                            then Combined 
                            else @fnSearchTR(ChildrensChildrenCombined, counter + 1) 
    in
        CombinedAll,

    CombinedAll = if Table.RowCount(ChildrenWithTable) = 0 then Data0 else fnSearchTR(ChildrenWithTable, 0),
    #"Filtered Rows" = Table.SelectRows(CombinedAll, each ([Name] = "TR")),
    #"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows",{"Children"}),
    #"Invoked Custom Function" = Table.AddColumn(#"Removed Other Columns", "ExpandTables", each ExpandTables([Children])),
    #"Removed Columns" = Table.RemoveColumns(#"Invoked Custom Function",{"Children"}),
    #"Expanded ExpandTables" = Table.ExpandTableColumn(#"Removed Columns", "ExpandTables", {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29"}, {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29"}),
    #"Changed Type" = Table.TransformColumnTypes(#"Expanded ExpandTables",{{"Column1", type date}}),
    #"Removed Errors" = Table.RemoveRowsWithErrors(#"Changed Type", {"Column1"}),
    #"Filtered Rows1" = Table.SelectRows(#"Removed Errors", each ([Column1] <> null))
in
    #"Filtered Rows1"

问题是其中一个组合单元格的 HTML 是:

<td><div class="oo">11/4</div><div class="oi">13/5</div><div class="oo">11/4</div></td>

据我所知,div 布局规则并不意味着换行符,因此 Power Query 不会插入换行符。我们没有 运行 完整的布局引擎,所以我们不知道列宽意味着每个 div 应该在自己的行上。

(如果有人对 HTML 布局语义了解更多,请告诉我,我可以向我的团队提出修复建议。)


您可以像这样 text-replace HTML 在 div 元素之间插入您自己的分隔符 ;

let
    WebPageWithReplace = (url as text, old as text, new as text) => 
        let
            Source = Web.Contents(url),
            TextReplace = Text.ToBinary(Text.Replace(Text.FromBinary(Source), old, new)),
            Page = Web.Page(TextReplace)
        in
            Page,
    Invoked = WebPageWithReplace(
        "https://www.oddschecker.com/politics/european-politics/french-election/next-president/bet-history/marine-le-pen/today#all-history",
        "</div><div",
        "</div>;<div"),
    Data = Invoked{1}[Data]
in
    Data

这样 Web.Page 仍然会找到并解析 HTML table。