Excel

Question

这是我从事了一年多的海量数据清理任务的第三个也是最后一个遗留问题。感谢 Stack Overflow 社区帮助解决问题：

问题 1- .
问题 2- .

我不是 100% 确定在 excel 中是否可以实现以下目标，但我会尽力描述我面临的数据清理和组织挑战。

我有一系列数据 markers/attributes，它们在 24 列中以随机顺序排列，跨越 500,000 多行。下面的图片 1 是原始数据的示例，为简化说明，跨 12 列和 22 行呈现。 A 到 L 列包含原始数据，M 到 X 列代表所需的输出。

任务摘要：需要完成的是一系列匹配函数，这些函数搜索所有索引列（在本例中为 A 到 L 列）以识别唯一值（例如1），搜索范围内的值（在本例中为A2:L21范围），识别唯一值的相邻值（对于值1，相邻值为2和13-XR），然后将它们输出到在包含任何相关值的每一行中从最频繁出现的值到最不频繁出现的降序序列（在这种情况下，1 出现 5 次并放置在 M2 到 M6 中；2 出现 3 次并放置在 N2 到N6；13-XR 出现 2 次并置于 O2 到 O6 中）。

为了澄清，下面是使用颜色来说明原始数据中的模式匹配（A 列到 L 列）以及这些模式随后应如何在输出中呈现（M 列到 X 列）的分步说明。我已将以下每个图像分成原始数据中的六个模式。

上图是 VBA 解决方案识别的第一个模式。它会将“1”标识为唯一值，并在 A:L 范围内搜索“1”的实例数（以蓝色突出显示），然后标识可以在同一行中找到的所有相邻值：“ 2" 在第 3、5 和 6 行（以绿色突出显示）；和第 4 行和第 5 行中的“13-XR”（以粉红色突出显示）。然后需要对“2”进行此操作，识别相邻值（“1”和“13-XR”），然后对“13-XR”进行识别（“1”和“2”作为相邻值） .输出将 return 列 M 中出现次数最多的唯一值（“1”出现 5 次），然后是列 N 中出现次数第二多的唯一值（“2”出现 3 次），以及出现次数第三多的唯一值在 O 列中（“13-XR”出现 2 次）。

以上稍微复杂一些。 VBA 会将“3”标识为唯一值，在 A:L 范围内搜索“3”的其他实例并标识与其相邻的所有值（在本例中为“4” 、“7”和“9”）。然后它会对“4”做同样的事情，识别所有相邻的值（只有“3”）；然后对于“7”，识别相邻值（“9”、“3”和“12”）；然后为“9”识别（“7”和“3”）；最后，对于“12”识别相邻值（仅“7”）。然后对于存在任何这些值的每一行，输出将 return M 列中的“3”（出现三次）和 N 列中的“7”（也出现三次）；如果计数相等，它们可以以 A 到 Z 或从最小到最大的升序方式呈现......或者只是随机的，相等计数的排序对我来说是任意的。 “9”将在 O 列中被 return 编辑，因为它出现了两次，然后是 P 列中的“4”和 Q 列中的“12”，因为它们都出现了一次，但 12 大于 4。

上图表示很可能是一种普遍现象，其中只有一个唯一值。此处，范围内的任何其他列均未标识“5”。因此，对于存在“5”的每一行，它在 M 列中被 returned 为“5”。

这将是另一种更常见的情况，其中一行中可能存在一个值，而另一行中可能存在两个值。在这种情况下，“6”仅在范围内标识一次，“8”是唯一找到的相邻值。当搜索“8”时，只有 return 是相邻值“6”的一个实例。此处，“8”出现两次，“6”仅出现一次，因此无论行中是否存在“8”或“6”，都会在 M 列中估算“8”，在 N 列中估算“6”。

此处“10”、“111”、“112”、“543”、“433”、“444”和“42-FG”被标识为 [=100 中相互关联的唯一值=] 范围。除“10”外的所有值都出现两次，return按降序排列在 M 到 S 列中。

最终模式的识别方式与上述相同，只是具有更多的唯一值 (n=10)。

最后的说明： 我不知道如何在 excel 内完成这个，但我希望其他人有知识将这个问题向前推进。以下是一些可能有助于解决问题的数据的附加说明：

第一列将始终按升序排序。如果它简化了事情，我可以做额外的自定义排序。
在大约 500,000 行中，15% 只有一个属性值（A 列中有一个值），30% 有两个属性值（A 列中有 1 个值，B 列中有 1 个值），13% 有三个属性值（A、B 和 C 列中有 1 个值）。
我在这个例子中给出了小数字。每个单元格中的实际原始数据值的长度将接近 20 个字符。
一个除了按降序显示模式之外的所有事情的解决方案将是绝对酷的。排序会很好，但如果它造成太多麻烦，我可以没有它。

如果此描述中的任何内容需要进一步说明，或者我可以提供更多信息，请告诉我，我会根据需要进行调整。

提前感谢任何可以帮助解决我的最后一个挑战的人。

附录：

整个数据集发生内存错误。 @ambie 发现错误的来源是相邻链（结果）在 1000 中编号（试图 return 结果跨 1000 列）。似乎问题不在于解决方案或数据，只是达到 excel 内的限制。一个可能的解决方案是（见下图）添加两个新列（ATT_COUNT 作为 M 列；ATT_ALL 作为 Z 列）。 M 列中的 ATT_COUNT 将 return 通常跨列 return 的唯一值总数。只有前 12 个最常出现的值会在 N 到 Y 列（ATT_1_CL 到 ATT_12_CL 中 returned。为了解决 ATT_COUNT > 12（超过 1000+）的情况，我们可以 return space 分隔格式的所有唯一值 ATT_ALL（Z 列).例如，在下图中，第 17、18、19 和 21 行在链中有 17 个唯一值。 N 到 Y 列中仅显示前 12 个最常出现的值。所有 17 个值均以 space 分隔格式显示在 Z 列中。

这里是a link to this mini example test data.

这里是a link to a mid sized sample of test data of ~50k rows.

这里是a link to the full sized sample test data of ~500k rows。

Answer 1

我们通常不提供 'code for you service' 但我知道在之前的问题中您已经提供了一些您已经尝试过的示例代码，我知道您不知道从哪里开始.

对于您未来的编码工作，诀窍是将问题分解为单独的任务。对于您的问题，这些将是：

识别所有唯一值并获取所有相邻值的列表 - 相当简单。
创建一个 'chains' 的列表，其中 link 一个相邻值到下一个 - 这更尴尬，因为虽然列表看起来已排序，但相邻值却没有，因此值相对较低列表中的 down 可能与已经是链的一部分的更高值相邻（样本中的 3 就是一个例子）。所以最简单的事情就是在读取所有唯一值后才分配链。
将每个唯一值映射到其适当的 'chain' - 我通过为链创建索引并将相关索引分配给唯一值来完成此操作。

Collection 对象非常适合您，因为它们可以处理重复问题，允许您填充未知大小的列表，并使用它们的 Key 属性使值映射变得容易.为了使代码易于阅读，我创建了一个包含一些字段的 class。所以首先，插入一个Class模块并命名为cItem。 class 背后的代码是：

Option Explicit

Public Element As String
Public Frq As Long
Public AdjIndex As Long
Public Adjs As Collection

Private Sub Class_Initialize()
    Set Adjs = New Collection
End Sub

在您的模块中，任务可以编码如下：

Dim data As Variant, adj As Variant
Dim uniques As Collection, chains As Collection, chain As Collection
Dim oItem As cItem, oAdj As cItem
Dim r As Long, c As Long, n As Long, i As Long, maxChain As Long
Dim output() As Variant

'Read the data.
'Note: Define range as you need.
With Sheet1
    data = .Range(.Cells(2, "A"), _
                  .Cells(.Rows.Count, "A").End(xlUp)) _
           .Resize(, 12) _
           .Value2
End With

'Find the unique values
Set uniques = New Collection
For r = 1 To UBound(data, 1)
    For c = 1 To UBound(data, 2)
        If IsEmpty(data(r, c)) Then Exit For
        Set oItem = Nothing: On Error Resume Next
        Set oItem = uniques(CStr(data(r, c))): On Error GoTo 0
        If oItem Is Nothing Then
            Set oItem = New cItem
            oItem.Element = CStr(data(r, c))
            uniques.Add oItem, oItem.Element
        End If
        oItem.Frq = oItem.Frq + 1
        'Find the left adjacent value
        If c > 1 Then
            On Error Resume Next
            oItem.Adjs.Add uniques(CStr(data(r, c - 1))), CStr(data(r, c - 1))
            On Error GoTo 0
        End If
        'Find the right adjacent value
        If c < UBound(data, 2) Then
            If Not IsEmpty(data(r, c + 1)) Then
                On Error Resume Next
                oItem.Adjs.Add uniques(CStr(data(r, c + 1))), CStr(data(r, c + 1))
                On Error GoTo 0
            End If
        End If
    Next
Next

'Define the adjacent indexes.
For Each oItem In uniques
    'If the item has a chain index, pass it to the adjacents.
    If oItem.AdjIndex <> 0 Then
        For Each oAdj In oItem.Adjs
            oAdj.AdjIndex = oItem.AdjIndex
        Next
    Else
        'If an adjacent has a chain index, pass it to the item.
        i = 0
        For Each oAdj In oItem.Adjs
            If oAdj.AdjIndex <> 0 Then
                i = oAdj.AdjIndex
                Exit For
            End If
        Next
        If i <> 0 Then
            oItem.AdjIndex = i
            For Each oAdj In oItem.Adjs
                oAdj.AdjIndex = i
            Next
        End If
        'If we're still missing a chain index, create a new one.
        If oItem.AdjIndex = 0 Then
            n = n + 1
            oItem.AdjIndex = n
            For Each oAdj In oItem.Adjs
                oAdj.AdjIndex = n
            Next
        End If
    End If
Next

'Populate the chain lists.
Set chains = New Collection
For Each oItem In uniques
    Set chain = Nothing: On Error Resume Next
    Set chain = chains(CStr(oItem.AdjIndex)): On Error GoTo 0
    If chain Is Nothing Then
        'It's a new chain so create a new collection.
        Set chain = New Collection
        chain.Add oItem.Element, CStr(oItem.Element)
        chains.Add chain, CStr(oItem.AdjIndex)
    Else
        'It's an existing chain, so find the frequency position (highest first).
        Set oAdj = uniques(chain(chain.Count))
        If oItem.Frq <= oAdj.Frq Then
            chain.Add oItem.Element, CStr(oItem.Element)
        Else
            For Each adj In chain
                Set oAdj = uniques(adj)
                If oItem.Frq > oAdj.Frq Then
                    chain.Add Item:=oItem.Element, Key:=CStr(oItem.Element), Before:=adj
                    Exit For
                End If
            Next
        End If
    End If
    'Get the column count of output array
    If chain.Count > maxChain Then maxChain = chain.Count
Next

'Populate each row with the relevant chain
ReDim output(1 To UBound(data, 1), 1 To maxChain)
For r = 1 To UBound(data, 1)
    Set oItem = uniques(CStr(data(r, 1)))
    Set chain = chains(CStr(oItem.AdjIndex))
    c = 1
    For Each adj In chain
        output(r, c) = adj
        c = c + 1
    Next
Next

'Write the output to sheet.
'Note: adjust range to suit.
Sheet1.Range("M2").Resize(UBound(output, 1), UBound(output, 2)).Value = output

这不是最有效的方法，但它确实使每个任务对您来说更加明显。我不确定我是否理解您的数据结构的全部复杂性，但上面的代码确实重现了您的示例，因此它应该为您提供一些有用的东西。

更新

好的，现在我已经看到了你的评论和真实数据，下面是一些修改后的代码，它应该更快并且处理了表面上 'empty' 单元格实际上是空字符串的事实。

首先创建一个名为cItem的class并在后面添加代码：

Option Explicit

Public Name As String
Public Frq As Long
Public Adj As Collection
Private mChainIndex As Long
Public Property Get ChainIndex() As Long
    ChainIndex = mChainIndex
End Property
Public Property Let ChainIndex(val As Long)
    Dim oItem As cItem
    If mChainIndex = 0 Then
        mChainIndex = val
        For Each oItem In Me.Adj
            oItem.ChainIndex = val
        Next
    End If            
End Property
Public Sub AddAdj(oAdj As cItem)
    Dim t As cItem

    On Error Resume Next
    Set t = Me.Adj(oAdj.Name)
    On Error GoTo 0
    If t Is Nothing Then Me.Adj.Add oAdj, oAdj.Name
End Sub
Private Sub Class_Initialize()
    Set Adj = New Collection
End Sub

现在创建另一个名为 cChain 的 class，后面的代码为：

Option Explicit

Public Index As Long
Public Members As Collection
Public Sub AddItem(oItem As cItem)
    Dim oChainItem As cItem
    With Me.Members
        Select Case .Count
            Case 0 'First item so just add it.
                .Add oItem, oItem.Name
            Case Is < 12 'Fewer than 12 items, so add to end or in order.
                Set oChainItem = .item(.Count)
                If oItem.Frq <= oChainItem.Frq Then 'It's last in order so just add it.
                    .Add oItem, oItem.Name
                Else 'Find its place in order.
                    For Each oChainItem In Me.Members
                        If oItem.Frq > oChainItem.Frq Then
                            .Add oItem, oItem.Name, before:=oChainItem.Name
                            Exit For
                        End If
                    Next
                End If
            Case 12 'Full list, so find place and remove last item.
                Set oChainItem = .item(12)
                If oItem.Frq > oChainItem.Frq Then
                    For Each oChainItem In Me.Members
                        If oItem.Frq > oChainItem.Frq Then
                            .Add oItem, oItem.Name, before:=oChainItem.Name
                            .Remove 13
                            Exit For
                        End If
                    Next
                End If
        End Select
    End With
End Sub
Private Sub Class_Initialize()
    Set Members = New Collection
End Sub

最后，您的模块代码将是：

Option Explicit

Public Sub ProcessSheet()
    Dim data As Variant
    Dim items As Collection, chains As Collection
    Dim oItem As cItem, oAdj As cItem
    Dim oChain As cChain
    Dim txt As String
    Dim r As Long, c As Long, n As Long
    Dim output() As Variant
    Dim pTick As Long, pCount As Long, pTot As Long, pTask As String

    'Read the data.
    pTask = "Reading data..."
    Application.StatusBar = pTask
    With Sheet1
        data = .Range(.Cells(2, "A"), _
                      .Cells(.Rows.Count, "A").End(xlUp)) _
               .Resize(, 12) _
               .Value2
    End With

    'Collect unique and adjacent values.
    pTask = "Finding uniques "
    pCount = 0: pTot = UBound(data, 1): pTick = 0
    Set items = New Collection
    For r = 1 To UBound(data, 1)
        If ProgressTicked(pTot, pCount, pTick) Then
            Application.StatusBar = pTask & pTick & "%"
            DoEvents
        End If
        For c = 1 To UBound(data, 2)
            txt = data(r, c)
            If Len(txt) = 0 Then Exit For
            Set oItem = GetOrCreateItem(items, txt)
            oItem.Frq = oItem.Frq + 1

            'Take adjacent on left.
            If c > 1 Then
                txt = data(r, c - 1)
                If Len(txt) > 0 Then
                    Set oAdj = GetOrCreateItem(items, txt)
                    oItem.AddAdj oAdj
                End If
            End If
            'Take adjacent on right.
            If c < UBound(data, 2) Then
                txt = data(r, c + 1)
                If Len(txt) > 0 Then
                    Set oAdj = GetOrCreateItem(items, txt)
                    oItem.AddAdj oAdj
                End If
            End If

        Next
    Next

    'Now that we have all the items and their frequencies,
    'we can find the adjacent chain indexes by a recursive
    'call of the ChainIndex set property.
    pTask = "Find chain indexes "
    pCount = 0: pTot = items.Count: pTick = 0
    Set chains = New Collection
    n = 1 'Chain index.
    For Each oItem In items
        If ProgressTicked(pTot, pCount, pTick) Then
            Application.StatusBar = pTask & pTick & "%"
            DoEvents
        End If
        If oItem.ChainIndex = 0 Then
            oItem.ChainIndex = n
            Set oChain = New cChain
            oChain.Index = n
            chains.Add oChain, CStr(n)
            n = n + 1
        End If
    Next

    'Build the chains.
    pTask = "Build chains "
    pCount = 0: pTot = items.Count: pTick = 0
    For Each oItem In items
        If ProgressTicked(pTot, pCount, pTick) Then
            Application.StatusBar = pTask & pTick & "%"
            DoEvents
        End If
        Set oChain = chains(CStr(oItem.ChainIndex))
        oChain.AddItem oItem
    Next

    'Write the data to our output array.
    pTask = "Populate output "
    pCount = 0: pTot = UBound(data, 1): pTick = 0
    ReDim output(1 To UBound(data, 1), 1 To 12)
    For r = 1 To UBound(data, 1)
        If ProgressTicked(pTot, pCount, pTick) Then
            Application.StatusBar = pTask & pTick & "%"
            DoEvents
        End If
        Set oItem = items(data(r, 1))
        Set oChain = chains(CStr(oItem.ChainIndex))
        c = 1
        For Each oItem In oChain.Members
            output(r, c) = oItem.Name
            c = c + 1
        Next
    Next

    'Write the output to sheet.
    'Note: adjust range to suit.
    pTask = "Writing data..."
    Application.StatusBar = pTask
    Sheet1.Range("M2").Resize(UBound(output, 1), UBound(output, 2)).Value = output
    Application.StatusBar = "Ready"
End Sub

Private Function GetOrCreateItem(col As Collection, key As String) As cItem
    Dim obj As cItem

    'If the item already exists then return it,
    'otherwise create a new item.
    On Error Resume Next
    Set obj = col(key)
    On Error GoTo 0

    If obj Is Nothing Then
        Set obj = New cItem
        obj.Name = key
        col.Add obj, key
    End If

    Set GetOrCreateItem = obj

End Function
Public Function ProgressTicked(ByVal t As Long, ByRef c As Long, ByRef p As Long) As Boolean
    c = c + 1
    If Int((c / t) * 100) > p Then
        p = p + 1
        ProgressTicked = True
    End If
End Function

Excel - 识别唯一值模式和 return 跨列降序输出，针对 500,000 多行进行了优化

Excel - Identify unique value patterns and return output in descending order across columns, optimized for 500,000+ rows

vba

list

unique

pattern-matching