Powershell:如何根据 属性 合并两个数组

Powershell: How to merge two arrays based on property

我的第一个数组是宠物主人的列表和他们的手机号码等。第二个数组是带有主人姓名信息的宠物。宠物列表大约1k,主人列表大约3k。

将主人信息添加到宠物信息中最快的方法是什么?目前我的脚本需要将近一分钟才能完成 运行,这似乎有点太多了。

foreach ($pet in $pets) {
    $owner = $owners | Where-Object { $_.name -eq $pet.owner }
    if ($owner) {
        $pet | Add-Member -MemberType NoteProperty -Name "Owner" -Value $owner.name    }
}

在所有者数组中为每只宠物进行线性查找是使您的方法从根本上变慢的原因:对于每只宠物,3,000 个主人必须搜索对象,导致 1,000 x 3,000 = 300 万次查找。

性能还受到偶然实现选择的影响:

  • 与 PowerShell pipeline 一样强大和优雅,它的 一个接一个 流通常明显慢于在 表达式 / 语言语句 中迭代数组,例如 foreach 循环 .

  • 此外,从 PowerShell 7.2.2 开始,Where-Object and ForEach-Object cmdlets are inefficiently implemented, which adds additional overhead - see GitHub issue #10982 and this answer.

    • Where-Object 功能性 限制会进一步恶化性能,因为一旦匹配就无法停止枚举 已找到;也就是说,输入总是完整处理,并且输出所有个匹配。

    • 相比之下,类似的 .Where() array method does offer a way to stop processing once the first match is found (e.g.,
      (1..10).Where({ $_ -ge 5 }, 'First'). Potentially bringing the same functionality to the Where-Object cmdlet is the subject of GitHub issue #13834.


因此,您有两个选择:

  • (A) 务实的解决方案:坚持从根本上来说效率低下的方法,但提高实施效率,以便最终的性能可能足够好:

    • 下面的 (A) 解决方案比您最初的基于 Where-Object 的方法快得多,大约快 33 到 38 倍,具体取决于所使用的 PowerShell 版本;请参阅下一节的基准。
  • (B) 适当的、可扩展的、better-performing但更复杂的解决方案:使用辅助。支持按名称高效查找所有者对象的数据结构,例如 hashtable, as suggested by Darin.

    • 下面的 (B) 解决方案比 (A) 解决方案快大约 7 到 13 倍,具体取决于所使用的 PowerShell 版本,因此大约。比 Where-Object 解决方案快 260 到 420 (!) 倍;请参阅下一节的基准。

注:

  • 在下面的代码中,我修改了您的示例,以便 属性 other 而不是所有者 name 作为新的 属性 添加到每个宠物对象(.Address,如 .OwnerAddress),前提是所有者名称以开头。

  • 此外,为简洁起见,省略了 -MemberType NoteProperty-Name-Value 调用中的参数名称(它们是隐含的)。

解决方案 (A):用(内部)foreach 语句替换 Where-Object 管道:

# Create 1000 sample pets and 3000 sample owners.
$pets = foreach ($i in 1..1000) { [pscustomobject] @{ Name = "Pet $i"; Owner = 'Owner {0}' -f (6 * $i) } }
$owners = foreach ($i in 1..3000) { [pscustomobject] @{ Name = "Owner $i"; Address = "Address $i" } }

foreach ($pet in $pets) { 
  # Perform the lookup more efficiently via an inner `foreach` loop.
  $owner = foreach ($o in $owners) { if ($o.Name -eq $pet.Owner) { $o; break } }
  if ($owner) {
    Add-Member -InputObject $pet OwnerAddress $owner.Address
  }
}

解决方案 (B):创建一个将所有者名称映射到所有者对象的哈希表,以进行高效查找:

# Create 1000 sample pets and 3000 sample owners.
$pets = foreach ($i in 1..1000) { [pscustomobject] @{ Name = "Pet $i"; Owner = 'Owner {0}' -f (6 * $i) } }
$owners = foreach ($i in 1..3000) { [pscustomobject] @{ Name = "Owner $i"; Address = "Address $i" } }

# Create a hashtable that maps owner names to owner objects,
# for efficient lookup by name.
$ownerMap = @{}; foreach ($owner in $owners) { $ownerMap[$owner.Name] = $owner }

foreach ($pet in $pets) { 
  # Look up the pet's owner in the owner map (hashtable); returns $null if not found.
  $owner = $ownerMap[$pet.Owner]
  if ($owner) {
    Add-Member -InputObject $pet OwnerAddress $owner.Address
  }
}

基准

  • 下面是比较三种方法的示例时间,平均超过 10 运行s。

  • 定时命令在 PowerShell 中从来都不是一门精确的科学,性能因许多因素而异,尤其是相对于 绝对 次的硬件,但下面的结果提供了 相对 性能的感觉,如 Factor 输出列所示: 1.00 表示最快的命令,列在最前面,较慢的命令表示为它的倍数,按速度降序排列。

  • 底部包含源代码,允许您自己运行这些基准测试

    • 警告:对于给定的集合大小,这些基准测试 运行 相当长一段时间(最多 10 分钟或更长时间),主要是由于 Where-Object 解决方案的速度太慢是。

    • 为了获得最佳结果,运行 在您的机器不(太)忙于做其他事情时进行基准测试。

  • 请注意,相对于 Windows PowerShell,cross-platform PowerShell (Core) edition 的整体性能似乎有了显着提高。

Windows PowerShell 5.1 on Windows 10:

Factor Secs (10-run avg.) Command
------ ------------------ -------
1.00   0.234              # Hashtable-assisted lookups....
6.85   1.605              # Nested foreach statements...
261.95 61.353             # Pipeline with Where-Object...

Windows10 上的 PowerShell(核心)7.2.2:

Factor Secs (10-run avg.) Command
------ ------------------ -------
1.00   0.096              # Hashtable-assisted lookups.…
12.70  1.216              # Nested foreach statements…
424.40 40.624             # Pipeline with Where-Object…

基准源代码:

  • 以下基准代码使用 this Gist.

    中的函数 Time-Command
  • 除非已经存在,否则系统会提示您自动下载并在您的会话中定义此函数。 (我个人可以向你保证这样做是安全的,但你应该始终自己检查源代码。)

# Download and define function `Time-Command` on demand (will prompt).
# To be safe, inspect the source code at the specified URL first.
if (-not (Get-Command -ErrorAction Ignore Time-Command)) {
  $gistUrl = 'https://gist.github.com/mklement0/9e1f13978620b09ab2d15da5535d1b27/raw/Time-Command.ps1'
  if ((Read-Host "`n====`n  OK to download and define benchmark function ``Time-Command```n  from Gist ${gistUrl}?`n=====`n(y/n)?").Trim() -notin 'y', 'yes') { Write-Warning 'Aborted.'; exit 2 }
  Invoke-RestMethod $gistUrl | Invoke-Expression
  if (-not ${function:Time-Command}) { exit 2 }
}

# Define the collection sizes
$petCount = 1000
$ownerCount = 3000

# Define a sample owners array.
$owners = foreach ($i in 1..$ownerCount) { [pscustomobject] @{ Name = "Owner $i"; Address = "Address $i" } }

# Define a script block that creates a sample pets array.
# Note: We use a script block, because the array must be re-created
#       for each run, since the pet objects get modified.
$petGenerator = {
  foreach ($i in 1..$petCount) { [pscustomobject] @{ Name = "Pet $i"; Owner = 'Owner {0}' -f (6 * $i) } }
}

# Define script blocks with the commands to time.
$commands = @(
  { # Nested foreach statements
    $pets = & $petGenerator
    foreach ($pet in $pets) { 
      $owner = foreach ($o in $owners) { if ($o.Name -eq $pet.Owner) { $o; break } }
      if ($owner) {
        Add-Member -ea stop -InputObject $pet OwnerAddress $owner.Address
      }
    }
  },
  { # Pipeline with Where-Object
    $pets = & $petGenerator
    foreach ($pet in $pets) { 
      $found = $owners | Where-Object { $_.name -eq $pet.Owner }
      if ($found) {
        Add-Member -InputObject $pet OwnerAddress $owner.Address
      }
    }
  },
  { # Hashtable-assisted lookups.
    $pets = & $petGenerator
    $ownerMap = @{}; foreach ($owner in $owners) { $ownerMap[$owner.Name] = $owner }
    foreach ($pet in $pets) { 
      $owner = $ownerMap[$pet.Owner]
      if ($owner) {
        Add-Member -InputObject $pet OwnerAddress $owner.Address
      }
    }
  }
)

Write-Verbose -Verbose 'Running benchmarks...'

# Average 10 runs.
# Add -OutputToHost to print script-block output, if desired.
Time-Command -Count 10 $commands