如果字符串的子字符串存在于同一数组中，则删除字符串

Question

这里是菜鸟。

如果父域存在于列表中，我正在尝试通过消除所有子域来缩减域列表。经过一些搜索和阅读后，我设法拼凑了一个脚本，该脚本在某种程度上使用 PowerShell 执行此操作。输出不完全是我想要的，但可以正常工作。我的解决方案的问题是运行由于我的初始列表的大小（数万个条目）需要很长时间。

更新：我更新了示例以澄清我的问题。

示例 "parent.txt" 列表：

adk2.co
adk2.com
adobe.com
helpx.adobe.com
manage.com
list-manage.com
graph.facebook.com

示例输出 "repeats.txt" 文件：

adk2.com (different top level domain than adk2.co but that's ok)
helpx.adobe.com
list-manage.com (not subdomain of manage.com but that's ok)

然后我会从父域中取出并消除重复项，留下 "unique" 子域和域的列表。我在单独的脚本中有这个。

我当前脚本的最终列表示例：

adk2.co    
adobe.com
manage.com
graph.facebook.com (it's not facebook.com because facebook.com wasn't in the original list.)

理想的最终名单：

adk2.co
adk2.com (since adk2.co and adk2.com are actually distinct domains)
adobe.com
manage.com
graph.facebook.com

下面是我的代码：

我获取了我的主机列表 (parent.txt) 并对照自身进行了检查，并将所有匹配项输出到一个新文件中。

$parent = Get-Content("parent.txt")
$hosts = Get-Content("parent.txt")
$repeats =@()

$out_file     = "$PSScriptRoot\repeats.txt"

$hosts | where { 
    $found = $FALSE
    foreach($domains in $parent){
        if($_.Contains($domains) -and $_ -ne $domains){
            $found = $TRUE
            $repeats += $_
        }
        if($found -eq $TRUE){
            break
        }
    }
    $found
}

$repeats     = $repeats -join "`n"

[System.IO.File]::WriteAllText($out_file,$repeats)

这似乎是一种非常低效的方法，因为我要遍历数组的每个元素。关于如何最好地优化它的任何建议？我有一些想法，比如对要检查和检查的元素设置更多条件，但我觉得有一种截然不同的方法会更好。

Answer 1

一种方法是使用散列 table 来存储所有 parent 值，然后对于每个 repeat，将其从 table 中删除。添加到散列 table 时的值 1 无关紧要，因为我们只测试密钥是否存在。

$parent = @(
'adk2.co',
'adk2.com',
'adobe.com',
'helpx.adobe.com',
'manage.com',
'list-manage.com'
)

$repeats = (
'adk2.com',
'helpx.adobe.com',
'list-manage.com'
)

$domains = @{}
$parent | % {$domains.Add($_, 1)}
$repeats | % {if ($domains.ContainsKey($_)) {$domains.Remove($_)}}

$domains.Keys | Sort

Answer 2

首先，一个严格基于共享域名的解决方案（例如，helpx.adobe.com和adobe.com被认为属于同一个域，但是list-manage.com 和 manage.com 不是）。这不是您要的，但也许对未来的读者更有用：

Get-Content parent.txt | Sort-Object -Unique { ($_ -split '\.')[-2,-1] -join '.' }

假设在您的样本输入中 list.manage.com 而不是 list-manage.com，上述命令产生：

adk2.co
adk2.com
adobe.com
graph.facebook.com
manage.com

{ ($_ -split '\.')[-2,-1] -join '.' } 按最后两个域组件（例如，adobe.com）对输入行进行排序：
-Unique 丢弃重复项。

共享后缀解决方案，按要求：

# Helper function for (naively) reversing a string.
# Note: Does not work properly with Unicode combining characters
#       and surrogate pairs.
function reverse($str) { $a = $str.ToCharArray(); [Array]::Reverse($a); -join $a }

# * Sort the reversed input lines, which effectively groups them by shared suffix
#   with the shortest entry first (e.g., the reverse of 'manage.com' before the
#   reverse of 'list-manage.com').
# * It is then sufficient to output only the first entry in each group, using
#   wildcard matching with -notlike to determine group boundaries.
# * Finally, sort the re-reversed results.
Get-Content parent.txt | ForEach-Object { reverse $_ } | Sort-Object |
  ForEach-Object { $prev = $null } {
    if ($null -eq $prev -or $_ -notlike "$prev*" ) { 
      reverse $_ 
      $prev = $_
    }
  } | Sort-Object

如果字符串的子字符串存在于同一数组中，则删除字符串

Removing Strings if Substring of the String is Present in Same Array

arrays

powershell

foreach

contains