有两个非常大的 lists/collections - 如何检测 and/or 有效地删除重复项

With two very large lists/collections - how to detect and/or remove duplicates efficiently

为什么

在两个列表之间进行比较和重复数据删除时,编码人员通常不会在时间压力下找到运行时效率最高的实现。两个嵌套的 for 循环是许多程序员常用的 goto 解决方案。人们可能会尝试使用 LINQ 进行 CROSS JOIN,但这显然效率低下。编码人员需要一种令人难忘且代码高效的方法,同时相对运行时高效。

这个问题是在看到一个更具体的问题后创建的: - 它更专门用于数据集的使用。 "dataset"这个词对以后的人没有帮助。没有找到其他一般性问题。

什么

我使用术语 List/Collection 来帮助解决这个更一般的编码问题。

var setToDeduplicate = new List<int>() { 1,2,3,4,5,6,7,8,9,10,11,.....}; //All integer values 1-1M 

var referenceSet = new List<int>() { 1,3,5,7,9,....}; //All odd integer values 1-1M

var deduplicatedSet = deduplicationFunction(setToDeduplicate, referenceSet);

通过实现deduplicationFunction函数,输入数据和输出应该是清楚的。输出可以是 IEnumerable。此输入示例中的预期输出是 1-1M {2,4,6,8,...}

中的偶数

注意:参考集中可能存在重复项。两组中的值仅供参考,因此我不是在寻找数学解决方案 - 这也适用于两组中的随机数输入。

如果使用简单的 LINQ 函数来处理,速度会太慢 O(1M*0.5M)。对于如此大的集合,需要一种更快的方法。

速度很重要,但是使用大量代码进行渐进式改进的价值会降低。此外,理想情况下它适用于其他数据类型,包括数据模型对象,但回答这个特定问题就足够了。其他数据类型只涉及更多的预处理或对答案的轻微更改。

解决方案总结

测试代码如下,结果如下:

using System;
using System.Collections.Generic;
using System.Data;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace Test
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Preparing...");

            List<int> set1 = new List<int>();
            List<int> set2 = new List<int>();

            Random r = new Random();
            var max = 10000;

            for (int i = 0; i < max; i++)
            {
                set1.Add(r.Next(0, max));
                set2.Add(r.Next(0, max/2) * 2);
            }

            Console.WriteLine("First run...");

            Stopwatch sw = new Stopwatch();
            IEnumerable<int> result;
            int count;

            while (true)
            {
                sw.Start();
                result = deduplicationFunction(set1, set2);
                var results1 = result.ToList();
                count = results1.Count;
                sw.Stop();
                Console.WriteLine("Dictionary and Where - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
                sw.Reset();


                sw.Start();
                result = deduplicationFunction2(set1, set2);
                var results2 = result.ToList();
                count = results2.Count;
                sw.Stop();
                Console.WriteLine("  HashSet ExceptWith - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
                sw.Reset();

                sw.Start();
                result = deduplicationFunction3(set1, set2);
                var results3 = result.ToList();
                count = results3.Count;
                sw.Stop();
                Console.WriteLine("     Sort Dual Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
                sw.Reset();

                sw.Start();
                result = deduplicationFunction4(set1, set2);
                var results4 = result.ToList();
                count = results3.Count;
                sw.Stop();
                Console.WriteLine("Presorted Dual Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
                sw.Reset();


                set2.RemoveAt(set2.Count - 1); //Remove the last item, because it was added in the 3rd test

                sw.Start();
                result = deduplicationFunction5(set1, set2);
                var results5 = result.ToList();
                count = results5.Count;
                sw.Stop();
                Console.WriteLine("        Nested Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
                sw.Reset();


                Console.ReadLine();

                Console.WriteLine("");
                Console.WriteLine("Next Run");
                Console.WriteLine("");
            }

        }


        //Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
        static IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
        {
            //Create a hashset first, which is much more efficient for searching
            var ReferenceHashSet = Reference
                                .Distinct() //Inserting duplicate keys in a dictionary will cause an exception
                                .ToDictionary(x => x, x => x); //If there was a ToHashSet function, that would be nicer

            int throwAway;
            return Set.Distinct().Where(y => ReferenceHashSet.TryGetValue(y, out throwAway) == false);
        }

        //Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
        static IEnumerable<int> deduplicationFunction2(List<int> Set, List<int> Reference)
        {
            //Create a hashset first, which is much more efficient for searching
            var SetAsHash = new HashSet<int>();

            Set.ForEach(x =>
            {
                if (SetAsHash.Contains(x))
                    return;

                SetAsHash.Add(x);
            }); // .Net 4.7.2 - ToHashSet will reduce this code to a single line.

            SetAsHash.ExceptWith(Reference); // This is ultimately what we're testing

            return SetAsHash.AsEnumerable();
        }

        static IEnumerable<int> deduplicationFunction3(List<int> Set, List<int> Reference)
        {
            Set.Sort();
            Reference.Sort();
            Reference.Add(Set[Set.Count - 1] + 1); //Ensure the last set item is non-duplicate for an In-built stop clause. This is easy for int list items, just + 1 on the last item.

            return deduplicationFunction4(Set, Reference);
        }

        static IEnumerable<int> deduplicationFunction4(List<int> Set, List<int> Reference)
        {
            int i1 = 0;
            int i2 = 0;
            int thisValue = Set[i1];
            int thisReference = Reference[i2];
            while (true)
            {
                var difference = thisReference - thisValue;

                if (difference < 0)
                {
                    i2++; //Compare side is too low, there might be an equal value to be found
                    if (i2 == Reference.Count)
                        break;
                    thisReference = Reference[i2];
                    continue;
                }

                if (difference > 0) //Duplicate
                    yield return thisValue;

                GoFurther:
                i1++;
                if (i1 == Set.Count)
                    break;
                if (Set[i1] == thisValue) //Eliminates duplicates
                    goto GoFurther; //I rarely use goto statements, but this is a good situation

                thisValue = Set[i1];
            }
        }

        static IEnumerable<int> deduplicationFunction5(List<int> Set, List<int> Reference)
        {
            var found = false;
            var lastValue = 0;
            var thisValue = 0;
            for (int i = 0; i < Set.Count; i++)
            {
                thisValue = Set[i];

                if (thisValue == lastValue)
                    continue;

                lastValue = thisValue;

                found = false;
                for (int x = 0; x < Reference.Count; x++)
                {
                    if (thisValue != Reference[x])
                        continue;

                    found = true;
                    break;
                }

                if (found)
                    continue;

                yield return thisValue;
            }
        }
    }
}

我将使用它来比较多种方法的性能。 (在这个阶段我对 Hash-approach vs dual-index-on-sorted-approach 特别感兴趣,尽管 ExceptWith 启用了一个简洁的解决方案)

到目前为止在集合中的 10k 个项目上的结果(好 运行):

第一个运行

好运行

选择的答案:

HashSet 和 Where

您必须使用哈希集(或字典)来提高速度:

//Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
{
    //Create a hashset first, which is much more efficient for searching
    var ReferenceHashSet = Reference
                        .Distinct() //Inserting duplicate keys in a dictionary will cause an exception
                        .ToDictionary(x => x, x => x); //If there was a ToHashSet function, that would be nicer

    int throwAway;
        return Set.Where(y => ReferenceHashSet.TryGetValue(y, out throwAway));
}

这是 lambda 表达式版本。它使用字典,它提供了在需要时改变值的适应性。可以使用文字 for 循环,并且可能会获得更多的增量性能改进,但相对于具有两个嵌套循环,这已经是一个惊人的改进。

在查看其他答案的同时学习一些东西,这是一个更快的实现:

static IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
{
    //Create a hashset first, which is much more efficient for searching
    var ReferenceHashSet = new HashSet<int>(Reference);
    return Set.Where(y => ReferenceHashSet.Contains(y) == false).Distinct();
}

重要的是,这种方法(虽然比@backs 的回答慢一点)仍然足够通用,可以用于数据库实体,并且其他类型可以很容易地用于重复检查字段。

下面是一个示例,说明如何轻松调整代码以用于 Person 类数据库实体列表。

static IEnumerable<Person> deduplicatePeople(List<Person> Set, List<Person> Reference)
{
    //Create a hashset first, which is much more efficient for searching
    var ReferenceHashSet = new HashSet<int>(Reference.Select(p => p.ID));
    return Set.Where(y => ReferenceHashSet.Contains(y.ID) == false)
            .GroupBy(p => p.ID).Select(p => p.First()); //The groupby and select should accomplish DistinctBy(..p.ID)
}

将 HashSet 用于您的初始列表,并使用 ExceptWith 方法获取结果设置:

var setToDeduplicate = new HashSet<int>() { 1,2,3,4,5,6,7,8,9,10,11,.....}; //All integer values 1-1M 

var referenceSet = new List<int>() { 1,3,5,7,9,....}; //All odd integer values 1-1M

setToDeduplicate.ExceptWith(referenceSet);

这里还有一些,基本上我想针对各种解决方案测试不同和非不同的输入。在非不同的版本中,我不得不在最终输出需要的地方调用不同的。

Mode             : Release (64Bit)
Test Framework   : .NET Framework 4.7.1

Operating System : Microsoft Windows 10 Pro
Version          : 10.0.17134

CPU Name         : Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
Description      : Intel64 Family 6 Model 58 Stepping 9

Cores (Threads)  : 4 (8)      : Architecture  : x64
Clock Speed      : 3901 MHz   : Bus Speed     : 100 MHz
L2Cache          : 1 MB       : L3Cache       : 8 MB

Benchmarks Runs : Inputs (1) * Scales (5) * Benchmarks (6) * Runs (100) = 3,000

结果 不同的输入

--- Random Set 1 ---------------------------------------------------------------------
| Value         |   Average |   Fastest |      Cycles |    Garbage | Test |     Gain |
--- Scale 100 --------------------------------------------------------- Time 0.334 ---
| Backs         |  0.008 ms |  0.007 ms |      31,362 |   8.000 KB | Pass |  68.34 % |
| ListUnsafe    |  0.009 ms |  0.008 ms |      35,487 |   8.000 KB | Pass |  63.45 % |
| HasSet        |  0.012 ms |  0.011 ms |      46,840 |   8.000 KB | Pass |  50.03 % |
| ArrayUnsafe   |  0.013 ms |  0.011 ms |      49,388 |   8.000 KB | Pass |  47.75 % |
| HashSetUnsafe |  0.018 ms |  0.013 ms |      66,866 |  16.000 KB | Pass |  26.62 % |
| Todd          |  0.024 ms |  0.019 ms |      90,763 |  16.000 KB | Base |   0.00 % |
--- Scale 1,000 ------------------------------------------------------- Time 0.377 ---
| Backs         |  0.070 ms |  0.060 ms |     249,374 |  28.977 KB | Pass |  57.56 % |
| ListUnsafe    |  0.078 ms |  0.067 ms |     277,080 |  28.977 KB | Pass |  52.67 % |
| HasSet        |  0.093 ms |  0.083 ms |     329,686 |  28.977 KB | Pass |  43.61 % |
| ArrayUnsafe   |  0.096 ms |  0.082 ms |     340,154 |  36.977 KB | Pass |  41.72 % |
| HashSetUnsafe |  0.103 ms |  0.085 ms |     367,681 |  55.797 KB | Pass |  37.07 % |
| Todd          |  0.164 ms |  0.151 ms |     578,933 | 112.664 KB | Base |   0.00 % |
--- Scale 10,000 ------------------------------------------------------ Time 0.965 ---
| ListUnsafe    |  0.706 ms |  0.611 ms |   2,467,327 | 258.516 KB | Pass |  48.60 % |
| Backs         |  0.758 ms |  0.654 ms |   2,656,610 | 180.297 KB | Pass |  44.81 % |
| ArrayUnsafe   |  0.783 ms |  0.696 ms |   2,739,156 | 276.281 KB | Pass |  43.02 % |
| HasSet        |  0.859 ms |  0.752 ms |   2,999,230 | 198.063 KB | Pass |  37.47 % |
| HashSetUnsafe |  0.864 ms |  0.783 ms |   3,029,086 | 332.273 KB | Pass |  37.07 % |
| Todd          |  1.373 ms |  1.251 ms |   4,795,929 | 604.742 KB | Base |   0.00 % |
--- Scale 100,000 ----------------------------------------------------- Time 5.535 ---
| ListUnsafe    |  5.624 ms |  4.874 ms |  19,658,154 |   2.926 MB | Pass |  40.36 % |
| HasSet        |  7.574 ms |  6.548 ms |  26,446,193 |   2.820 MB | Pass |  19.68 % |
| Backs         |  7.585 ms |  5.634 ms |  26,303,794 |   2.009 MB | Pass |  19.57 % |
| ArrayUnsafe   |  8.287 ms |  6.219 ms |  28,923,797 |   3.583 MB | Pass |  12.12 % |
| Todd          |  9.430 ms |  7.326 ms |  32,880,985 |   2.144 MB | Base |   0.00 % |
| HashSetUnsafe |  9.601 ms |  7.859 ms |  32,845,228 |   5.197 MB | Pass |  -1.81 % |
--- Scale 1,000,000 -------------------------------------------------- Time 47.652 ---
| ListUnsafe    | 57.751 ms | 44.734 ms | 201,477,028 |  29.309 MB | Pass |  22.14 % |
| Backs         | 65.567 ms | 49.023 ms | 228,772,283 |  21.526 MB | Pass |  11.61 % |
| HasSet        | 73.163 ms | 56.799 ms | 254,703,994 |  25.904 MB | Pass |   1.36 % |
| Todd          | 74.175 ms | 53.739 ms | 258,760,390 |   9.144 MB | Base |   0.00 % |
| ArrayUnsafe   | 86.530 ms | 67.803 ms | 300,374,535 |  13.755 MB | Pass | -16.66 % |
| HashSetUnsafe | 97.140 ms | 77.844 ms | 337,639,426 |  39.527 MB | Pass | -30.96 % |
--------------------------------------------------------------------------------------

结果 在需要的地方对结果使用 Distinct 的随机列表

--- Random Set 1 ---------------------------------------------------------------------
| Value         |    Average |   Fastest |      Cycles |    Garbage | Test |    Gain |
--- Scale 100 --------------------------------------------------------- Time 0.272 ---
| Backs         |   0.007 ms |  0.006 ms |      28,449 |   8.000 KB | Pass | 72.96 % |
| HasSet        |   0.010 ms |  0.009 ms |      38,222 |   8.000 KB | Pass | 62.05 % |
| HashSetUnsafe |   0.014 ms |  0.010 ms |      51,816 |  16.000 KB | Pass | 47.52 % |
| ListUnsafe    |   0.017 ms |  0.014 ms |      64,333 |  16.000 KB | Pass | 33.84 % |
| ArrayUnsafe   |   0.020 ms |  0.015 ms |      72,468 |  16.000 KB | Pass | 24.70 % |
| Todd          |   0.026 ms |  0.021 ms |      95,500 |  24.000 KB | Base |  0.00 % |
--- Scale 1,000 ------------------------------------------------------- Time 0.361 ---
| Backs         |   0.061 ms |  0.053 ms |     219,141 |  28.977 KB | Pass | 70.46 % |
| HasSet        |   0.092 ms |  0.080 ms |     325,353 |  28.977 KB | Pass | 55.78 % |
| HashSetUnsafe |   0.093 ms |  0.079 ms |     331,390 |  55.797 KB | Pass | 55.03 % |
| ListUnsafe    |   0.122 ms |  0.101 ms |     432,029 |  73.016 KB | Pass | 41.19 % |
| ArrayUnsafe   |   0.133 ms |  0.113 ms |     469,560 |  73.016 KB | Pass | 35.88 % |
| Todd          |   0.208 ms |  0.173 ms |     730,661 | 148.703 KB | Base |  0.00 % |
--- Scale 10,000 ------------------------------------------------------ Time 0.870 ---
| Backs         |   0.620 ms |  0.579 ms |   2,174,415 | 180.188 KB | Pass | 55.31 % |
| HasSet        |   0.696 ms |  0.635 ms |   2,440,300 | 198.063 KB | Pass | 49.87 % |
| HashSetUnsafe |   0.731 ms |  0.679 ms |   2,563,125 | 332.164 KB | Pass | 47.32 % |
| ListUnsafe    |   0.804 ms |  0.761 ms |   2,818,293 | 400.492 KB | Pass | 42.11 % |
| ArrayUnsafe   |   0.810 ms |  0.751 ms |   2,838,680 | 400.492 KB | Pass | 41.68 % |
| Todd          |   1.388 ms |  1.271 ms |   4,863,651 | 736.953 KB | Base |  0.00 % |
--- Scale 100,000 ----------------------------------------------------- Time 6.616 ---
| Backs         |   5.604 ms |  4.710 ms |  19,600,934 |   2.009 MB | Pass | 62.92 % |
| HasSet        |   6.607 ms |  5.847 ms |  23,093,963 |   2.820 MB | Pass | 56.29 % |
| HashSetUnsafe |   8.565 ms |  7.465 ms |  29,239,067 |   5.197 MB | Pass | 43.34 % |
| ListUnsafe    |  11.447 ms |  9.543 ms |  39,452,865 |   5.101 MB | Pass | 24.28 % |
| ArrayUnsafe   |  11.517 ms |  9.841 ms |  39,731,502 |   5.483 MB | Pass | 23.81 % |
| Todd          |  15.116 ms | 11.369 ms |  51,963,309 |   3.427 MB | Base |  0.00 % |
--- Scale 1,000,000 -------------------------------------------------- Time 55.310 ---
| Backs         |  53.766 ms | 44.321 ms | 187,905,335 |  21.526 MB | Pass | 51.32 % |
| HasSet        |  60.759 ms | 50.742 ms | 212,409,649 |  25.904 MB | Pass | 44.99 % |
| HashSetUnsafe |  79.248 ms | 67.130 ms | 275,455,545 |  39.527 MB | Pass | 28.25 % |
| ListUnsafe    | 106.527 ms | 90.159 ms | 370,838,650 |  39.153 MB | Pass |  3.55 % |
| Todd          | 110.444 ms | 93.225 ms | 384,636,081 |  22.676 MB | Base |  0.00 % |
| ArrayUnsafe   | 114.548 ms | 98.033 ms | 398,219,513 |  38.974 MB | Pass | -3.72 % |
--------------------------------------------------------------------------------------

数据

private Tuple<List<int>, List<int>> GenerateData(int scale)
{
   return new Tuple<List<int>, List<int>>(
      Enumerable.Range(0, scale)
                .Select(x => x)
                .ToList(),
      Enumerable.Range(0, scale)
                .Select(x => Rand.Next(10000))
                .ToList());
}

代码

public class Backs : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
   protected override List<int> InternalRun()
   {
      var hashSet = new HashSet<int>(Input.Item1); 
      hashSet.ExceptWith(Input.Item2); 
      return hashSet.ToList(); 
   }
}

public class HasSet : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{

   protected override List<int> InternalRun()
   {
      var hashSet = new HashSet<int>(Input.Item2); 

      return Input.Item1.Where(y => !hashSet.Contains(y)).ToList(); 
   }
}

public class Todd : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
   protected override List<int> InternalRun()
   {
      var referenceHashSet = Input.Item2.Distinct()                 
                                      .ToDictionary(x => x, x => x);

      return Input.Item1.Where(y => !referenceHashSet.TryGetValue(y, out _)).ToList();
   }
}

public unsafe class HashSetUnsafe : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
   protected override List<int> InternalRun()
   {
      var reference = new HashSet<int>(Input.Item2);
      var result = new HashSet<int>();
      fixed (int* pAry = Input.Item1.ToArray())
      {
         var len = pAry+Input.Item1.Count;
         for (var p = pAry; p < len; p++)
         {
            if(!reference.Contains(*p))
               result.Add(*p);
         }
      }
      return result.ToList(); 
   }
}
public unsafe class ListUnsafe : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
   protected override List<int> InternalRun()
   {
      var reference = new HashSet<int>(Input.Item2);
      var result = new List<int>(Input.Item2.Count);

      fixed (int* pAry = Input.Item1.ToArray())
      {
         var len = pAry+Input.Item1.Count;
         for (var p = pAry; p < len; p++)
         {
            if(!reference.Contains(*p))
               result.Add(*p);
         }
      }
      return result.ToList(); 
   }
}

public unsafe class ArrayUnsafe : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
   protected override List<int> InternalRun()
   {
      var reference = new HashSet<int>(Input.Item2);
      var result = new int[Input.Item1.Count];

      fixed (int* pAry = Input.Item1.ToArray(), pRes = result)
      {
         var j = 0;
         var len = pAry+Input.Item1.Count;
         for (var p = pAry; p < len; p++)
         {
            if(!reference.Contains(*p))
               *(pRes+j++) = *p;
         }
         return result.Take(j).ToList(); 
      }

   }
}

总结

这里真的没有惊喜,如果你有一个独特的列表开始它对某些解决方案更好,如果不是最简单的 hashset 版本是最好的

单循环双索引

正如@PepitoSh 在问题评论中所推荐的:

I think HashSet is a very generic solution to a rather specific problem. If your lists are ordered, scanning them parallel and compare the current items is the fastest

这与具有两个嵌套循环非常不同。相反,有一个单一的通用循环,索引根据相对值差异并行递增。区别基本上是任何正常比较函数的输出:{ negative, 0, positive }

static IEnumerable<int> deduplicationFunction4(List<int> Set, List<int> Reference)
{
    int i1 = 0;
    int i2 = 0;
    int thisValue = Set[i1];
    int thisReference = Reference[i2];
    while (true)
    {
        var difference = thisReference - thisValue;

        if (difference < 0)
        {
            i2++; //Compare side is too low, there might be an equal value to be found
            if (i2 == Reference.Count)
                break;
            thisReference = Reference[i2];
            continue;
        }

        if (difference > 0) //Duplicate
            yield return thisValue;

        GoFurther:
        i1++;
        if (i1 == Set.Count)
            break;
        if (Set[i1] == thisValue) //Eliminates duplicates
            goto GoFurther; //I rarely use goto statements, but this is a good situation

        thisValue = Set[i1];
    }
}

如果列表尚未排序,如何调用此函数:

Set.Sort();
Reference.Sort();
Reference.Add(Set[Set.Count - 1] + 1); //Ensure the last set item is non-duplicate for an In-built stop clause. This is easy for int list items, just + 1 on the last item.

return deduplicationFunction4(Set, Reference);

这让我在基准测试中表现最好。在某些情况下,这可能也可以用不安全的代码来尝试以获得更多的加速。在数据已经排序的场景中,这是迄今为止最好的。也可能会选择更快的排序算法,但不是本题的主题。

注意:此方法会在运行过程中删除重复数据。

在确定文本搜索结果之前,我实际上编写过这样一个循环模式,只是我有 N 个数组来检查 "closeness"。所以我有一个索引数组 - array[index[i]]。因此,我确信具有受控索引递增的单个循环并不是一个新概念,但它在这里肯定是一个很好的解决方案。