有两个非常大的 lists/collections - 如何检测 and/or 有效地删除重复项
With two very large lists/collections - how to detect and/or remove duplicates efficiently
为什么
在两个列表之间进行比较和重复数据删除时,编码人员通常不会在时间压力下找到运行时效率最高的实现。两个嵌套的 for 循环是许多程序员常用的 goto 解决方案。人们可能会尝试使用 LINQ 进行 CROSS JOIN,但这显然效率低下。编码人员需要一种令人难忘且代码高效的方法,同时相对运行时高效。
这个问题是在看到一个更具体的问题后创建的: - 它更专门用于数据集的使用。 "dataset"这个词对以后的人没有帮助。没有找到其他一般性问题。
什么
我使用术语 List/Collection 来帮助解决这个更一般的编码问题。
var setToDeduplicate = new List<int>() { 1,2,3,4,5,6,7,8,9,10,11,.....}; //All integer values 1-1M
var referenceSet = new List<int>() { 1,3,5,7,9,....}; //All odd integer values 1-1M
var deduplicatedSet = deduplicationFunction(setToDeduplicate, referenceSet);
通过实现deduplicationFunction函数,输入数据和输出应该是清楚的。输出可以是 IEnumerable。此输入示例中的预期输出是 1-1M {2,4,6,8,...}
中的偶数
注意:参考集中可能存在重复项。两组中的值仅供参考,因此我不是在寻找数学解决方案 - 这也适用于两组中的随机数输入。
如果使用简单的 LINQ 函数来处理,速度会太慢 O(1M*0.5M)。对于如此大的集合,需要一种更快的方法。
速度很重要,但是使用大量代码进行渐进式改进的价值会降低。此外,理想情况下它适用于其他数据类型,包括数据模型对象,但回答这个特定问题就足够了。其他数据类型只涉及更多的预处理或对答案的轻微更改。
解决方案总结
测试代码如下,结果如下:
using System;
using System.Collections.Generic;
using System.Data;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Test
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Preparing...");
List<int> set1 = new List<int>();
List<int> set2 = new List<int>();
Random r = new Random();
var max = 10000;
for (int i = 0; i < max; i++)
{
set1.Add(r.Next(0, max));
set2.Add(r.Next(0, max/2) * 2);
}
Console.WriteLine("First run...");
Stopwatch sw = new Stopwatch();
IEnumerable<int> result;
int count;
while (true)
{
sw.Start();
result = deduplicationFunction(set1, set2);
var results1 = result.ToList();
count = results1.Count;
sw.Stop();
Console.WriteLine("Dictionary and Where - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
sw.Reset();
sw.Start();
result = deduplicationFunction2(set1, set2);
var results2 = result.ToList();
count = results2.Count;
sw.Stop();
Console.WriteLine(" HashSet ExceptWith - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
sw.Reset();
sw.Start();
result = deduplicationFunction3(set1, set2);
var results3 = result.ToList();
count = results3.Count;
sw.Stop();
Console.WriteLine(" Sort Dual Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
sw.Reset();
sw.Start();
result = deduplicationFunction4(set1, set2);
var results4 = result.ToList();
count = results3.Count;
sw.Stop();
Console.WriteLine("Presorted Dual Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
sw.Reset();
set2.RemoveAt(set2.Count - 1); //Remove the last item, because it was added in the 3rd test
sw.Start();
result = deduplicationFunction5(set1, set2);
var results5 = result.ToList();
count = results5.Count;
sw.Stop();
Console.WriteLine(" Nested Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
sw.Reset();
Console.ReadLine();
Console.WriteLine("");
Console.WriteLine("Next Run");
Console.WriteLine("");
}
}
//Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
static IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
{
//Create a hashset first, which is much more efficient for searching
var ReferenceHashSet = Reference
.Distinct() //Inserting duplicate keys in a dictionary will cause an exception
.ToDictionary(x => x, x => x); //If there was a ToHashSet function, that would be nicer
int throwAway;
return Set.Distinct().Where(y => ReferenceHashSet.TryGetValue(y, out throwAway) == false);
}
//Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
static IEnumerable<int> deduplicationFunction2(List<int> Set, List<int> Reference)
{
//Create a hashset first, which is much more efficient for searching
var SetAsHash = new HashSet<int>();
Set.ForEach(x =>
{
if (SetAsHash.Contains(x))
return;
SetAsHash.Add(x);
}); // .Net 4.7.2 - ToHashSet will reduce this code to a single line.
SetAsHash.ExceptWith(Reference); // This is ultimately what we're testing
return SetAsHash.AsEnumerable();
}
static IEnumerable<int> deduplicationFunction3(List<int> Set, List<int> Reference)
{
Set.Sort();
Reference.Sort();
Reference.Add(Set[Set.Count - 1] + 1); //Ensure the last set item is non-duplicate for an In-built stop clause. This is easy for int list items, just + 1 on the last item.
return deduplicationFunction4(Set, Reference);
}
static IEnumerable<int> deduplicationFunction4(List<int> Set, List<int> Reference)
{
int i1 = 0;
int i2 = 0;
int thisValue = Set[i1];
int thisReference = Reference[i2];
while (true)
{
var difference = thisReference - thisValue;
if (difference < 0)
{
i2++; //Compare side is too low, there might be an equal value to be found
if (i2 == Reference.Count)
break;
thisReference = Reference[i2];
continue;
}
if (difference > 0) //Duplicate
yield return thisValue;
GoFurther:
i1++;
if (i1 == Set.Count)
break;
if (Set[i1] == thisValue) //Eliminates duplicates
goto GoFurther; //I rarely use goto statements, but this is a good situation
thisValue = Set[i1];
}
}
static IEnumerable<int> deduplicationFunction5(List<int> Set, List<int> Reference)
{
var found = false;
var lastValue = 0;
var thisValue = 0;
for (int i = 0; i < Set.Count; i++)
{
thisValue = Set[i];
if (thisValue == lastValue)
continue;
lastValue = thisValue;
found = false;
for (int x = 0; x < Reference.Count; x++)
{
if (thisValue != Reference[x])
continue;
found = true;
break;
}
if (found)
continue;
yield return thisValue;
}
}
}
}
我将使用它来比较多种方法的性能。 (在这个阶段我对 Hash-approach vs dual-index-on-sorted-approach 特别感兴趣,尽管 ExceptWith
启用了一个简洁的解决方案)
到目前为止在集合中的 10k 个项目上的结果(好 运行):
第一个运行
- 字典和位置 - 计数:3565,毫秒:16.38。
- HashSet ExceptWith - 计数:3565,毫秒:5.33。
- 排序双索引 - 计数:3565,毫秒:6.34。
- 预排序双索引 - 计数:3565,毫秒:1.14。
- 嵌套索引 - 计数:3565,毫秒:964.16。
好运行
- 字典和位置 - 计数:3565,毫秒:1.21。
- HashSet ExceptWith - 计数:3565,毫秒:0.94。
- 排序双索引 - 计数:3565,毫秒:1.09。
- 预排序双索引 - 计数:3565,毫秒:0.76。
- 嵌套索引 - 计数:3565,毫秒:628.60。
选择的答案:
- @backs HashSet.ExceptWith 方法 - 使用最少的代码稍微快一些,使用一个有趣的函数
ExceptWith
,但是由于缺乏通用性而被削弱,而且有趣的函数不太常见已知。
- 我的一个答案:HashSet > Where(..Contains..) - 只比@backs 慢一点点,但使用了一种使用 LINQ 的代码模式,并且在原始元素列表之外非常通用。我相信这是我在编码时遇到的更常见的情况,并且相信许多其他编码人员也是如此。
- 特别感谢@TheGeneral 对一些答案和一些有趣的
unsafe
版本进行了基准测试,并帮助使@Backs 的答案在后续测试中更有效率。
HashSet 和 Where
您必须使用哈希集(或字典)来提高速度:
//Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
{
//Create a hashset first, which is much more efficient for searching
var ReferenceHashSet = Reference
.Distinct() //Inserting duplicate keys in a dictionary will cause an exception
.ToDictionary(x => x, x => x); //If there was a ToHashSet function, that would be nicer
int throwAway;
return Set.Where(y => ReferenceHashSet.TryGetValue(y, out throwAway));
}
这是 lambda 表达式版本。它使用字典,它提供了在需要时改变值的适应性。可以使用文字 for 循环,并且可能会获得更多的增量性能改进,但相对于具有两个嵌套循环,这已经是一个惊人的改进。
在查看其他答案的同时学习一些东西,这是一个更快的实现:
static IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
{
//Create a hashset first, which is much more efficient for searching
var ReferenceHashSet = new HashSet<int>(Reference);
return Set.Where(y => ReferenceHashSet.Contains(y) == false).Distinct();
}
重要的是,这种方法(虽然比@backs 的回答慢一点)仍然足够通用,可以用于数据库实体,并且其他类型可以很容易地用于重复检查字段。
下面是一个示例,说明如何轻松调整代码以用于 Person
类数据库实体列表。
static IEnumerable<Person> deduplicatePeople(List<Person> Set, List<Person> Reference)
{
//Create a hashset first, which is much more efficient for searching
var ReferenceHashSet = new HashSet<int>(Reference.Select(p => p.ID));
return Set.Where(y => ReferenceHashSet.Contains(y.ID) == false)
.GroupBy(p => p.ID).Select(p => p.First()); //The groupby and select should accomplish DistinctBy(..p.ID)
}
将 HashSet 用于您的初始列表,并使用 ExceptWith
方法获取结果设置:
var setToDeduplicate = new HashSet<int>() { 1,2,3,4,5,6,7,8,9,10,11,.....}; //All integer values 1-1M
var referenceSet = new List<int>() { 1,3,5,7,9,....}; //All odd integer values 1-1M
setToDeduplicate.ExceptWith(referenceSet);
这里还有一些,基本上我想针对各种解决方案测试不同和非不同的输入。在非不同的版本中,我不得不在最终输出需要的地方调用不同的。
Mode : Release (64Bit)
Test Framework : .NET Framework 4.7.1
Operating System : Microsoft Windows 10 Pro
Version : 10.0.17134
CPU Name : Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
Description : Intel64 Family 6 Model 58 Stepping 9
Cores (Threads) : 4 (8) : Architecture : x64
Clock Speed : 3901 MHz : Bus Speed : 100 MHz
L2Cache : 1 MB : L3Cache : 8 MB
Benchmarks Runs : Inputs (1) * Scales (5) * Benchmarks (6) * Runs (100) = 3,000
结果 不同的输入
--- Random Set 1 ---------------------------------------------------------------------
| Value | Average | Fastest | Cycles | Garbage | Test | Gain |
--- Scale 100 --------------------------------------------------------- Time 0.334 ---
| Backs | 0.008 ms | 0.007 ms | 31,362 | 8.000 KB | Pass | 68.34 % |
| ListUnsafe | 0.009 ms | 0.008 ms | 35,487 | 8.000 KB | Pass | 63.45 % |
| HasSet | 0.012 ms | 0.011 ms | 46,840 | 8.000 KB | Pass | 50.03 % |
| ArrayUnsafe | 0.013 ms | 0.011 ms | 49,388 | 8.000 KB | Pass | 47.75 % |
| HashSetUnsafe | 0.018 ms | 0.013 ms | 66,866 | 16.000 KB | Pass | 26.62 % |
| Todd | 0.024 ms | 0.019 ms | 90,763 | 16.000 KB | Base | 0.00 % |
--- Scale 1,000 ------------------------------------------------------- Time 0.377 ---
| Backs | 0.070 ms | 0.060 ms | 249,374 | 28.977 KB | Pass | 57.56 % |
| ListUnsafe | 0.078 ms | 0.067 ms | 277,080 | 28.977 KB | Pass | 52.67 % |
| HasSet | 0.093 ms | 0.083 ms | 329,686 | 28.977 KB | Pass | 43.61 % |
| ArrayUnsafe | 0.096 ms | 0.082 ms | 340,154 | 36.977 KB | Pass | 41.72 % |
| HashSetUnsafe | 0.103 ms | 0.085 ms | 367,681 | 55.797 KB | Pass | 37.07 % |
| Todd | 0.164 ms | 0.151 ms | 578,933 | 112.664 KB | Base | 0.00 % |
--- Scale 10,000 ------------------------------------------------------ Time 0.965 ---
| ListUnsafe | 0.706 ms | 0.611 ms | 2,467,327 | 258.516 KB | Pass | 48.60 % |
| Backs | 0.758 ms | 0.654 ms | 2,656,610 | 180.297 KB | Pass | 44.81 % |
| ArrayUnsafe | 0.783 ms | 0.696 ms | 2,739,156 | 276.281 KB | Pass | 43.02 % |
| HasSet | 0.859 ms | 0.752 ms | 2,999,230 | 198.063 KB | Pass | 37.47 % |
| HashSetUnsafe | 0.864 ms | 0.783 ms | 3,029,086 | 332.273 KB | Pass | 37.07 % |
| Todd | 1.373 ms | 1.251 ms | 4,795,929 | 604.742 KB | Base | 0.00 % |
--- Scale 100,000 ----------------------------------------------------- Time 5.535 ---
| ListUnsafe | 5.624 ms | 4.874 ms | 19,658,154 | 2.926 MB | Pass | 40.36 % |
| HasSet | 7.574 ms | 6.548 ms | 26,446,193 | 2.820 MB | Pass | 19.68 % |
| Backs | 7.585 ms | 5.634 ms | 26,303,794 | 2.009 MB | Pass | 19.57 % |
| ArrayUnsafe | 8.287 ms | 6.219 ms | 28,923,797 | 3.583 MB | Pass | 12.12 % |
| Todd | 9.430 ms | 7.326 ms | 32,880,985 | 2.144 MB | Base | 0.00 % |
| HashSetUnsafe | 9.601 ms | 7.859 ms | 32,845,228 | 5.197 MB | Pass | -1.81 % |
--- Scale 1,000,000 -------------------------------------------------- Time 47.652 ---
| ListUnsafe | 57.751 ms | 44.734 ms | 201,477,028 | 29.309 MB | Pass | 22.14 % |
| Backs | 65.567 ms | 49.023 ms | 228,772,283 | 21.526 MB | Pass | 11.61 % |
| HasSet | 73.163 ms | 56.799 ms | 254,703,994 | 25.904 MB | Pass | 1.36 % |
| Todd | 74.175 ms | 53.739 ms | 258,760,390 | 9.144 MB | Base | 0.00 % |
| ArrayUnsafe | 86.530 ms | 67.803 ms | 300,374,535 | 13.755 MB | Pass | -16.66 % |
| HashSetUnsafe | 97.140 ms | 77.844 ms | 337,639,426 | 39.527 MB | Pass | -30.96 % |
--------------------------------------------------------------------------------------
结果 在需要的地方对结果使用 Distinct 的随机列表
--- Random Set 1 ---------------------------------------------------------------------
| Value | Average | Fastest | Cycles | Garbage | Test | Gain |
--- Scale 100 --------------------------------------------------------- Time 0.272 ---
| Backs | 0.007 ms | 0.006 ms | 28,449 | 8.000 KB | Pass | 72.96 % |
| HasSet | 0.010 ms | 0.009 ms | 38,222 | 8.000 KB | Pass | 62.05 % |
| HashSetUnsafe | 0.014 ms | 0.010 ms | 51,816 | 16.000 KB | Pass | 47.52 % |
| ListUnsafe | 0.017 ms | 0.014 ms | 64,333 | 16.000 KB | Pass | 33.84 % |
| ArrayUnsafe | 0.020 ms | 0.015 ms | 72,468 | 16.000 KB | Pass | 24.70 % |
| Todd | 0.026 ms | 0.021 ms | 95,500 | 24.000 KB | Base | 0.00 % |
--- Scale 1,000 ------------------------------------------------------- Time 0.361 ---
| Backs | 0.061 ms | 0.053 ms | 219,141 | 28.977 KB | Pass | 70.46 % |
| HasSet | 0.092 ms | 0.080 ms | 325,353 | 28.977 KB | Pass | 55.78 % |
| HashSetUnsafe | 0.093 ms | 0.079 ms | 331,390 | 55.797 KB | Pass | 55.03 % |
| ListUnsafe | 0.122 ms | 0.101 ms | 432,029 | 73.016 KB | Pass | 41.19 % |
| ArrayUnsafe | 0.133 ms | 0.113 ms | 469,560 | 73.016 KB | Pass | 35.88 % |
| Todd | 0.208 ms | 0.173 ms | 730,661 | 148.703 KB | Base | 0.00 % |
--- Scale 10,000 ------------------------------------------------------ Time 0.870 ---
| Backs | 0.620 ms | 0.579 ms | 2,174,415 | 180.188 KB | Pass | 55.31 % |
| HasSet | 0.696 ms | 0.635 ms | 2,440,300 | 198.063 KB | Pass | 49.87 % |
| HashSetUnsafe | 0.731 ms | 0.679 ms | 2,563,125 | 332.164 KB | Pass | 47.32 % |
| ListUnsafe | 0.804 ms | 0.761 ms | 2,818,293 | 400.492 KB | Pass | 42.11 % |
| ArrayUnsafe | 0.810 ms | 0.751 ms | 2,838,680 | 400.492 KB | Pass | 41.68 % |
| Todd | 1.388 ms | 1.271 ms | 4,863,651 | 736.953 KB | Base | 0.00 % |
--- Scale 100,000 ----------------------------------------------------- Time 6.616 ---
| Backs | 5.604 ms | 4.710 ms | 19,600,934 | 2.009 MB | Pass | 62.92 % |
| HasSet | 6.607 ms | 5.847 ms | 23,093,963 | 2.820 MB | Pass | 56.29 % |
| HashSetUnsafe | 8.565 ms | 7.465 ms | 29,239,067 | 5.197 MB | Pass | 43.34 % |
| ListUnsafe | 11.447 ms | 9.543 ms | 39,452,865 | 5.101 MB | Pass | 24.28 % |
| ArrayUnsafe | 11.517 ms | 9.841 ms | 39,731,502 | 5.483 MB | Pass | 23.81 % |
| Todd | 15.116 ms | 11.369 ms | 51,963,309 | 3.427 MB | Base | 0.00 % |
--- Scale 1,000,000 -------------------------------------------------- Time 55.310 ---
| Backs | 53.766 ms | 44.321 ms | 187,905,335 | 21.526 MB | Pass | 51.32 % |
| HasSet | 60.759 ms | 50.742 ms | 212,409,649 | 25.904 MB | Pass | 44.99 % |
| HashSetUnsafe | 79.248 ms | 67.130 ms | 275,455,545 | 39.527 MB | Pass | 28.25 % |
| ListUnsafe | 106.527 ms | 90.159 ms | 370,838,650 | 39.153 MB | Pass | 3.55 % |
| Todd | 110.444 ms | 93.225 ms | 384,636,081 | 22.676 MB | Base | 0.00 % |
| ArrayUnsafe | 114.548 ms | 98.033 ms | 398,219,513 | 38.974 MB | Pass | -3.72 % |
--------------------------------------------------------------------------------------
数据
private Tuple<List<int>, List<int>> GenerateData(int scale)
{
return new Tuple<List<int>, List<int>>(
Enumerable.Range(0, scale)
.Select(x => x)
.ToList(),
Enumerable.Range(0, scale)
.Select(x => Rand.Next(10000))
.ToList());
}
代码
public class Backs : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var hashSet = new HashSet<int>(Input.Item1);
hashSet.ExceptWith(Input.Item2);
return hashSet.ToList();
}
}
public class HasSet : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var hashSet = new HashSet<int>(Input.Item2);
return Input.Item1.Where(y => !hashSet.Contains(y)).ToList();
}
}
public class Todd : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var referenceHashSet = Input.Item2.Distinct()
.ToDictionary(x => x, x => x);
return Input.Item1.Where(y => !referenceHashSet.TryGetValue(y, out _)).ToList();
}
}
public unsafe class HashSetUnsafe : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var reference = new HashSet<int>(Input.Item2);
var result = new HashSet<int>();
fixed (int* pAry = Input.Item1.ToArray())
{
var len = pAry+Input.Item1.Count;
for (var p = pAry; p < len; p++)
{
if(!reference.Contains(*p))
result.Add(*p);
}
}
return result.ToList();
}
}
public unsafe class ListUnsafe : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var reference = new HashSet<int>(Input.Item2);
var result = new List<int>(Input.Item2.Count);
fixed (int* pAry = Input.Item1.ToArray())
{
var len = pAry+Input.Item1.Count;
for (var p = pAry; p < len; p++)
{
if(!reference.Contains(*p))
result.Add(*p);
}
}
return result.ToList();
}
}
public unsafe class ArrayUnsafe : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var reference = new HashSet<int>(Input.Item2);
var result = new int[Input.Item1.Count];
fixed (int* pAry = Input.Item1.ToArray(), pRes = result)
{
var j = 0;
var len = pAry+Input.Item1.Count;
for (var p = pAry; p < len; p++)
{
if(!reference.Contains(*p))
*(pRes+j++) = *p;
}
return result.Take(j).ToList();
}
}
}
总结
这里真的没有惊喜,如果你有一个独特的列表开始它对某些解决方案更好,如果不是最简单的 hashset 版本是最好的
单循环双索引
正如@PepitoSh 在问题评论中所推荐的:
I think HashSet is a very generic solution to a rather specific
problem. If your lists are ordered, scanning them parallel and compare
the current items is the fastest
这与具有两个嵌套循环非常不同。相反,有一个单一的通用循环,索引根据相对值差异并行递增。区别基本上是任何正常比较函数的输出:{ negative, 0, positive }
static IEnumerable<int> deduplicationFunction4(List<int> Set, List<int> Reference)
{
int i1 = 0;
int i2 = 0;
int thisValue = Set[i1];
int thisReference = Reference[i2];
while (true)
{
var difference = thisReference - thisValue;
if (difference < 0)
{
i2++; //Compare side is too low, there might be an equal value to be found
if (i2 == Reference.Count)
break;
thisReference = Reference[i2];
continue;
}
if (difference > 0) //Duplicate
yield return thisValue;
GoFurther:
i1++;
if (i1 == Set.Count)
break;
if (Set[i1] == thisValue) //Eliminates duplicates
goto GoFurther; //I rarely use goto statements, but this is a good situation
thisValue = Set[i1];
}
}
如果列表尚未排序,如何调用此函数:
Set.Sort();
Reference.Sort();
Reference.Add(Set[Set.Count - 1] + 1); //Ensure the last set item is non-duplicate for an In-built stop clause. This is easy for int list items, just + 1 on the last item.
return deduplicationFunction4(Set, Reference);
这让我在基准测试中表现最好。在某些情况下,这可能也可以用不安全的代码来尝试以获得更多的加速。在数据已经排序的场景中,这是迄今为止最好的。也可能会选择更快的排序算法,但不是本题的主题。
注意:此方法会在运行过程中删除重复数据。
在确定文本搜索结果之前,我实际上编写过这样一个循环模式,只是我有 N 个数组来检查 "closeness"。所以我有一个索引数组 - array[index[i]]
。因此,我确信具有受控索引递增的单个循环并不是一个新概念,但它在这里肯定是一个很好的解决方案。
为什么
在两个列表之间进行比较和重复数据删除时,编码人员通常不会在时间压力下找到运行时效率最高的实现。两个嵌套的 for 循环是许多程序员常用的 goto 解决方案。人们可能会尝试使用 LINQ 进行 CROSS JOIN,但这显然效率低下。编码人员需要一种令人难忘且代码高效的方法,同时相对运行时高效。
这个问题是在看到一个更具体的问题后创建的:
什么
我使用术语 List/Collection 来帮助解决这个更一般的编码问题。
var setToDeduplicate = new List<int>() { 1,2,3,4,5,6,7,8,9,10,11,.....}; //All integer values 1-1M
var referenceSet = new List<int>() { 1,3,5,7,9,....}; //All odd integer values 1-1M
var deduplicatedSet = deduplicationFunction(setToDeduplicate, referenceSet);
通过实现deduplicationFunction函数,输入数据和输出应该是清楚的。输出可以是 IEnumerable。此输入示例中的预期输出是 1-1M {2,4,6,8,...}
中的偶数注意:参考集中可能存在重复项。两组中的值仅供参考,因此我不是在寻找数学解决方案 - 这也适用于两组中的随机数输入。
如果使用简单的 LINQ 函数来处理,速度会太慢 O(1M*0.5M)。对于如此大的集合,需要一种更快的方法。
速度很重要,但是使用大量代码进行渐进式改进的价值会降低。此外,理想情况下它适用于其他数据类型,包括数据模型对象,但回答这个特定问题就足够了。其他数据类型只涉及更多的预处理或对答案的轻微更改。
解决方案总结
测试代码如下,结果如下:
using System;
using System.Collections.Generic;
using System.Data;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Test
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Preparing...");
List<int> set1 = new List<int>();
List<int> set2 = new List<int>();
Random r = new Random();
var max = 10000;
for (int i = 0; i < max; i++)
{
set1.Add(r.Next(0, max));
set2.Add(r.Next(0, max/2) * 2);
}
Console.WriteLine("First run...");
Stopwatch sw = new Stopwatch();
IEnumerable<int> result;
int count;
while (true)
{
sw.Start();
result = deduplicationFunction(set1, set2);
var results1 = result.ToList();
count = results1.Count;
sw.Stop();
Console.WriteLine("Dictionary and Where - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
sw.Reset();
sw.Start();
result = deduplicationFunction2(set1, set2);
var results2 = result.ToList();
count = results2.Count;
sw.Stop();
Console.WriteLine(" HashSet ExceptWith - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
sw.Reset();
sw.Start();
result = deduplicationFunction3(set1, set2);
var results3 = result.ToList();
count = results3.Count;
sw.Stop();
Console.WriteLine(" Sort Dual Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
sw.Reset();
sw.Start();
result = deduplicationFunction4(set1, set2);
var results4 = result.ToList();
count = results3.Count;
sw.Stop();
Console.WriteLine("Presorted Dual Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
sw.Reset();
set2.RemoveAt(set2.Count - 1); //Remove the last item, because it was added in the 3rd test
sw.Start();
result = deduplicationFunction5(set1, set2);
var results5 = result.ToList();
count = results5.Count;
sw.Stop();
Console.WriteLine(" Nested Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
sw.Reset();
Console.ReadLine();
Console.WriteLine("");
Console.WriteLine("Next Run");
Console.WriteLine("");
}
}
//Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
static IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
{
//Create a hashset first, which is much more efficient for searching
var ReferenceHashSet = Reference
.Distinct() //Inserting duplicate keys in a dictionary will cause an exception
.ToDictionary(x => x, x => x); //If there was a ToHashSet function, that would be nicer
int throwAway;
return Set.Distinct().Where(y => ReferenceHashSet.TryGetValue(y, out throwAway) == false);
}
//Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
static IEnumerable<int> deduplicationFunction2(List<int> Set, List<int> Reference)
{
//Create a hashset first, which is much more efficient for searching
var SetAsHash = new HashSet<int>();
Set.ForEach(x =>
{
if (SetAsHash.Contains(x))
return;
SetAsHash.Add(x);
}); // .Net 4.7.2 - ToHashSet will reduce this code to a single line.
SetAsHash.ExceptWith(Reference); // This is ultimately what we're testing
return SetAsHash.AsEnumerable();
}
static IEnumerable<int> deduplicationFunction3(List<int> Set, List<int> Reference)
{
Set.Sort();
Reference.Sort();
Reference.Add(Set[Set.Count - 1] + 1); //Ensure the last set item is non-duplicate for an In-built stop clause. This is easy for int list items, just + 1 on the last item.
return deduplicationFunction4(Set, Reference);
}
static IEnumerable<int> deduplicationFunction4(List<int> Set, List<int> Reference)
{
int i1 = 0;
int i2 = 0;
int thisValue = Set[i1];
int thisReference = Reference[i2];
while (true)
{
var difference = thisReference - thisValue;
if (difference < 0)
{
i2++; //Compare side is too low, there might be an equal value to be found
if (i2 == Reference.Count)
break;
thisReference = Reference[i2];
continue;
}
if (difference > 0) //Duplicate
yield return thisValue;
GoFurther:
i1++;
if (i1 == Set.Count)
break;
if (Set[i1] == thisValue) //Eliminates duplicates
goto GoFurther; //I rarely use goto statements, but this is a good situation
thisValue = Set[i1];
}
}
static IEnumerable<int> deduplicationFunction5(List<int> Set, List<int> Reference)
{
var found = false;
var lastValue = 0;
var thisValue = 0;
for (int i = 0; i < Set.Count; i++)
{
thisValue = Set[i];
if (thisValue == lastValue)
continue;
lastValue = thisValue;
found = false;
for (int x = 0; x < Reference.Count; x++)
{
if (thisValue != Reference[x])
continue;
found = true;
break;
}
if (found)
continue;
yield return thisValue;
}
}
}
}
我将使用它来比较多种方法的性能。 (在这个阶段我对 Hash-approach vs dual-index-on-sorted-approach 特别感兴趣,尽管 ExceptWith
启用了一个简洁的解决方案)
到目前为止在集合中的 10k 个项目上的结果(好 运行):
第一个运行
- 字典和位置 - 计数:3565,毫秒:16.38。
- HashSet ExceptWith - 计数:3565,毫秒:5.33。
- 排序双索引 - 计数:3565,毫秒:6.34。
- 预排序双索引 - 计数:3565,毫秒:1.14。
- 嵌套索引 - 计数:3565,毫秒:964.16。
好运行
- 字典和位置 - 计数:3565,毫秒:1.21。
- HashSet ExceptWith - 计数:3565,毫秒:0.94。
- 排序双索引 - 计数:3565,毫秒:1.09。
- 预排序双索引 - 计数:3565,毫秒:0.76。
- 嵌套索引 - 计数:3565,毫秒:628.60。
选择的答案:
- @backs HashSet.ExceptWith 方法 - 使用最少的代码稍微快一些,使用一个有趣的函数
ExceptWith
,但是由于缺乏通用性而被削弱,而且有趣的函数不太常见已知。 - 我的一个答案:HashSet > Where(..Contains..) - 只比@backs 慢一点点,但使用了一种使用 LINQ 的代码模式,并且在原始元素列表之外非常通用。我相信这是我在编码时遇到的更常见的情况,并且相信许多其他编码人员也是如此。
- 特别感谢@TheGeneral 对一些答案和一些有趣的
unsafe
版本进行了基准测试,并帮助使@Backs 的答案在后续测试中更有效率。
HashSet 和 Where
您必须使用哈希集(或字典)来提高速度:
//Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
{
//Create a hashset first, which is much more efficient for searching
var ReferenceHashSet = Reference
.Distinct() //Inserting duplicate keys in a dictionary will cause an exception
.ToDictionary(x => x, x => x); //If there was a ToHashSet function, that would be nicer
int throwAway;
return Set.Where(y => ReferenceHashSet.TryGetValue(y, out throwAway));
}
这是 lambda 表达式版本。它使用字典,它提供了在需要时改变值的适应性。可以使用文字 for 循环,并且可能会获得更多的增量性能改进,但相对于具有两个嵌套循环,这已经是一个惊人的改进。
在查看其他答案的同时学习一些东西,这是一个更快的实现:
static IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
{
//Create a hashset first, which is much more efficient for searching
var ReferenceHashSet = new HashSet<int>(Reference);
return Set.Where(y => ReferenceHashSet.Contains(y) == false).Distinct();
}
重要的是,这种方法(虽然比@backs 的回答慢一点)仍然足够通用,可以用于数据库实体,并且其他类型可以很容易地用于重复检查字段。
下面是一个示例,说明如何轻松调整代码以用于 Person
类数据库实体列表。
static IEnumerable<Person> deduplicatePeople(List<Person> Set, List<Person> Reference)
{
//Create a hashset first, which is much more efficient for searching
var ReferenceHashSet = new HashSet<int>(Reference.Select(p => p.ID));
return Set.Where(y => ReferenceHashSet.Contains(y.ID) == false)
.GroupBy(p => p.ID).Select(p => p.First()); //The groupby and select should accomplish DistinctBy(..p.ID)
}
将 HashSet 用于您的初始列表,并使用 ExceptWith
方法获取结果设置:
var setToDeduplicate = new HashSet<int>() { 1,2,3,4,5,6,7,8,9,10,11,.....}; //All integer values 1-1M
var referenceSet = new List<int>() { 1,3,5,7,9,....}; //All odd integer values 1-1M
setToDeduplicate.ExceptWith(referenceSet);
这里还有一些,基本上我想针对各种解决方案测试不同和非不同的输入。在非不同的版本中,我不得不在最终输出需要的地方调用不同的。
Mode : Release (64Bit)
Test Framework : .NET Framework 4.7.1
Operating System : Microsoft Windows 10 Pro
Version : 10.0.17134
CPU Name : Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
Description : Intel64 Family 6 Model 58 Stepping 9
Cores (Threads) : 4 (8) : Architecture : x64
Clock Speed : 3901 MHz : Bus Speed : 100 MHz
L2Cache : 1 MB : L3Cache : 8 MB
Benchmarks Runs : Inputs (1) * Scales (5) * Benchmarks (6) * Runs (100) = 3,000
结果 不同的输入
--- Random Set 1 ---------------------------------------------------------------------
| Value | Average | Fastest | Cycles | Garbage | Test | Gain |
--- Scale 100 --------------------------------------------------------- Time 0.334 ---
| Backs | 0.008 ms | 0.007 ms | 31,362 | 8.000 KB | Pass | 68.34 % |
| ListUnsafe | 0.009 ms | 0.008 ms | 35,487 | 8.000 KB | Pass | 63.45 % |
| HasSet | 0.012 ms | 0.011 ms | 46,840 | 8.000 KB | Pass | 50.03 % |
| ArrayUnsafe | 0.013 ms | 0.011 ms | 49,388 | 8.000 KB | Pass | 47.75 % |
| HashSetUnsafe | 0.018 ms | 0.013 ms | 66,866 | 16.000 KB | Pass | 26.62 % |
| Todd | 0.024 ms | 0.019 ms | 90,763 | 16.000 KB | Base | 0.00 % |
--- Scale 1,000 ------------------------------------------------------- Time 0.377 ---
| Backs | 0.070 ms | 0.060 ms | 249,374 | 28.977 KB | Pass | 57.56 % |
| ListUnsafe | 0.078 ms | 0.067 ms | 277,080 | 28.977 KB | Pass | 52.67 % |
| HasSet | 0.093 ms | 0.083 ms | 329,686 | 28.977 KB | Pass | 43.61 % |
| ArrayUnsafe | 0.096 ms | 0.082 ms | 340,154 | 36.977 KB | Pass | 41.72 % |
| HashSetUnsafe | 0.103 ms | 0.085 ms | 367,681 | 55.797 KB | Pass | 37.07 % |
| Todd | 0.164 ms | 0.151 ms | 578,933 | 112.664 KB | Base | 0.00 % |
--- Scale 10,000 ------------------------------------------------------ Time 0.965 ---
| ListUnsafe | 0.706 ms | 0.611 ms | 2,467,327 | 258.516 KB | Pass | 48.60 % |
| Backs | 0.758 ms | 0.654 ms | 2,656,610 | 180.297 KB | Pass | 44.81 % |
| ArrayUnsafe | 0.783 ms | 0.696 ms | 2,739,156 | 276.281 KB | Pass | 43.02 % |
| HasSet | 0.859 ms | 0.752 ms | 2,999,230 | 198.063 KB | Pass | 37.47 % |
| HashSetUnsafe | 0.864 ms | 0.783 ms | 3,029,086 | 332.273 KB | Pass | 37.07 % |
| Todd | 1.373 ms | 1.251 ms | 4,795,929 | 604.742 KB | Base | 0.00 % |
--- Scale 100,000 ----------------------------------------------------- Time 5.535 ---
| ListUnsafe | 5.624 ms | 4.874 ms | 19,658,154 | 2.926 MB | Pass | 40.36 % |
| HasSet | 7.574 ms | 6.548 ms | 26,446,193 | 2.820 MB | Pass | 19.68 % |
| Backs | 7.585 ms | 5.634 ms | 26,303,794 | 2.009 MB | Pass | 19.57 % |
| ArrayUnsafe | 8.287 ms | 6.219 ms | 28,923,797 | 3.583 MB | Pass | 12.12 % |
| Todd | 9.430 ms | 7.326 ms | 32,880,985 | 2.144 MB | Base | 0.00 % |
| HashSetUnsafe | 9.601 ms | 7.859 ms | 32,845,228 | 5.197 MB | Pass | -1.81 % |
--- Scale 1,000,000 -------------------------------------------------- Time 47.652 ---
| ListUnsafe | 57.751 ms | 44.734 ms | 201,477,028 | 29.309 MB | Pass | 22.14 % |
| Backs | 65.567 ms | 49.023 ms | 228,772,283 | 21.526 MB | Pass | 11.61 % |
| HasSet | 73.163 ms | 56.799 ms | 254,703,994 | 25.904 MB | Pass | 1.36 % |
| Todd | 74.175 ms | 53.739 ms | 258,760,390 | 9.144 MB | Base | 0.00 % |
| ArrayUnsafe | 86.530 ms | 67.803 ms | 300,374,535 | 13.755 MB | Pass | -16.66 % |
| HashSetUnsafe | 97.140 ms | 77.844 ms | 337,639,426 | 39.527 MB | Pass | -30.96 % |
--------------------------------------------------------------------------------------
结果 在需要的地方对结果使用 Distinct 的随机列表
--- Random Set 1 ---------------------------------------------------------------------
| Value | Average | Fastest | Cycles | Garbage | Test | Gain |
--- Scale 100 --------------------------------------------------------- Time 0.272 ---
| Backs | 0.007 ms | 0.006 ms | 28,449 | 8.000 KB | Pass | 72.96 % |
| HasSet | 0.010 ms | 0.009 ms | 38,222 | 8.000 KB | Pass | 62.05 % |
| HashSetUnsafe | 0.014 ms | 0.010 ms | 51,816 | 16.000 KB | Pass | 47.52 % |
| ListUnsafe | 0.017 ms | 0.014 ms | 64,333 | 16.000 KB | Pass | 33.84 % |
| ArrayUnsafe | 0.020 ms | 0.015 ms | 72,468 | 16.000 KB | Pass | 24.70 % |
| Todd | 0.026 ms | 0.021 ms | 95,500 | 24.000 KB | Base | 0.00 % |
--- Scale 1,000 ------------------------------------------------------- Time 0.361 ---
| Backs | 0.061 ms | 0.053 ms | 219,141 | 28.977 KB | Pass | 70.46 % |
| HasSet | 0.092 ms | 0.080 ms | 325,353 | 28.977 KB | Pass | 55.78 % |
| HashSetUnsafe | 0.093 ms | 0.079 ms | 331,390 | 55.797 KB | Pass | 55.03 % |
| ListUnsafe | 0.122 ms | 0.101 ms | 432,029 | 73.016 KB | Pass | 41.19 % |
| ArrayUnsafe | 0.133 ms | 0.113 ms | 469,560 | 73.016 KB | Pass | 35.88 % |
| Todd | 0.208 ms | 0.173 ms | 730,661 | 148.703 KB | Base | 0.00 % |
--- Scale 10,000 ------------------------------------------------------ Time 0.870 ---
| Backs | 0.620 ms | 0.579 ms | 2,174,415 | 180.188 KB | Pass | 55.31 % |
| HasSet | 0.696 ms | 0.635 ms | 2,440,300 | 198.063 KB | Pass | 49.87 % |
| HashSetUnsafe | 0.731 ms | 0.679 ms | 2,563,125 | 332.164 KB | Pass | 47.32 % |
| ListUnsafe | 0.804 ms | 0.761 ms | 2,818,293 | 400.492 KB | Pass | 42.11 % |
| ArrayUnsafe | 0.810 ms | 0.751 ms | 2,838,680 | 400.492 KB | Pass | 41.68 % |
| Todd | 1.388 ms | 1.271 ms | 4,863,651 | 736.953 KB | Base | 0.00 % |
--- Scale 100,000 ----------------------------------------------------- Time 6.616 ---
| Backs | 5.604 ms | 4.710 ms | 19,600,934 | 2.009 MB | Pass | 62.92 % |
| HasSet | 6.607 ms | 5.847 ms | 23,093,963 | 2.820 MB | Pass | 56.29 % |
| HashSetUnsafe | 8.565 ms | 7.465 ms | 29,239,067 | 5.197 MB | Pass | 43.34 % |
| ListUnsafe | 11.447 ms | 9.543 ms | 39,452,865 | 5.101 MB | Pass | 24.28 % |
| ArrayUnsafe | 11.517 ms | 9.841 ms | 39,731,502 | 5.483 MB | Pass | 23.81 % |
| Todd | 15.116 ms | 11.369 ms | 51,963,309 | 3.427 MB | Base | 0.00 % |
--- Scale 1,000,000 -------------------------------------------------- Time 55.310 ---
| Backs | 53.766 ms | 44.321 ms | 187,905,335 | 21.526 MB | Pass | 51.32 % |
| HasSet | 60.759 ms | 50.742 ms | 212,409,649 | 25.904 MB | Pass | 44.99 % |
| HashSetUnsafe | 79.248 ms | 67.130 ms | 275,455,545 | 39.527 MB | Pass | 28.25 % |
| ListUnsafe | 106.527 ms | 90.159 ms | 370,838,650 | 39.153 MB | Pass | 3.55 % |
| Todd | 110.444 ms | 93.225 ms | 384,636,081 | 22.676 MB | Base | 0.00 % |
| ArrayUnsafe | 114.548 ms | 98.033 ms | 398,219,513 | 38.974 MB | Pass | -3.72 % |
--------------------------------------------------------------------------------------
数据
private Tuple<List<int>, List<int>> GenerateData(int scale)
{
return new Tuple<List<int>, List<int>>(
Enumerable.Range(0, scale)
.Select(x => x)
.ToList(),
Enumerable.Range(0, scale)
.Select(x => Rand.Next(10000))
.ToList());
}
代码
public class Backs : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var hashSet = new HashSet<int>(Input.Item1);
hashSet.ExceptWith(Input.Item2);
return hashSet.ToList();
}
}
public class HasSet : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var hashSet = new HashSet<int>(Input.Item2);
return Input.Item1.Where(y => !hashSet.Contains(y)).ToList();
}
}
public class Todd : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var referenceHashSet = Input.Item2.Distinct()
.ToDictionary(x => x, x => x);
return Input.Item1.Where(y => !referenceHashSet.TryGetValue(y, out _)).ToList();
}
}
public unsafe class HashSetUnsafe : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var reference = new HashSet<int>(Input.Item2);
var result = new HashSet<int>();
fixed (int* pAry = Input.Item1.ToArray())
{
var len = pAry+Input.Item1.Count;
for (var p = pAry; p < len; p++)
{
if(!reference.Contains(*p))
result.Add(*p);
}
}
return result.ToList();
}
}
public unsafe class ListUnsafe : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var reference = new HashSet<int>(Input.Item2);
var result = new List<int>(Input.Item2.Count);
fixed (int* pAry = Input.Item1.ToArray())
{
var len = pAry+Input.Item1.Count;
for (var p = pAry; p < len; p++)
{
if(!reference.Contains(*p))
result.Add(*p);
}
}
return result.ToList();
}
}
public unsafe class ArrayUnsafe : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var reference = new HashSet<int>(Input.Item2);
var result = new int[Input.Item1.Count];
fixed (int* pAry = Input.Item1.ToArray(), pRes = result)
{
var j = 0;
var len = pAry+Input.Item1.Count;
for (var p = pAry; p < len; p++)
{
if(!reference.Contains(*p))
*(pRes+j++) = *p;
}
return result.Take(j).ToList();
}
}
}
总结
这里真的没有惊喜,如果你有一个独特的列表开始它对某些解决方案更好,如果不是最简单的 hashset 版本是最好的
单循环双索引
正如@PepitoSh 在问题评论中所推荐的:
I think HashSet is a very generic solution to a rather specific problem. If your lists are ordered, scanning them parallel and compare the current items is the fastest
这与具有两个嵌套循环非常不同。相反,有一个单一的通用循环,索引根据相对值差异并行递增。区别基本上是任何正常比较函数的输出:{ negative, 0, positive }
static IEnumerable<int> deduplicationFunction4(List<int> Set, List<int> Reference)
{
int i1 = 0;
int i2 = 0;
int thisValue = Set[i1];
int thisReference = Reference[i2];
while (true)
{
var difference = thisReference - thisValue;
if (difference < 0)
{
i2++; //Compare side is too low, there might be an equal value to be found
if (i2 == Reference.Count)
break;
thisReference = Reference[i2];
continue;
}
if (difference > 0) //Duplicate
yield return thisValue;
GoFurther:
i1++;
if (i1 == Set.Count)
break;
if (Set[i1] == thisValue) //Eliminates duplicates
goto GoFurther; //I rarely use goto statements, but this is a good situation
thisValue = Set[i1];
}
}
如果列表尚未排序,如何调用此函数:
Set.Sort();
Reference.Sort();
Reference.Add(Set[Set.Count - 1] + 1); //Ensure the last set item is non-duplicate for an In-built stop clause. This is easy for int list items, just + 1 on the last item.
return deduplicationFunction4(Set, Reference);
这让我在基准测试中表现最好。在某些情况下,这可能也可以用不安全的代码来尝试以获得更多的加速。在数据已经排序的场景中,这是迄今为止最好的。也可能会选择更快的排序算法,但不是本题的主题。
注意:此方法会在运行过程中删除重复数据。
在确定文本搜索结果之前,我实际上编写过这样一个循环模式,只是我有 N 个数组来检查 "closeness"。所以我有一个索引数组 - array[index[i]]
。因此,我确信具有受控索引递增的单个循环并不是一个新概念,但它在这里肯定是一个很好的解决方案。