在一组 URL 中找到一个共同的模式

Question

我有一个 URL 文件。文件看起来像这样

http://www.example.com/images/1
http://www.example.com/images/2
.
.
.
http://www.example.com/images/2000
http://www.example.org/p/q/r/1/s/t
http://www.example.org/p/q/r/2/s/t
http://www.example.org/p/q/r/3/s/t
.
.
.
http://www.example.org/p/q/r/5000/s/t

等等。 URL 未排序。我只是整理出来，解释清楚了。

我必须处理这些 URL，这样如果两个 URL 之间有一个词（两个斜杠之间的词）不同并且出现的次数 > 1000，我会用 *

替换那个词

例如，在上面的文件中，我会有

http://www.example.com/images/*
http://www.example.org/p/q/r/*/s/t

文件大小为数百 GB。有人可以帮我解决这个问题吗？

Answer 1

我们在 this paper (end of page 3), which is a parallel adaptation of the sequential "infix extraction" algorithm, introduced in another paper（第 4 页，算法 1）中针对此问题提出了 MapReduce 算法。

这里引用顺序算法：

1. sort URIs
2. tokenize URIs at all special characters
3. cluster URIs according to the first n tokens 
4. for all clusters do
5.   for all URIs in the cluster do
6.     for all possible prefixes do
7.       find the set of (distinct) next tokens T
8.     end for
9.   end for
10.   for all URIs in the cluster do
11.     set as a prefix the one with the largest |T|
12.     set as infix the substring following the prefix
13.   end for
14. end for

并行版本的主要思想是创建集群（步骤 3），将 URI 的第二个标记（在 http:// 之后）用作映射输出键，然后执行与步骤类似的操作4-14) 在每个减速器中，即在每个集群中。我们并行版本的源代码可以找到here。

提取每个 URI 的 "infix" 后，您可以轻松地将其替换为您想要的任何字符，例如 '*'。请记住，这是一个昂贵的过程，在 MapReduce 中完成它才有意义，只有当您拥有数百万以上的 URI 时，您似乎确实拥有这些 URI。

在一组 URL 中找到一个共同的模式

Find a common pattern among a set of URLs

url

mapreduce

distributed-computing

amazon-web-services