在 python 数据框中查找相似的文本

Question

假设我有一个 python 数据框如下，

data['text']

abc.google.com
d-2667808233512566908.ampproject.net
d-27973032622323999654.ampproject.net
def.google.com
d-28678547673442325000.ampproject.net
i1-j4-20-1-1-13960-2081004232-s.init.cedexis-radar.net
d-29763453703185417167.ampproject.net
poi.google.com
d-3064948553577027059.ampproject.net
i1-io-0-4-1-20431-1341659986-s.init.cedexis-radar.net
d-2914631797784843280.ampproject.net
i1-j1-18-24-1-11326-1053733564-s.init.cedexis-radar.net

我想找到相似的常用文本并对其进行分组。例如，abc.google.com、def.google.com、poi.google.com 将指向 google.com 等

要求的输出是，

google.com
ampproject.net
ampproject.net
google.com
ampproject.net
s.init.cedexis-radar.net
ampproject.net
google.com
ampproject.net
s.init.cedexis-radar.net
ampproject.net
s.init.cedexis-radar.net

它更像是一个数据清理练习，我可以在其中清理不需要的部分。一种方法是为每个可能的组手动检查和编码。但我会有数百万条短信。那么python中有没有办法/包来做到这一点？

很抱歉在没有尝试任何事情的情况下提出这个问题。我试图对此进行研究但没有取得太大成功。不知道我应该如何开始。如果有人可以让我知道还需要采取的方法，那将对我有所帮助。

谢谢

Answer 1

对于清理，如果您确定数据集中文本段的特定格式，则可以使用正则表达式。

另一种方法是尝试匹配常见模式。例如，在许多文本段中，您有 google.com。您可以在 pre-processing.

期间使用此信息

例子

lines = ['abc.google.com',
         'd-2667808233512566908.ampproject.net',
         'd-27973032622323999654.ampproject.net',
         'def.google.com',
         'd-28678547673442325000.ampproject.net',
         'i1-j4-20-1-1-13960-2081004232-s.init.cedexis-radar.net',
         'd-29763453703185417167.ampproject.net',
         'poi.google.com',
         'd-3064948553577027059.ampproject.net',
         'i1-io-0-4-1-20431-1341659986-s.init.cedexis-radar.net',
         'd-2914631797784843280.ampproject.net',
         'i1-j1-18-24-1-11326-1053733564-s.init.cedexis-radar.net']


def commonSubstringFinder(string1, string2):
    common_substring = ""
    split1 = string1.split('.')
    split2 = string2.split('.')
    index1 = len(split1) - 1
    index2 = len(split2) - 1
    size = 0
    while index1 >= 0 & index2 >= 0:
        if split1[index1] == split2[index2]:
            if common_substring:
                common_substring = split1[index1] + '.' + common_substring
            else:
                common_substring += split1[index1]
            size += 1
        else:
            ind1 = len(split1[index1]) - 1
            ind2 = len(split2[index2]) - 1
            if split1[index1][ind1] == split2[index2][ind2]:
                common_substring = '.' + common_substring
            while ind1 >= 0 & ind2 >= 0:
                if split1[index1][ind1] == split2[index2][ind2] and split1[index1][ind1].isalpha():
                    if common_substring:
                        common_substring = split1[index1][ind1] + common_substring
                    else:
                        common_substring += split1[index1][ind1]
                else:
                    break
                ind1 -= 1
                ind2 -= 1

            break
        index1 -= 1
        index2 -= 1

    if size > 1:
        return common_substring
    else:
        return ""

output = []
for line in lines:
    flag = True
    for i in range(len(output)):
        result = commonSubstringFinder(output[i], line)
        if len(result) > 0:
            output[i] = result
            output.append(result)
            flag = False
            break
    if flag:
        output.append(line)

for item in output:
    print(item)

这输出：

google.com
ampproject.net
ampproject.net
google.com
ampproject.net
s.init.cedexis-radar.net
ampproject.net
google.com
ampproject.net
s.init.cedexis-radar.net
ampproject.net
s.init.cedexis-radar.net

在 python 数据框中查找相似的文本

Find the similar texts across the python dataframe

python

nlp

text-mining

python-2.7