Pandas and python: 数据集的多字段去重

Question

我有一个公司数据集。每个公司都有纳税人编号、地址、phone 和一些其他字段。这是我从 Roméo Després 那里得到的 Pandas 代码：

import pandas as pd

df = pd.DataFrame({
    "tax_id": ["A", "B", "C", "D", "E", "A", "B", "C", "F", "E"],
    "phone": [0, 1, 2, 3, 4, 5, 0, 0, 6, 3],
    "address": ["x", "y", "z", "x", "y", "x", "t", "z", "u", "v"],
})
print(df)

  tax_id  phone address
0      A      0       x
1      B      1       y
2      C      2       z
3      D      3       x
4      E      4       y
5      A      5       x
6      B      0       t
7      C      0       z
8      F      6       u
9      E      3       v

我需要通过这些字段对数据集进行重复数据删除，这意味着非唯一公司只能通过这些字段之一进行链接。 IE。某些公司在我的列表中绝对是独一无二的，前提是它没有任何关键字段的任何匹配项。如果公司与其他实体共享纳税人编号，并且该实体与第三个实体共享地址，则所有三个公司都是同一家公司。独特公司的预期产出应为：

  tax_id  phone address
0      A      0       x
1      B      1       y
2      C      2       z
8      F      6       u

预期输出以及每个副本的唯一公司索引应如下所示：

  tax_id  phone address  representative_index
0      A      0       x                     0
1      B      1       y                     1
2      C      2       z                     2
3      D      3       x                     0
4      E      4       y                     1
5      A      5       x                     0
6      B      0       t                     0
7      C      0       z                     0
8      F      6       u                     8
9      E      3       v                     3

在这种情况下如何使用 python/pandas 过滤掉重复项？

我想到的唯一算法是以下直接方法：

我按第一个键对数据集进行分组，收集其他键作为集合结果数据集
然后我用第二个键迭代地遍历集合并且添加到我的分组数据集中以获得第一个键新第二个键的某些值值，一遍又一遍地迭代它们。
最后没有什么要补充的了，我对第 3 个键重复这个。

就性能和编码的简单性而言，这看起来不太有前途。

还有其他方法可以通过几个键之一删除重复项吗？

Answer 1

您可以使用图形分析库解决此问题 networkx。

import itertools

import networkx as nx
import pandas as pd


df = pd.DataFrame({
    "tax_id": ["A", "B", "C", "D", "E", "A", "B", "C", "F", "E"],
    "phone": [0, 1, 2, 3, 4, 5, 0, 0, 6, 3],
    "address": ["x", "y", "z", "x", "y", "x", "t", "z", "u", "v"],
})

def iter_edges(df):
    """Yield all relationships between rows."""
    for name, series in df.iteritems():
        for nodes in df.groupby(name).indices.values():
            yield from itertools.combinations(nodes, 2)

def iter_representatives(graph):
    """Yield all elements and their representative."""
    for component in nx.connected_components(graph):
        representative = min(component)
        for element in component:
            yield element, representative


graph = nx.Graph()
graph.add_nodes_from(df.index)
graph.add_edges_from(iter_edges(df))

df["representative_index"] = pd.Series(dict(iter_representatives(graph)))

最后df看起来像：

  tax_id  phone address  representative_index
0      A      0       x                     0
1      B      1       y                     0
2      C      2       z                     0
3      D      3       x                     0
4      E      4       y                     0
5      A      5       x                     0
6      B      0       t                     0
7      C      0       z                     0
8      F      6       u                     8
9      E      3       v                     0

请注意，您可以前往 df.drop_duplicates("representative_index") 获取唯一行：

  tax_id  phone address  representative_index
0      A      0       x                     0
8      F      6       u                     8

Pandas and python: 数据集的多字段去重

Pandas and python: deduplication of dataset by several fields

python

algorithm

dataset

duplicates

pandas