Pandas and python: 数据集的多字段去重

Pandas and python: deduplication of dataset by several fields

我有一个公司数据集。每个公司都有纳税人编号、地址、phone 和一些其他字段。这是我从 Roméo Després 那里得到的 Pandas 代码:

import pandas as pd

df = pd.DataFrame({
    "tax_id": ["A", "B", "C", "D", "E", "A", "B", "C", "F", "E"],
    "phone": [0, 1, 2, 3, 4, 5, 0, 0, 6, 3],
    "address": ["x", "y", "z", "x", "y", "x", "t", "z", "u", "v"],
})
print(df)

  tax_id  phone address
0      A      0       x
1      B      1       y
2      C      2       z
3      D      3       x
4      E      4       y
5      A      5       x
6      B      0       t
7      C      0       z
8      F      6       u
9      E      3       v

我需要通过这些字段对数据集进行重复数据删除,这意味着非唯一公司只能通过这些字段之一进行链接。 IE。某些公司在我的列表中绝对是独一无二的,前提是它没有任何关键字段的任何匹配项。如果公司与其他实体共享纳税人编号,并且该实体与第三个实体共享地址,则所有三个公司都是同一家公司。独特公司的预期产出应为:

  tax_id  phone address
0      A      0       x
1      B      1       y
2      C      2       z
8      F      6       u

预期输出以及每个副本的唯一公司索引应如下所示:

  tax_id  phone address  representative_index
0      A      0       x                     0
1      B      1       y                     1
2      C      2       z                     2
3      D      3       x                     0
4      E      4       y                     1
5      A      5       x                     0
6      B      0       t                     0
7      C      0       z                     0
8      F      6       u                     8
9      E      3       v                     3

在这种情况下如何使用 python/pandas 过滤掉重复项?

我想到的唯一算法是以下直接方法:

  1. 我按第一个键对数据集进行分组,收集其他键作为集合 结果数据集
  2. 然后我用第二个键迭代地遍历集合并且 添加到我的分组数据集中以获得第一个键新第二个键的某些值 值,一遍又一遍地迭代它们。
  3. 最后没有什么要补充的了,我对第 3 个键重复这个。

就性能和编码的简单性而言,这看起来不太有前途。

还有其他方法可以通过几个键之一删除重复项吗?

您可以使用图形分析库解决此问题 networkx

import itertools

import networkx as nx
import pandas as pd


df = pd.DataFrame({
    "tax_id": ["A", "B", "C", "D", "E", "A", "B", "C", "F", "E"],
    "phone": [0, 1, 2, 3, 4, 5, 0, 0, 6, 3],
    "address": ["x", "y", "z", "x", "y", "x", "t", "z", "u", "v"],
})

def iter_edges(df):
    """Yield all relationships between rows."""
    for name, series in df.iteritems():
        for nodes in df.groupby(name).indices.values():
            yield from itertools.combinations(nodes, 2)

def iter_representatives(graph):
    """Yield all elements and their representative."""
    for component in nx.connected_components(graph):
        representative = min(component)
        for element in component:
            yield element, representative


graph = nx.Graph()
graph.add_nodes_from(df.index)
graph.add_edges_from(iter_edges(df))

df["representative_index"] = pd.Series(dict(iter_representatives(graph)))

最后df看起来像:

  tax_id  phone address  representative_index
0      A      0       x                     0
1      B      1       y                     0
2      C      2       z                     0
3      D      3       x                     0
4      E      4       y                     0
5      A      5       x                     0
6      B      0       t                     0
7      C      0       z                     0
8      F      6       u                     8
9      E      3       v                     0

请注意,您可以前往 df.drop_duplicates("representative_index") 获取唯一行:

  tax_id  phone address  representative_index
0      A      0       x                     0
8      F      6       u                     8