计算字典 pandas 列中的项目

Question

我有一个数据框，其中有一列包含字典。我想计算整个列中字典键的出现次数。

一种方法如下：

import pandas as pd
from collections import Counter

df = pd.DataFrame({"data": [{"weight": 3, "color": "blue"},
{"size": 5, "weight": 2},{"size": 3, "color": "red"}]})

c = Counter()

for index, row in df.iterrows():
  for item in list(row["data"].keys()):
    c[item] += 1

print(c)

这给出了

Counter({'weight': 2, 'color': 2, 'size': 2})

有没有更快的方法？

Answer 1

一种更快的方法是使用 itertools.chain 展平该列并根据结果构建一个 Counter（仅包含字典键）：

from itertools import chain

Counter(chain.from_iterable(df.data.values.tolist()))
# Counter({'weight': 2, 'color': 2, 'size': 2})

时间安排：

def OP(df):
    c = Counter()
    for index, row in df.iterrows():
        for item in list(row["data"].keys()):
            c[item] += 1

%timeit OP(df)
# 570 µs ± 49.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit Counter(chain.from_iterable(df.data.values.tolist()))
# 14.2 µs ± 902 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Answer 2

我们可以使用

pd.DataFrame(df.data.tolist()).notnull().sum().to_dict()
Out[653]: {'color': 2, 'size': 2, 'weight': 2}

Answer 3

首先，创建一个空的 Counter 在我看来是毫无用处的。 Counter 如果您提供一个列表，它可以为您计算。这是它的主要目的，我会说。

我会做：

from functools import reduce
c = reduce(lambda x, y : x+y, [Counter(x.keys()) for x in df['data']])

和c是：

Counter({'color': 2, 'size': 2, 'weight': 2})

为了解释行 abobe 的作用，首先它使用列表理解创建了一个 Counter 对象的列表。它遍历该列并使用每个字典的键生成 Counter 个对象。
然后使用函数 reduce 将这些计数器相加。 Counter支持加法

在我的机器上，这种方法使用提供的输入，大约比 OP 方法快 4 倍。

Answer 4

即将具有 pandas 功能：

In [171]: df['data'].apply(pd.Series).count(axis=0).to_dict()                                                   
Out[171]: {'weight': 2, 'color': 2, 'size': 2}

计算字典 pandas 列中的项目

Counting items in pandas column of dictionaries

python

counter

pandas