计算子组中缺失的实例
Count missing instances in subgroups
我在 Pandas 中有一个包含收集数据的数据框;
import pandas as pd
df = pd.DataFrame({'Group': ['A','A','A','A','A','A','A','B','B','B','B','B','B','B'], 'Subgroup': ['Blue', 'Blue','Blue','Red','Red','Red','Red','Blue','Blue','Blue','Blue','Red','Red','Red'],'Obs':[1,2,4,1,2,3,4,1,2,3,6,1,2,3]})
+-------+----------+-----+
| Group | Subgroup | Obs |
+-------+----------+-----+
| A | Blue | 1 |
| A | Blue | 2 |
| A | Blue | 4 |
| A | Red | 1 |
| A | Red | 2 |
| A | Red | 3 |
| A | Red | 4 |
| B | Blue | 1 |
| B | Blue | 2 |
| B | Blue | 3 |
| B | Blue | 6 |
| B | Red | 1 |
| B | Red | 2 |
| B | Red | 3 |
+-------+----------+-----+
Observations ('Obs') 本应无间隙编号,但您可以看到我们在 A 组中有 'missed' Blue 3,在 B 组中有 Blue 4 和 5。期望的结果是每组所有 'missed' 个观察值 ('Obs') 的百分比,因此在示例中:
+-------+--------------------+--------+--------+
| Group | Total Observations | Missed | % |
+-------+--------------------+--------+--------+
| A | 8 | 1 | 12.5% |
| B | 9 | 2 | 22.22% |
+-------+--------------------+--------+--------+
我尝试使用 for 循环和使用组(例如:
df.groupby(['Group','Subgroup']).sum()
print(groups.head)
) 但我似乎无法以任何方式让它发挥作用。我是不是用错了方法?
来自 another answer(对@Lie Ryan 的大声喊叫)我找到了一个查找缺失元素的函数,但是我不太明白如何实现它;
def window(seq, n=2):
"Returns a sliding window (of width n) over data from the iterable"
" s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ... "
it = iter(seq)
result = tuple(islice(it, n))
if len(result) == n:
yield result
for elem in it:
result = result[1:] + (elem,)
yield result
def missing_elements(L):
missing = chain.from_iterable(range(x + 1, y) for x, y in window(L) if (y - x) > 1)
return list(missing)
谁能指点一下方向是正确的吗?
很简单,您需要 groupby
此处:
- 使用
groupby
+ diff
,算出每个 Group
和 SubGroup
缺少多少观察值
- 在
Group
上分组 df
,并计算上一步计算的列的 size
和 sum
- 几个更直接的步骤(计算 %)给你你想要的输出。
f = [ # declare an aggfunc list in advance, we'll need it later
('Total Observations', 'size'),
('Missed', 'sum')
]
g = df.groupby(['Group', 'Subgroup'])\
.Obs.diff()\
.sub(1)\
.groupby(df.Group)\
.agg(f)
g['Total Observations'] += g['Missed']
g['%'] = g['Missed'] / g['Total Observations'] * 100
g
Total Observations Missed %
Group
A 8.0 1.0 12.500000
B 9.0 2.0 22.222222
使用 groupby、apply 和 assign 的类似方法:
(
df.groupby(['Group','Subgroup']).Obs
.apply(lambda x: [x.max()-x.min()+1, x.max()-x.min()+1-len(x)])
.apply(pd.Series)
.groupby(level=0).sum()
.assign(pct=lambda x: x[1]/x[0]*100)
.set_axis(['Total Observations', 'Missed', '%'], axis=1, inplace=False)
)
Out[75]:
Total Observations Missed %
Group
A 8 1 12.500000
B 9 2 22.222222
from collections import Counter
gs = ['Group', 'Subgroup']
old_tups = set(zip(*df.values.T))
missed = pd.Series(Counter(
g for (g, s), d in df.groupby(gs)
for o in range(d.Obs.min(), d.Obs.max() + 1)
if (g, s, o) not in old_tups
), name='Missed')
hit = df.set_index(gs).Obs.count(level=0)
total = hit.add(missed).rename('Total')
ratio = missed.div(total).rename('%')
pd.concat([total, missed, ratio], axis=1).reset_index()
Group Total Missed %
0 A 8 1 0.125000
1 B 9 2 0.222222
我在 Pandas 中有一个包含收集数据的数据框;
import pandas as pd
df = pd.DataFrame({'Group': ['A','A','A','A','A','A','A','B','B','B','B','B','B','B'], 'Subgroup': ['Blue', 'Blue','Blue','Red','Red','Red','Red','Blue','Blue','Blue','Blue','Red','Red','Red'],'Obs':[1,2,4,1,2,3,4,1,2,3,6,1,2,3]})
+-------+----------+-----+
| Group | Subgroup | Obs |
+-------+----------+-----+
| A | Blue | 1 |
| A | Blue | 2 |
| A | Blue | 4 |
| A | Red | 1 |
| A | Red | 2 |
| A | Red | 3 |
| A | Red | 4 |
| B | Blue | 1 |
| B | Blue | 2 |
| B | Blue | 3 |
| B | Blue | 6 |
| B | Red | 1 |
| B | Red | 2 |
| B | Red | 3 |
+-------+----------+-----+
Observations ('Obs') 本应无间隙编号,但您可以看到我们在 A 组中有 'missed' Blue 3,在 B 组中有 Blue 4 和 5。期望的结果是每组所有 'missed' 个观察值 ('Obs') 的百分比,因此在示例中:
+-------+--------------------+--------+--------+
| Group | Total Observations | Missed | % |
+-------+--------------------+--------+--------+
| A | 8 | 1 | 12.5% |
| B | 9 | 2 | 22.22% |
+-------+--------------------+--------+--------+
我尝试使用 for 循环和使用组(例如:
df.groupby(['Group','Subgroup']).sum()
print(groups.head)
) 但我似乎无法以任何方式让它发挥作用。我是不是用错了方法?
来自 another answer(对@Lie Ryan 的大声喊叫)我找到了一个查找缺失元素的函数,但是我不太明白如何实现它;
def window(seq, n=2):
"Returns a sliding window (of width n) over data from the iterable"
" s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ... "
it = iter(seq)
result = tuple(islice(it, n))
if len(result) == n:
yield result
for elem in it:
result = result[1:] + (elem,)
yield result
def missing_elements(L):
missing = chain.from_iterable(range(x + 1, y) for x, y in window(L) if (y - x) > 1)
return list(missing)
谁能指点一下方向是正确的吗?
很简单,您需要 groupby
此处:
- 使用
groupby
+diff
,算出每个Group
和SubGroup
缺少多少观察值
- 在
Group
上分组df
,并计算上一步计算的列的size
和sum
- 几个更直接的步骤(计算 %)给你你想要的输出。
f = [ # declare an aggfunc list in advance, we'll need it later
('Total Observations', 'size'),
('Missed', 'sum')
]
g = df.groupby(['Group', 'Subgroup'])\
.Obs.diff()\
.sub(1)\
.groupby(df.Group)\
.agg(f)
g['Total Observations'] += g['Missed']
g['%'] = g['Missed'] / g['Total Observations'] * 100
g
Total Observations Missed %
Group
A 8.0 1.0 12.500000
B 9.0 2.0 22.222222
使用 groupby、apply 和 assign 的类似方法:
(
df.groupby(['Group','Subgroup']).Obs
.apply(lambda x: [x.max()-x.min()+1, x.max()-x.min()+1-len(x)])
.apply(pd.Series)
.groupby(level=0).sum()
.assign(pct=lambda x: x[1]/x[0]*100)
.set_axis(['Total Observations', 'Missed', '%'], axis=1, inplace=False)
)
Out[75]:
Total Observations Missed %
Group
A 8 1 12.500000
B 9 2 22.222222
from collections import Counter
gs = ['Group', 'Subgroup']
old_tups = set(zip(*df.values.T))
missed = pd.Series(Counter(
g for (g, s), d in df.groupby(gs)
for o in range(d.Obs.min(), d.Obs.max() + 1)
if (g, s, o) not in old_tups
), name='Missed')
hit = df.set_index(gs).Obs.count(level=0)
total = hit.add(missed).rename('Total')
ratio = missed.div(total).rename('%')
pd.concat([total, missed, ratio], axis=1).reset_index()
Group Total Missed %
0 A 8 1 0.125000
1 B 9 2 0.222222