如何按位置查找另一列在不同行中具有多个值的列值的总长度

Question

这是的 第 2 部分 问题。

有没有办法找到同时有Apple和Strawberry的ID，然后求出总长度？和只有 Apple 的 ID，以及只有 Strawberry 的 IDS？ 基于位置

df:

        ID           Fruit        Location
0       ABC          Apple        NY            <-ABC has Apple and Strawberry
1       ABC          Strawberry   NY            <-ABC has Apple and Strawberry
2       EFG          Apple        LA            <-EFG has Apple only
3       XYZ          Apple        HOUSTON       <-XYZ has Apple and Strawberry
4       XYZ          Strawberry   HOUSTON       <-XYZ has Apple and Strawberry 
5       CDF          Strawberry   BOSTON        <-CDF has Strawberry
6       AAA          Apple        CHICAGO       <-AAA has Apple only

期望的输出：

IDs that has Apple and Strawberry:
NY       1
HOUSTON  1
IDs that has Apple only:
LA       1
CHICAGO  1
IDs that has Strawberry only:
BOSTON   1

之前的代码是：

v = ['Apple','Strawberry']
out = df.groupby('ID')['Fruit'].apply(lambda x: set(x) == set(v)).sum()
print(out)
>>> 2

我尝试了以下方法，但没有用，结果相同

v = ['Apple','Strawberry']
out = df.groupby('ID', 'LOCATION')['Fruit'].apply(lambda x: set(x) == set(v)).sum()
print(out)
>>> 2

谢谢！

Answer 1

使用 groupby 和 apply

的低效解决方案

x = df.groupby('ID').agg({ 'Fruit': lambda x: tuple(x), 'Location': 'first'})
y=x.groupby('Fruit')['Location'].value_counts()

y:

Fruit                Location
(Apple,)             CHICAGO     1
                     LA          1
(Apple, Strawberry)  HOUSTON     1
                     NY          1
(Strawberry,)        BOSTON      1
Name: Location, dtype: int64

for index in set(y.index.get_level_values(0)):
    if len(index)==2:
        print(f"IDs that has {index[0]} and {index[1]}:")
        print(y.loc[index].to_string())
    else:
        print(f"IDs that has {index[0]} only:")
        print(y.loc[index].to_string())

IDs that has Apple only:
Location
CHICAGO    1
LA         1
IDs that has Apple and Strawberry:
Location
HOUSTON    1
NY         1
IDs that has Strawberry only:
Location
BOSTON    1

如何按位置查找另一列在不同行中具有多个值的列值的总长度

How to find the total length of a column value that has multiple values in different rows for another column BY LOCATION

python

lambda

apply

dataframe

pandas