如何从嵌套 for 循环自动创建 pandas 数据框？

Question

这是一个纯属虚构的例子，但它演示了我所需要的。我当前的代码可以得到我想要的结果，但我想编写一个嵌套的 for 循环来自动创建列表/数据帧而无需硬编码（或任何可以减少硬编码的方法）。

在这种情况下，我的数据包含年龄组和性别列。我想用 Plotly 为每个年龄组创建一个堆叠条形图，按性别细分。另外，我正在使用 pandas 来处理数据。

我遇到的问题是年龄组和性别可能会发生变化。比如当前数据集有Age Groups: 20s, 30s, 40s, 50s, 60s, 70s, 80s, 90+，但是以后可以加入其他年龄组（90s，100s，110s等）所以我将不得不返回并手动添加这些。

同样，当前数据集有性别：女性、男性、未指定，但以后可以添加其他类别。如果添加新的性别类别，我将不得不返回代码并手动添加它。

import plotly.offline as pyo
import plotly.graph_objs as go
import pandas as pd

# source = "https://data.ontario.ca/dataset/f4112442-bdc8-45d2-be3c-12efae72fb27/resource/455fd63b-603d-4608-8216-7d8647f43350/download/conposcovidloc.csv"
df = pd.read_csv("conposcovidloc.csv")

# Age_Group = ['<20', '20s', '30s', '40s', '50s', '60s', '70s','80s', '90+', 'UNKNOWN']
Age_Group = df["Age_Group"].unique().tolist()


# Client_Gender = df["Client_Gender"].unique().tolist()

count_female = []
count_male = []
count_unspecified = []
count_diverse = []

for age in Age_Group:
    count_female.append(df[(df["Age_Group"]==age) & (df["Client_Gender"]=="FEMALE")]["Age_Group"].count())
    count_male.append(df[(df["Age_Group"]==age) & (df["Client_Gender"]=="MALE")]["Age_Group"].count())
    count_unspecified.append(df[(df["Age_Group"]==age) & (df["Client_Gender"]=="UNSPECIFIED")]["Age_Group"].count())
    count_diverse.append(df[(df["Age_Group"]==age) & (df["Client_Gender"]=="GENDER DIVERSE")]["Age_Group"].count())

trace1 = go.Bar(x=Age_Group, y=count_female, name="Female", marker={"color": "#FFD700"})
trace2 = go.Bar(x=Age_Group, y=count_male, name="Male", marker={"color": "#9EA0A1"})
trace3 = go.Bar(x=Age_Group, y=count_unspecified, name="Unspecified", marker={"color": "#CD7F32"})
trace4 = go.Bar(x=Age_Group, y=count_diverse, name="Gender Diverse", marker={"color": "#000000"})

data = [trace1, trace2, trace3, trace4]
layout = go.Layout(title="Ontario COVID-19 Case Breakdown by Age Group and Gender", barmode="stack")

fig = go.Figure(data=data, layout=layout)
pyo.plot(fig, filename="bar.html")

我在想也许可以做这样的事情来获得一个新的数据框

df2 = []

for age in Age_Group:
    for gender in Client_Gender:
        count_female.append(df[(df["Age_Group"]==age) & (df["Client_Gender"]==gender)]["Age_Group"].count())
        df2.append()

trace = go.Bar(x=Age_Group, y=Client_Gender, name=Client_Gender)

也许我的处理方式完全错误。

编辑：感谢@samir-hinojosa 提供的使用 globals() 的建议，我快搞定了。这是我修改后的代码，几乎就是我所需要的。我的 for 循环看起来像是被复制了多次，我不确定为什么。

import plotly.offline as pyo
import plotly.graph_objs as go
import pandas as pd

url = "https://data.ontario.ca/dataset/f4112442-bdc8-45d2-be3c-12efae72fb27/resource/455fd63b-603d-4608-8216-7d8647f43350/download/conposcovidloc.csv"
df = pd.read_csv(url)

Age_Group = df["Age_Group"].unique().tolist()
Client_Gender = df["Client_Gender"].unique().tolist()

data = []
for gender in df["Client_Gender"].unique():
    globals()["count_" + gender] = []

for gender in Client_Gender:
    for age in Age_Group:
        globals()["count_" + gender].append(df[(df["Age_Group"]==age) & (df["Client_Gender"]==gender)]["Client_Gender"].count())
        trace = go.Bar(x=Age_Group, y=globals()["count_" + gender], name=gender)
        data.append(trace)

layout = go.Layout(title="Ontario COVID-19 Case Breakdown by Age Group and Gender") # Remove barmode to get nested 

fig = go.Figure(data=data, layout=layout)
pyo.plot(fig, filename="html/bar.html")

图表的数字和形状看起来正确，但图例多次显示性别，不知道如何解决这个问题。传说中应该只有 4 种性别。

Answer 1

您可以使用 globals()。你可以在下面看到一个例子

import pandas as pd
url = "https://raw.githubusercontent.com/alexisperrier/intro2nlp/master/data/openclassrooms_intro2nlp_sentiment_vegetables.csv"
df = pd.read_csv(url)
df.head()

tweet_id    search_keyword  sentiment   text    neg pos
0   1340355010299908096 parsnip 1   @user @user All the best @user you cheeky litt...   0.009569    0.874337
1   1340093851143450624 green beans 1   RT @user @user lamb chops , green beans , maca...   0.001479    0.966661
2   1340089889984012290 eggplant    1   @user I make the best eggplant parmesan 0.002113    0.955990
3   1340053955792035840 yams    0   They candied yams go stupid!    0.918229    0.011744
4   1339085046548897792 spinach 0   @user Cooked spinach. Just kidding that stuff ...   0.871717    0.014765

df["search_keyword"].unique()

array(['parsnip', 'green beans', 'eggplant', 'yams', 'spinach', 'celery',
       'leek', 'carrot', 'tomato', 'chickpea', 'avocado', 'asparagus',
       'mushroom', 'cabbage', 'kale', 'lettuce', 'quinoa', 'potato',
       'onion', 'cucumber', 'rice', 'cauliflower', 'brocolli', 'turnip',
       'lentils', 'pumpkin', 'corn', 'okra', 'radish', 'artichoke',
       'squash', 'garlic', 'endive', 'zuchinni'], dtype=object)

在这种情况下，我将根据 search_keyword 列表

动态创建多个数据框

for search_keyword in df["search_keyword"].unique():
    globals()["df_" + search_keyword] = df[df["search_keyword"]==search_keyword]

现在，您可以根据名称访问每个数据框"df_" + df["search_keyword"].unique()

df_eggplant.head()

tweet_id    search_keyword  sentiment   text    neg pos
2   1340089889984012290 eggplant    1   @user I make the best eggplant parmesan 0.002113    0.955990
33  1340284341449076736 eggplant    1   Just guys no the later today? only using a bit...   0.009440    0.838516
62  1338954692173258753 eggplant    1   @user Oh wow, lucky eggplant!   0.003778    0.946546
182 1338889204575526919 eggplant    0   RT @user destyal hotfuck 27cm. Fucked. Gand ... 0.885911    0.013338
308 1339045305027686403 eggplant    0   bachelorette BacheloretteABC TheBacheloretteAB...   0.980897    0.002719

以同样的方式，您可以使用 globals() 访问每个数据帧。例如：

my_dataframes = ['parsnip', 'mushroom', 'cauliflower']

for dataframe in my_dataframes:
    display(globals()["df_" + dataframe].head(3))


tweet_id    search_keyword  sentiment   text    neg pos
0   1340355010299908096 parsnip 1   @user @user All the best @user you cheeky litt...   0.009569    0.874337
350 1340251679149744129 parsnip 0   @user It is worse than Martin Heidegger. My br...   0.875097    0.011754
541 1340426164188237825 parsnip 1   New burger invention? Cheesy parsnip latkes wi...   0.002752    0.946687

tweet_id    search_keyword  sentiment   text    neg pos
14  1338944115279495168 mushroom    0   @user Trump has never "administered" anything ...   0.913989    0.006175
20  1339156461327437824 mushroom    1   @user You'd probably be more careful than me a...   0.006338    0.960806
35  1340401530864873479 mushroom    1   This Creamy Mushroom Chicken Pasta is so cream...   0.002506    0.980949

tweet_id    search_keyword  sentiment   text    neg pos
39  1339992494025617410 cauliflower 0   @user @user no love for the cauliflower   0.841673    0.011049
63  1340349399529119745 cauliflower 1   Grab yourself a delicious dinner today @user "...   0.001387    0.921891
92  1340344141012750336 cauliflower 1   A comfort food classic, this Cauliflower, Panc...   0.000985    0.968648

Answer 2

根据您的需求，我想您正在寻找以下内容：

import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

url = "https://data.ontario.ca/dataset/f4112442-bdc8-45d2-be3c-12efae72fb27/resource/455fd63b-603d-4608-8216-7d8647f43350/download/conposcovidloc.csv"
df = pd.read_csv(url)

df_temp = df[["Row_ID", "Age_Group", "Client_Gender"]].groupby(["Age_Group", "Client_Gender"]).count().reset_index()
df_temp.columns = ["Age group", "Client gender", "Value"]

fig, ax1 = plt.subplots(figsize=(10, 5))
plot = sns.barplot(x="Age group", y="Value", hue="Client gender", data=df_temp, ax=ax1)
plt.title("Comparison of age group and client genders", size=20)
plt.legend(bbox_to_anchor=(1.004, 1), borderaxespad=0, title="Client gender")
plt.tight_layout()
plt.xlabel("Age group", size=12)
plt.ylabel("Client gender", size=12)
#plt.savefig("img/comparison.png")
sns.despine(fig)

如果要验证

df[(df["Client_Gender"]=="FEMALE") & (df["Age_Group"]=="20s")].shape
(112076, 18)

df[(df["Client_Gender"]=="MALE") & (df["Age_Group"]=="20s")].shape
(106093, 18)

df[(df["Client_Gender"]=="FEMALE") & (df["Age_Group"]=="50s")].shape
(70978, 18)

df[(df["Client_Gender"]=="MALE") & (df["Age_Group"]=="50s")].shape
(64816, 18)

您可以注意到这些值似乎没问题。

如何从嵌套 for 循环自动创建 pandas 数据框？

How to create pandas dataframe automatically from nested for loop?

python

dataframe

pandas

plotly