遍历一列中的唯一字符串，并从与该唯一字符串关联的其他 2 列创建一个字典或数据数组

Question

我有很多具有相同数据结构的csv文件如下：

y	x	variant_name
82	12	F^W#Bfr18
76	3	F^W#Bfr18
45	18	*BCDS%q3rn
59	14	*BCDS%q3rn
...	...	...

我正在尝试遍历每个文件并对 variant_name 列使用 groupby 函数，并在 x 和 y 列中收集相应的数据，并且生成一个散点图（轴是 x 和 y，作为本例中的列名）。

group = df.groupby('variant_name')

我想我可以使用 lambda 函数来保存与特定 variant_name 关联的所有 x 和 y 值，但我完全卡住了。我希望这是有道理的。如果我需要澄清，请告诉我。谢谢！

Answer 1

使用以下代码：

df.groupby('variant_name').agg({'x': list, 'y':list})

你得到：

                   x         y
variant                       
*BCDS%q3rn  [45, 59]  [18, 14]
F^W#Bfr18   [82, 76]   [12, 3]

然后您可以迭代不同的变体并绘制它们：

import pylab as plt
fig, ax = plt.subplots(1, 1)
for t in df.groupby('variant').agg({'x': list, 'y':list}).itertuples():
    ax.scatter(t.x, t.y, label=t.Index)
ax.legend()
plt.show()

编辑

如果你想为每个变体绘制一个图，你可以在 for 的正文中移动图形的创建：

alpha = 1
for t in df.groupby('variant_name').agg({'x': list, 'y':list}).itertuples():
    fig, ax = plt.subplots(1, 1, num=t.Index)
    plt.suptitle(t.Index)
    ax.scatter(t.x, t.y, label=t.Index, alpha=alpha)

我在这里添加了一个参数alpha，因为如果你有很多点，它可以改善你的情节，帮助可视化你的数据密度。

另一方面，如果您正在转向更复杂的绘图，我建议您将绘图代码与其余代码分开：

def _plot_variant(variant_data, alpha=1):
    fig_title = variant_data.variant_name
    fig, ax = plt.subplots(1, 1, num=fig_title)
    plt.suptitle(fig_title)
    ax.scatter(variant_data.x, variant_data.y, alpha=alpha)

df.groupby('variant_name', as_index=False).agg({'x': list, 'y':list}).apply(_plot_variant, axis=1)
plt.show()

Answer 2

您可以 .groupby 并通过 scatter:

获得平均值和 plot

df = df.groupby('variant_name', as_index=False).mean()
df.plot(kind='scatter', x='x', y='y')

或者，您可以将 hue 传递给没有 groupby 的 sns.scatterplot：

import seaborn as sns
sns.scatterplot(data=df, x='x', y='y', hue='variant_name')

Answer 3

Seaborn 可能是正确的选择。将变体拆分到自己的图表中非常简单。 FacetGrid 有很多选项可以控制行数和列数等。

import seaborn as sns
g = sns.FacetGrid(df, col='variant_name')
g.map_dataframe(sns.scatterplot, x='x', y='y')

遍历一列中的唯一字符串，并从与该唯一字符串关联的其他 2 列创建一个字典或数据数组

loop through unique string in one column and create a dictionary or array of data from 2 other columns associated with the unique string

python

matplotlib

pandas

data-wrangling