Select 一列的子集，然后与另一列进行比较

Question

我在 pyspark 中有一个 csv 文件，其中包含大量销售信息 - 单位、商店 ID、总销售额、客户忠诚度、产品编号等。

我需要比较加入忠诚度计划的客户的销售量与不加入忠诚度计划的客户的销售量。忠诚度计划中的所有客户都在 "collector_key" 字段中用正整数表示，而不是用负整数表示，如下所示：

>>> df.head(10)
  collector_key  sales
0             -1  42.72
1             -1  27.57
2   139517343969  62.44
3             -1   0.00
4             -1   0.00
5             -1   7.32
6             -1  64.51
7             -1   0.00
8   134466064080  20.72
9             -1   0.00

起初我想也许我可以使用 if/else 语句将用户洗牌到忠诚度和非忠诚度列表中。但后来我想到，将忠诚度客户过滤到他们自己的数据框中并对非忠诚度客户做同样的事情，然后将两个结果相减可能会更有效。我想也许我可以在 "collector_key" 列上做一个正则表达式：

non_loy_cust = test_1.filter(regex='^(-?\d+)\s+')

但是我不确定如何保留 "sales" 列，因为 "regex" 和 "items" 是互斥的。

最重要的是，我需要总结销售列，以便我有一个忠诚和非忠诚客户的单一数字，但我认为（一旦我克服了前面的障碍）可以用一些东西来做喜欢：

loy_sales = df.groupby('sales').max()
non_loy_sales = df2.groupby('sales').max()

或者我可能忽略了更简单的第三个选项？

Answer 1

我想你正在寻找 .transform()

# set group first
df['collector_key'] = df['collector_key'].map(lambda x: 0 if x == -1 else 1)


#loyalty (1) vs non-loyalty sales(0)
df.groupby('collector_key')['sales'].sum() 

   collector_key
0    142.12
1     83.16

# adding max sales column
df['max_sales'] = df.groupby('collector_key')['sales'].transform('max')

    collector_key   sales   max_sales
0        -1         42.72   64.51
1        -1         27.57   64.51
2    139517343969   62.44   62.44
3        -1         0.00    64.51
4        -1         0.00    64.51
5        -1         7.32    64.51

Select 一列的子集，然后与另一列进行比较

Select subset of one column, then compare to another

apache-spark

pyspark

spark-dataframe