仅使用 pivot table returns 索引列，省略旋转列

Question

我正在尝试旋转我的 measure 列以使其值成为字段。

意思是net_revenue和vic应该成为他们自己的字段。

下图中，左边是Input，右边是Desired Output：

我知道 measure 有重复的键（例如，net_revenue 出现不止一次），但是 date_budget，我也在索引，对于那块数据。 date_budget 确实会重复，但只有在 measure 发生变化时才会重复，因此索引列永远不会有真正重复的行。

问题： 在 Pentaho CPython 脚本中，当我查看脚本的输出时，我只返回索引列，而不是旋转列 net_revenue 和 vic。这是为什么？

脚本：

import pandas as pd

budget['monthly_budget_phasing'] = pd.to_numeric(budget['monthly_budget_phasing'], errors='coerce')

# Perform the pivot.
budget = pd.pivot_table(budget,
    values='monthly_budget_phasing',
    index=['country', 'customer', 'date_budget'],
    columns='measure'
    )

budget.reset_index(inplace=True)

result_df = budget

示例数据框：

d = {
    'country': ['us', 'us', 'us', 'us', 'us', 'us', 'us', 'us', 'us', 'us', 'us', 'us'],
    'customer': ['customer1', 'customer1', 'customer1', 'customer1', 'customer1', 'customer1', 'customer2', 'customer2', 'customer2', 'customer2', 'customer2', 'customer2',],
    'measure': ['net_revenue', 'net_revenue', 'net_revenue', 'vic', 'vic', 'vic', 'net_revenue', 'net_revenue', 'net_revenue', 'vic', 'vic', 'vic'],
    'date_budget': ['1/1/2018', '2/1/2018', '3/1/2018', '1/1/2018', '2/1/2018', '3/1/2018', '1/1/2018', '2/1/2018', '3/1/2018', '1/1/2018', '2/1/2018', '3/1/2018'],
    'monthly_budget_phasing': ['', '', '', '', '', '', '', '', '', '', '', '']
    }
df = pd.DataFrame(data=d)

曾在 Pandas 和 aggfunc='first' 工作，但在 Pentaho 中不起作用。 Pentaho 仍然只输出 country, customer, measure.

Pandas 终端输出：

   country   customer date_budget      measure monthly_budget_phasing
0       us  customer1    1/1/2018  net_revenue                    
1       us  customer1    2/1/2018  net_revenue                    
2       us  customer1    3/1/2018  net_revenue                    
3       us  customer1    1/1/2018          vic                    
4       us  customer1    2/1/2018          vic                    
5       us  customer1    3/1/2018          vic                    
6       us  customer2    1/1/2018  net_revenue                    
7       us  customer2    2/1/2018  net_revenue                    
8       us  customer2    3/1/2018  net_revenue                    
9       us  customer2    1/1/2018          vic                    
10      us  customer2    2/1/2018          vic                    
11      us  customer2    3/1/2018          vic                    
measure country   customer date_budget net_revenue  vic
0            us  customer1    1/1/2018           
1            us  customer1    2/1/2018           
2            us  customer1    3/1/2018           
3            us  customer2    1/1/2018           
4            us  customer2    2/1/2018           
5            us  customer2    3/1/2018

即使上面 Python 有效，Pentaho 8.0 CPython 插件仍然会导致问题。

首先我融化枣子：

那我解冻措施：

我的 net_revenue 和 vic 字段在哪里？

Answer 1

看来你需要添加 replace:

budget['monthly_budget_phasing'] = pd.to_numeric(budget['monthly_budget_phasing'].replace('$','', regex=True), errors='coerce')
#alternative
#budget['monthly_budget_phasing'] = budget['monthly_budget_phasing'].replace('$','', regex=True).astype(int)


df = pd.pivot_table(budget,
    values='monthly_budget_phasing',
    index=['country', 'customer', 'date_budget'],
    columns='measure',
    aggfunc='first'

    ).reset_index()

选择：

cols = ['country', 'customer', 'date_budget', 'measure']
#if duplicates, first remove it
df = budget.drop_duplicates(cols)
#pivot by unstack
df = df.set_index(cols)['monthly_budget_phasing'].unstack().reset_index()

print (df)
measure country   customer date_budget  net_revenue  vic
0            us  customer1    1/1/2018           55   29
1            us  customer1    2/1/2018           23   35
2            us  customer1    3/1/2018           42   98
3            us  customer2    1/1/2018           87   90
4            us  customer2    2/1/2018           77   75
5            us  customer2    3/1/2018           34   12

Answer 2

Kettle 需要知道列，这些列在转换运行之前每一步都会产生 - 这就是为什么我认为它不能在 Python 中完成（select * 查询有点例外，但他们也在转换运行之前秘密获取元数据）。在 Kettle 中执行 pivot 的通常方法是使用 Row denormalizer 步骤。该步骤要求您为非透视值指定列名，但如果您不能 hard-code 这些值，则可以根据您的数据通过 ETL Metadata Injection 步骤传递它们。

为了动态传递值，创建 2 个转换： sub-transformation 将从父转换中获取输入数据并通过行非规范化器执行数据透视操作。父转换将读取输入数据，获取唯一值，这些值将成为列名，然后将这些值传递给 ETL 元数据注入步骤。注入步骤将使用列名填充行非规范化器元数据并执行转换，提供您的输入数据。

仅使用 pivot table returns 索引列，省略旋转列

Using pivot table returns only indexed columns, omits pivoted columns

python

pentaho

pandas