将特征生成限制为 FeatureTools 中的特定实体

Restricting feature generation to a particular entity in FeatureTools

我正在尝试了解如何在 FeatureTools(版本 0.16)中指定 primitive_options 以仅包含特定实体。基于 docs 我应该使用 include_entities:

List of entities to be included when creating features for the primitive(s). All other entities will be ignored (list[str]).

简单案例

下面是一些示例代码:

import pprint
from featuretools.primitives import GreaterThanScalar

esd1 = ft.demo.load_mock_customer(return_entityset=True)

def run_dfs(esd, primitive_options={}):
    feature_defs = ft.dfs(
        entityset=esd,
        target_entity="customers",
        agg_primitives=["count"],
        where_primitives=["count",GreaterThanScalar(value=0)],
        trans_primitives=[GreaterThanScalar(value=0)],
        primitive_options=primitive_options,
        max_depth=4,
        features_only=True
    )
    pprint.pprint(feature_defs)

run_dfs(esd1)

这会产生:

[<Feature: zip_code>,
 <Feature: COUNT(sessions)>,
 <Feature: COUNT(transactions)>,
 <Feature: COUNT(sessions) > 0>,
 <Feature: COUNT(transactions) > 0>]

假设我对会话和事务计数以及会话是否大于 0 感兴趣。根据文档,我会在 include_entities 此处查找:

run_dfs(esd1, primitive_options={
          "greater_than_scalar":{
              "include_entities":['sessions']}
        })

然而,这个输出是:

[<Feature: zip_code>,
 <Feature: COUNT(sessions)>,
 <Feature: COUNT(transactions)>]

这两个 GreaterThanScalar 功能现在都没有了。如果我改用 ignore_entities,我会得到:

run_dfs(esd1, primitive_options={
            "greater_than_scalar":{
                "ignore_entities":["transactions"],
            }
        })

[<Feature: zip_code>,
 <Feature: COUNT(sessions)>,
 <Feature: COUNT(transactions)>,
 <Feature: COUNT(sessions) > 0>]

所以它有效,但我不确定为什么 ignore_entities 给出了我需要的结果而 include_entities 没有。我错过了什么吗?

更复杂的情况

虽然我可以让简单的案例发挥作用,但我真正想要的是稍微复杂一点的东西。我想获得一个布尔值功能,告诉我在特定设备上是否有超过零个会话。

这样做:

esd2 = ft.demo.load_mock_customer(return_entityset=True)
esd2['sessions'].add_interesting_values()
run_dfs(esd2)

产量:

[<Feature: zip_code>,
 <Feature: COUNT(sessions)>,
 <Feature: COUNT(transactions)>,
 <Feature: COUNT(sessions WHERE device = desktop)>,
 <Feature: COUNT(sessions WHERE device = tablet)>,
 <Feature: COUNT(sessions WHERE device = mobile)>,
 <Feature: COUNT(transactions) > 0>,
 <Feature: COUNT(sessions) > 0>,
 <Feature: COUNT(transactions WHERE sessions.device = mobile)>,
 <Feature: COUNT(transactions WHERE sessions.device = desktop)>,
 <Feature: COUNT(transactions WHERE sessions.device = tablet)>,
 <Feature: COUNT(sessions WHERE device = desktop) > 0>,
 <Feature: COUNT(sessions WHERE device = tablet) > 0>,
 <Feature: COUNT(sessions WHERE device = mobile) > 0>,
 <Feature: COUNT(transactions WHERE sessions.device = tablet) > 0>,
 <Feature: COUNT(transactions WHERE sessions.device = mobile) > 0>,
 <Feature: COUNT(transactions WHERE sessions.device = desktop) > 0>]

我需要的功能是从底部数起 4 到 6 个。如果我尝试将 dfs 限制为会话实体和设备变量:

run_dfs(esd2, primitive_options={
            "greater_than_scalar":{
                "ignore_entities":["transactions"],
                "include_variables":{"sessions":["device"]}
            }
        })

结果是:

[<Feature: zip_code>,
 <Feature: COUNT(sessions)>,
 <Feature: COUNT(transactions)>,
 <Feature: COUNT(sessions WHERE device = desktop)>,
 <Feature: COUNT(sessions WHERE device = tablet)>,
 <Feature: COUNT(sessions WHERE device = mobile)>,
 <Feature: COUNT(transactions WHERE sessions.device = mobile)>,
 <Feature: COUNT(transactions WHERE sessions.device = desktop)>,
 <Feature: COUNT(transactions WHERE sessions.device = tablet)>]

没有 GreaterThanScalar 特征。

有没有办法让 dfs 只提供我想要的三个 GreaterThanScalar 特征?

更新:第三个案例

有没有办法限制在 where 下计算的内容?例如:

esd3 = ft.demo.load_mock_customer(return_entityset=True)
esd3['sessions'].add_interesting_values()
esd3['products'].add_interesting_values()

run_dfs(esd3, primitive_options={
            "greater_than_scalar":{
                "ignore_entities":["transactions","sessions"],
            },
            "count":{
                "ignore_variables":{"transactions":['session_id']}
            }
        })

给出:

[<Feature: zip_code>,
 <Feature: COUNT(sessions)>,
 <Feature: COUNT(transactions)>,
 <Feature: COUNT(sessions WHERE device = desktop)>,
 <Feature: COUNT(sessions WHERE device = tablet)>,
 <Feature: COUNT(sessions WHERE device = mobile)>,
 <Feature: COUNT(transactions WHERE sessions.device = mobile)>,
 <Feature: COUNT(transactions WHERE products.brand = B)>,
 <Feature: COUNT(transactions WHERE sessions.device = tablet)>,
 <Feature: COUNT(transactions WHERE products.brand = A)>,
 <Feature: COUNT(transactions WHERE sessions.device = desktop)>]

是否可以将 COUNT(transactions WHERE ...) 功能限制为仅 products。我仍然想保留 COUNT sessions ... 功能。

将 'sessions' 实体中的 'session_id' 添加到 include_variables 选项将生成您正在寻找的功能:

primitive_options={
    "greater_than_scalar":{
         "ignore_entities":["transactions"],
         "include_variables":{"sessions":["session_id", "device"]}}}

Count 原语使用实体索引作为其基础,以及任何 where 列。如果您只包含 GreaterThanScalar 原始选项的 where 列,dfs 最终会忽略 GreaterThanScalar 的所有 Count 功能,因为它们都使用隐式忽略列(实体索引)。在这种情况下,所需的 Count 变量使用 'sessions' 实体,因此将 'sessions' 实体索引 ('session_id') 添加到 included_variables 选项可以实现所需的要生成的特征。

此外,在使用 include_entities 的第一个示例中,GreaterThanScalar 特征丢失,因为 'customers' 实体(目标实体)未包含在内。 Count特征都是'customers'实体中的聚合特征;它们代表每个客户的东西数量。为了使用 Count 特性,需要允许 GreaterThanScalar 原语使用 Count 特性所在的 'customers' 实体以及 Count 特性所在的实体所需的 Count 功能基于(在本例中为 'sessions')。