将特征生成限制为 FeatureTools 中的特定实体
Restricting feature generation to a particular entity in FeatureTools
我正在尝试了解如何在 FeatureTools(版本 0.16)中指定 primitive_options
以仅包含特定实体。基于 docs 我应该使用 include_entities
:
List of entities to be included when creating features for the primitive(s). All other entities will be ignored (list[str]).
简单案例
下面是一些示例代码:
import pprint
from featuretools.primitives import GreaterThanScalar
esd1 = ft.demo.load_mock_customer(return_entityset=True)
def run_dfs(esd, primitive_options={}):
feature_defs = ft.dfs(
entityset=esd,
target_entity="customers",
agg_primitives=["count"],
where_primitives=["count",GreaterThanScalar(value=0)],
trans_primitives=[GreaterThanScalar(value=0)],
primitive_options=primitive_options,
max_depth=4,
features_only=True
)
pprint.pprint(feature_defs)
run_dfs(esd1)
这会产生:
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions) > 0>,
<Feature: COUNT(transactions) > 0>]
假设我对会话和事务计数以及会话是否大于 0 感兴趣。根据文档,我会在 include_entities
此处查找:
run_dfs(esd1, primitive_options={
"greater_than_scalar":{
"include_entities":['sessions']}
})
然而,这个输出是:
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>]
这两个 GreaterThanScalar 功能现在都没有了。如果我改用 ignore_entities
,我会得到:
run_dfs(esd1, primitive_options={
"greater_than_scalar":{
"ignore_entities":["transactions"],
}
})
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions) > 0>]
所以它有效,但我不确定为什么 ignore_entities
给出了我需要的结果而 include_entities
没有。我错过了什么吗?
更复杂的情况
虽然我可以让简单的案例发挥作用,但我真正想要的是稍微复杂一点的东西。我想获得一个布尔值功能,告诉我在特定设备上是否有超过零个会话。
这样做:
esd2 = ft.demo.load_mock_customer(return_entityset=True)
esd2['sessions'].add_interesting_values()
run_dfs(esd2)
产量:
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions WHERE device = desktop)>,
<Feature: COUNT(sessions WHERE device = tablet)>,
<Feature: COUNT(sessions WHERE device = mobile)>,
<Feature: COUNT(transactions) > 0>,
<Feature: COUNT(sessions) > 0>,
<Feature: COUNT(transactions WHERE sessions.device = mobile)>,
<Feature: COUNT(transactions WHERE sessions.device = desktop)>,
<Feature: COUNT(transactions WHERE sessions.device = tablet)>,
<Feature: COUNT(sessions WHERE device = desktop) > 0>,
<Feature: COUNT(sessions WHERE device = tablet) > 0>,
<Feature: COUNT(sessions WHERE device = mobile) > 0>,
<Feature: COUNT(transactions WHERE sessions.device = tablet) > 0>,
<Feature: COUNT(transactions WHERE sessions.device = mobile) > 0>,
<Feature: COUNT(transactions WHERE sessions.device = desktop) > 0>]
我需要的功能是从底部数起 4 到 6 个。如果我尝试将 dfs
限制为会话实体和设备变量:
run_dfs(esd2, primitive_options={
"greater_than_scalar":{
"ignore_entities":["transactions"],
"include_variables":{"sessions":["device"]}
}
})
结果是:
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions WHERE device = desktop)>,
<Feature: COUNT(sessions WHERE device = tablet)>,
<Feature: COUNT(sessions WHERE device = mobile)>,
<Feature: COUNT(transactions WHERE sessions.device = mobile)>,
<Feature: COUNT(transactions WHERE sessions.device = desktop)>,
<Feature: COUNT(transactions WHERE sessions.device = tablet)>]
没有 GreaterThanScalar 特征。
有没有办法让 dfs
只提供我想要的三个 GreaterThanScalar 特征?
更新:第三个案例
有没有办法限制在 where
下计算的内容?例如:
esd3 = ft.demo.load_mock_customer(return_entityset=True)
esd3['sessions'].add_interesting_values()
esd3['products'].add_interesting_values()
run_dfs(esd3, primitive_options={
"greater_than_scalar":{
"ignore_entities":["transactions","sessions"],
},
"count":{
"ignore_variables":{"transactions":['session_id']}
}
})
给出:
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions WHERE device = desktop)>,
<Feature: COUNT(sessions WHERE device = tablet)>,
<Feature: COUNT(sessions WHERE device = mobile)>,
<Feature: COUNT(transactions WHERE sessions.device = mobile)>,
<Feature: COUNT(transactions WHERE products.brand = B)>,
<Feature: COUNT(transactions WHERE sessions.device = tablet)>,
<Feature: COUNT(transactions WHERE products.brand = A)>,
<Feature: COUNT(transactions WHERE sessions.device = desktop)>]
是否可以将 COUNT(transactions WHERE ...)
功能限制为仅 products
。我仍然想保留 COUNT sessions ...
功能。
将 'sessions' 实体中的 'session_id' 添加到 include_variables
选项将生成您正在寻找的功能:
primitive_options={
"greater_than_scalar":{
"ignore_entities":["transactions"],
"include_variables":{"sessions":["session_id", "device"]}}}
Count
原语使用实体索引作为其基础,以及任何 where
列。如果您只包含 GreaterThanScalar
原始选项的 where
列,dfs
最终会忽略 GreaterThanScalar
的所有 Count
功能,因为它们都使用隐式忽略列(实体索引)。在这种情况下,所需的 Count
变量使用 'sessions' 实体,因此将 'sessions' 实体索引 ('session_id') 添加到 included_variables
选项可以实现所需的要生成的特征。
此外,在使用 include_entities
的第一个示例中,GreaterThanScalar
特征丢失,因为 'customers' 实体(目标实体)未包含在内。 Count
特征都是'customers'实体中的聚合特征;它们代表每个客户的东西数量。为了使用 Count
特性,需要允许 GreaterThanScalar
原语使用 Count
特性所在的 'customers' 实体以及 Count
特性所在的实体所需的 Count
功能基于(在本例中为 'sessions')。
我正在尝试了解如何在 FeatureTools(版本 0.16)中指定 primitive_options
以仅包含特定实体。基于 docs 我应该使用 include_entities
:
List of entities to be included when creating features for the primitive(s). All other entities will be ignored (list[str]).
简单案例
下面是一些示例代码:
import pprint
from featuretools.primitives import GreaterThanScalar
esd1 = ft.demo.load_mock_customer(return_entityset=True)
def run_dfs(esd, primitive_options={}):
feature_defs = ft.dfs(
entityset=esd,
target_entity="customers",
agg_primitives=["count"],
where_primitives=["count",GreaterThanScalar(value=0)],
trans_primitives=[GreaterThanScalar(value=0)],
primitive_options=primitive_options,
max_depth=4,
features_only=True
)
pprint.pprint(feature_defs)
run_dfs(esd1)
这会产生:
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions) > 0>,
<Feature: COUNT(transactions) > 0>]
假设我对会话和事务计数以及会话是否大于 0 感兴趣。根据文档,我会在 include_entities
此处查找:
run_dfs(esd1, primitive_options={
"greater_than_scalar":{
"include_entities":['sessions']}
})
然而,这个输出是:
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>]
这两个 GreaterThanScalar 功能现在都没有了。如果我改用 ignore_entities
,我会得到:
run_dfs(esd1, primitive_options={
"greater_than_scalar":{
"ignore_entities":["transactions"],
}
})
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions) > 0>]
所以它有效,但我不确定为什么 ignore_entities
给出了我需要的结果而 include_entities
没有。我错过了什么吗?
更复杂的情况
虽然我可以让简单的案例发挥作用,但我真正想要的是稍微复杂一点的东西。我想获得一个布尔值功能,告诉我在特定设备上是否有超过零个会话。
这样做:
esd2 = ft.demo.load_mock_customer(return_entityset=True)
esd2['sessions'].add_interesting_values()
run_dfs(esd2)
产量:
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions WHERE device = desktop)>,
<Feature: COUNT(sessions WHERE device = tablet)>,
<Feature: COUNT(sessions WHERE device = mobile)>,
<Feature: COUNT(transactions) > 0>,
<Feature: COUNT(sessions) > 0>,
<Feature: COUNT(transactions WHERE sessions.device = mobile)>,
<Feature: COUNT(transactions WHERE sessions.device = desktop)>,
<Feature: COUNT(transactions WHERE sessions.device = tablet)>,
<Feature: COUNT(sessions WHERE device = desktop) > 0>,
<Feature: COUNT(sessions WHERE device = tablet) > 0>,
<Feature: COUNT(sessions WHERE device = mobile) > 0>,
<Feature: COUNT(transactions WHERE sessions.device = tablet) > 0>,
<Feature: COUNT(transactions WHERE sessions.device = mobile) > 0>,
<Feature: COUNT(transactions WHERE sessions.device = desktop) > 0>]
我需要的功能是从底部数起 4 到 6 个。如果我尝试将 dfs
限制为会话实体和设备变量:
run_dfs(esd2, primitive_options={
"greater_than_scalar":{
"ignore_entities":["transactions"],
"include_variables":{"sessions":["device"]}
}
})
结果是:
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions WHERE device = desktop)>,
<Feature: COUNT(sessions WHERE device = tablet)>,
<Feature: COUNT(sessions WHERE device = mobile)>,
<Feature: COUNT(transactions WHERE sessions.device = mobile)>,
<Feature: COUNT(transactions WHERE sessions.device = desktop)>,
<Feature: COUNT(transactions WHERE sessions.device = tablet)>]
没有 GreaterThanScalar 特征。
有没有办法让 dfs
只提供我想要的三个 GreaterThanScalar 特征?
更新:第三个案例
有没有办法限制在 where
下计算的内容?例如:
esd3 = ft.demo.load_mock_customer(return_entityset=True)
esd3['sessions'].add_interesting_values()
esd3['products'].add_interesting_values()
run_dfs(esd3, primitive_options={
"greater_than_scalar":{
"ignore_entities":["transactions","sessions"],
},
"count":{
"ignore_variables":{"transactions":['session_id']}
}
})
给出:
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions WHERE device = desktop)>,
<Feature: COUNT(sessions WHERE device = tablet)>,
<Feature: COUNT(sessions WHERE device = mobile)>,
<Feature: COUNT(transactions WHERE sessions.device = mobile)>,
<Feature: COUNT(transactions WHERE products.brand = B)>,
<Feature: COUNT(transactions WHERE sessions.device = tablet)>,
<Feature: COUNT(transactions WHERE products.brand = A)>,
<Feature: COUNT(transactions WHERE sessions.device = desktop)>]
是否可以将 COUNT(transactions WHERE ...)
功能限制为仅 products
。我仍然想保留 COUNT sessions ...
功能。
将 'sessions' 实体中的 'session_id' 添加到 include_variables
选项将生成您正在寻找的功能:
primitive_options={
"greater_than_scalar":{
"ignore_entities":["transactions"],
"include_variables":{"sessions":["session_id", "device"]}}}
Count
原语使用实体索引作为其基础,以及任何 where
列。如果您只包含 GreaterThanScalar
原始选项的 where
列,dfs
最终会忽略 GreaterThanScalar
的所有 Count
功能,因为它们都使用隐式忽略列(实体索引)。在这种情况下,所需的 Count
变量使用 'sessions' 实体,因此将 'sessions' 实体索引 ('session_id') 添加到 included_variables
选项可以实现所需的要生成的特征。
此外,在使用 include_entities
的第一个示例中,GreaterThanScalar
特征丢失,因为 'customers' 实体(目标实体)未包含在内。 Count
特征都是'customers'实体中的聚合特征;它们代表每个客户的东西数量。为了使用 Count
特性,需要允许 GreaterThanScalar
原语使用 Count
特性所在的 'customers' 实体以及 Count
特性所在的实体所需的 Count
功能基于(在本例中为 'sessions')。