为什么我的关联模型会在数据集中找到不应该存在的子组?
Why does my association model find subgroups in a dataset when there shouldn't any?
I give a lot of information on the methods that I used to write my code. If you just want to read my question, skip to the quotes at the end.
我正在从事一个项目,该项目的目标是检测一组患者的亚群。我认为这听起来像是使用关联规则挖掘的绝好机会,因为我目前正在就这个主题进行 class。
我一共有42个变量。其中,20 个是连续的,必须离散化。对于每个变量,我使用 Freedman-Diaconis 规则来确定将一个组划分为多少个类别。
def Freedman_Diaconis(column_values):
#sort the list first
column_values[1].sort()
first_quartile = int(len(column_values[1]) * .25)
third_quartile = int(len(column_values[1]) * .75)
fq_value = column_values[1][first_quartile]
tq_value = column_values[1][third_quartile]
iqr = tq_value - fq_value
n_to_pow = len(column_values[1])**(-1/3)
h = 2 * iqr * n_to_pow
retval = (column_values[1][-1] - column_values[1][1])/h
test = int(retval+1)
return test
从那里我使用了最小-最大归一化
def min_max_transform(column_of_data, num_bins):
min_max_normalizer = preprocessing.MinMaxScaler(feature_range=(1, num_bins))
data_min_max = min_max_normalizer.fit_transform(column_of_data[1])
data_min_max_ints = take_int(data_min_max)
return data_min_max_ints
转换我的数据,然后我简单地取整数部分以获得最终分类。
def take_int(list_of_float):
ints = []
for flt in list_of_float:
asint = int(flt)
ints.append(asint)
return ints
然后我还写了一个函数,用于将这个值与变量名结合起来。
def string_transform(prefix, column, index):
transformed_list = []
transformed = ""
if index < 4:
for entry in column[1]:
transformed = prefix+str(entry)
transformed_list.append(transformed)
else:
prefix_num = prefix.split('x')
for entry in column[1]:
transformed = str(prefix_num[1])+'x'+str(entry)
transformed_list.append(transformed)
return transformed_list
这样做是为了区分具有相同值但出现在不同列中的变量。例如,变量 x14 的值为 1 与变量 x20 的值为 1 的含义不同。字符串转换函数将为前面提到的示例创建 14x1 和 20x1。
在此之后,我将所有内容以篮子格式写入文件
def create_basket(list_of_lists, headers):
#for filename in os.listdir("."):
# if filename.e
if not os.path.exists('baskets'):
os.makedirs('baskets')
down_length = len(list_of_lists[0])
with open('baskets/dataset.basket', 'w') as basketfile:
basket_writer = csv.DictWriter(basketfile, fieldnames=headers)
for i in range(0, down_length):
basket_writer.writerow({"trt": list_of_lists[0][i], "y": list_of_lists[1][i], "x1": list_of_lists[2][i],
"x2": list_of_lists[3][i], "x3": list_of_lists[4][i], "x4": list_of_lists[5][i],
"x5": list_of_lists[6][i], "x6": list_of_lists[7][i], "x7": list_of_lists[8][i],
"x8": list_of_lists[9][i], "x9": list_of_lists[10][i], "x10": list_of_lists[11][i],
"x11": list_of_lists[12][i], "x12":list_of_lists[13][i], "x13": list_of_lists[14][i],
"x14": list_of_lists[15][i], "x15": list_of_lists[16][i], "x16": list_of_lists[17][i],
"x17": list_of_lists[18][i], "x18": list_of_lists[19][i], "x19": list_of_lists[20][i],
"x20": list_of_lists[21][i], "x21": list_of_lists[22][i], "x22": list_of_lists[23][i],
"x23": list_of_lists[24][i], "x24": list_of_lists[25][i], "x25": list_of_lists[26][i],
"x26": list_of_lists[27][i], "x27": list_of_lists[28][i], "x28": list_of_lists[29][i],
"x29": list_of_lists[30][i], "x30": list_of_lists[31][i], "x31": list_of_lists[32][i],
"x32": list_of_lists[33][i], "x33": list_of_lists[34][i], "x34": list_of_lists[35][i],
"x35": list_of_lists[36][i], "x36": list_of_lists[37][i], "x37": list_of_lists[38][i],
"x38": list_of_lists[39][i], "x39": list_of_lists[40][i], "x40": list_of_lists[41][i]})
并且我用了Orange中的apriori包,看看有没有关联规则
rules = Orange.associate.AssociationRulesSparseInducer(patient_basket, support=0.3, confidence=0.3)
print "%4s %4s %s" % ("Supp", "Conf", "Rule")
for r in rules:
my_rule = str(r)
split_rule = my_rule.split("->")
if 'trt' in split_rule[1]:
print 'treatment rule'
print "%4.1f %4.1f %s" % (r.support, r.confidence, r)
使用这个技术,我发现了很多与我的测试数据的关联规则。
THIS IS WHERE I HAVE A PROBLEM
看训练数据的注释时,有这个注释
...That is, the only
reason for the differences among observed responses to the same treatment across patients is
random noise. Hence, there is NO meaningful subgroup for this dataset...
我的问题是,
why do I get multiple association rules that would imply that there are subgroups, when according to the notes I shouldn't see anything?
我得到的电梯号码高于 2,而不是如果一切都像注释状态那样随机,您应该期望的 1。
Supp Conf Rule
0.3 0.7 6x0 -> trt1
即使我的代码运行了,我也没有得到接近预期的结果。这让我相信我搞砸了一些东西,但我不确定它是什么。
经过一些研究,我意识到我的样本量对于我拥有的变量数量来说太小了。我需要更大的样本量才能真正使用我正在使用的方法。事实上,我尝试使用的方法是在假设它在具有数十万或数百万行的数据库上 运行 的情况下开发的。
I give a lot of information on the methods that I used to write my code. If you just want to read my question, skip to the quotes at the end.
我正在从事一个项目,该项目的目标是检测一组患者的亚群。我认为这听起来像是使用关联规则挖掘的绝好机会,因为我目前正在就这个主题进行 class。
我一共有42个变量。其中,20 个是连续的,必须离散化。对于每个变量,我使用 Freedman-Diaconis 规则来确定将一个组划分为多少个类别。
def Freedman_Diaconis(column_values):
#sort the list first
column_values[1].sort()
first_quartile = int(len(column_values[1]) * .25)
third_quartile = int(len(column_values[1]) * .75)
fq_value = column_values[1][first_quartile]
tq_value = column_values[1][third_quartile]
iqr = tq_value - fq_value
n_to_pow = len(column_values[1])**(-1/3)
h = 2 * iqr * n_to_pow
retval = (column_values[1][-1] - column_values[1][1])/h
test = int(retval+1)
return test
从那里我使用了最小-最大归一化
def min_max_transform(column_of_data, num_bins):
min_max_normalizer = preprocessing.MinMaxScaler(feature_range=(1, num_bins))
data_min_max = min_max_normalizer.fit_transform(column_of_data[1])
data_min_max_ints = take_int(data_min_max)
return data_min_max_ints
转换我的数据,然后我简单地取整数部分以获得最终分类。
def take_int(list_of_float):
ints = []
for flt in list_of_float:
asint = int(flt)
ints.append(asint)
return ints
然后我还写了一个函数,用于将这个值与变量名结合起来。
def string_transform(prefix, column, index):
transformed_list = []
transformed = ""
if index < 4:
for entry in column[1]:
transformed = prefix+str(entry)
transformed_list.append(transformed)
else:
prefix_num = prefix.split('x')
for entry in column[1]:
transformed = str(prefix_num[1])+'x'+str(entry)
transformed_list.append(transformed)
return transformed_list
这样做是为了区分具有相同值但出现在不同列中的变量。例如,变量 x14 的值为 1 与变量 x20 的值为 1 的含义不同。字符串转换函数将为前面提到的示例创建 14x1 和 20x1。
在此之后,我将所有内容以篮子格式写入文件
def create_basket(list_of_lists, headers):
#for filename in os.listdir("."):
# if filename.e
if not os.path.exists('baskets'):
os.makedirs('baskets')
down_length = len(list_of_lists[0])
with open('baskets/dataset.basket', 'w') as basketfile:
basket_writer = csv.DictWriter(basketfile, fieldnames=headers)
for i in range(0, down_length):
basket_writer.writerow({"trt": list_of_lists[0][i], "y": list_of_lists[1][i], "x1": list_of_lists[2][i],
"x2": list_of_lists[3][i], "x3": list_of_lists[4][i], "x4": list_of_lists[5][i],
"x5": list_of_lists[6][i], "x6": list_of_lists[7][i], "x7": list_of_lists[8][i],
"x8": list_of_lists[9][i], "x9": list_of_lists[10][i], "x10": list_of_lists[11][i],
"x11": list_of_lists[12][i], "x12":list_of_lists[13][i], "x13": list_of_lists[14][i],
"x14": list_of_lists[15][i], "x15": list_of_lists[16][i], "x16": list_of_lists[17][i],
"x17": list_of_lists[18][i], "x18": list_of_lists[19][i], "x19": list_of_lists[20][i],
"x20": list_of_lists[21][i], "x21": list_of_lists[22][i], "x22": list_of_lists[23][i],
"x23": list_of_lists[24][i], "x24": list_of_lists[25][i], "x25": list_of_lists[26][i],
"x26": list_of_lists[27][i], "x27": list_of_lists[28][i], "x28": list_of_lists[29][i],
"x29": list_of_lists[30][i], "x30": list_of_lists[31][i], "x31": list_of_lists[32][i],
"x32": list_of_lists[33][i], "x33": list_of_lists[34][i], "x34": list_of_lists[35][i],
"x35": list_of_lists[36][i], "x36": list_of_lists[37][i], "x37": list_of_lists[38][i],
"x38": list_of_lists[39][i], "x39": list_of_lists[40][i], "x40": list_of_lists[41][i]})
并且我用了Orange中的apriori包,看看有没有关联规则
rules = Orange.associate.AssociationRulesSparseInducer(patient_basket, support=0.3, confidence=0.3)
print "%4s %4s %s" % ("Supp", "Conf", "Rule")
for r in rules:
my_rule = str(r)
split_rule = my_rule.split("->")
if 'trt' in split_rule[1]:
print 'treatment rule'
print "%4.1f %4.1f %s" % (r.support, r.confidence, r)
使用这个技术,我发现了很多与我的测试数据的关联规则。
THIS IS WHERE I HAVE A PROBLEM
看训练数据的注释时,有这个注释
...That is, the only reason for the differences among observed responses to the same treatment across patients is random noise. Hence, there is NO meaningful subgroup for this dataset...
我的问题是,
why do I get multiple association rules that would imply that there are subgroups, when according to the notes I shouldn't see anything?
我得到的电梯号码高于 2,而不是如果一切都像注释状态那样随机,您应该期望的 1。
Supp Conf Rule
0.3 0.7 6x0 -> trt1
即使我的代码运行了,我也没有得到接近预期的结果。这让我相信我搞砸了一些东西,但我不确定它是什么。
经过一些研究,我意识到我的样本量对于我拥有的变量数量来说太小了。我需要更大的样本量才能真正使用我正在使用的方法。事实上,我尝试使用的方法是在假设它在具有数十万或数百万行的数据库上 运行 的情况下开发的。