如何将多个数据集中的观察频率计数合并为一个 table?
How to get observation frequency counts from multiple dataset into one table?
我有一堆大数据集。
DS_1(包含所有唯一 ID 和名称):
ID Name
1 Apple
2 Banana
3 Cherry
DS_2:
ID Observation
1 Apple detail
1 Apple detail
1 Apple detail
2 Banana detail
2 Banana detail
3 Cherry detail
3 Cherry detail
3 Cherry detail
DS_3:
ID Observation
2 Banana detail
2 Banana detail
3 Cherry detail
我想创建一个新数据集来显示数据集中的频率计数(并最终计算 Total_Obs)。我会输出这样的东西:
ID Name DS_2 DS_3 Total_Obs
1 Apple 3 0 3
2 Banana 2 2 4
3 Cherry 3 1 4
数据集相当大。除了连接数据集和执行频率 table 之外,是否有更有效的方法来执行此操作?或者必须创建一堆排序频率 tables,然后按 ID 合并所有数据集?
你可以在下面做 -
Select t1.id As id
,t1.name As name
,coalesce(DS_2_obs,0) as DS_2_obs
,coalesce(DS_3_obs,0) as DS_3_obs
,coalesce(DS_2_obs,0) + coalesce(DS_3_obs,0) As Total_Obs
from DS_1 t1
left join (Select id, count(1) as DS_2_obs from DS_2 group by id) t2
on t1.id = t2.id
left join (Select id, count(1) as DS_3_obs from DS_3 group by id) t3
on t1.id = t3.id;
此外,您应该始终标记您正在使用的数据库。
如果上述 SQL 花费大量时间,而不是 t2 和 t3 作为内联查询,您可以考虑使用 frequency/counts 创建聚合观察 tables 并在ID。这样,当您将观察聚合与主要 table 连接时,连接可以根据索引更快。
您可以使用视图来堆叠源数据集。该视图不会消耗大量磁盘 space。堆栈可由各种过程或步骤使用,以生成输出报告或实际计数汇总转置。
示例:
* generate some sample data, a master table and 30 detail tables;
data master;
do id = 1 to 200; * ids 1 to 200;
length name ;
name = repeat (cats(id,'_'),5);
output;
end;
run;
data _null_;
call streaminit(123);
retain id 0;
declare hash h(multidata:'yes');
h.defineKey('id');
h.defineData('id');
h.defineDone();
do index = 1 to 30;
do i = 1 to rand('integer', 100); * upto 100 ids;
id = rand('integer', 200); * random id;
do j = 1 to rand('integer', 20); * replicated upto 20 times;
h.add();
end;
end;
h.output (dataset:cats('child_',index));
h.clear();
end;
run;
堆叠视图
data assemble_v / view=assemble_v;
if (_n_ = 1) then do;
declare hash names(dataset:'master');
length id 8 name ;
names.defineKey('id');
names.defineData('name');
names.defineDone();
end;
set child_1-child_30 indsname=_source;
source = _source;
if names.find() ne 0 then name = '*missing*';
retain unity 1;
run;
计数的输出表示
options missing = '0';
proc tabulate data=assemble_v out=partitions(index=(order=(id name)));
title "TABULATE: ID Counts for detail tables";
class id name;
class source / order=data;
table id*name, n=''*(source=' ' all='Total') / nocellmerge;
run;
options missing = '0';
proc report data=assemble_v;
title "REPORT: ID Counts for detail tables";
columns id name source,unity ('Total' unity=Total);
define id / group ;
define name / group;
define source / ' ' across order=data;
define unity / ' ' ;
define total / ' ' ;
run;
计数的数据转换
proc freq data=assemble_v noprint ;
table id*name*source / out=combos(keep=id name source count);
run;
proc transpose data=combos out=wide(drop=_name_ _label_);
by id name;
var count;
id source;
run;
我有一堆大数据集。
DS_1(包含所有唯一 ID 和名称):
ID Name
1 Apple
2 Banana
3 Cherry
DS_2:
ID Observation
1 Apple detail
1 Apple detail
1 Apple detail
2 Banana detail
2 Banana detail
3 Cherry detail
3 Cherry detail
3 Cherry detail
DS_3:
ID Observation
2 Banana detail
2 Banana detail
3 Cherry detail
我想创建一个新数据集来显示数据集中的频率计数(并最终计算 Total_Obs)。我会输出这样的东西:
ID Name DS_2 DS_3 Total_Obs
1 Apple 3 0 3
2 Banana 2 2 4
3 Cherry 3 1 4
数据集相当大。除了连接数据集和执行频率 table 之外,是否有更有效的方法来执行此操作?或者必须创建一堆排序频率 tables,然后按 ID 合并所有数据集?
你可以在下面做 -
Select t1.id As id
,t1.name As name
,coalesce(DS_2_obs,0) as DS_2_obs
,coalesce(DS_3_obs,0) as DS_3_obs
,coalesce(DS_2_obs,0) + coalesce(DS_3_obs,0) As Total_Obs
from DS_1 t1
left join (Select id, count(1) as DS_2_obs from DS_2 group by id) t2
on t1.id = t2.id
left join (Select id, count(1) as DS_3_obs from DS_3 group by id) t3
on t1.id = t3.id;
此外,您应该始终标记您正在使用的数据库。
如果上述 SQL 花费大量时间,而不是 t2 和 t3 作为内联查询,您可以考虑使用 frequency/counts 创建聚合观察 tables 并在ID。这样,当您将观察聚合与主要 table 连接时,连接可以根据索引更快。
您可以使用视图来堆叠源数据集。该视图不会消耗大量磁盘 space。堆栈可由各种过程或步骤使用,以生成输出报告或实际计数汇总转置。
示例:
* generate some sample data, a master table and 30 detail tables;
data master;
do id = 1 to 200; * ids 1 to 200;
length name ;
name = repeat (cats(id,'_'),5);
output;
end;
run;
data _null_;
call streaminit(123);
retain id 0;
declare hash h(multidata:'yes');
h.defineKey('id');
h.defineData('id');
h.defineDone();
do index = 1 to 30;
do i = 1 to rand('integer', 100); * upto 100 ids;
id = rand('integer', 200); * random id;
do j = 1 to rand('integer', 20); * replicated upto 20 times;
h.add();
end;
end;
h.output (dataset:cats('child_',index));
h.clear();
end;
run;
堆叠视图
data assemble_v / view=assemble_v;
if (_n_ = 1) then do;
declare hash names(dataset:'master');
length id 8 name ;
names.defineKey('id');
names.defineData('name');
names.defineDone();
end;
set child_1-child_30 indsname=_source;
source = _source;
if names.find() ne 0 then name = '*missing*';
retain unity 1;
run;
计数的输出表示
options missing = '0';
proc tabulate data=assemble_v out=partitions(index=(order=(id name)));
title "TABULATE: ID Counts for detail tables";
class id name;
class source / order=data;
table id*name, n=''*(source=' ' all='Total') / nocellmerge;
run;
options missing = '0';
proc report data=assemble_v;
title "REPORT: ID Counts for detail tables";
columns id name source,unity ('Total' unity=Total);
define id / group ;
define name / group;
define source / ' ' across order=data;
define unity / ' ' ;
define total / ' ' ;
run;
计数的数据转换
proc freq data=assemble_v noprint ;
table id*name*source / out=combos(keep=id name source count);
run;
proc transpose data=combos out=wide(drop=_name_ _label_);
by id name;
var count;
id source;
run;