如何将多个数据集中的观察频率计数合并为一个 table?

How to get observation frequency counts from multiple dataset into one table?

我有一堆大数据集。

DS_1(包含所有唯一 ID 和名称):

ID   Name
1    Apple
2    Banana
3    Cherry

DS_2:

ID   Observation
1    Apple detail
1    Apple detail
1    Apple detail
2    Banana detail
2    Banana detail
3    Cherry detail
3    Cherry detail
3    Cherry detail

DS_3:

ID   Observation
2    Banana detail
2    Banana detail
3    Cherry detail

我想创建一个新数据集来显示数据集中的频率计数(并最终计算 Total_Obs)。我会输出这样的东西:

ID   Name      DS_2    DS_3   Total_Obs
1    Apple     3       0      3
2    Banana    2       2      4
3    Cherry    3       1      4

数据集相当大。除了连接数据集和执行频率 table 之外,是否有更有效的方法来执行此操作?或者必须创建一堆排序频率 tables,然后按 ID 合并所有数据集?

你可以在下面做 -

Select t1.id As id
      ,t1.name As name
      ,coalesce(DS_2_obs,0) as DS_2_obs
      ,coalesce(DS_3_obs,0) as DS_3_obs
      ,coalesce(DS_2_obs,0) + coalesce(DS_3_obs,0) As Total_Obs
from DS_1 t1
left join (Select id, count(1) as DS_2_obs from DS_2 group by id) t2
on t1.id = t2.id
left join (Select id, count(1) as DS_3_obs from DS_3 group by id) t3
on t1.id = t3.id;

此外,您应该始终标记您正在使用的数据库。

如果上述 SQL 花费大量时间,而不是 t2 和 t3 作为内联查询,您可以考虑使用 frequency/counts 创建聚合观察 tables 并在ID。这样,当您将观察聚合与主要 table 连接时,连接可以根据索引更快。

您可以使用视图来堆叠源数据集。该视图不会消耗大量磁盘 space。堆栈可由各种过程或步骤使用,以生成输出报告或实际计数汇总转置。

示例:

* generate some sample data, a master table and 30 detail tables;

data master;
  do id = 1 to 200;  * ids 1 to 200;
    length name ;
    name = repeat (cats(id,'_'),5);
    output;
  end;
run;

data _null_;

  call streaminit(123);

  retain id 0;

  declare hash h(multidata:'yes');
  h.defineKey('id');
  h.defineData('id');
  h.defineDone();

  do index = 1 to 30;
    do i = 1 to rand('integer', 100);  * upto 100 ids;

      id = rand('integer', 200);  * random id;
      do j = 1 to rand('integer', 20); * replicated upto 20 times;
        h.add();
      end;
    end;

    h.output (dataset:cats('child_',index));
    h.clear();

  end;
run;

堆叠视图

data assemble_v / view=assemble_v;
  if (_n_ = 1) then do;
    declare hash names(dataset:'master');
    length id 8 name ;
    names.defineKey('id');
    names.defineData('name');
    names.defineDone();
  end;

  set child_1-child_30 indsname=_source;

  source = _source;
  if names.find() ne 0 then name = '*missing*';

  retain unity 1;
run;

计数的输出表示


options missing = '0';
proc tabulate data=assemble_v out=partitions(index=(order=(id name)));
  title "TABULATE: ID Counts for detail tables";
  class id name;
  class source / order=data;
  table id*name, n=''*(source=' ' all='Total') / nocellmerge;
run;

options missing = '0';

proc report data=assemble_v;
  title "REPORT: ID Counts for detail tables";
  columns id name source,unity ('Total' unity=Total);
  define id / group ;
  define name / group;
  define source / ' ' across order=data;
  define unity / ' ' ;
  define total / ' ' ;
run;



计数的数据转换


proc freq data=assemble_v noprint ;
  table id*name*source / out=combos(keep=id name source count);
run;

proc transpose data=combos out=wide(drop=_name_ _label_);
  by id name;
  var count;
  id source;
run;