具有相同标签的列的平均值

Mean of columns with same label

我有两个向量

data vector: A = [1 2 2 1 2 6; 2 3 2 3 3 5]
label vector: B = [1 2 1 2 3 NaN]

我想取具有相同标签的所有列的平均值,并将它们输出为按标签编号排序的矩阵,忽略 NaN。所以,在这个例子中我想要:

labelmean(A,B) = [1.5 1.5 2; 2 3 3]

这可以用这样的 for 循环来完成。

function out = labelmean(data,label)
out=[];
for i=unique(label)
    if isnan(i); continue; end
    out = [out, mean(data(:,label==i),2)];
end 

但是,我正在处理包含许多数据点和标签的巨大数组。此外,此代码片段将经常执行。我想知道是否有更有效的方法来执行此操作而无需遍历每个单独的标签。

这将是使用 accumarray. Think of accumarray as a miniature MapReduce paradigm. There are keys and values and so the job of accumarray is to group all of the values that share the same key together and you do something with those values. In your case, the keys would be the elements in B but what the values are going to be are the row locations that you need for the corresponding values in B. Basically, for each value in B, the position in B tells you which row you need to access in A. Therefore, we simply need to grab all of the row locations that map to the same ID, access the rows of A, then find the mean over all rows. We need to be careful in that we ignore values that are NaN. We can filter these out before calling accumarray. The "something" that you do in accumarray traditionally should output a single number, but we are in fact outputting a column vector for each label. Therefore, a trick is to wrap the output into a cell array, then use cat 结合逗号分隔列表将输出转换为矩阵的好例子。

因此,像这样的东西应该可以工作:

% Sample data
A = [1 2 2 1 2 6; 2 3 2 3 3 5];
B = [1 2 1 2 3 NaN];

% Find non-NaN locations
mask = ~isnan(B);

% Generate row locations that are not NaN as well as the labels
ind = 1 : numel(B);
Bf = B(mask).';
ind = ind(mask).';

% Find label-wise means
C = accumarray(Bf, ind, [], @(x) {mean(A(:,x), 2)});

% Convert to numeric matrix
out = cat(2, C{:});

如果您不喜欢使用临时变量来查找那些非 NaN 值,我们可以用更少的代码行来做到这一点,但您仍然需要行索引向量确定我们需要从哪里采样:

% Sample data
A = [1 2 2 1 2 6; 2 3 2 3 3 5];
B = [1 2 1 2 3 NaN];

% Solution
ind = 1 : numel(B);
C = accumarray(B(~isnan(B)).', ind(~isnan(B)).', [], @(x) {mean(A(:,x), 2)});
out = cat(2, C{:});

根据您的数据,我们得到:

>> out

out =

    1.5000    1.5000    2.0000
    2.0000    3.0000    3.0000

这是一种方法:

  1. 获取不包含 NaNs 的标签索引。
  2. 创建一个零和一的稀疏矩阵,乘以 A 将得到所需的行总和。
  3. 将该矩阵除以每一列的总和,使总和成为平均值。
  4. 应用矩阵乘法得到结果,并转换为完整矩阵。

代码:

I = find(~isnan(B));                                 % step 1
t = sparse(I, B(I), 1, size(A,2), max(B(I)));        % step 2
t = bsxfun(@rdivide, t, sum(t,1));                   % step 3
result = full(A*t);                                  % step 4

这个答案不是新方法,而是给定答案的基准,因为如果你谈论性能,你总是要对它进行基准测试。

clear all;
% I tried to make a real-life dataset (the original author may provide a
% better one)
A = [1:3e4; 1:10:3e5; 1:100:3e6]; % large dataset
B = repmat(1:1e3, 1, 3e1); % large number of labels

labelmean(A,B);
labelmeanLuisMendoA(A,B);
labelmeanLuisMendoB(A,B);
labelmeanRayryeng(A,B);

function out = labelmean(data,label)
    tic
    out=[];
    for i=unique(label)
        if isnan(i); continue; end
        out = [out, mean(data(:,label==i),2)];
    end
    toc
end

function out = labelmeanLuisMendoA(A,B)
    tic
    B2 = B(~isnan(B)); % remove NaN's
    t = full(sparse(1:numel(B2),B2,1,size(A,2),max(B2))); % template matrix
    out = A*t; % sum of columns that share a label
    out = bsxfun(@rdivide, out, sum(t,1)); % convert sum into mean
    toc
end

function out = labelmeanLuisMendoB(A,B)
    tic
    B2 = B(~isnan(B));                                   % step 1
    t = sparse(1:numel(B2), B2, 1, size(A,2), max(B2));  % step 2
    t = bsxfun(@rdivide, t, sum(t,1));                   % step 3
    out = full(A*t);                                  % step 4
    toc
end

function out = labelmeanRayryeng(A,B)
    tic
    ind = 1 : numel(B);
    C = accumarray(B(~isnan(B)).', ind(~isnan(B)).', [], @(x) {mean(A(:,x), 2)});
    out = cat(2, C{:});
    toc
end

输出为:

Elapsed time is 0.080415 seconds. % original
Elapsed time is 0.088427 seconds. % LuisMendo original answer
Elapsed time is 0.004223 seconds. % LuisMendo optimised version
Elapsed time is 0.037347 seconds. % rayryeng answer

对于这个数据集,LuisMendo 优化版本是明显的赢家,而他的第一个版本比原始版本慢。

=> 不要忘记对您的表现进行基准测试!

编辑: 测试平台规范

  • Matlab R2016b
  • Ubuntu 64 位
  • 15.6 GiB 内存
  • 英特尔® 酷睿™ i7-5600U CPU @ 2.60GHz × 4