我们如何计算 MATLAB 元胞数组中特定字符串的起始和结束索引？

Question

假设我们有这个元胞数组：

strings = {'a'; 'a'; 'a'; 'a'; 'a'; 'a'; 'b'; 'b'; 'b'; 'b'; 'm'; 'm'; 'm'; 'm'};

我想要这样的输出：

a  1    6
b  7    10
m  11   14

数字显示每个唯一字符串的开始和结束索引。然而，这只是一个例子。我的元胞数组有 100 多个唯一字符串。在 MATLAB 中执行此操作的有效方法是什么？

Answer 1

unique 的输出应该立即为您提供所需的内容：

strings = {'a'; 'a'; 'a'; 'a'; 'a'; 'a'; 'b'; 'b'; 'b'; 'b'; 'm'; 'm'; 'm'; 'm'};
[uniquestrings, start, bin] = unique(strings);

其中：

uniquestrings = 

    'a'    'b'    'm'


start =

     1     7    11


bin =

     1     1     1     1     1     1     2     2     2     2     3     3     3     3

虽然这对所提供的数据很有效，但我很想看到一个更具 'real' 代表性的数据集，或许可以使该函数更通用。

Answer 2

从 unique 开始，将您的数据映射到索引：

[~,~,ix]=unique(strings);
d=[];
%calculate end indices
d(:,2)=[find(diff(ix));numel(ix)]
%calculate start indices
d(:,1)=[1;d(1:end-1,2)+1]
%corresponding chars:
e=strings(d(:,1))

输出为：

Answer 3

假设您的字符串以连续运行个字符串的方式填充，并且运行是 唯一一次 您将看到一个特定的唯一字符串，您可以将其与 unique and accumarray 结合使用。首先，使用 unique 获取所有唯一字符串的列表，然后为每个字符串分配一个唯一 ID，从 1 到您拥有的尽可能多的唯一字符串。 unique 的问题是，一旦您对字符串进行排序，就会分配 ID。您不想这样做，因为您希望按原样使用字符串的位置来确定其运行的开始和结束位置。因此，您需要使用 'stable' 标志。您将需要第一个输出为您提供数组中的唯一字符串（稍后使用）和第三个输出以获取此新 ID 分配：

strings = {'a'; 'a'; 'a'; 'a'; 'a'; 'a'; 'b'; 'b'; 'b'; 'b'; 'm'; 'm'; 'm'; 'm'};
[s,~,id] = unique(strings, 'stable');

现在您有了这个，使用 accumarray 这样您就可以获取每个 ID 并将它们组合在一起。在这种情况下，您需要使用与每个唯一字符串关联的位置编号，并且您需要将属于同一字符 ID 的所有位置编号合并在一起。执行此操作后，我们可以输出一个元素元胞数组，其中每个元素都是一个二元素向量，为您提供每个运行.

的最小和最大位置

out = accumarray(id, (1:numel(strings)).', [], @(x) {[min(x), max(x)]});

然后您可以将其显示得很好 table:

T = table(s, vertcat(out{:}), 'VariableNames', {'Letter', 'BeginEnd'});

我们得到：

T = 

    Letter    BeginEnd
    ______    ________

    'a'        1     6
    'b'        7    10
    'm'       11    14

但是，如果您想获取矩阵中的第一个和最后一个元素，只需执行以下操作：

ind = vertcat(out{:});

第一列为您提供每个字符的起始位置，第二列为您提供每个字符的结束位置。

Answer 4

另一种使用unique的方法：

strings = {'a'; 'a'; 'a'; 'a'; 'a'; 'a'; 'b'; 'b'; 'b'; 'b'; 'm'; 'm'; 'm'; 'm'};
[u, l] = unique(strings, 'last');
[~, f] = unique(strings, 'first');

这给

或者您可以将结果连接到元胞数组中

result = [u num2cell([f l])]

生产

result = 
    'a'    [ 1]    [ 6]
    'b'    [ 7]    [10]
    'm'    [11]    [14]

我们如何计算 MATLAB 元胞数组中特定字符串的起始和结束索引？

How we can calculate starting and ending indices of specific string in a cell array in MATLAB?

string

matlab

unique

cell

duplicates