将多种类型的文本文件转换为矩阵

Convert text file of multiple types to matrix

我正在使用 iris 数据集,它看起来如下...

5.4,3.7,1.5,0.2,Iris-setosa
4.8,3.4,1.6,0.2,Iris-setosa
4.8,3.0,1.4,0.1,Iris-setosa
4.3,3.0,1.1,0.1,Iris-setosa
5.8,4.0,1.2,0.2,Iris-setosa
...

如您所见,数据中有不同的类型。前几个是浮点数,最后一个是字符串。因此我不能使用 dlmread。当我尝试时,出现错误。

我尝试使用 fscanf,但我的解决方案没有给我想要的...

filename = "train.txt"
A = fopen(filename, 'r')
data = fscanf(A, '%f %f %f %f %s')

这是给 data 作为 1x1 数组。

我想要的是将数据转换成一个矩阵,我可以在其中按行和列访问值。因此,data(1,1) 将是 5.4。我不太熟悉 Octave 中的 I/O,因此非常感谢您的帮助。

Regular experssions can be very helpful in problems like this. They allow you to search for a particular pattern or patterns. For example, using regexp you can find all instances of a pattern in your datasheet and read them into an array, with out = regexp(str, expression, 'match'). Depending on how you set up the program, it'll likely read it in as a 1xn array. But if you know the number of columns in each row, you can easily convert to an array with something like vec2mat.

以下对我有用,在 Matlab R2017a 和 Octave 4.2.1 中。有关详细信息,请参阅 textscan documentation

fid = fopen('filename.txt');
x = textscan(fid, '%f,%f,%f,%f,%s');
fclose(fid);
x_num = [x{1:4}];
x_str = x{5};

这给出了

x_num =
   5.400000000000000   3.700000000000000   1.500000000000000   0.200000000000000
   4.800000000000000   3.400000000000000   1.600000000000000   0.200000000000000
   4.800000000000000   3.000000000000000   1.400000000000000   0.100000000000000
   4.300000000000000   3.000000000000000   1.100000000000000   0.100000000000000
   5.800000000000000   4.000000000000000   1.200000000000000   0.200000000000000

x_str =
  5×1 cell array
    'Iris-setosa'
    'Iris-setosa'
    'Iris-setosa'
    'Iris-setosa'
    'Iris-setosa'

您可以使用 textscan function 并将参数 CollectOutput 设置为 true;

轻松实现此目的

Logical indicator determining data concatenation, specified as the comma-separated pair consisting of 'CollectOutput' and either true or false. If true, then the importing function concatenates consecutive output cells of the same fundamental MATLAB® class into a single array.

示例:

filename = 'train.txt';
fid = fopen(filename, 'r');
data = textscan(fid,'%f%f%f%f%s','CollectOutput',true,'Delimiter',',');
fclose(fid);

data 变量将以元胞数组的形式返回,其中文件内容将根据基础类型进行分组。第一个单元格将包含数值,而第二个单元格将包含字符串值...您可以按如下方式分别检索它们:

numerics = data{1};
texts = data{2};