解析字符串数组

Question

我有一个描述浮点数矩阵的一维字符串数组 ( Array{String,1} )（见下文）。我需要解析这个矩阵。有什么巧妙的建议吗？

茱莉亚 1.5
MacOS

是的，我确实从文件中读取了这个。我不想使用 CSV 读取整个文件，因为我想保留使用内存 I/O 读取整个文件的选项，我认为 CSV 没有。另外，我有一些复杂的行，包括字符串和数字，以及我需要解析的字符串和字符串，这排除了 DelimitedFiles。列由两个空格分隔。

julia> lines[24+member_total:idx-1]
49-element Array{String,1}:
 "0.0000000E+00  0.0000000E+00  0.0000000E+00  1.3308000E+01"
 "0.0000000E+00  0.0000000E+00  1.9987500E-01  1.3308000E+01"
 "0.0000000E+00  0.0000000E+00  1.1998650E+00  1.3308000E+01"
 "0.0000000E+00  0.0000000E+00  2.1998550E+00  1.3308000E+01"
 "0.0000000E+00  0.0000000E+00  3.1998450E+00  1.3308000E+01"
 "0.0000000E+00  0.0000000E+00  4.1998350E+00  1.3308000E+01"
 ⋮
 "0.0000000E+00  0.0000000E+00  5.9699895E+01  1.4000000E-01"
 "0.0000000E+00  0.0000000E+00  6.0199890E+01  1.0100000E-01"
 "0.0000000E+00  0.0000000E+00  6.0699885E+01  6.2000000E-02"
 "0.0000000E+00  0.0000000E+00  6.1199880E+01  2.3000000E-02"
 "0.0000000E+00  0.0000000E+00  6.1500000E+01  0.0000000E+00"

Answer 1

我解决了这个问题。不是最光滑的东西，但它确实有效...

function rmspaces(line)
    line = replace(line, "\t" => " ")
    # println("line: ", line)
    while occursin("  ", line)
        line = replace(line, "  "=>" ")
        # println("line: ", line)
    end

    return line
end

function readmatrix(lines, numcolumns::Int64; type=Float64)
    #Remove the spaces to one
    for i=1:length(lines)
        lines[i] = rmspaces(lines[i])
    end

    matrix = zeros(length(lines), numcolumns)

    for i=1:length(lines)
        idx = 1 # set the initial stop at the beginning
        spot = 1
        for j=1:length(lines[i])
            if lines[i][j]==' ' && j>1 #Stops at spaces
                number = parse(type,lines[i][idx:j]) #from the last stop to this one
                idx = j #Set this stop in memory
                matrix[i,spot] = number
                spot += 1
            end
        end
        if spot<numcolumns+1 #If there isn't a space after the last number,
            #we need to attach the last number in every row. If the last number
            #was appended, then the spot will be increased to be more than the number
            #of columns.
            number = parse(type, lines[i][idx:end])
            matrix[i,spot] = number
        end
    end
    return matrix
end

Answer 2

strs = ["0.0000000E+00  0.0000000E+00  0.0000000E+00  1.3308000E+01",
        "0.0000000E+00  0.0000000E+00  1.9987500E-01  1.3308000E+01",
        "0.0000000E+00  0.0000000E+00  1.1998650E+00  1.3308000E+01"]

mapreduce(vcat, strs) do s
    (parse.(Float64, split(s, "  ")))'
end

3×4 Array{Float64,2}:
 0.0  0.0  0.0       13.308
 0.0  0.0  0.199875  13.308
 0.0  0.0  1.19986   13.308

Answer 3

我强烈反对重新发明轮子和使用定制的解析器，因为此类解决方案在生产中的实际稳健性。

如果您的文件在单个 String 中，请使用：

using DelimitedFiles
readdlm(IOBuffer(strs))

如果您的文件作为 String 中的 Vector 使用

cat(readdlm.(IOBuffer.(strsa))...,dims=1)

最后，内存映射与CSV一起使用没有冲突：

using Mmap

s = open("d.txt") # d.txt contains your lines
                  # if you want to read & wrtie use "w+" option
 
m = Mmap.mmap(s, Vector{UInt8}) # memory mapping of your file

readdlm(IOBuffer(m))

同时，无论内存映射如何，您始终可以将流设置为开头并读取数据：

seek(s,0)
readdlm(s)
seek(s,0) # reset the stream

解析字符串数组

Parse an array of strings

text-parsing

julia