将文本文件解析为 pandas 数据帧

Question

我有一个包含连续数据的文本文件。以下文本文件包含 2 行示例：

123@#{} 456@$%
1 23

此外，我在数据框中需要的 3 列的列长度为 2、3、4。我想将文件解析为 pandas 数据框，以便第一列获取前 2 个字母，第二列获取接下来的 3 个字母，依此类推，根据给定的列长度 (2,3,4) .. 下一组字母应该形成下一行等等...... 所以我的 pandas 数据框应该是这样的：

colA    colB       colC
12       3@#       {} 4    
56       @$%       1 23

任何人都可以提出建议吗？

Answer 1

没有内置方法可以执行此操作，所以我要做的是解析和拆分行并根据整个行长度附加到列表中：

In [216]:

t = '123@#{} 456@$%1 23'
l = [t[x:x+9] for x in range(len(t))[::9]]
l
Out[216]:
['123@#{} 4', '56@$%1 23']
In [218]:
# constuct a df
df = pd.DataFrame({'data':l})
df
Out[218]:
        data
0  123@#{} 4
1  56@$%1 23
In [220]:
# now call the vectorised str methods to split the text data into 3 columns
df['colA'] = df.data.str[0:2]
df['colB'] = df.data.str[2:5]
df['colC'] = df.data.str[5:9]
df
Out[220]:
        data colA colB  colC
0  123@#{} 4   12  3@#  {} 4
1  56@$%1 23   56  @$%  1 23
In [221]:
# drop the data column
df = df.drop('data', axis=1)
df
Out[221]:
  colA colB  colC
0   12  3@#  {} 4
1   56  @$%  1 23

编辑

为了处理您更新的数据文件，我添加了一些代码来解析您的文本文件以填充字典：

In [35]:

d={'data':[]}
line_len=9
skip=True
with open(r'c:\data\date.csv') as f:
    temp = ''
    for line in f:
        if len(line) == 0:
            pass
        if len(line) >= 9:
            d['data'].append(line[:line_len])
        # consume the rest of the line
        if len(temp) !=line_len:
            if len(line) >= 9:
                temp = line[line_len:].rstrip('\n')
            else:
                temp += line.rstrip('\n') 
        if len(temp) == line_len:
            d['data'].append(temp)
            temp=''

    df = pd.DataFrame(d)
df['colA'] = df.data.str[0:2]
df['colB'] = df.data.str[2:5]
df['colC'] = df.data.str[5:9]
df = df.drop('data', axis=1)
df
Out[35]:
  colA colB  colC
0   12  3@#  {} 4
1   56  @$%  1 23
2   12  3@#  {} 4
3   56  @$%  1 23

Answer 2

将线分成大小相等的部分并使用 read_fwf:

lines = [data[i:i+row_length]  for i in xrange(0, len(data), row_length)]
buf = StringIO.StringIO("\n".join(lines))
df = pd.read_fwf(buf, colspecs=[(0,2), (2,5), (5,9)], header=None)
print df

结果将是：

    0    1     2
0  12  3@#  {} 4
1  56  @$%  1 23

但我认为，没有 pandas 的直接方法会更容易。

将文本文件解析为 pandas 数据帧

parsing text file into pandas dataframe

python

pandas