将数组从 txt 文件解析为 Python 中的 Pandas 数据帧
Parsing array from txt file to Pandas dataframe in Python
嗨,我的 .txt 文件中有这样的数组:
n|vechicle.car.characteristics[0].speed|180
n|vechicle.car.characteristics[0].weight|3
c|vechicle.car.characteristics[0].颜色|黑色
c|vechicle.car.characteristics[0].fuel|95
n|vechicle.car.characteristics[1].speed|160
n|vechicle.car.characteristics[1].weight|4
c|vechicle.car.characteristics[1].颜色|绿色
c|vechicle.car.characteristics[1].fuel|92
n|vechicle.car.characteristics[2].speed|200
n|vechicle.car.characteristics[2].weight|5
c|vechicle.car.characteristics[2].color|white
c|vechicle.car.characteristics[2].fuel|95
我想把它解析成这样的数据帧:
speed weight color fuel
0 180 3 black 95
1 160 4 green 92
2 200 5 white 95
就是,我是怎么解决的:
import re
import pandas as pd
df_output_list = {}
df_output_dict = []
match_counter = 1
with open('sample_car.txt',encoding='utf-8') as file:
line = file.readline()
while line:
result = re.split(r'\|',line.rstrip())
result2 = re.findall(r'.(?<=\[)(\d+)(?=\])',result[1])
regex = re.compile('vechicle.car.characteristics.')
match = re.search(regex, result[1])
if match:
if match_counter == 1:
ArrInd = 0
match_counter+=1
#print(df_output_list)
if ArrInd == int(result2[0]):
df_output_list[result[1].split('.')[3]] = result[2]
ArrInd = int(result2[0])
else:
df_output_dict.append(df_output_list)
df_output_list = {}
df_output_list[result[1].split('.')[3]] = result[2]
ArrInd = int(result2[0])
line = file.readline()
df_output_dict.append(df_output_list)
#print(df_output_dict)
df_output = pd.DataFrame(df_output_dict)
print(df_output)
我发现它太复杂了。可以简化一下吗?
列名应该自动解析。
读取 csv
文件 sep='|'
然后获取包含值的最后一列,然后 reshape
适当的形状。
>>> columns=['speed','weight','color','fuel']
>>> s = pd.read_csv('filename.txt', sep='|', header=None).iloc[:,-1]
>>> df = pd.DataFrame(s.to_numpy().reshape(-1,4), columns=columns)
>>> df
speed weight color fuel
0 180 3 black 95
1 160 4 green 92
2 200 5 white 95
如果您有像 n|vechicle.car.characteristics[0].speed|180
这样的固定行格式,那么我们可以这样做
>>> df = pd.read_csv('d.csv', sep='|', header=None)
>>> columns = df.iloc[:,1].str.split('.').str[-1].unique()
>>> df_out = pd.DataFrame(df.iloc[:,-1].to_numpy().reshape(-1,len(columns)), columns=columns)
>>> df_out
speed weight color fuel
0 180 3 black 95
1 160 4 green 92
2 200 5 white 95
嗨,我的 .txt 文件中有这样的数组:
n|vechicle.car.characteristics[0].speed|180
n|vechicle.car.characteristics[0].weight|3
c|vechicle.car.characteristics[0].颜色|黑色
c|vechicle.car.characteristics[0].fuel|95
n|vechicle.car.characteristics[1].speed|160
n|vechicle.car.characteristics[1].weight|4
c|vechicle.car.characteristics[1].颜色|绿色
c|vechicle.car.characteristics[1].fuel|92
n|vechicle.car.characteristics[2].speed|200
n|vechicle.car.characteristics[2].weight|5
c|vechicle.car.characteristics[2].color|white
c|vechicle.car.characteristics[2].fuel|95
我想把它解析成这样的数据帧:
speed weight color fuel
0 180 3 black 95
1 160 4 green 92
2 200 5 white 95
就是,我是怎么解决的:
import re
import pandas as pd
df_output_list = {}
df_output_dict = []
match_counter = 1
with open('sample_car.txt',encoding='utf-8') as file:
line = file.readline()
while line:
result = re.split(r'\|',line.rstrip())
result2 = re.findall(r'.(?<=\[)(\d+)(?=\])',result[1])
regex = re.compile('vechicle.car.characteristics.')
match = re.search(regex, result[1])
if match:
if match_counter == 1:
ArrInd = 0
match_counter+=1
#print(df_output_list)
if ArrInd == int(result2[0]):
df_output_list[result[1].split('.')[3]] = result[2]
ArrInd = int(result2[0])
else:
df_output_dict.append(df_output_list)
df_output_list = {}
df_output_list[result[1].split('.')[3]] = result[2]
ArrInd = int(result2[0])
line = file.readline()
df_output_dict.append(df_output_list)
#print(df_output_dict)
df_output = pd.DataFrame(df_output_dict)
print(df_output)
我发现它太复杂了。可以简化一下吗?
列名应该自动解析。
读取 csv
文件 sep='|'
然后获取包含值的最后一列,然后 reshape
适当的形状。
>>> columns=['speed','weight','color','fuel']
>>> s = pd.read_csv('filename.txt', sep='|', header=None).iloc[:,-1]
>>> df = pd.DataFrame(s.to_numpy().reshape(-1,4), columns=columns)
>>> df
speed weight color fuel
0 180 3 black 95
1 160 4 green 92
2 200 5 white 95
如果您有像 n|vechicle.car.characteristics[0].speed|180
这样的固定行格式,那么我们可以这样做
>>> df = pd.read_csv('d.csv', sep='|', header=None)
>>> columns = df.iloc[:,1].str.split('.').str[-1].unique()
>>> df_out = pd.DataFrame(df.iloc[:,-1].to_numpy().reshape(-1,len(columns)), columns=columns)
>>> df_out
speed weight color fuel
0 180 3 black 95
1 160 4 green 92
2 200 5 white 95