在 python 中使用自定义多分隔符将文本文件转换为数据框

Convert text file into dataframe with custom multiple delimiter in python

我是 python 的新手。我有一个 txt 文件。它包含一些数据,例如

0: 480x640 2 persons, 1 cat, 1 clock, 1: 480x640 2 persons, 1 chair, Done. date (0.635s) Tue, 05 April 03:54:02 
0: 480x640 3 persons, 1 cat, 1 laptop, 1 clock, 1: 480x640 4 persons, 2 chairs, Done. date (0.587s) Tue, 05 April 03:54:05 
0: 480x640 3 persons, 1 chair, 1: 480x640 4 persons, 2 chairs, Done. date (0.582s) Tue, 05 April 03:54:07 

我曾经将其转换为 pandas 具有多个定界符的数据帧

我试过代码:

import pandas as pd

`student_csv =  pd.read_csv('output.txt', names=['a', 'b','date','status'], sep='[0: 480x640, 1: 480x640 , date]')

student_csv.to_csv('txttocsv.csv', index = None)`

现在如何将它转换成 pandas 数据框,像这样...

     a               b                       c           
    
2 persons    2 persons,  Done    Tue, 05 April03:54:02   

如何将文本文件转换为数据帧

要准确了解您的拆分规则是很棘手的。您可以使用正则表达式作为分隔符。

这是一个将列表和日期拆分为列的工作示例,但您可能需要根据自己的具体规则对其进行调整:

df = pd.read_csv('output.txt', sep=r'(?:,\s*|^)(?:\d+: \d+x\d+|Done[^)]+\)\s*)',
                 header=None, engine='python', names=(None, 'a', 'b', 'date')).iloc[:, 1:]

输出:

                                      a                     b                    date
0             2 persons, 1 cat, 1 clock    2 persons, 1 chair  Tue, 05 April 03:54:02
1   3 persons, 1 cat, 1 laptop, 1 clock   4 persons, 2 chairs  Tue, 05 April 03:54:05
2                    3 persons, 1 chair   4 persons, 2 chairs  Tue, 05 April 03:54:07

您可以在 sep 参数中使用 | 作为多个分隔符

df = pd.read_csv('data.txt', sep=r'0: 480x640|1: 480x640|date \(.*\)',
                 engine='python', names=('None', 'a', 'b', 'c')).drop('None', axis=1)
print(df)

                                        a                             b  \
0             2 persons, 1 cat, 1 clock,     2 persons, 1 chair, Done.
1   3 persons, 1 cat, 1 laptop, 1 clock,    4 persons, 2 chairs, Done.
2                    3 persons, 1 chair,    4 persons, 2 chairs, Done.

                     c
0  Tue, 05 April 03:54:02
1  Tue, 05 April 03:54:05
2  Tue, 05 April 03:54:07