将文本文件转换为 Pandas 数据框

Convert Text File into Pandas Dataframe

我想从文本文件创建数据框。我从一个网站 抓取 一些数据并将其写入 .txt 文件。如文本文件的前 10 行所示,共有 10 个 'columns'。任何人都可以帮助我以 pandas 数据帧格式将行分隔到相应的列中吗?非常感谢!

下面是文本文件的例子。我希望前 10 行是列名,随后的行在各自的列下。

NFT Collection
Volume (ETH)
Market Cap (ETH)
Max price (ETH)
Avg price (ETH)
Min price (ETH)
% Opensea+Rarible
#Transactions
#Wallets
Contract date
Axies | Axie Infinity
4,884
480,695
5.24
.0563
.0024
0
86,807
2,389,981
189d ago
Sandbox's LANDs
578
112,989
6
1.11
.108
100%
394
12,879
700d ago
更新

直接在循环中填充数据帧应该是最有效的内存方式。此方法还避免一次加载整个文本文件:

txt_file = "path/to/your/file"

COL_COUNT = 10

with open(txt_file, "r") as f:
    col = [next(f).strip() for i in range(COL_COUNT)]
    df = pd.DataFrame(columns=col) 
    i = COL_COUNT
    while line:=f.readline():
        if i % COL_COUNT == 0:
            row = []
        row.append(line.strip())
        if i % COL_COUNT == COL_COUNT - 1:
            df = df.append(pd.DataFrame([row], columns=col))
        i += 1

    df.set_index(col[0], inplace=True) # get rid of row index
    print(df)

输出:

                      Volume (ETH) Market Cap (ETH) Max price (ETH) Avg price (ETH) Min price (ETH) % Opensea+Rarible #Transactions   #Wallets Contract date
NFT Collection
Axies | Axie Infinity        4,884          480,695            5.24           .0563           .0024                 0        86,807  2,389,981      189d ago
Sandbox's LANDs                578          112,989               6            1.11            .108              100%           394     12,879      700d ago
更新 2

列表方法仍然更快,但对于大文件可能会占用更多内存:

txt_file = "path/to/your/file"

COL_COUNT = 10

table = []
with open(txt_file, "r") as f:
    col = [next(f).strip() for i in range(COL_COUNT)]
    i = COL_COUNT
    while line:=f.readline():
        if i % COL_COUNT == 0:
            row = []
        row.append(line.strip())
        if i % COL_COUNT == COL_COUNT - 1:
            table.append(row)
        i += 1

    df = pd.DataFrame(table, columns=col)
    df.set_index(col[0], inplace=True) # get rid of row index
    print(df)

假设您的文本文件名为 foo.txt,首先我们可以为您的数据构建字典,使用:

foo = {}
with open('foo.txt') as f:
    head = [next(f).strip() for x in range(10)]
    for i in range(500):
        foo[i] = [next(f).strip() for x in range(10)]

然后简单地使用from_dict方法创建数据框:

pd.DataFrame.from_dict(foo, columns=head, orient='index')

给你:

    NFT Collection  Volume (ETH)    Market Cap (ETH)    Max price (ETH) Avg price (ETH) Min price (ETH) % Opensea+Rarible   #Transactions   #Wallets    Contract date
0   Axies | Axie Infinity   4,884   480,695 5.24    .0563   .0024   0   86,807  2,389,981   189d ago
1   Sandbox's LANDs 578 112,989 6   1.11    .108    100%    394 12,879  144d ago

像这样:

text = """NFT Collection
Volume (ETH)
Market Cap (ETH)
Max price (ETH)
Avg price (ETH)
Min price (ETH)
% Opensea+Rarible
#Transactions
#Wallets
Contract date
Axies | Axie Infinity
4,884
480,695
5.24
.0563
.0024
0
86,807
2,389,981
189d ago
Sandbox's LANDs
578
112,989
6
1.11
.108
100%
394
12,879
700d ago"""

text = text.split('\n')
text = [text[i:(i+10)] for i in range(0,len(text),10)]
df = pd.DataFrame(text[1:],columns=text[0])

这是另一种变体:

from io import StringIO

with open("input.txt", "r") as file:
    data = [line.strip() for line in file]
data = StringIO("\n".join(";".join(data[i:i+10]) for i in range(0, len(data), 10)))
df = pd.read_csv(data, delimiter=";")

优点:您不必将数字从字符串转换为 int/float 等,pd.read_csv 可以做到。缺点:您必须确保分隔符(在 joinpd.read_csv 中使用)是输入中未出现的字符。