将文本文件转换为 Pandas 数据框
Convert Text File into Pandas Dataframe
我想从文本文件创建数据框。我从一个网站 抓取 一些数据并将其写入 .txt 文件。如文本文件的前 10 行所示,共有 10 个 'columns'。任何人都可以帮助我以 pandas 数据帧格式将行分隔到相应的列中吗?非常感谢!
下面是文本文件的例子。我希望前 10 行是列名,随后的行在各自的列下。
NFT Collection
Volume (ETH)
Market Cap (ETH)
Max price (ETH)
Avg price (ETH)
Min price (ETH)
% Opensea+Rarible
#Transactions
#Wallets
Contract date
Axies | Axie Infinity
4,884
480,695
5.24
.0563
.0024
0
86,807
2,389,981
189d ago
Sandbox's LANDs
578
112,989
6
1.11
.108
100%
394
12,879
700d ago
更新
直接在循环中填充数据帧应该是最有效的内存方式。此方法还避免一次加载整个文本文件:
txt_file = "path/to/your/file"
COL_COUNT = 10
with open(txt_file, "r") as f:
col = [next(f).strip() for i in range(COL_COUNT)]
df = pd.DataFrame(columns=col)
i = COL_COUNT
while line:=f.readline():
if i % COL_COUNT == 0:
row = []
row.append(line.strip())
if i % COL_COUNT == COL_COUNT - 1:
df = df.append(pd.DataFrame([row], columns=col))
i += 1
df.set_index(col[0], inplace=True) # get rid of row index
print(df)
输出:
Volume (ETH) Market Cap (ETH) Max price (ETH) Avg price (ETH) Min price (ETH) % Opensea+Rarible #Transactions #Wallets Contract date
NFT Collection
Axies | Axie Infinity 4,884 480,695 5.24 .0563 .0024 0 86,807 2,389,981 189d ago
Sandbox's LANDs 578 112,989 6 1.11 .108 100% 394 12,879 700d ago
更新 2
列表方法仍然更快,但对于大文件可能会占用更多内存:
txt_file = "path/to/your/file"
COL_COUNT = 10
table = []
with open(txt_file, "r") as f:
col = [next(f).strip() for i in range(COL_COUNT)]
i = COL_COUNT
while line:=f.readline():
if i % COL_COUNT == 0:
row = []
row.append(line.strip())
if i % COL_COUNT == COL_COUNT - 1:
table.append(row)
i += 1
df = pd.DataFrame(table, columns=col)
df.set_index(col[0], inplace=True) # get rid of row index
print(df)
假设您的文本文件名为 foo.txt
,首先我们可以为您的数据构建字典,使用:
foo = {}
with open('foo.txt') as f:
head = [next(f).strip() for x in range(10)]
for i in range(500):
foo[i] = [next(f).strip() for x in range(10)]
然后简单地使用from_dict
方法创建数据框:
pd.DataFrame.from_dict(foo, columns=head, orient='index')
给你:
NFT Collection Volume (ETH) Market Cap (ETH) Max price (ETH) Avg price (ETH) Min price (ETH) % Opensea+Rarible #Transactions #Wallets Contract date
0 Axies | Axie Infinity 4,884 480,695 5.24 .0563 .0024 0 86,807 2,389,981 189d ago
1 Sandbox's LANDs 578 112,989 6 1.11 .108 100% 394 12,879 144d ago
像这样:
text = """NFT Collection
Volume (ETH)
Market Cap (ETH)
Max price (ETH)
Avg price (ETH)
Min price (ETH)
% Opensea+Rarible
#Transactions
#Wallets
Contract date
Axies | Axie Infinity
4,884
480,695
5.24
.0563
.0024
0
86,807
2,389,981
189d ago
Sandbox's LANDs
578
112,989
6
1.11
.108
100%
394
12,879
700d ago"""
text = text.split('\n')
text = [text[i:(i+10)] for i in range(0,len(text),10)]
df = pd.DataFrame(text[1:],columns=text[0])
这是另一种变体:
from io import StringIO
with open("input.txt", "r") as file:
data = [line.strip() for line in file]
data = StringIO("\n".join(";".join(data[i:i+10]) for i in range(0, len(data), 10)))
df = pd.read_csv(data, delimiter=";")
优点:您不必将数字从字符串转换为 int
/float
等,pd.read_csv
可以做到。缺点:您必须确保分隔符(在 join
和 pd.read_csv
中使用)是输入中未出现的字符。
我想从文本文件创建数据框。我从一个网站 抓取 一些数据并将其写入 .txt 文件。如文本文件的前 10 行所示,共有 10 个 'columns'。任何人都可以帮助我以 pandas 数据帧格式将行分隔到相应的列中吗?非常感谢!
下面是文本文件的例子。我希望前 10 行是列名,随后的行在各自的列下。
NFT Collection
Volume (ETH)
Market Cap (ETH)
Max price (ETH)
Avg price (ETH)
Min price (ETH)
% Opensea+Rarible
#Transactions
#Wallets
Contract date
Axies | Axie Infinity
4,884
480,695
5.24
.0563
.0024
0
86,807
2,389,981
189d ago
Sandbox's LANDs
578
112,989
6
1.11
.108
100%
394
12,879
700d ago
更新
直接在循环中填充数据帧应该是最有效的内存方式。此方法还避免一次加载整个文本文件:
txt_file = "path/to/your/file"
COL_COUNT = 10
with open(txt_file, "r") as f:
col = [next(f).strip() for i in range(COL_COUNT)]
df = pd.DataFrame(columns=col)
i = COL_COUNT
while line:=f.readline():
if i % COL_COUNT == 0:
row = []
row.append(line.strip())
if i % COL_COUNT == COL_COUNT - 1:
df = df.append(pd.DataFrame([row], columns=col))
i += 1
df.set_index(col[0], inplace=True) # get rid of row index
print(df)
输出:
Volume (ETH) Market Cap (ETH) Max price (ETH) Avg price (ETH) Min price (ETH) % Opensea+Rarible #Transactions #Wallets Contract date
NFT Collection
Axies | Axie Infinity 4,884 480,695 5.24 .0563 .0024 0 86,807 2,389,981 189d ago
Sandbox's LANDs 578 112,989 6 1.11 .108 100% 394 12,879 700d ago
更新 2
列表方法仍然更快,但对于大文件可能会占用更多内存:
txt_file = "path/to/your/file"
COL_COUNT = 10
table = []
with open(txt_file, "r") as f:
col = [next(f).strip() for i in range(COL_COUNT)]
i = COL_COUNT
while line:=f.readline():
if i % COL_COUNT == 0:
row = []
row.append(line.strip())
if i % COL_COUNT == COL_COUNT - 1:
table.append(row)
i += 1
df = pd.DataFrame(table, columns=col)
df.set_index(col[0], inplace=True) # get rid of row index
print(df)
假设您的文本文件名为 foo.txt
,首先我们可以为您的数据构建字典,使用:
foo = {}
with open('foo.txt') as f:
head = [next(f).strip() for x in range(10)]
for i in range(500):
foo[i] = [next(f).strip() for x in range(10)]
然后简单地使用from_dict
方法创建数据框:
pd.DataFrame.from_dict(foo, columns=head, orient='index')
给你:
NFT Collection Volume (ETH) Market Cap (ETH) Max price (ETH) Avg price (ETH) Min price (ETH) % Opensea+Rarible #Transactions #Wallets Contract date
0 Axies | Axie Infinity 4,884 480,695 5.24 .0563 .0024 0 86,807 2,389,981 189d ago
1 Sandbox's LANDs 578 112,989 6 1.11 .108 100% 394 12,879 144d ago
像这样:
text = """NFT Collection
Volume (ETH)
Market Cap (ETH)
Max price (ETH)
Avg price (ETH)
Min price (ETH)
% Opensea+Rarible
#Transactions
#Wallets
Contract date
Axies | Axie Infinity
4,884
480,695
5.24
.0563
.0024
0
86,807
2,389,981
189d ago
Sandbox's LANDs
578
112,989
6
1.11
.108
100%
394
12,879
700d ago"""
text = text.split('\n')
text = [text[i:(i+10)] for i in range(0,len(text),10)]
df = pd.DataFrame(text[1:],columns=text[0])
这是另一种变体:
from io import StringIO
with open("input.txt", "r") as file:
data = [line.strip() for line in file]
data = StringIO("\n".join(";".join(data[i:i+10]) for i in range(0, len(data), 10)))
df = pd.read_csv(data, delimiter=";")
优点:您不必将数字从字符串转换为 int
/float
等,pd.read_csv
可以做到。缺点:您必须确保分隔符(在 join
和 pd.read_csv
中使用)是输入中未出现的字符。