如何将文本文件中的原始数据加载到 pandas 数据框中?
how to load raw data in a text file in to pandas dataframe?
我的数据在文本文件中,格式如下:
heading1:blah
heading2:废话
heading3:blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah (text entered new line for heading3仅此行)
heading1:blah
heading2:blah
heading3:blah blah blah blah blah blah blah blah blah blah
等等...
注:
- heading3 数据转到下一行。
- 这是数据集的 Zip 文件link
感谢您 post 将 link 添加到数据中。如果它是公开的,那么一开始就这样做是有帮助的。我 运行 这个在完整的数据集上;在一台像样的笔记本电脑上花了几秒钟。
import numpy as np
import pandas as pd
with open('rfa_all.NL-SEPARATED.txt', 'r') as f:
data = f.readlines()
# create a dictionary with keys and lists.
# if you don't set the values as lists, you get an error.
d = {'SRC': [], 'TGT': [], 'VOT': [], 'RES': [], 'YEA': [], 'DAT': [], 'TXT': []}
for line in data: # go through file line by line
if line != '\n': # skip new line characters
line = line.replace('\n', '') # get rid of '\n' in all fields
key, val = line.split(':', 1) # take the first 2 tokens from the split statement
d[key].append(val)
df = pd.DataFrame(d)
df
来自此 post 的广泛帮助:
我确信有一种更快的设置方法,但我认为这会起作用。
我的数据在文本文件中,格式如下:
heading1:blah
heading2:废话
heading3:blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah (text entered new line for heading3仅此行)
heading1:blah
heading2:blah
heading3:blah blah blah blah blah blah blah blah blah blah
等等...
注:
- heading3 数据转到下一行。
- 这是数据集的 Zip 文件link
感谢您 post 将 link 添加到数据中。如果它是公开的,那么一开始就这样做是有帮助的。我 运行 这个在完整的数据集上;在一台像样的笔记本电脑上花了几秒钟。
import numpy as np
import pandas as pd
with open('rfa_all.NL-SEPARATED.txt', 'r') as f:
data = f.readlines()
# create a dictionary with keys and lists.
# if you don't set the values as lists, you get an error.
d = {'SRC': [], 'TGT': [], 'VOT': [], 'RES': [], 'YEA': [], 'DAT': [], 'TXT': []}
for line in data: # go through file line by line
if line != '\n': # skip new line characters
line = line.replace('\n', '') # get rid of '\n' in all fields
key, val = line.split(':', 1) # take the first 2 tokens from the split statement
d[key].append(val)
df = pd.DataFrame(d)
df
来自此 post 的广泛帮助:
我确信有一种更快的设置方法,但我认为这会起作用。