从文件中读取 pandas DataFrame 时出错
Error reading pandas DataFrame from file
我正在尝试使用 python pandas 中的 DataFrame.from_csv() 读取文件。该文件包含此值。
TICKER,date,ASKHI,PRC,BIDLO,PortfolioDate,PortfolioName
MSFT,2012-06-29 00:00:00,NA,NA,NA,2010-12-31 00:00:00,SAP500
MSFT,2012-07-31 00:00:00,NA,NA,NA,2010-12-31 00:00:00,SAP500
MSFT,2012-08-31 00:00:00,NA,NA,NA,2010-12-31 00:00:00,SAP500
MSFT,2012-09-28 00:00:00,NA,NA,NA,2010-12-31 00:00:00,SAP500
MSFT,2012-10-31 00:00:00,28.88,28.54,28.5,2010-12-31 00:00:00,SAP500
但是,当我访问时,我从数据帧中读取它,框架生成如下。
date ASKHI PRC BIDLO PortfolioDate \
TICKER
MSFT 2012-06-29 00:00:00 NaN NaN NaN 2010-12-31 00:00:00
MSFT 2012-07-31 00:00:00 NaN NaN NaN 2010-12-31 00:00:00
MSFT 2012-08-31 00:00:00 NaN NaN NaN 2010-12-31 00:00:00
MSFT 2012-09-28 00:00:00 NaN NaN NaN 2010-12-31 00:00:00
MSFT 2012-10-31 00:00:00 28.88 28.54 28.5 2010-12-31 00:00:00
PortfolioName
TICKER
MSFT SAP500
MSFT SAP500
MSFT SAP500
MSFT SAP500
MSFT SAP500
当我使用框架 ['date'] 选择列 'date' 时,结果是:
TICKER
MSFT 2012-06-29 00:00:00
MSFT 2012-07-31 00:00:00
MSFT 2012-08-31 00:00:00
MSFT 2012-09-28 00:00:00
MSFT 2012-10-31 00:00:00
我的代码是:
frame = DataFrame.from_csv('/home/raghu/log.txt',sep=',');
我是新手。有什么我想念的吗?为什么第一列是这样的?
编辑:Pandas 版本:'0.14.1'
不要使用 from_csv
it is no longer maintained, instead use read_csv
:
In [112]:
import io
temp="""TICKER,date,ASKHI,PRC,BIDLO,PortfolioDate,PortfolioName
MSFT,2012-06-29 00:00:00,NA,NA,NA,2010-12-31 00:00:00,SAP500
MSFT,2012-07-31 00:00:00,NA,NA,NA,2010-12-31 00:00:00,SAP500
MSFT,2012-08-31 00:00:00,NA,NA,NA,2010-12-31 00:00:00,SAP500
MSFT,2012-09-28 00:00:00,NA,NA,NA,2010-12-31 00:00:00,SAP500
MSFT,2012-10-31 00:00:00,28.88,28.54,28.5,2010-12-31 00:00:00,SAP500"""
df = pd.read_csv(io.StringIO(temp))
df
Out[112]:
TICKER date ASKHI PRC BIDLO PortfolioDate \
0 MSFT 2012-06-29 00:00:00 NaN NaN NaN 2010-12-31 00:00:00
1 MSFT 2012-07-31 00:00:00 NaN NaN NaN 2010-12-31 00:00:00
2 MSFT 2012-08-31 00:00:00 NaN NaN NaN 2010-12-31 00:00:00
3 MSFT 2012-09-28 00:00:00 NaN NaN NaN 2010-12-31 00:00:00
4 MSFT 2012-10-31 00:00:00 28.88 28.54 28.5 2010-12-31 00:00:00
PortfolioName
0 SAP500
1 SAP500
2 SAP500
3 SAP500
4 SAP500
In [113]:
df['date']
Out[113]:
0 2012-06-29 00:00:00
1 2012-07-31 00:00:00
2 2012-08-31 00:00:00
3 2012-09-28 00:00:00
4 2012-10-31 00:00:00
Name: date, dtype: object
你对第一列感到奇怪的原因是因为它在你使用 from_csv
(the default value for index_col
is 0
) which read_csv
时将第一列视为索引不这样做(index_col
的默认值是 None
).
编辑
要在不升级的情况下修复错误,只需将参数中的 index_col=None
设置为 from_csv
:
In [115]:
df = pd.DataFrame.from_csv(io.StringIO(temp), index_col=None)
df['date']
Out[115]:
0 2012-06-29 00:00:00
1 2012-07-31 00:00:00
2 2012-08-31 00:00:00
3 2012-09-28 00:00:00
4 2012-10-31 00:00:00
Name: date, dtype: object
我正在尝试使用 python pandas 中的 DataFrame.from_csv() 读取文件。该文件包含此值。
TICKER,date,ASKHI,PRC,BIDLO,PortfolioDate,PortfolioName
MSFT,2012-06-29 00:00:00,NA,NA,NA,2010-12-31 00:00:00,SAP500
MSFT,2012-07-31 00:00:00,NA,NA,NA,2010-12-31 00:00:00,SAP500
MSFT,2012-08-31 00:00:00,NA,NA,NA,2010-12-31 00:00:00,SAP500
MSFT,2012-09-28 00:00:00,NA,NA,NA,2010-12-31 00:00:00,SAP500
MSFT,2012-10-31 00:00:00,28.88,28.54,28.5,2010-12-31 00:00:00,SAP500
但是,当我访问时,我从数据帧中读取它,框架生成如下。
date ASKHI PRC BIDLO PortfolioDate \
TICKER
MSFT 2012-06-29 00:00:00 NaN NaN NaN 2010-12-31 00:00:00
MSFT 2012-07-31 00:00:00 NaN NaN NaN 2010-12-31 00:00:00
MSFT 2012-08-31 00:00:00 NaN NaN NaN 2010-12-31 00:00:00
MSFT 2012-09-28 00:00:00 NaN NaN NaN 2010-12-31 00:00:00
MSFT 2012-10-31 00:00:00 28.88 28.54 28.5 2010-12-31 00:00:00
PortfolioName
TICKER
MSFT SAP500
MSFT SAP500
MSFT SAP500
MSFT SAP500
MSFT SAP500
当我使用框架 ['date'] 选择列 'date' 时,结果是:
TICKER
MSFT 2012-06-29 00:00:00
MSFT 2012-07-31 00:00:00
MSFT 2012-08-31 00:00:00
MSFT 2012-09-28 00:00:00
MSFT 2012-10-31 00:00:00
我的代码是:
frame = DataFrame.from_csv('/home/raghu/log.txt',sep=',');
我是新手。有什么我想念的吗?为什么第一列是这样的?
编辑:Pandas 版本:'0.14.1'
不要使用 from_csv
it is no longer maintained, instead use read_csv
:
In [112]:
import io
temp="""TICKER,date,ASKHI,PRC,BIDLO,PortfolioDate,PortfolioName
MSFT,2012-06-29 00:00:00,NA,NA,NA,2010-12-31 00:00:00,SAP500
MSFT,2012-07-31 00:00:00,NA,NA,NA,2010-12-31 00:00:00,SAP500
MSFT,2012-08-31 00:00:00,NA,NA,NA,2010-12-31 00:00:00,SAP500
MSFT,2012-09-28 00:00:00,NA,NA,NA,2010-12-31 00:00:00,SAP500
MSFT,2012-10-31 00:00:00,28.88,28.54,28.5,2010-12-31 00:00:00,SAP500"""
df = pd.read_csv(io.StringIO(temp))
df
Out[112]:
TICKER date ASKHI PRC BIDLO PortfolioDate \
0 MSFT 2012-06-29 00:00:00 NaN NaN NaN 2010-12-31 00:00:00
1 MSFT 2012-07-31 00:00:00 NaN NaN NaN 2010-12-31 00:00:00
2 MSFT 2012-08-31 00:00:00 NaN NaN NaN 2010-12-31 00:00:00
3 MSFT 2012-09-28 00:00:00 NaN NaN NaN 2010-12-31 00:00:00
4 MSFT 2012-10-31 00:00:00 28.88 28.54 28.5 2010-12-31 00:00:00
PortfolioName
0 SAP500
1 SAP500
2 SAP500
3 SAP500
4 SAP500
In [113]:
df['date']
Out[113]:
0 2012-06-29 00:00:00
1 2012-07-31 00:00:00
2 2012-08-31 00:00:00
3 2012-09-28 00:00:00
4 2012-10-31 00:00:00
Name: date, dtype: object
你对第一列感到奇怪的原因是因为它在你使用 from_csv
(the default value for index_col
is 0
) which read_csv
时将第一列视为索引不这样做(index_col
的默认值是 None
).
编辑
要在不升级的情况下修复错误,只需将参数中的 index_col=None
设置为 from_csv
:
In [115]:
df = pd.DataFrame.from_csv(io.StringIO(temp), index_col=None)
df['date']
Out[115]:
0 2012-06-29 00:00:00
1 2012-07-31 00:00:00
2 2012-08-31 00:00:00
3 2012-09-28 00:00:00
4 2012-10-31 00:00:00
Name: date, dtype: object