Pandas 使用追加创建新列
Pandas creates new columns with append
我正在尝试将多个文本文件编译成一个数据框。但是,当我使用 Pandas Concat 函数连接数据框时,生成的数据框的形状会添加新列。在下面的代码示例中,数据框 3 有 12 列而不是 8 列。为什么?
**Input:**
import pandas as pd
df1 = pd.read_csv('2011-12-01-data.txt',sep = None, engine = 'python')
df2 = pd.read_csv('2011-12-02-data.txt',sep = None, engine = 'python')
df3= pd.concat([df1, df2])
print(df1.shape)
print(df2.shape)
print(df3.shape)
**Output:**
df1 shape = (26986, 8)
df1 shape =(27266, 8)
df3 shape =(54252, 12)
上可用的航班数据
我认为默认列名称 0-7
需要 header=None
参数,因为文件没有 headers。另外如果有分隔符tab
,可以指定它。
df1 = pd.read_csv('2011-12-01-data.txt',sep = '\t', engine = 'python', header=None)
df2 = pd.read_csv('2011-12-02-data.txt',sep = '\t', engine = 'python', header=None)
df3= pd.concat([df1, df2])
print(df1.shape)
print(df2.shape)
print(df3.shape)
(26987, 8)
(27267, 8)
(54254, 8)
print(df1.columns)
Int64Index([0, 1, 2, 3, 4, 5, 6, 7], dtype='int64')
print(df2.columns)
Int64Index([0, 1, 2, 3, 4, 5, 6, 7], dtype='int64')
print(df3.columns)
Int64Index([0, 1, 2, 3, 4, 5, 6, 7], dtype='int64')
另一种解决方案是为新列名称指定 names
参数:
names= ['col1','col2','col3','col4','col5','col6','col7','col8']
df1 = pd.read_csv('2011-12-01-data.txt',sep = '\t', engine = 'python', names=names)
df2 = pd.read_csv('2011-12-02-data.txt',sep = '\t', engine = 'python', names=names)
df3= pd.concat([df1, df2])
print(df1.shape)
print(df2.shape)
print(df3.shape)
(26987, 8)
(27267, 8)
(54254, 8)
print(df1.columns)
print(df2.columns)
print(df3.columns)
Index(['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8'], dtype='object')
Index(['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8'], dtype='object')
Index(['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8'], dtype='object')
你只有 12 列,因为两个数据框第一行的一些值是相同的,所以从它们创建列名。在 concat
列之后仅针对此列对齐。如果值不同,则没有对齐,您会得到 NaN
s。
print(df1.columns)
Index(['aa', 'AA-1007-TPA-MIA', '12/01/2011 01:55 PM', '12/01/2011 02:07 PM',
'F78', '12/01/2011 03:00 PM', '12/01/2011 02:57 PM', 'D5'],
dtype='object')
print(df2.columns)
Index(['aa', 'AA-1007-TPA-MIA', '12/02/2011 01:55 PM', '12/02/2011 02:13 PM',
'F78', '12/02/2011 03:00 PM', '12/02/2011 03:05 PM', 'D5'],
dtype='object')
print(df3.columns)
Index(['12/01/2011 01:55 PM', '12/01/2011 02:07 PM', '12/01/2011 02:57 PM',
'12/01/2011 03:00 PM', '12/02/2011 01:55 PM', '12/02/2011 02:13 PM',
'12/02/2011 03:00 PM', '12/02/2011 03:05 PM', 'AA-1007-TPA-MIA', 'D5',
'F78', 'aa'],
dtype='object')
print(df3.head())
12/01/2011 01:55 PM 12/01/2011 02:07 PM 12/01/2011 02:57 PM \
0 NaN 12/1/2011 2:07PM EST 12/1/2011 2:51PM EST
1 NaN 12/1/11 2:06 PM (-05:00) 12/1/11 2:51 PM (-05:00)
2 NaN 12/1/11 2:06 PM (-05:00) 12/1/11 2:51 PM (-05:00)
3 NaN 12/1/11 2:06 PM (-05:00) 12/1/11 2:51 PM (-05:00)
4 NaN 12/1/11 2:06 PM (-05:00) 12/1/11 2:51 PM (-05:00)
12/01/2011 03:00 PM 12/02/2011 01:55 PM 12/02/2011 02:13 PM \
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
12/02/2011 03:00 PM 12/02/2011 03:05 PM AA-1007-TPA-MIA D5 F78 \
0 NaN NaN AA-1007-TPA-MIA NaN NaN
1 NaN NaN AA-1007-TPA-MIA NaN NaN
2 NaN NaN AA-1007-TPA-MIA NaN NaN
3 NaN NaN AA-1007-TPA-MIA NaN NaN
4 NaN NaN AA-1007-TPA-MIA NaN NaN
aa
0 flightexplorer
1 airtravelcenter
2 myrateplan
3 helloflight
4 flytecomm
print(df3.tail())
12/01/2011 01:55 PM 12/01/2011 02:07 PM 12/01/2011 02:57 PM \
27261 NaN NaN NaN
27262 NaN NaN NaN
27263 NaN NaN NaN
27264 NaN NaN NaN
27265 NaN NaN NaN
12/01/2011 03:00 PM 12/02/2011 01:55 PM 12/02/2011 02:13 PM \
27261 NaN Dec 02 - 10:20pm Dec 02 - 10:23pm
27262 NaN 10:20pDec 2 10:23pDec 2
27263 NaN 2011-12-02 10:20 PM NaN
27264 NaN 2011-12-02 10:20 pm NaN
27265 NaN 2011-12-02 10:20PM CST 2011-12-02 10:31PM CST
12/02/2011 03:00 PM 12/02/2011 03:05 PM AA-1007-TPA-MIA D5 \
27261 Dec 02 - 11:59pm Dec 02 - 11:51pm* AA-2059-DFW-SLC A3
27262 11:43pDec 2 NaN AA-2059-DFW-SLC A3
27263 2011-12-02 11:59 PM NaN AA-2059-DFW-SLC NaN
27264 NaN NaN AA-2059-DFW-SLC NaN
27265 2011-12-02 11:35PM MST 2011-12-02 11:43PM MST AA-2059-DFW-SLC A3
F78 aa
27261 C20/C travelocity
27262 C20 orbitz
27263 NaN weather
27264 C20 dfw
27265 C20 flightwise
用户jezrael的回答解决了问题。但是,让我尝试解释一下为什么 pandas 向您的串联数据框中添加了新列以及出了什么问题。
pandas误读header
当您设置 header = None 时,pandas 将文件的第一行读取为 header 并将其默认设置为每列的名称。根据您的代码,如果 header = None,这些是每个数据帧将获得的两组列。
df1:
['aa',
'AA-1007-TPA-MIA',
'12/01/2011 01:55 下午',
'12/01/2011 02:07 下午',
'F78',
'2011 年 12 月 1 日 03:00 下午',
'2011 年 12 月 1 日 02:57 下午',
'D5']
df2:
['aa',
'AA-1007-TPA-MIA',
'12/02/2011 01:55 下午',
'12/02/2011 02:13 下午',
'F78',
'12/02/2011 03:00 下午',
'12/02/2011 03:05 下午',
'D5']
Non-Unique 列追加
最后,当您连接两个数据帧时,df1 和 df2 不常见的所有列都作为单独的列附加。 'aa'、'AA-1007-TPA-MIA'、'F78' 和 'D5' 对于 df1 和 df2 是唯一的,而其他所有内容都附加到列列表中。
这导致 4(df1&df2) + 4(df1) + 4(df2) = 12 列
我正在尝试将多个文本文件编译成一个数据框。但是,当我使用 Pandas Concat 函数连接数据框时,生成的数据框的形状会添加新列。在下面的代码示例中,数据框 3 有 12 列而不是 8 列。为什么?
**Input:**
import pandas as pd
df1 = pd.read_csv('2011-12-01-data.txt',sep = None, engine = 'python')
df2 = pd.read_csv('2011-12-02-data.txt',sep = None, engine = 'python')
df3= pd.concat([df1, df2])
print(df1.shape)
print(df2.shape)
print(df3.shape)
**Output:**
df1 shape = (26986, 8)
df1 shape =(27266, 8)
df3 shape =(54252, 12)
上可用的航班数据
我认为默认列名称 0-7
需要 header=None
参数,因为文件没有 headers。另外如果有分隔符tab
,可以指定它。
df1 = pd.read_csv('2011-12-01-data.txt',sep = '\t', engine = 'python', header=None)
df2 = pd.read_csv('2011-12-02-data.txt',sep = '\t', engine = 'python', header=None)
df3= pd.concat([df1, df2])
print(df1.shape)
print(df2.shape)
print(df3.shape)
(26987, 8)
(27267, 8)
(54254, 8)
print(df1.columns)
Int64Index([0, 1, 2, 3, 4, 5, 6, 7], dtype='int64')
print(df2.columns)
Int64Index([0, 1, 2, 3, 4, 5, 6, 7], dtype='int64')
print(df3.columns)
Int64Index([0, 1, 2, 3, 4, 5, 6, 7], dtype='int64')
另一种解决方案是为新列名称指定 names
参数:
names= ['col1','col2','col3','col4','col5','col6','col7','col8']
df1 = pd.read_csv('2011-12-01-data.txt',sep = '\t', engine = 'python', names=names)
df2 = pd.read_csv('2011-12-02-data.txt',sep = '\t', engine = 'python', names=names)
df3= pd.concat([df1, df2])
print(df1.shape)
print(df2.shape)
print(df3.shape)
(26987, 8)
(27267, 8)
(54254, 8)
print(df1.columns)
print(df2.columns)
print(df3.columns)
Index(['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8'], dtype='object')
Index(['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8'], dtype='object')
Index(['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8'], dtype='object')
你只有 12 列,因为两个数据框第一行的一些值是相同的,所以从它们创建列名。在 concat
列之后仅针对此列对齐。如果值不同,则没有对齐,您会得到 NaN
s。
print(df1.columns)
Index(['aa', 'AA-1007-TPA-MIA', '12/01/2011 01:55 PM', '12/01/2011 02:07 PM',
'F78', '12/01/2011 03:00 PM', '12/01/2011 02:57 PM', 'D5'],
dtype='object')
print(df2.columns)
Index(['aa', 'AA-1007-TPA-MIA', '12/02/2011 01:55 PM', '12/02/2011 02:13 PM',
'F78', '12/02/2011 03:00 PM', '12/02/2011 03:05 PM', 'D5'],
dtype='object')
print(df3.columns)
Index(['12/01/2011 01:55 PM', '12/01/2011 02:07 PM', '12/01/2011 02:57 PM',
'12/01/2011 03:00 PM', '12/02/2011 01:55 PM', '12/02/2011 02:13 PM',
'12/02/2011 03:00 PM', '12/02/2011 03:05 PM', 'AA-1007-TPA-MIA', 'D5',
'F78', 'aa'],
dtype='object')
print(df3.head())
12/01/2011 01:55 PM 12/01/2011 02:07 PM 12/01/2011 02:57 PM \
0 NaN 12/1/2011 2:07PM EST 12/1/2011 2:51PM EST
1 NaN 12/1/11 2:06 PM (-05:00) 12/1/11 2:51 PM (-05:00)
2 NaN 12/1/11 2:06 PM (-05:00) 12/1/11 2:51 PM (-05:00)
3 NaN 12/1/11 2:06 PM (-05:00) 12/1/11 2:51 PM (-05:00)
4 NaN 12/1/11 2:06 PM (-05:00) 12/1/11 2:51 PM (-05:00)
12/01/2011 03:00 PM 12/02/2011 01:55 PM 12/02/2011 02:13 PM \
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
12/02/2011 03:00 PM 12/02/2011 03:05 PM AA-1007-TPA-MIA D5 F78 \
0 NaN NaN AA-1007-TPA-MIA NaN NaN
1 NaN NaN AA-1007-TPA-MIA NaN NaN
2 NaN NaN AA-1007-TPA-MIA NaN NaN
3 NaN NaN AA-1007-TPA-MIA NaN NaN
4 NaN NaN AA-1007-TPA-MIA NaN NaN
aa
0 flightexplorer
1 airtravelcenter
2 myrateplan
3 helloflight
4 flytecomm
print(df3.tail())
12/01/2011 01:55 PM 12/01/2011 02:07 PM 12/01/2011 02:57 PM \
27261 NaN NaN NaN
27262 NaN NaN NaN
27263 NaN NaN NaN
27264 NaN NaN NaN
27265 NaN NaN NaN
12/01/2011 03:00 PM 12/02/2011 01:55 PM 12/02/2011 02:13 PM \
27261 NaN Dec 02 - 10:20pm Dec 02 - 10:23pm
27262 NaN 10:20pDec 2 10:23pDec 2
27263 NaN 2011-12-02 10:20 PM NaN
27264 NaN 2011-12-02 10:20 pm NaN
27265 NaN 2011-12-02 10:20PM CST 2011-12-02 10:31PM CST
12/02/2011 03:00 PM 12/02/2011 03:05 PM AA-1007-TPA-MIA D5 \
27261 Dec 02 - 11:59pm Dec 02 - 11:51pm* AA-2059-DFW-SLC A3
27262 11:43pDec 2 NaN AA-2059-DFW-SLC A3
27263 2011-12-02 11:59 PM NaN AA-2059-DFW-SLC NaN
27264 NaN NaN AA-2059-DFW-SLC NaN
27265 2011-12-02 11:35PM MST 2011-12-02 11:43PM MST AA-2059-DFW-SLC A3
F78 aa
27261 C20/C travelocity
27262 C20 orbitz
27263 NaN weather
27264 C20 dfw
27265 C20 flightwise
用户jezrael的回答解决了问题。但是,让我尝试解释一下为什么 pandas 向您的串联数据框中添加了新列以及出了什么问题。
pandas误读header
当您设置 header = None 时,pandas 将文件的第一行读取为 header 并将其默认设置为每列的名称。根据您的代码,如果 header = None,这些是每个数据帧将获得的两组列。
df1: ['aa', 'AA-1007-TPA-MIA', '12/01/2011 01:55 下午', '12/01/2011 02:07 下午', 'F78', '2011 年 12 月 1 日 03:00 下午', '2011 年 12 月 1 日 02:57 下午', 'D5']
df2: ['aa', 'AA-1007-TPA-MIA', '12/02/2011 01:55 下午', '12/02/2011 02:13 下午', 'F78', '12/02/2011 03:00 下午', '12/02/2011 03:05 下午', 'D5']
Non-Unique 列追加
最后,当您连接两个数据帧时,df1 和 df2 不常见的所有列都作为单独的列附加。 'aa'、'AA-1007-TPA-MIA'、'F78' 和 'D5' 对于 df1 和 df2 是唯一的,而其他所有内容都附加到列列表中。
这导致 4(df1&df2) + 4(df1) + 4(df2) = 12 列