在 pandas 中将一列拆分为 3 列
Split a column into 3 columns in pandas
我有一个名为 Names
的列,它看起来像这样,我需要将它与另一个 panda 数据框中的其他列进行比较,该数据框具有姓氏和名字,但不像这个具有首字母。我正在尝试使用 space 作为分隔符将首字母从新列中的列中拆分出来,但可能需要对整个字符串执行此操作。我试过这个:
transpose_enron['lastname'], transpose_enron['firstname'], transpose_enron['middle initial'] = zip(*transpose_enron['Names'].apply(lambda x: x.split(' ', 1)))
它给了我这个错误
"ValueError: need more than 1 value to unpack"
0 ALLEN PHILLIP K
1 BADUM JAMES P
2 BANNANTINE JAMES M
8 BELFER ROBERT
关于如何做到这一点的任何想法。
您可以使用 DataFrame
构造函数,如果您需要删除原始列 drop
:
print df
Names
0 ALLEN PHILLIP K
1 BADUM JAMES P
2 BANNANTINE JAMES M
3 BELFER ROBERT
df[['lastname', 'firstname', 'middle initial']] = pd.DataFrame([ x.split() for x in df['Names'].tolist() ])
#if you want delete original column
df = df.drop('Names', axis=1)
print df
lastname firstname middle initial
0 ALLEN PHILLIP K
1 BADUM JAMES P
2 BANNANTINE JAMES M
3 BELFER ROBERT None
时间:len(df) = 10000*4
df = pd.concat([df]*10000).reset_index(drop=True)
print df.head()
def jez(df):
df[['lastname', 'firstname', 'middle initial']] = pd.DataFrame([ x.split() for x in df['Names'].tolist() ])
return df
def edc(df):
df[['lastname', 'firstname', 'middle initial']] = df['Names'].str.split(expand=True)
return df
print jez(df).head()
print edc(df).head()
如果数据帧较大,我的解决方案是 Edchum
的最快解决方案:
In [51]: %timeit jez(df)
10 loops, best of 3: 30.1 ms per loop
In [52]: %timeit edc(df)
10 loops, best of 3: 78 ms per loop
因评论错误而编辑:
问题出在数据上,它包含 3 个分隔符而不是 2 个,因此您需要将它们拆分为四列,然后删除临时列 tmp
:
print df
Names
0 ALLEN PHILLIP K
1 BADUM JAMES P tttt
2 BANNANTINE JAMES M
df[['lastname', 'firstname', 'middle initial', 'tmp']] = pd.DataFrame([ x.split() for x in df['Names'].tolist() ])
print df
Names lastname firstname middle initial tmp
0 ALLEN PHILLIP K ALLEN PHILLIP K None
1 BADUM JAMES P tttt BADUM JAMES P tttt
2 BANNANTINE JAMES M BANNANTINE JAMES M None
#if you want delete original column
df = df.drop(['Names', 'tmp'], axis=1)
print df
lastname firstname middle initial
0 ALLEN PHILLIP K
1 BADUM JAMES P
2 BANNANTINE JAMES M
将矢量化 str.split
与 expand=True
结合使用,这会将列表解压到新的列中:
In [17]:
df[['lastname', 'firstname', 'middle initial']] = df['name'].str.split(expand=True)
df
Out[17]:
name lastname firstname middle initial
index
0 ALLEN PHILLIP K ALLEN PHILLIP K
1 BADUM JAMES P BADUM JAMES P
2 BANNANTINE JAMES M BANNANTINE JAMES M
8 BELFER ROBERT BELFER ROBERT None
我有一个名为 Names
的列,它看起来像这样,我需要将它与另一个 panda 数据框中的其他列进行比较,该数据框具有姓氏和名字,但不像这个具有首字母。我正在尝试使用 space 作为分隔符将首字母从新列中的列中拆分出来,但可能需要对整个字符串执行此操作。我试过这个:
transpose_enron['lastname'], transpose_enron['firstname'], transpose_enron['middle initial'] = zip(*transpose_enron['Names'].apply(lambda x: x.split(' ', 1)))
它给了我这个错误
"ValueError: need more than 1 value to unpack"
0 ALLEN PHILLIP K
1 BADUM JAMES P
2 BANNANTINE JAMES M
8 BELFER ROBERT
关于如何做到这一点的任何想法。
您可以使用 DataFrame
构造函数,如果您需要删除原始列 drop
:
print df
Names
0 ALLEN PHILLIP K
1 BADUM JAMES P
2 BANNANTINE JAMES M
3 BELFER ROBERT
df[['lastname', 'firstname', 'middle initial']] = pd.DataFrame([ x.split() for x in df['Names'].tolist() ])
#if you want delete original column
df = df.drop('Names', axis=1)
print df
lastname firstname middle initial
0 ALLEN PHILLIP K
1 BADUM JAMES P
2 BANNANTINE JAMES M
3 BELFER ROBERT None
时间:len(df) = 10000*4
df = pd.concat([df]*10000).reset_index(drop=True)
print df.head()
def jez(df):
df[['lastname', 'firstname', 'middle initial']] = pd.DataFrame([ x.split() for x in df['Names'].tolist() ])
return df
def edc(df):
df[['lastname', 'firstname', 'middle initial']] = df['Names'].str.split(expand=True)
return df
print jez(df).head()
print edc(df).head()
如果数据帧较大,我的解决方案是 Edchum
的最快解决方案:
In [51]: %timeit jez(df)
10 loops, best of 3: 30.1 ms per loop
In [52]: %timeit edc(df)
10 loops, best of 3: 78 ms per loop
因评论错误而编辑:
问题出在数据上,它包含 3 个分隔符而不是 2 个,因此您需要将它们拆分为四列,然后删除临时列 tmp
:
print df
Names
0 ALLEN PHILLIP K
1 BADUM JAMES P tttt
2 BANNANTINE JAMES M
df[['lastname', 'firstname', 'middle initial', 'tmp']] = pd.DataFrame([ x.split() for x in df['Names'].tolist() ])
print df
Names lastname firstname middle initial tmp
0 ALLEN PHILLIP K ALLEN PHILLIP K None
1 BADUM JAMES P tttt BADUM JAMES P tttt
2 BANNANTINE JAMES M BANNANTINE JAMES M None
#if you want delete original column
df = df.drop(['Names', 'tmp'], axis=1)
print df
lastname firstname middle initial
0 ALLEN PHILLIP K
1 BADUM JAMES P
2 BANNANTINE JAMES M
将矢量化 str.split
与 expand=True
结合使用,这会将列表解压到新的列中:
In [17]:
df[['lastname', 'firstname', 'middle initial']] = df['name'].str.split(expand=True)
df
Out[17]:
name lastname firstname middle initial
index
0 ALLEN PHILLIP K ALLEN PHILLIP K
1 BADUM JAMES P BADUM JAMES P
2 BANNANTINE JAMES M BANNANTINE JAMES M
8 BELFER ROBERT BELFER ROBERT None