如何使用与 excel 文件中的串联数据相关的值创建列
How to create a column with values related to Concatenated data from excel file
我创建了一个代码,将具有相同名称但在不同工作表中的列与一个 excel
文件连接起来
这是我的#updated 代码
import os
import pandas as pd
import numpy as np
mydir = (os.getcwd()).replace('\', '/') + '/'
cluster_cell = pd.read_excel(r'' + mydir + 'Cluster_Cell.xlsx', usecols=['CELLNAME'], encoding='latin-1')
GUL_EP_GSM = pd.read_excel(r'' + mydir + 'GU EP20191016171519.xlsx', sheet_name="GSM_CDDData", usecols=['CELLNAME'],
encoding='latin-1')
print(GUL_EP_GSM)
GUL_EP_UMTS = pd.read_excel(r'' + mydir + 'GU EP20191016171519.xlsx', sheet_name="UMTS_CDDData", usecols=['CELLNAME'],
encoding='latin-1')
print(GUL_EP_UMTS)
GUL_EP_LTE = pd.read_excel(r'' + mydir + 'GU EP20191016171519.xlsx', sheet_name="LTE_CDDData", usecols=['CELLNAME'],
encoding='latin-1')
print(GUL_EP_LTE)
appended_cell_name = pd.DataFrame({'CELLNAME': cluster_cell['CELLNAME'] + GUL_EP_GSM['CELLNAME'] + GUL_EP_UMTS['CELLNAME'] + GUL_EP_LTE['CELLNAME']})
mask1 = appended_cell_name['CELLNAME'].isin(GUL_EP_GSM['CELLNAME'])
mask2 = appended_cell_name['CELLNAME'].isin(GUL_EP_UMTS['CELLNAME'])
mask3 = appended_cell_name['CELLNAME'].isin(GUL_EP_LTE['CELLNAME'])
cond=[mask1,mask2,mask3]
values=['GSM','UMTS','LTE']
appended_cell_name['system']=np.select(cond,values)
GUL_EP_GSM['CELLNAME']
GUL_EP_UMTS['CELLNAME']
GUL_EP_LTE['CELLNAME']
appended_cell_name['system'].unique()
print(appended_cell_name)
appended_cell_name.to_excel("Cluser_Cell_Final.xlsx", index=False)
所以现在我需要为这个 Concatenated 创建一个名为“Systemcontains
valuesrelated to
pandas` Objects I have created
的列
让它深入....
例如:-
对于 GUL_EP_GSM
对象,我想将值添加到新创建的 column
中,值为 GSM
,对于 GUL_EP_UMTS
值,我想将值添加到新创建的 column
中UMTS
,对于 GUL_EP_LTE
值,我想添加到新创建的 column
LTE
喜欢下面的图片
这是以前的,只是我的代码
before
这是我想做的事
after
知道怎么做吗?
而且我需要以最简单的方式简化我的代码....
错误
Traceback (most recent call last):
File "C:/Users/mwx825326/PycharmProjects/MyExcelCombine/MyCopyAndPaste.py", line 11, in <module>
appended_cell_name=pd.DataFrame({'CELLNAME':cluster_cell + GUL_EP_GSM + GUL_EP_UMTS + GUL_EP_LTE})
File "C:\Users\mwx825326\PycharmProjects\MyExcelCombine\venv\lib\site-packages\pandas\core\frame.py", line 411, in __init__
mgr = init_dict(data, index, columns, dtype=dtype)
File "C:\Users\mwx825326\PycharmProjects\MyExcelCombine\venv\lib\site-packages\pandas\core\internals\construction.py", line 257, in init_dict
return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "C:\Users\mwx825326\PycharmProjects\MyExcelCombine\venv\lib\site-packages\pandas\core\internals\construction.py", line 77, in arrays_to_mgr
index = extract_index(arrays)
File "C:\Users\mwx825326\PycharmProjects\MyExcelCombine\venv\lib\site-packages\pandas\core\internals\construction.py", line 358, in extract_index
raise ValueError("If using all scalar values, you must pass an index")
ValueError: If using all scalar values, you must pass an index
打印输出
appended_cell_name
CELLNAME system
0 3129648.0 0
1 3129652.0 0
2 3129656.0 0
3 3141617.0 0
4 3141621.0 0
... ... ...
2347 NaN 0
2348 NaN 0
2349 NaN 0
2350 NaN 0
2351 NaN 0
GUL_EP_GSM
CELLNAME
0 10169
1 10170
2 10171
3 10349
4 10350
.. ...
124 54237
125 54996
126 54997
127 54998
128 55966
GUL_EP_UMTS
CELLNAME
0 1106382
1 1106383
2 1106384
3 1019810
4 1019811
.. ...
262 1102891
263 1102892
264 1018393
265 1018394
266 1018395
GUL_EP_LTE
CELLNAME
0 2002388
1 2002389
2 2002390
3 2100118
4 2100119
.. ...
538 2304376
539 2304377
540 2205890
541 2205891
542 2205892
mask1
输出
0 False
1 False
2 False
3 False
4 False
...
2347 False
2348 False
2349 False
2350 False
2351 False
Name: CELLNAME, Length: 2352, dtype: bool
mask2
输出
0 False
1 False
2 False
3 False
4 False
...
2347 False
2348 False
2349 False
2350 False
2351 False
Name: CELLNAME, Length: 2352, dtype: bool
mask3
输出
0 False
1 False
2 False
3 False
4 False
...
2347 False
2348 False
2349 False
2350 False
2351 False
Name: CELLNAME, Length: 2352, dtype: bool
的输出
print(GUL_EP_LTE.info())
print(GUL_EP_UMTS.info())
print(GUL_EP_LTE.info())
是
[2352 rows x 2 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 543 entries, 0 to 542
Data columns (total 1 columns):
CELLNAME 543 non-null int64
dtypes: int64(1)
memory usage: 4.3 KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 267 entries, 0 to 266
Data columns (total 1 columns):
CELLNAME 267 non-null int64
dtypes: int64(1)
memory usage: 2.1 KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 543 entries, 0 to 542
Data columns (total 1 columns):
CELLNAME 543 non-null int64
dtypes: int64(1)
memory usage: 4.3 KB
None
print(appended_cell_name. info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2352 entries, 0 to 2351
Data columns (total 2 columns):
CELLNAME 129 non-null float64
system 2352 non-null object
dtypes: float64(1), object(1)
memory usage: 45.9+ KB
None
不太懂,不过可以在拼接前加这样一列:
GUL_EP_GSM['system']='GSM'
GUL_EP_UMTS['system']='UMTS'
GUL_EP_LTE['system']='LTE'
然后连接
你需要:
GUL_EP_GSM=GUL_EP_GSM.astpye(float)
GUL_EP_UMTS=GUL_EP_UMTS.astpye(float)
GUL_EP_LTE=GUL_EP_LTE.astpye(float)
#appended_cell_name['CELLNAME']=appended_cell_name['CELLNAME'].astype(int) #or convert appended_cell_name to int
因为appended_cell_name['CELLNAME']的类型是float
那么您的代码将是:
import os
import pandas as pd
import numpy as np
mydir = (os.getcwd()).replace('\', '/') + '/'
cluster_cell = pd.read_excel(r'' + mydir + 'Cluster_Cell.xlsx', usecols=['CELLNAME'], encoding='latin-1')
GUL_EP_GSM = pd.read_excel(r'' + mydir + 'GU EP20191016171519.xlsx', sheet_name="GSM_CDDData", usecols=['CELLNAME'],
encoding='latin-1')
GUL_EP_UMTS = pd.read_excel(r'' + mydir + 'GU EP20191016171519.xlsx', sheet_name="UMTS_CDDData", usecols=['CELLNAME'],
encoding='latin-1')
GUL_EP_LTE = pd.read_excel(r'' + mydir + 'GU EP20191016171519.xlsx', sheet_name="LTE_CDDData", usecols=['CELLNAME'],
encoding='latin-1')
#converting to float
GUL_EP_GSM=GUL_EP_GSM.astpye(float)
GUL_EP_UMTS=GUL_EP_UMTS.astpye(float)
GUL_EP_LTE=GUL_EP_LTE.astpye(float)
#print dataframes astype float
print(GUL_EP_GSM)
print(GUL_EP_UMTS)
print(GUL_EP_LTE)
#creating appended_cell_name dataframe
appended_cell_name = pd.DataFrame({'CELLNAME': cluster_cell['CELLNAME'] + GUL_EP_GSM['CELLNAME'] + GUL_EP_UMTS['CELLNAME'] + GUL_EP_LTE['CELLNAME']})
#creating mask with isin
mask1 = appended_cell_name['CELLNAME'].isin(GUL_EP_GSM['CELLNAME'])
mask2 = appended_cell_name['CELLNAME'].isin(GUL_EP_UMTS['CELLNAME'])
mask3 = appended_cell_name['CELLNAME'].isin(GUL_EP_LTE['CELLNAME'])
#using np.select
cond=[mask1,mask2,mask3]
values=['GSM','UMTS','LTE']
appended_cell_name['system']=np.select(cond,values)
#print appended_cell_name
print(appended_cell_name)
#save in excell
appended_cell_name.to_excel("Cluser_Cell_Final.xlsx", index=False)
你有三个数据框:GUL_EP_GSM、GUL_EP_UMTS、GUL_EP_LTE。然后你需要检查 Series.isin
如果appended_cell_name数据框的CELLNAME列中的值在每个那些数据框的列中并相应地填写系统列。
您唯一需要更改我的解决方案的是每个示例的列标签:
GUL_EP_GSM['name_of_colum_in_your data']
而不是
GUL_EP_GSM ['GUL_EP_GSM']
使用np.select + Series.isin:
正在准备数据
GUL_EP_GSM=pd.DataFrame({'CELLNAME':[53119,53120,53121,54438,54439,54440,55976,55977,55978]})
GUL_EP_UMTS=pd.DataFrame({'CELLNAME':[41773,41774,41775,44620,44621,44622]})
GUL_EP_LTE=pd.DataFrame({'GELLNAME':[2303909,2303923,2303924]})
appended_cell_name=pd.DataFrame({'CELLNAME':[53119,53120,53121,54438,54439,54440,55976,55977,55978,41773,41774,41775,44620,44621,44622,2303909,2303923,2303924]})
正在创建遮罩
mask1=appended_cell_name['CELLNAME'].isin(GUL_EP_GSM['CELLNAME'])
mask2=appended_cell_name['CELLNAME'].isin(GUL_EP_UMTS['CELLNAME'])
mask3=appended_cell_name['CELLNAME'].isin(GUL_EP_LTE['CELLNAME'])
Solve with np.select
然后你申请np.select:
cond=[mask1,mask2,mask3]
values=['GSM','UMTS','LTE']
appended_cell_name['system']=np.select(cond,values)
print(appended_cell_name)
Also you can use Series.mask(This is other solution):
appended_cell_name['system']= ( appended_cell_name['CELLNAME'].mask(mask1,'GSM')
.mask(mask2,'UMTS')
.mask(mask3,'LTE') )
print(appended_cell_name)
输出:
CELLNAME system
0 53119 GSM
1 53120 GSM
2 53121 GSM
3 54438 GSM
4 54439 GSM
5 54440 GSM
6 55976 GSM
7 55977 GSM
8 55978 GSM
9 41773 UMTS
10 41774 UMTS
11 41775 UMTS
12 44620 UMTS
13 44621 UMTS
14 44622 UMTS
15 2303909 LTE
16 2303923 LTE
17 2303924 LTE
使用的数据帧:
print(GUL_EP_GSM)
GUL_EP_GSM
0 53119
1 53120
2 53121
3 54438
4 54439
5 54440
6 55976
7 55977
8 55978
print(GUL_EP_UMTS)
CELLNAME
0 41773
1 41774
2 41775
3 44620
4 44621
5 44622
print(GUL_EP_LTE)
CELLNAME
0 2303909
1 2303923
2 2303924
我创建了一个代码,将具有相同名称但在不同工作表中的列与一个 excel
文件连接起来
这是我的#updated 代码
import os
import pandas as pd
import numpy as np
mydir = (os.getcwd()).replace('\', '/') + '/'
cluster_cell = pd.read_excel(r'' + mydir + 'Cluster_Cell.xlsx', usecols=['CELLNAME'], encoding='latin-1')
GUL_EP_GSM = pd.read_excel(r'' + mydir + 'GU EP20191016171519.xlsx', sheet_name="GSM_CDDData", usecols=['CELLNAME'],
encoding='latin-1')
print(GUL_EP_GSM)
GUL_EP_UMTS = pd.read_excel(r'' + mydir + 'GU EP20191016171519.xlsx', sheet_name="UMTS_CDDData", usecols=['CELLNAME'],
encoding='latin-1')
print(GUL_EP_UMTS)
GUL_EP_LTE = pd.read_excel(r'' + mydir + 'GU EP20191016171519.xlsx', sheet_name="LTE_CDDData", usecols=['CELLNAME'],
encoding='latin-1')
print(GUL_EP_LTE)
appended_cell_name = pd.DataFrame({'CELLNAME': cluster_cell['CELLNAME'] + GUL_EP_GSM['CELLNAME'] + GUL_EP_UMTS['CELLNAME'] + GUL_EP_LTE['CELLNAME']})
mask1 = appended_cell_name['CELLNAME'].isin(GUL_EP_GSM['CELLNAME'])
mask2 = appended_cell_name['CELLNAME'].isin(GUL_EP_UMTS['CELLNAME'])
mask3 = appended_cell_name['CELLNAME'].isin(GUL_EP_LTE['CELLNAME'])
cond=[mask1,mask2,mask3]
values=['GSM','UMTS','LTE']
appended_cell_name['system']=np.select(cond,values)
GUL_EP_GSM['CELLNAME']
GUL_EP_UMTS['CELLNAME']
GUL_EP_LTE['CELLNAME']
appended_cell_name['system'].unique()
print(appended_cell_name)
appended_cell_name.to_excel("Cluser_Cell_Final.xlsx", index=False)
所以现在我需要为这个 Concatenated 创建一个名为“Systemcontains
valuesrelated to
pandas` Objects I have created
让它深入....
例如:-
对于 GUL_EP_GSM
对象,我想将值添加到新创建的 column
中,值为 GSM
,对于 GUL_EP_UMTS
值,我想将值添加到新创建的 column
中UMTS
,对于 GUL_EP_LTE
值,我想添加到新创建的 column
LTE
喜欢下面的图片
这是以前的,只是我的代码
before
这是我想做的事
after
知道怎么做吗?
而且我需要以最简单的方式简化我的代码....
错误
Traceback (most recent call last):
File "C:/Users/mwx825326/PycharmProjects/MyExcelCombine/MyCopyAndPaste.py", line 11, in <module>
appended_cell_name=pd.DataFrame({'CELLNAME':cluster_cell + GUL_EP_GSM + GUL_EP_UMTS + GUL_EP_LTE})
File "C:\Users\mwx825326\PycharmProjects\MyExcelCombine\venv\lib\site-packages\pandas\core\frame.py", line 411, in __init__
mgr = init_dict(data, index, columns, dtype=dtype)
File "C:\Users\mwx825326\PycharmProjects\MyExcelCombine\venv\lib\site-packages\pandas\core\internals\construction.py", line 257, in init_dict
return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "C:\Users\mwx825326\PycharmProjects\MyExcelCombine\venv\lib\site-packages\pandas\core\internals\construction.py", line 77, in arrays_to_mgr
index = extract_index(arrays)
File "C:\Users\mwx825326\PycharmProjects\MyExcelCombine\venv\lib\site-packages\pandas\core\internals\construction.py", line 358, in extract_index
raise ValueError("If using all scalar values, you must pass an index")
ValueError: If using all scalar values, you must pass an index
打印输出
appended_cell_name
CELLNAME system
0 3129648.0 0
1 3129652.0 0
2 3129656.0 0
3 3141617.0 0
4 3141621.0 0
... ... ...
2347 NaN 0
2348 NaN 0
2349 NaN 0
2350 NaN 0
2351 NaN 0
GUL_EP_GSM
CELLNAME
0 10169
1 10170
2 10171
3 10349
4 10350
.. ...
124 54237
125 54996
126 54997
127 54998
128 55966
GUL_EP_UMTS
CELLNAME
0 1106382
1 1106383
2 1106384
3 1019810
4 1019811
.. ...
262 1102891
263 1102892
264 1018393
265 1018394
266 1018395
GUL_EP_LTE
CELLNAME
0 2002388
1 2002389
2 2002390
3 2100118
4 2100119
.. ...
538 2304376
539 2304377
540 2205890
541 2205891
542 2205892
mask1
输出
0 False
1 False
2 False
3 False
4 False
...
2347 False
2348 False
2349 False
2350 False
2351 False
Name: CELLNAME, Length: 2352, dtype: bool
mask2
输出
0 False
1 False
2 False
3 False
4 False
...
2347 False
2348 False
2349 False
2350 False
2351 False
Name: CELLNAME, Length: 2352, dtype: bool
mask3
输出
0 False
1 False
2 False
3 False
4 False
...
2347 False
2348 False
2349 False
2350 False
2351 False
Name: CELLNAME, Length: 2352, dtype: bool
的输出
print(GUL_EP_LTE.info())
print(GUL_EP_UMTS.info())
print(GUL_EP_LTE.info())
是
[2352 rows x 2 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 543 entries, 0 to 542
Data columns (total 1 columns):
CELLNAME 543 non-null int64
dtypes: int64(1)
memory usage: 4.3 KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 267 entries, 0 to 266
Data columns (total 1 columns):
CELLNAME 267 non-null int64
dtypes: int64(1)
memory usage: 2.1 KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 543 entries, 0 to 542
Data columns (total 1 columns):
CELLNAME 543 non-null int64
dtypes: int64(1)
memory usage: 4.3 KB
None
print(appended_cell_name. info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2352 entries, 0 to 2351
Data columns (total 2 columns):
CELLNAME 129 non-null float64
system 2352 non-null object
dtypes: float64(1), object(1)
memory usage: 45.9+ KB
None
不太懂,不过可以在拼接前加这样一列:
GUL_EP_GSM['system']='GSM'
GUL_EP_UMTS['system']='UMTS'
GUL_EP_LTE['system']='LTE'
然后连接
你需要:
GUL_EP_GSM=GUL_EP_GSM.astpye(float)
GUL_EP_UMTS=GUL_EP_UMTS.astpye(float)
GUL_EP_LTE=GUL_EP_LTE.astpye(float)
#appended_cell_name['CELLNAME']=appended_cell_name['CELLNAME'].astype(int) #or convert appended_cell_name to int
因为appended_cell_name['CELLNAME']的类型是float
那么您的代码将是:
import os
import pandas as pd
import numpy as np
mydir = (os.getcwd()).replace('\', '/') + '/'
cluster_cell = pd.read_excel(r'' + mydir + 'Cluster_Cell.xlsx', usecols=['CELLNAME'], encoding='latin-1')
GUL_EP_GSM = pd.read_excel(r'' + mydir + 'GU EP20191016171519.xlsx', sheet_name="GSM_CDDData", usecols=['CELLNAME'],
encoding='latin-1')
GUL_EP_UMTS = pd.read_excel(r'' + mydir + 'GU EP20191016171519.xlsx', sheet_name="UMTS_CDDData", usecols=['CELLNAME'],
encoding='latin-1')
GUL_EP_LTE = pd.read_excel(r'' + mydir + 'GU EP20191016171519.xlsx', sheet_name="LTE_CDDData", usecols=['CELLNAME'],
encoding='latin-1')
#converting to float
GUL_EP_GSM=GUL_EP_GSM.astpye(float)
GUL_EP_UMTS=GUL_EP_UMTS.astpye(float)
GUL_EP_LTE=GUL_EP_LTE.astpye(float)
#print dataframes astype float
print(GUL_EP_GSM)
print(GUL_EP_UMTS)
print(GUL_EP_LTE)
#creating appended_cell_name dataframe
appended_cell_name = pd.DataFrame({'CELLNAME': cluster_cell['CELLNAME'] + GUL_EP_GSM['CELLNAME'] + GUL_EP_UMTS['CELLNAME'] + GUL_EP_LTE['CELLNAME']})
#creating mask with isin
mask1 = appended_cell_name['CELLNAME'].isin(GUL_EP_GSM['CELLNAME'])
mask2 = appended_cell_name['CELLNAME'].isin(GUL_EP_UMTS['CELLNAME'])
mask3 = appended_cell_name['CELLNAME'].isin(GUL_EP_LTE['CELLNAME'])
#using np.select
cond=[mask1,mask2,mask3]
values=['GSM','UMTS','LTE']
appended_cell_name['system']=np.select(cond,values)
#print appended_cell_name
print(appended_cell_name)
#save in excell
appended_cell_name.to_excel("Cluser_Cell_Final.xlsx", index=False)
你有三个数据框:GUL_EP_GSM、GUL_EP_UMTS、GUL_EP_LTE。然后你需要检查 Series.isin 如果appended_cell_name数据框的CELLNAME列中的值在每个那些数据框的列中并相应地填写系统列。
您唯一需要更改我的解决方案的是每个示例的列标签:
GUL_EP_GSM['name_of_colum_in_your data']
而不是
GUL_EP_GSM ['GUL_EP_GSM']
使用np.select + Series.isin:
正在准备数据
GUL_EP_GSM=pd.DataFrame({'CELLNAME':[53119,53120,53121,54438,54439,54440,55976,55977,55978]})
GUL_EP_UMTS=pd.DataFrame({'CELLNAME':[41773,41774,41775,44620,44621,44622]})
GUL_EP_LTE=pd.DataFrame({'GELLNAME':[2303909,2303923,2303924]})
appended_cell_name=pd.DataFrame({'CELLNAME':[53119,53120,53121,54438,54439,54440,55976,55977,55978,41773,41774,41775,44620,44621,44622,2303909,2303923,2303924]})
正在创建遮罩
mask1=appended_cell_name['CELLNAME'].isin(GUL_EP_GSM['CELLNAME'])
mask2=appended_cell_name['CELLNAME'].isin(GUL_EP_UMTS['CELLNAME'])
mask3=appended_cell_name['CELLNAME'].isin(GUL_EP_LTE['CELLNAME'])
Solve with np.select
然后你申请np.select:
cond=[mask1,mask2,mask3]
values=['GSM','UMTS','LTE']
appended_cell_name['system']=np.select(cond,values)
print(appended_cell_name)
Also you can use Series.mask(This is other solution):
appended_cell_name['system']= ( appended_cell_name['CELLNAME'].mask(mask1,'GSM')
.mask(mask2,'UMTS')
.mask(mask3,'LTE') )
print(appended_cell_name)
输出:
CELLNAME system
0 53119 GSM
1 53120 GSM
2 53121 GSM
3 54438 GSM
4 54439 GSM
5 54440 GSM
6 55976 GSM
7 55977 GSM
8 55978 GSM
9 41773 UMTS
10 41774 UMTS
11 41775 UMTS
12 44620 UMTS
13 44621 UMTS
14 44622 UMTS
15 2303909 LTE
16 2303923 LTE
17 2303924 LTE
使用的数据帧:
print(GUL_EP_GSM)
GUL_EP_GSM
0 53119
1 53120
2 53121
3 54438
4 54439
5 54440
6 55976
7 55977
8 55978
print(GUL_EP_UMTS)
CELLNAME
0 41773
1 41774
2 41775
3 44620
4 44621
5 44622
print(GUL_EP_LTE)
CELLNAME
0 2303909
1 2303923
2 2303924