Pyspark 自动重命名重复列
Pyspark automatically rename repeated columns
我想自动重命名 df 的重复列。例如:
df
Out[4]: DataFrame[norep1: string, num1: string, num1: bigint, norep2: bigint, num1: bigint, norep3: bigint]
应用一些函数以 df 结尾,例如:
f_rename_repcol(df)
Out[4]: DataFrame[norep1: string, num1_1: string, num1_2: bigint, norep2: bigint, num1_3: bigint, norep3: bigint]
我已经创建了自己的函数,并且可以正常工作,但我确信有一种更短更好的方法:
def f_df_col_renombra_rep(df):
from collections import Counter
from itertools import chain
import pandas as pd
columnas_original = np.array(df.columns)
d1 = Counter(df.columns)
i_corrige = [a>1 for a in dict(d1.items()).values()]
var_corrige = np.array(dict(d1.items()).keys())[i_corrige]
var_corrige_2 = [a for a in columnas_original if a in var_corrige]
columnas_nuevas = []
for var in var_corrige:
aux_corr = [a for a in var_corrige_2 if a in var]
i=0
columnas_nuevas_aux=[]
for valor in aux_corr:
i+=1
nombre_nuevo = valor +"_"+ str(i)
columnas_nuevas_aux.append(nombre_nuevo)
columnas_nuevas.append(columnas_nuevas_aux)
columnas_nuevas=list(chain.from_iterable(columnas_nuevas))
indice_cambio = pd.Series(columnas_original).isin(var_corrige)
i = 0
j = 0
colsalida = [None]*len(df.columns)
for col in df.columns:
if indice_cambio[i] == True:
colsalida[i] = columnas_nuevas[j]
j += 1
else:
colsalida[i] = col
# no cambio el nombre
i += 1
df_out = df.toDF(*(colsalida))
return df_out
您可以在此处修改重命名功能以满足您的需要,但总的来说,我认为这是重命名所有重复列的最佳方式
old_col=df.schema.names
running_list=[]
new_col=[]
i=0
for column in old_col:
if(column in running_list):
new_col.append(column+"_"+str(i))
i=i+1
else:
new_col.append(column)
running_list.append(column)
print(new_col)
这是我进行的转换,分配给重复列的后缀没有区别,直到名称(前缀)保持不变并且我可以保存文件。
要更新列,您只需 运行:
df=df.toDF(*new_col)
这应该更新列名并删除所有重复项
如果您想将编号保留为_1、_2、_3:
您可以使用字典和 try 和 except 块,
dict={}
for column in old_col:
try:
i=dict[column]+1
new_col.append(column+"_"+str(i))
dict[column]=i
except:
dict[column]=1
new_col.append(column+"_"+str(1)
print(new_col)
我这样做的简单方法是:
def col_duplicates(self):
'''rename dataframe with dups'''
columnas = self.columns.copy()
for i in range(len(columnas)-1):
for j in range(i+1, len(columnas), 1):
if columnas[i] == columnas[j]:
columnas[j] = columnas[i] + '_dup_' + str(j) # this line controls how to rename
return self.toDF(*columnas)
用作:
new_df_without_duplicates = col_duplicates(df_with_duplicates)
我想自动重命名 df 的重复列。例如:
df
Out[4]: DataFrame[norep1: string, num1: string, num1: bigint, norep2: bigint, num1: bigint, norep3: bigint]
应用一些函数以 df 结尾,例如:
f_rename_repcol(df)
Out[4]: DataFrame[norep1: string, num1_1: string, num1_2: bigint, norep2: bigint, num1_3: bigint, norep3: bigint]
我已经创建了自己的函数,并且可以正常工作,但我确信有一种更短更好的方法:
def f_df_col_renombra_rep(df):
from collections import Counter
from itertools import chain
import pandas as pd
columnas_original = np.array(df.columns)
d1 = Counter(df.columns)
i_corrige = [a>1 for a in dict(d1.items()).values()]
var_corrige = np.array(dict(d1.items()).keys())[i_corrige]
var_corrige_2 = [a for a in columnas_original if a in var_corrige]
columnas_nuevas = []
for var in var_corrige:
aux_corr = [a for a in var_corrige_2 if a in var]
i=0
columnas_nuevas_aux=[]
for valor in aux_corr:
i+=1
nombre_nuevo = valor +"_"+ str(i)
columnas_nuevas_aux.append(nombre_nuevo)
columnas_nuevas.append(columnas_nuevas_aux)
columnas_nuevas=list(chain.from_iterable(columnas_nuevas))
indice_cambio = pd.Series(columnas_original).isin(var_corrige)
i = 0
j = 0
colsalida = [None]*len(df.columns)
for col in df.columns:
if indice_cambio[i] == True:
colsalida[i] = columnas_nuevas[j]
j += 1
else:
colsalida[i] = col
# no cambio el nombre
i += 1
df_out = df.toDF(*(colsalida))
return df_out
您可以在此处修改重命名功能以满足您的需要,但总的来说,我认为这是重命名所有重复列的最佳方式
old_col=df.schema.names
running_list=[]
new_col=[]
i=0
for column in old_col:
if(column in running_list):
new_col.append(column+"_"+str(i))
i=i+1
else:
new_col.append(column)
running_list.append(column)
print(new_col)
这是我进行的转换,分配给重复列的后缀没有区别,直到名称(前缀)保持不变并且我可以保存文件。
要更新列,您只需 运行:
df=df.toDF(*new_col)
这应该更新列名并删除所有重复项
如果您想将编号保留为_1、_2、_3: 您可以使用字典和 try 和 except 块,
dict={}
for column in old_col:
try:
i=dict[column]+1
new_col.append(column+"_"+str(i))
dict[column]=i
except:
dict[column]=1
new_col.append(column+"_"+str(1)
print(new_col)
我这样做的简单方法是:
def col_duplicates(self):
'''rename dataframe with dups'''
columnas = self.columns.copy()
for i in range(len(columnas)-1):
for j in range(i+1, len(columnas), 1):
if columnas[i] == columnas[j]:
columnas[j] = columnas[i] + '_dup_' + str(j) # this line controls how to rename
return self.toDF(*columnas)
用作:
new_df_without_duplicates = col_duplicates(df_with_duplicates)