将 NumPy 数组转换为 Pandas 包含列的数据框
Convert NumPy arrays to Pandas Dataframe with columns
我想标准化我的分类值和数值。
cols = df.columns.values.tolist()
df_num = df.drop(CAT_COLUMNS, axis=1)
df_num = df_num.as_matrix()
df_num = preprocessing.StandardScaler().fit_transform(df_num)
df.fillna('NA', inplace=True)
df_cat = df.T.to_dict().values()
vec_cat = DictVectorizer( sparse=False )
df_cat = vec_cat.fit_transform(df_cat)
之后我需要将 2 个 numpy
数组组合回 pandas
数据框,但下面的方法对我不起作用。
mas = np.hstack((df_num, df_cat))
df = pd.DataFrame(data=mas, columns=cols)
错误信息: ValueError: Shape of passed values is (475, 243), indices imply (83, 243)
One more approach:
columns = df.columns.values.tolist()
for col in columns:
try:
if col in CAT_COLUMNS:
df[col] = pd.get_dummies(df[col])
else:
df[col] = df[col].apply(preprocessing.StandardScaler().fit)
except Exception, err:
print 'Column: %s and msg=%s' % (col, err.message)
错误信息:
Column: DATE and msg=Singleton array array(1444424400.0) cannot be considered a valid collection.
Column: QTR_HR_START and msg=Singleton array array(21600000L, dtype=int64) cannot be considered a valid collection.
...
PS. Is there any way to avoid numpy et all? As example, I want to leverage on pandas_ml
library
您要找的是pandas.get_dummies()
。它将对分类列执行一次热编码,并生成一个数据帧作为结果。从那里您可以使用 pandas.concat([existing_df, new_df],axis=0)
将新列添加到现有数据框中。这将避免使用 numpy 数组。
如何使用的示例:
for cat_column in CAT_COLUMNS:
dummy_df = pd.get_dummies(df[column])
#Optionally rename columns to indicate categorical feature name
dummy_df.columns = ["%s_%s" % (cat_column, col) for col in dummy_df.columns]
df = pd.concat([df, dummy_df], axis=1)
非常简单的以下方法怎么样?
def normalize_dataframe(df):
columns = df.columns.values.tolist()
for col in columns:
try:
if col in CAT_COLUMNS:
df[col] = pd.get_dummies(df[col])
else:
df[col] = preprocessing.StandardScaler().fit_transform(df[col])
except Exception, err:
print 'Column: %s and msg=%s' % (col, err.message)
return df
我想标准化我的分类值和数值。
cols = df.columns.values.tolist()
df_num = df.drop(CAT_COLUMNS, axis=1)
df_num = df_num.as_matrix()
df_num = preprocessing.StandardScaler().fit_transform(df_num)
df.fillna('NA', inplace=True)
df_cat = df.T.to_dict().values()
vec_cat = DictVectorizer( sparse=False )
df_cat = vec_cat.fit_transform(df_cat)
之后我需要将 2 个 numpy
数组组合回 pandas
数据框,但下面的方法对我不起作用。
mas = np.hstack((df_num, df_cat))
df = pd.DataFrame(data=mas, columns=cols)
错误信息: ValueError: Shape of passed values is (475, 243), indices imply (83, 243)
One more approach:
columns = df.columns.values.tolist()
for col in columns:
try:
if col in CAT_COLUMNS:
df[col] = pd.get_dummies(df[col])
else:
df[col] = df[col].apply(preprocessing.StandardScaler().fit)
except Exception, err:
print 'Column: %s and msg=%s' % (col, err.message)
错误信息:
Column: DATE and msg=Singleton array array(1444424400.0) cannot be considered a valid collection.
Column: QTR_HR_START and msg=Singleton array array(21600000L, dtype=int64) cannot be considered a valid collection.
...
PS. Is there any way to avoid numpy et all? As example, I want to leverage on
pandas_ml
library
您要找的是pandas.get_dummies()
。它将对分类列执行一次热编码,并生成一个数据帧作为结果。从那里您可以使用 pandas.concat([existing_df, new_df],axis=0)
将新列添加到现有数据框中。这将避免使用 numpy 数组。
如何使用的示例:
for cat_column in CAT_COLUMNS:
dummy_df = pd.get_dummies(df[column])
#Optionally rename columns to indicate categorical feature name
dummy_df.columns = ["%s_%s" % (cat_column, col) for col in dummy_df.columns]
df = pd.concat([df, dummy_df], axis=1)
非常简单的以下方法怎么样?
def normalize_dataframe(df):
columns = df.columns.values.tolist()
for col in columns:
try:
if col in CAT_COLUMNS:
df[col] = pd.get_dummies(df[col])
else:
df[col] = preprocessing.StandardScaler().fit_transform(df[col])
except Exception, err:
print 'Column: %s and msg=%s' % (col, err.message)
return df