根据定义的 zscore 将分组列的离群值替换为组的平均值
Replace grouped columns' outliers with mean of the group based on defined zscore
我有一个非常大的数据框,地图上有许多数据点,数据集(纬度和经度)上的异常值彼此非常接近。我想按如下所示对 A 列的所有行进行分组,计算它们的 zscore 并将 zscore > 1.5 的组中的每个值替换为该组的平均值。
df =
[data][1]
我尝试了 zscore 值 table 但没有成功
<**zscore = lambda x : (x - x.mean()) / x.std()
grouped_df = df.groupby("A")
transformed_df = grouped_df.transform(zscore)
transformed_df which gives me a table with zscores**>
您可以使用 scikit-learn
中的 haversine_distances
来计算同一组中某个点与该点的质心之间的距离。鉴于你应该有非常接近的点,你可以用组中点的经纬度的平均值来近似质心的经纬度。
这是一个基于英国城镇数据的示例(您可以从 here 下载免费示例)。特别是,数据包含每个城市的坐标和县(您可以将其视为您设置中的一个组):
name county latitude longitude
0 Aaron's Hill Surrey 51.18291 -0.63098
1 Abbas Combe Somerset 51.00283 -2.41825
2 Abberley Worcestershire 52.30522 -2.37574
3 Abberton Essex 51.83440 0.91066
4 Abberton Worcestershire 52.17955 -2.00817
5 Abberwick Northumberland 55.41325 -1.79720
6 Abbess End Essex 51.78000 0.28172
7 Abbess Roding Essex 51.77815 0.27685
8 Abbey Devon 50.88896 -3.22276
9 Abbeycwmhir / Abaty Cwm-hir Powys 52.33104 -3.38988
此处更改代码以解决您的问题:
from math import radians
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import haversine_distances
df = pd.read_csv('uk-towns-sample.csv', usecols=['name', 'county', 'latitude', 'longitude'])
# Compute coordinates of the centroid for each county (group)
dist_county = pd.DataFrame(df.groupby('county').agg({'latitude': np.mean, 'longitude': np.mean}))
# Convert latitude and longitude to radians (it is needed by the function to compute haversine distance)
df[['latitude_radians', 'longitude_radians']] = df[['latitude', 'longitude']].applymap(radians)
dist_county[['latitude_radians', 'longitude_radians']] = dist_county[['latitude', 'longitude']].applymap(radians)
# Compute the distance of each town w.r.t. the centroid of its conunty
df['dist'] = df[['county', 'latitude_radians', 'longitude_radians']].apply(
lambda x: haversine_distances(
[x[['latitude_radians', 'longitude_radians']].values],
[dist_county.loc[x['county']][['latitude_radians', 'longitude_radians']].values]
)[0][0] * 6371000/1000, # multiply by Earth radius to get kilometers,
axis=1
)
# Compute mean and std of distances by county
county_stats = df.groupby('county').agg({'dist': [np.mean, np.std]})
# Compute the z-score using the distance of each town w.r.t. the centroid of its county, and the mean and std of distances for that county
df['zscore'] = df.apply(
lambda x: (x['dist'] - county_stats.loc[x['county']][('dist', 'mean')] ) / county_stats.loc[x['county']][('dist', 'std')],
axis=1
)
# Change latitude and longitude of the outliers with those of the centroid of their counties
df.loc[df.zscore > 1.5, ['latitude', 'longitude']] = df[df.zscore > 1.5].merge(
dist_county, left_on='county', right_on=dist_county.index, how='left'
)[['latitude_y', 'longitude_y']].values
生成的 DataFrame df
看起来像:
name county latitude longitude latitude_radians longitude_radians dist zscore
0 Aaron's Hill Surrey 51.18291 -0.63098 0.893310 -0.011013 12.479147 -0.293419
1 Abbas Combe Somerset 51.00283 -2.41825 0.890167 -0.042206 35.205157 1.088695
2 Abberley Worcestershire 52.30522 -2.37574 0.912898 -0.041464 17.014249 0.266168
3 Abberton Essex 51.83440 0.91066 0.904681 0.015894 24.504285 -0.254400
4 Abberton Worcestershire 52.17955 -2.00817 0.910705 -0.035049 11.906150 -0.663460
... ... ... ... ... ... ... ... ...
1795 Ayton Berwickshire 55.84232 -2.12285 0.974632 -0.037051 5.899085 0.007876
1796 Ayton Tyne and Wear 54.89416 -1.55643 0.958084 -0.027165 3.192591 -0.935937
如果您查看埃塞克斯县的异常值,新坐标对应于质心的坐标,即 (51.846594, 0.554532):
name county latitude longitude
414 Aimes Green Essex 51.846594 0.554532
1721 Aveley Essex 51.846594 0.554532
我有一个非常大的数据框,地图上有许多数据点,数据集(纬度和经度)上的异常值彼此非常接近。我想按如下所示对 A 列的所有行进行分组,计算它们的 zscore 并将 zscore > 1.5 的组中的每个值替换为该组的平均值。
df =
[data][1]
我尝试了 zscore 值 table 但没有成功
<**zscore = lambda x : (x - x.mean()) / x.std()
grouped_df = df.groupby("A")
transformed_df = grouped_df.transform(zscore)
transformed_df which gives me a table with zscores**>
您可以使用 scikit-learn
中的 haversine_distances
来计算同一组中某个点与该点的质心之间的距离。鉴于你应该有非常接近的点,你可以用组中点的经纬度的平均值来近似质心的经纬度。
这是一个基于英国城镇数据的示例(您可以从 here 下载免费示例)。特别是,数据包含每个城市的坐标和县(您可以将其视为您设置中的一个组):
name county latitude longitude
0 Aaron's Hill Surrey 51.18291 -0.63098
1 Abbas Combe Somerset 51.00283 -2.41825
2 Abberley Worcestershire 52.30522 -2.37574
3 Abberton Essex 51.83440 0.91066
4 Abberton Worcestershire 52.17955 -2.00817
5 Abberwick Northumberland 55.41325 -1.79720
6 Abbess End Essex 51.78000 0.28172
7 Abbess Roding Essex 51.77815 0.27685
8 Abbey Devon 50.88896 -3.22276
9 Abbeycwmhir / Abaty Cwm-hir Powys 52.33104 -3.38988
此处更改代码以解决您的问题:
from math import radians
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import haversine_distances
df = pd.read_csv('uk-towns-sample.csv', usecols=['name', 'county', 'latitude', 'longitude'])
# Compute coordinates of the centroid for each county (group)
dist_county = pd.DataFrame(df.groupby('county').agg({'latitude': np.mean, 'longitude': np.mean}))
# Convert latitude and longitude to radians (it is needed by the function to compute haversine distance)
df[['latitude_radians', 'longitude_radians']] = df[['latitude', 'longitude']].applymap(radians)
dist_county[['latitude_radians', 'longitude_radians']] = dist_county[['latitude', 'longitude']].applymap(radians)
# Compute the distance of each town w.r.t. the centroid of its conunty
df['dist'] = df[['county', 'latitude_radians', 'longitude_radians']].apply(
lambda x: haversine_distances(
[x[['latitude_radians', 'longitude_radians']].values],
[dist_county.loc[x['county']][['latitude_radians', 'longitude_radians']].values]
)[0][0] * 6371000/1000, # multiply by Earth radius to get kilometers,
axis=1
)
# Compute mean and std of distances by county
county_stats = df.groupby('county').agg({'dist': [np.mean, np.std]})
# Compute the z-score using the distance of each town w.r.t. the centroid of its county, and the mean and std of distances for that county
df['zscore'] = df.apply(
lambda x: (x['dist'] - county_stats.loc[x['county']][('dist', 'mean')] ) / county_stats.loc[x['county']][('dist', 'std')],
axis=1
)
# Change latitude and longitude of the outliers with those of the centroid of their counties
df.loc[df.zscore > 1.5, ['latitude', 'longitude']] = df[df.zscore > 1.5].merge(
dist_county, left_on='county', right_on=dist_county.index, how='left'
)[['latitude_y', 'longitude_y']].values
生成的 DataFrame df
看起来像:
name county latitude longitude latitude_radians longitude_radians dist zscore
0 Aaron's Hill Surrey 51.18291 -0.63098 0.893310 -0.011013 12.479147 -0.293419
1 Abbas Combe Somerset 51.00283 -2.41825 0.890167 -0.042206 35.205157 1.088695
2 Abberley Worcestershire 52.30522 -2.37574 0.912898 -0.041464 17.014249 0.266168
3 Abberton Essex 51.83440 0.91066 0.904681 0.015894 24.504285 -0.254400
4 Abberton Worcestershire 52.17955 -2.00817 0.910705 -0.035049 11.906150 -0.663460
... ... ... ... ... ... ... ... ...
1795 Ayton Berwickshire 55.84232 -2.12285 0.974632 -0.037051 5.899085 0.007876
1796 Ayton Tyne and Wear 54.89416 -1.55643 0.958084 -0.027165 3.192591 -0.935937
如果您查看埃塞克斯县的异常值,新坐标对应于质心的坐标,即 (51.846594, 0.554532):
name county latitude longitude
414 Aimes Green Essex 51.846594 0.554532
1721 Aveley Essex 51.846594 0.554532