根据定义的 zscore 将分组列的离群值替换为组的平均值

Replace grouped columns' outliers with mean of the group based on defined zscore

我有一个非常大的数据框,地图上有许多数据点,数据集(纬度和经度)上的异常值彼此非常接近。我想按如下所示对 A 列的所有行进行分组,计算它们的 zscore 并将 zscore > 1.5 的组中的每个值替换为该组的平均值。

df =

[data][1]

我尝试了 zscore 值 table 但没有成功

<**zscore = lambda x : (x - x.mean()) / x.std()
grouped_df = df.groupby("A")
transformed_df = grouped_df.transform(zscore)
transformed_df which gives me a table with zscores**>

您可以使用 scikit-learn 中的 haversine_distances 来计算同一组中某个点与该点的质心之间的距离。鉴于你应该有非常接近的点,你可以用组中点的经纬度的平均值来近似质心的经纬度。

这是一个基于英国城镇数据的示例(您可以从 here 下载免费示例)。特别是,数据包含每个城市的坐标和县(您可以将其视为您设置中的一个组):

                          name          county  latitude  longitude
0                 Aaron's Hill          Surrey  51.18291   -0.63098
1                  Abbas Combe        Somerset  51.00283   -2.41825
2                     Abberley  Worcestershire  52.30522   -2.37574
3                     Abberton           Essex  51.83440    0.91066
4                     Abberton  Worcestershire  52.17955   -2.00817
5                    Abberwick  Northumberland  55.41325   -1.79720
6                   Abbess End           Essex  51.78000    0.28172
7                Abbess Roding           Essex  51.77815    0.27685
8                        Abbey           Devon  50.88896   -3.22276
9  Abbeycwmhir / Abaty Cwm-hir           Powys  52.33104   -3.38988

此处更改代码以解决您的问题:

from math import radians

import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import haversine_distances

df = pd.read_csv('uk-towns-sample.csv', usecols=['name', 'county', 'latitude', 'longitude'])

# Compute coordinates of the centroid for each county (group)
dist_county = pd.DataFrame(df.groupby('county').agg({'latitude': np.mean, 'longitude': np.mean}))

# Convert latitude and longitude to radians (it is needed by the function to compute haversine distance)
df[['latitude_radians', 'longitude_radians']] = df[['latitude', 'longitude']].applymap(radians)
dist_county[['latitude_radians', 'longitude_radians']] = dist_county[['latitude', 'longitude']].applymap(radians)

# Compute the distance of each town w.r.t. the centroid of its conunty
df['dist'] = df[['county', 'latitude_radians', 'longitude_radians']].apply(
    lambda x: haversine_distances(
        [x[['latitude_radians', 'longitude_radians']].values],
        [dist_county.loc[x['county']][['latitude_radians', 'longitude_radians']].values]
    )[0][0] * 6371000/1000,  # multiply by Earth radius to get kilometers,
    axis=1
)

# Compute mean and std of distances by county
county_stats = df.groupby('county').agg({'dist': [np.mean, np.std]})

# Compute the z-score using the distance of each town w.r.t. the centroid of its county, and the mean and std of distances for that county
df['zscore'] = df.apply(
    lambda x: (x['dist'] - county_stats.loc[x['county']][('dist', 'mean')] ) / county_stats.loc[x['county']][('dist', 'std')],
    axis=1
)

# Change latitude and longitude of the outliers with those of the centroid of their counties
df.loc[df.zscore > 1.5, ['latitude', 'longitude']] = df[df.zscore > 1.5].merge(
    dist_county, left_on='county', right_on=dist_county.index, how='left'
)[['latitude_y', 'longitude_y']].values

生成的 DataFrame df 看起来像:

              name           county  latitude  longitude  latitude_radians  longitude_radians       dist    zscore
0     Aaron's Hill           Surrey  51.18291   -0.63098          0.893310          -0.011013  12.479147 -0.293419
1      Abbas Combe         Somerset  51.00283   -2.41825          0.890167          -0.042206  35.205157  1.088695
2         Abberley   Worcestershire  52.30522   -2.37574          0.912898          -0.041464  17.014249  0.266168
3         Abberton            Essex  51.83440    0.91066          0.904681           0.015894  24.504285 -0.254400
4         Abberton   Worcestershire  52.17955   -2.00817          0.910705          -0.035049  11.906150 -0.663460
...            ...              ...       ...        ...               ...                ...        ...       ...
1795         Ayton     Berwickshire  55.84232   -2.12285          0.974632          -0.037051   5.899085  0.007876
1796         Ayton    Tyne and Wear  54.89416   -1.55643          0.958084          -0.027165   3.192591 -0.935937

如果您查看埃塞克斯县的异常值,新坐标对应于质心的坐标,即 (51.846594, 0.554532):

             name county   latitude  longitude
414   Aimes Green  Essex  51.846594   0.554532
1721       Aveley  Essex  51.846594   0.554532