根据定义的 zscore 将分组列的离群值替换为组的平均值

Question

我有一个非常大的数据框，地图上有许多数据点，数据集（纬度和经度）上的异常值彼此非常接近。我想按如下所示对 A 列的所有行进行分组，计算它们的 zscore 并将 zscore > 1.5 的组中的每个值替换为该组的平均值。

df =

[data][1]

我尝试了 zscore 值 table 但没有成功

<**zscore = lambda x : (x - x.mean()) / x.std()
grouped_df = df.groupby("A")
transformed_df = grouped_df.transform(zscore)
transformed_df which gives me a table with zscores**>

Answer 1

您可以使用 scikit-learn 中的 haversine_distances 来计算同一组中某个点与该点的质心之间的距离。鉴于你应该有非常接近的点，你可以用组中点的经纬度的平均值来近似质心的经纬度。

这是一个基于英国城镇数据的示例（您可以从 here 下载免费示例）。特别是，数据包含每个城市的坐标和县（您可以将其视为您设置中的一个组）：

                          name          county  latitude  longitude
0                 Aaron's Hill          Surrey  51.18291   -0.63098
1                  Abbas Combe        Somerset  51.00283   -2.41825
2                     Abberley  Worcestershire  52.30522   -2.37574
3                     Abberton           Essex  51.83440    0.91066
4                     Abberton  Worcestershire  52.17955   -2.00817
5                    Abberwick  Northumberland  55.41325   -1.79720
6                   Abbess End           Essex  51.78000    0.28172
7                Abbess Roding           Essex  51.77815    0.27685
8                        Abbey           Devon  50.88896   -3.22276
9  Abbeycwmhir / Abaty Cwm-hir           Powys  52.33104   -3.38988

此处更改代码以解决您的问题：

from math import radians

import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import haversine_distances

df = pd.read_csv('uk-towns-sample.csv', usecols=['name', 'county', 'latitude', 'longitude'])

# Compute coordinates of the centroid for each county (group)
dist_county = pd.DataFrame(df.groupby('county').agg({'latitude': np.mean, 'longitude': np.mean}))

# Convert latitude and longitude to radians (it is needed by the function to compute haversine distance)
df[['latitude_radians', 'longitude_radians']] = df[['latitude', 'longitude']].applymap(radians)
dist_county[['latitude_radians', 'longitude_radians']] = dist_county[['latitude', 'longitude']].applymap(radians)

# Compute the distance of each town w.r.t. the centroid of its conunty
df['dist'] = df[['county', 'latitude_radians', 'longitude_radians']].apply(
    lambda x: haversine_distances(
        [x[['latitude_radians', 'longitude_radians']].values],
        [dist_county.loc[x['county']][['latitude_radians', 'longitude_radians']].values]
    )[0][0] * 6371000/1000,  # multiply by Earth radius to get kilometers,
    axis=1
)

# Compute mean and std of distances by county
county_stats = df.groupby('county').agg({'dist': [np.mean, np.std]})

# Compute the z-score using the distance of each town w.r.t. the centroid of its county, and the mean and std of distances for that county
df['zscore'] = df.apply(
    lambda x: (x['dist'] - county_stats.loc[x['county']][('dist', 'mean')] ) / county_stats.loc[x['county']][('dist', 'std')],
    axis=1
)

# Change latitude and longitude of the outliers with those of the centroid of their counties
df.loc[df.zscore > 1.5, ['latitude', 'longitude']] = df[df.zscore > 1.5].merge(
    dist_county, left_on='county', right_on=dist_county.index, how='left'
)[['latitude_y', 'longitude_y']].values

生成的 DataFrame df 看起来像：

              name           county  latitude  longitude  latitude_radians  longitude_radians       dist    zscore
0     Aaron's Hill           Surrey  51.18291   -0.63098          0.893310          -0.011013  12.479147 -0.293419
1      Abbas Combe         Somerset  51.00283   -2.41825          0.890167          -0.042206  35.205157  1.088695
2         Abberley   Worcestershire  52.30522   -2.37574          0.912898          -0.041464  17.014249  0.266168
3         Abberton            Essex  51.83440    0.91066          0.904681           0.015894  24.504285 -0.254400
4         Abberton   Worcestershire  52.17955   -2.00817          0.910705          -0.035049  11.906150 -0.663460
...            ...              ...       ...        ...               ...                ...        ...       ...
1795         Ayton     Berwickshire  55.84232   -2.12285          0.974632          -0.037051   5.899085  0.007876
1796         Ayton    Tyne and Wear  54.89416   -1.55643          0.958084          -0.027165   3.192591 -0.935937

如果您查看埃塞克斯县的异常值，新坐标对应于质心的坐标，即 (51.846594, 0.554532):

             name county   latitude  longitude
414   Aimes Green  Essex  51.846594   0.554532
1721       Aveley  Essex  51.846594   0.554532

根据定义的 zscore 将分组列的离群值替换为组的平均值

Replace grouped columns' outliers with mean of the group based on defined zscore

python

dataframe

data-science