如果没有结果，则查找具有给定值的行，return 可能的最高值

Question

我有一个 Spark 数据框。我需要获取给定月份给定区域的平均属性价格。如果没有找到那个月的数据，那么我需要获取最新的数据月份（如果存在的话）。

我的数据是这样的

+-----+-------------+-----------+----------+-------------+--------+---------+
|month|property_type|postal_code|avg_price | city        |dpt_code|city_code|
+-----+-------------+-----------+----------+-------------+--------+---------|
|  11 |House        |XXXXX      |2834      |FOO          |123     |1        |
|  11 |House        |XXXXY      |870       |NEAR_FOO     |123     |2        |
|   2 |House        |YYYYY      |732       |LA           |100     |5        |
|   3 |House        |YYYYX      |2361      |NEAR_LA      |100     |6        |
|   11|House        |ZZZZZ      |2162      |ATL          |105     |9        |

所以假设我选择 dpt_code = 123 和月份 = 11 我得到：

+-----+-------------+-----------+----------+-------------+--------+---------+
|month|property_type|postal_code|avg_price | city        |dpt_code|city_code|
+-----+-------------+-----------+----------+-------------+--------+---------|
|  11 |House        |XXXXX      |2834      |FOO          |123     |1        |
|  11 |House        |XXXXY      |870       |NEAR_FOO     |123     |2        |

这是最简单的情况。现在我不知道如何实现的是：

假设 dpt_code = 100 月 = 11 我想要这个返回：

+-----+-------------+-----------+----------+-------------+--------+---------+
|month|property_type|postal_code|avg_price | city        |dpt_code|city_code|
|   2 |House        |YYYYY      |732       |LA           |100     |5        |
|   3 |House        |YYYYX      |2361      |NEAR_LA      |100     |6        |

目前我认为这段代码适用于第一部分：

def get_mean_prices_by_dept_and_month(df, dept, month):
    df.filter((df["dept_code"] == dept) & (df["month"] == month)) 
    # if month condition isn't satisfied then do same request but 
    # with df["month"] = latest month possible for that dept_code, else null.

Answer 1

在您的最后一个案例中，不应该 return 仅在第 3 个月吗？因为这是存在的最新数据月份。你可以这样做：

import pandas as pd

df = pd.DataFrame({
    'month': [11, 11, 2, 3, 11],
    'avg_price': [2834, 870, 732, 2361, 2162],
    'dept_code': [123, 123, 100, 100, 105]
})

def get_mean_prices_by_dept_and_month(df, dept, month):
    df2 = df[(df["dept_code"] == dept) & (df["month"] == month)]
    if df2.empty:
        # Here you can change 'month' to '12' if you want iterate backwards starting from last month of the year
        for i in range(month, 0, -1):
            df2 = df[(df["dept_code"] == dept) & (df["month"] == i)]
            if not df2.empty:
                break
    return df2
    
df_filter = get_mean_prices_by_dept_and_month(df, 100, 11)
print(df_filter)

结果：

   month  avg_price  dept_code
3      3       2361        100

Answer 2

您必须应用 dept_code 和 month 过滤数据帧的计数才能知道它将包含数据。然后根据这个，应用一个操作来找到匹配最大月份的 dept_code 和 return 数据的 maximum month。

from pyspark.sql import functions as F

data = [(11, "House", "XXXXX", "2834", "FOO", 123, 1,),
 (11, "House", "XXXXY", "870", "NEAR_FOO", 123, 2,),
 (2, "House", "YYYYY", "732", "LA", 100, 5,),
 (3, "House", "YYYYX", "2361", "NEAR_LA", 100, 6,),
 (11, "House", "ZZZZZ", "2162", "ATL", 105, 9,), ]

df = spark.createDataFrame(data, ("month", "property_type", "postal_code", "avg_price", "city", "dept_code", "city_code",))

def get_mean_prices_by_dept_and_month(df, dept, month):
    department_data = df.filter((df["dept_code"] == dept))
    given_month_df = department_data.where(department_data["month"] == month)
    if given_month_df.count() > 0:
        return given_month
    latest_month = department_data.select(F.max("month").alias("latest_month")).head()["latest_month"]
    if latest_month is None:
        None
    return department_data.where(department_data["month"] == latest_month)

get_mean_prices_by_dept_and_month(df, 100, 11).show()

"""
+-----+-------------+-----------+---------+-------+---------+---------+
|month|property_type|postal_code|avg_price|   city|dept_code|city_code|
+-----+-------------+-----------+---------+-------+---------+---------+
|    3|        House|      YYYYX|     2361|NEAR_LA|      100|        6|
+-----+-------------+-----------+---------+-------+---------+---------+
"""

如果没有结果，则查找具有给定值的行，return 可能的最高值

Find rows with given values if no result, return highest value possible

python

pyspark