如果没有结果,则查找具有给定值的行,return 可能的最高值
Find rows with given values if no result, return highest value possible
我有一个 Spark 数据框。
我需要获取给定月份给定区域的平均 属性 价格。如果没有找到那个月的数据,那么我需要获取最新的数据月份(如果存在的话)。
我的数据是这样的
+-----+-------------+-----------+----------+-------------+--------+---------+
|month|property_type|postal_code|avg_price | city |dpt_code|city_code|
+-----+-------------+-----------+----------+-------------+--------+---------|
| 11 |House |XXXXX |2834 |FOO |123 |1 |
| 11 |House |XXXXY |870 |NEAR_FOO |123 |2 |
| 2 |House |YYYYY |732 |LA |100 |5 |
| 3 |House |YYYYX |2361 |NEAR_LA |100 |6 |
| 11|House |ZZZZZ |2162 |ATL |105 |9 |
所以假设我选择 dpt_code = 123 和月份 = 11
我得到:
+-----+-------------+-----------+----------+-------------+--------+---------+
|month|property_type|postal_code|avg_price | city |dpt_code|city_code|
+-----+-------------+-----------+----------+-------------+--------+---------|
| 11 |House |XXXXX |2834 |FOO |123 |1 |
| 11 |House |XXXXY |870 |NEAR_FOO |123 |2 |
这是最简单的情况。
现在我不知道如何实现的是:
假设 dpt_code = 100 月 = 11
我想要这个返回:
+-----+-------------+-----------+----------+-------------+--------+---------+
|month|property_type|postal_code|avg_price | city |dpt_code|city_code|
| 2 |House |YYYYY |732 |LA |100 |5 |
| 3 |House |YYYYX |2361 |NEAR_LA |100 |6 |
目前我认为这段代码适用于第一部分:
def get_mean_prices_by_dept_and_month(df, dept, month):
df.filter((df["dept_code"] == dept) & (df["month"] == month))
# if month condition isn't satisfied then do same request but
# with df["month"] = latest month possible for that dept_code, else null.
在您的最后一个案例中,不应该 return 仅在第 3 个月吗?因为这是存在的最新数据月份。
你可以这样做:
import pandas as pd
df = pd.DataFrame({
'month': [11, 11, 2, 3, 11],
'avg_price': [2834, 870, 732, 2361, 2162],
'dept_code': [123, 123, 100, 100, 105]
})
def get_mean_prices_by_dept_and_month(df, dept, month):
df2 = df[(df["dept_code"] == dept) & (df["month"] == month)]
if df2.empty:
# Here you can change 'month' to '12' if you want iterate backwards starting from last month of the year
for i in range(month, 0, -1):
df2 = df[(df["dept_code"] == dept) & (df["month"] == i)]
if not df2.empty:
break
return df2
df_filter = get_mean_prices_by_dept_and_month(df, 100, 11)
print(df_filter)
结果:
month avg_price dept_code
3 3 2361 100
您必须应用 dept_code
和 month
过滤数据帧的计数才能知道它将包含数据。然后根据这个,应用一个操作来找到匹配最大月份的 dept_code
和 return 数据的 maximum month
。
from pyspark.sql import functions as F
data = [(11, "House", "XXXXX", "2834", "FOO", 123, 1,),
(11, "House", "XXXXY", "870", "NEAR_FOO", 123, 2,),
(2, "House", "YYYYY", "732", "LA", 100, 5,),
(3, "House", "YYYYX", "2361", "NEAR_LA", 100, 6,),
(11, "House", "ZZZZZ", "2162", "ATL", 105, 9,), ]
df = spark.createDataFrame(data, ("month", "property_type", "postal_code", "avg_price", "city", "dept_code", "city_code",))
def get_mean_prices_by_dept_and_month(df, dept, month):
department_data = df.filter((df["dept_code"] == dept))
given_month_df = department_data.where(department_data["month"] == month)
if given_month_df.count() > 0:
return given_month
latest_month = department_data.select(F.max("month").alias("latest_month")).head()["latest_month"]
if latest_month is None:
None
return department_data.where(department_data["month"] == latest_month)
get_mean_prices_by_dept_and_month(df, 100, 11).show()
"""
+-----+-------------+-----------+---------+-------+---------+---------+
|month|property_type|postal_code|avg_price| city|dept_code|city_code|
+-----+-------------+-----------+---------+-------+---------+---------+
| 3| House| YYYYX| 2361|NEAR_LA| 100| 6|
+-----+-------------+-----------+---------+-------+---------+---------+
"""
我有一个 Spark 数据框。 我需要获取给定月份给定区域的平均 属性 价格。如果没有找到那个月的数据,那么我需要获取最新的数据月份(如果存在的话)。
我的数据是这样的
+-----+-------------+-----------+----------+-------------+--------+---------+
|month|property_type|postal_code|avg_price | city |dpt_code|city_code|
+-----+-------------+-----------+----------+-------------+--------+---------|
| 11 |House |XXXXX |2834 |FOO |123 |1 |
| 11 |House |XXXXY |870 |NEAR_FOO |123 |2 |
| 2 |House |YYYYY |732 |LA |100 |5 |
| 3 |House |YYYYX |2361 |NEAR_LA |100 |6 |
| 11|House |ZZZZZ |2162 |ATL |105 |9 |
所以假设我选择 dpt_code = 123 和月份 = 11 我得到:
+-----+-------------+-----------+----------+-------------+--------+---------+
|month|property_type|postal_code|avg_price | city |dpt_code|city_code|
+-----+-------------+-----------+----------+-------------+--------+---------|
| 11 |House |XXXXX |2834 |FOO |123 |1 |
| 11 |House |XXXXY |870 |NEAR_FOO |123 |2 |
这是最简单的情况。 现在我不知道如何实现的是:
假设 dpt_code = 100 月 = 11 我想要这个返回:
+-----+-------------+-----------+----------+-------------+--------+---------+
|month|property_type|postal_code|avg_price | city |dpt_code|city_code|
| 2 |House |YYYYY |732 |LA |100 |5 |
| 3 |House |YYYYX |2361 |NEAR_LA |100 |6 |
目前我认为这段代码适用于第一部分:
def get_mean_prices_by_dept_and_month(df, dept, month):
df.filter((df["dept_code"] == dept) & (df["month"] == month))
# if month condition isn't satisfied then do same request but
# with df["month"] = latest month possible for that dept_code, else null.
在您的最后一个案例中,不应该 return 仅在第 3 个月吗?因为这是存在的最新数据月份。 你可以这样做:
import pandas as pd
df = pd.DataFrame({
'month': [11, 11, 2, 3, 11],
'avg_price': [2834, 870, 732, 2361, 2162],
'dept_code': [123, 123, 100, 100, 105]
})
def get_mean_prices_by_dept_and_month(df, dept, month):
df2 = df[(df["dept_code"] == dept) & (df["month"] == month)]
if df2.empty:
# Here you can change 'month' to '12' if you want iterate backwards starting from last month of the year
for i in range(month, 0, -1):
df2 = df[(df["dept_code"] == dept) & (df["month"] == i)]
if not df2.empty:
break
return df2
df_filter = get_mean_prices_by_dept_and_month(df, 100, 11)
print(df_filter)
结果:
month avg_price dept_code
3 3 2361 100
您必须应用 dept_code
和 month
过滤数据帧的计数才能知道它将包含数据。然后根据这个,应用一个操作来找到匹配最大月份的 dept_code
和 return 数据的 maximum month
。
from pyspark.sql import functions as F
data = [(11, "House", "XXXXX", "2834", "FOO", 123, 1,),
(11, "House", "XXXXY", "870", "NEAR_FOO", 123, 2,),
(2, "House", "YYYYY", "732", "LA", 100, 5,),
(3, "House", "YYYYX", "2361", "NEAR_LA", 100, 6,),
(11, "House", "ZZZZZ", "2162", "ATL", 105, 9,), ]
df = spark.createDataFrame(data, ("month", "property_type", "postal_code", "avg_price", "city", "dept_code", "city_code",))
def get_mean_prices_by_dept_and_month(df, dept, month):
department_data = df.filter((df["dept_code"] == dept))
given_month_df = department_data.where(department_data["month"] == month)
if given_month_df.count() > 0:
return given_month
latest_month = department_data.select(F.max("month").alias("latest_month")).head()["latest_month"]
if latest_month is None:
None
return department_data.where(department_data["month"] == latest_month)
get_mean_prices_by_dept_and_month(df, 100, 11).show()
"""
+-----+-------------+-----------+---------+-------+---------+---------+
|month|property_type|postal_code|avg_price| city|dept_code|city_code|
+-----+-------------+-----------+---------+-------+---------+---------+
| 3| House| YYYYX| 2361|NEAR_LA| 100| 6|
+-----+-------------+-----------+---------+-------+---------+---------+
"""