在具有大数据集的两组日期中找到最近的较低日期 Mysql
Find closest lower date among two sets of dates with big data set Mysql
我有两个table
- "visit" 基本上存储网站上的每次访问
| visitdate | city |
----------------------------------
| 2014-12-01 00:00:02 | Paris |
| 2015-01-03 00:00:02 | Marseille|
- "cityweather" 每天存储 3 次很多城市的天气信息
| weatherdate | city | temp |
-------------------------------------------
| 2014-12-01 09:00:02 | Paris | 20 |
| 2014-12-01 09:00:02 | Marseille| 22 |
我明确指出 table visit 中可能存在不在 cityweather 中的城市,反之亦然,我只需要取两个 table 共有的城市。
所以我的问题是:
我如何 SELECT
每个 visitdate
低于访问日期的 MAX(weatherdate)
?
它应该是这样的:
| visitdate | city | beforedate |
--------------------------------------------------------
| 2014-12-01 00:00:02 | Paris | 2014-11-30 21:00:00 |
| 2015-01-03 15:07:26 | Marseille| 2015-01-03 09:00:00 |
我试过这样的事情:
SELECT t.city, t.visitdate, d.weatherdate as beforedate
FROM visitsub as t
JOIN cityweatherfrsub as d
ON d.weatherdate =
( SELECT MAX(d.weatherdate)
FROM cityweatherfrsub
WHERE d.weatherdate <= t.visitdate AND d.city=t.city
)
AND d.city = t.city;
但是 table 的大小使得无法在 "reasonnable" 时间内计算它(10^14 步):
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
---------------------------------------------------------------------------------------------------------------------------------------------------------
| 1 | PRIMARY | d | ALL | idx_city,Idx_citydate | NULL | NULL | NULL | 1204305 | Using where |
| 1 | PRIMARY | t | ref | Idxcity, Idxcitydate | Idxcitydate | 303 | meteo.d.city | 111 | Using where; Using index |
| 2 | DEPENDANT SUBQUERY | cityweather | index | NULL | Idx_date | 6 | NULL | 1204305 | Using where; Using index |
我现在正在研究 user-variable
领域,比如 @variable
,但我对它很陌生,只写了一些不起作用的东西 Error Code: 1111. Invalid use of group function
:
SET @j :=0;
SET @k :=0;
SET @l :=0;
SET @m :=0;
CREATE TABLE intermedweather
SELECT @l as city, @k as visitdate, @j as beforedate
FROM visitsub t
JOIN cityweatherfrsub d
WHERE (@j := d.weatherdate) <= (@k := t.visitdate)
AND (@l := d.city) = (@m := t.city)
AND @j = MAX(d.weatherdate);
You can find here a similar post but it can't work for my problem
可能是这样的:
select
V.*,
(
select
MAX(weatherdate)
from Weather W
where
W.weatherdate < V.visitdate and
W.city = V.city
) beforedate
from Visit V
where
exists ( select 1 from Weather where V.city = W.city)
试试这个:
SELECT t.visitdate, t.city, max(d.weatherdate) beforedate
FROM visit t inner JOIN cityweather d
on t.city=d.city
group by t.city,t.visitdate
我不确定这是否是您所需要的,但它应该可以解决问题。
SELECT t.visitdate, d.city, MAX(d.weatherdate) as beforedate
FROM cityweather d
JOIN visit t
ON d.weatherdate <= t.visitdate
AND d.city=t.city
GROUP BY t.visitdate, d.city;
替代方法,避免 MAX()
SELECT v.visitdate, v.city, w.weatherdate AS beforedate
FROM visit v
JOIN cityweather w
ON v.city = w.city
AND v.visitdate >= w.weatherdate
AND NOT EXISTS ( SELECT * FROM cityweather nx
WHERE nx.city = v.city
AND nx.weatherdate <= v.visitdate
AND nx.weatherdate > w.weatherdate
);
最后我自己找到了答案。这一切都归结为缩小 table 城市天气的选择范围。所以我分两步做了,以避免我们到目前为止遇到的 O(n^2) 问题,我减少了其他答案中发现的第一个 table(有时是虚拟的 table)的大小:
第一步(关键一步):
CREATE TABLE intermedtable
SELECT t.city, t.visitdate, d.weatherdate
FROM visit as t
JOIN cityweather as d
WHERE d.city=t.city AND d.weatherdate <= t.visitdate AND d.weatherdate + interval 1 day >= t.visitdate;
与之前相比,这里的关键是 d.weatherdate + interval 1 day >= t.visitdate
条件。 "only" 花了 22 分钟。
第二步是为每对 (city, visitdate)
找到 MAX(weatherdate)
:
Create table beforedatetable
SELECT city, visitdate, max(weatherdate) as beforedate
FROM intermedtable
GROUP BY city, visitdate;
通过这个解决方案,我从 16 小时的计算(最后崩溃)减少到 32 分钟。
这个答案的核心是通过添加 d.weatherdate + interval 1 day >= t.visitdate
条件来减少上一个答案中创建的虚拟 table 的大小。这是基于这样一个事实,即感兴趣的天气日期离访问日期不能超过一天。
我有两个table
- "visit" 基本上存储网站上的每次访问
| visitdate | city | ---------------------------------- | 2014-12-01 00:00:02 | Paris | | 2015-01-03 00:00:02 | Marseille|
- "cityweather" 每天存储 3 次很多城市的天气信息
| weatherdate | city | temp | ------------------------------------------- | 2014-12-01 09:00:02 | Paris | 20 | | 2014-12-01 09:00:02 | Marseille| 22 |
我明确指出 table visit 中可能存在不在 cityweather 中的城市,反之亦然,我只需要取两个 table 共有的城市。
所以我的问题是:
我如何 SELECT
每个 visitdate
低于访问日期的 MAX(weatherdate)
?
它应该是这样的:
| visitdate | city | beforedate | -------------------------------------------------------- | 2014-12-01 00:00:02 | Paris | 2014-11-30 21:00:00 | | 2015-01-03 15:07:26 | Marseille| 2015-01-03 09:00:00 |
我试过这样的事情:
SELECT t.city, t.visitdate, d.weatherdate as beforedate
FROM visitsub as t
JOIN cityweatherfrsub as d
ON d.weatherdate =
( SELECT MAX(d.weatherdate)
FROM cityweatherfrsub
WHERE d.weatherdate <= t.visitdate AND d.city=t.city
)
AND d.city = t.city;
但是 table 的大小使得无法在 "reasonnable" 时间内计算它(10^14 步):
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | --------------------------------------------------------------------------------------------------------------------------------------------------------- | 1 | PRIMARY | d | ALL | idx_city,Idx_citydate | NULL | NULL | NULL | 1204305 | Using where | | 1 | PRIMARY | t | ref | Idxcity, Idxcitydate | Idxcitydate | 303 | meteo.d.city | 111 | Using where; Using index | | 2 | DEPENDANT SUBQUERY | cityweather | index | NULL | Idx_date | 6 | NULL | 1204305 | Using where; Using index |
我现在正在研究 user-variable
领域,比如 @variable
,但我对它很陌生,只写了一些不起作用的东西 Error Code: 1111. Invalid use of group function
:
SET @j :=0;
SET @k :=0;
SET @l :=0;
SET @m :=0;
CREATE TABLE intermedweather
SELECT @l as city, @k as visitdate, @j as beforedate
FROM visitsub t
JOIN cityweatherfrsub d
WHERE (@j := d.weatherdate) <= (@k := t.visitdate)
AND (@l := d.city) = (@m := t.city)
AND @j = MAX(d.weatherdate);
You can find here a similar post but it can't work for my problem
可能是这样的:
select
V.*,
(
select
MAX(weatherdate)
from Weather W
where
W.weatherdate < V.visitdate and
W.city = V.city
) beforedate
from Visit V
where
exists ( select 1 from Weather where V.city = W.city)
试试这个:
SELECT t.visitdate, t.city, max(d.weatherdate) beforedate
FROM visit t inner JOIN cityweather d
on t.city=d.city
group by t.city,t.visitdate
我不确定这是否是您所需要的,但它应该可以解决问题。
SELECT t.visitdate, d.city, MAX(d.weatherdate) as beforedate
FROM cityweather d
JOIN visit t
ON d.weatherdate <= t.visitdate
AND d.city=t.city
GROUP BY t.visitdate, d.city;
替代方法,避免 MAX()
SELECT v.visitdate, v.city, w.weatherdate AS beforedate
FROM visit v
JOIN cityweather w
ON v.city = w.city
AND v.visitdate >= w.weatherdate
AND NOT EXISTS ( SELECT * FROM cityweather nx
WHERE nx.city = v.city
AND nx.weatherdate <= v.visitdate
AND nx.weatherdate > w.weatherdate
);
最后我自己找到了答案。这一切都归结为缩小 table 城市天气的选择范围。所以我分两步做了,以避免我们到目前为止遇到的 O(n^2) 问题,我减少了其他答案中发现的第一个 table(有时是虚拟的 table)的大小:
第一步(关键一步):
CREATE TABLE intermedtable
SELECT t.city, t.visitdate, d.weatherdate
FROM visit as t
JOIN cityweather as d
WHERE d.city=t.city AND d.weatherdate <= t.visitdate AND d.weatherdate + interval 1 day >= t.visitdate;
与之前相比,这里的关键是 d.weatherdate + interval 1 day >= t.visitdate
条件。 "only" 花了 22 分钟。
第二步是为每对 (city, visitdate)
找到 MAX(weatherdate)
:
Create table beforedatetable
SELECT city, visitdate, max(weatherdate) as beforedate
FROM intermedtable
GROUP BY city, visitdate;
通过这个解决方案,我从 16 小时的计算(最后崩溃)减少到 32 分钟。
这个答案的核心是通过添加 d.weatherdate + interval 1 day >= t.visitdate
条件来减少上一个答案中创建的虚拟 table 的大小。这是基于这样一个事实,即感兴趣的天气日期离访问日期不能超过一天。