在具有大数据集的两组日期中找到最近的较低日期 Mysql

Find closest lower date among two sets of dates with big data set Mysql

我有两个table

    | visitdate           | city     |
    ----------------------------------
    | 2014-12-01 00:00:02 | Paris    |
    | 2015-01-03 00:00:02 | Marseille|
    | weatherdate           | city     | temp |
    -------------------------------------------
    | 2014-12-01 09:00:02   | Paris    | 20   |
    | 2014-12-01 09:00:02   | Marseille| 22   |

我明确指出 table visit 中可能存在不在 cityweather 中的城市,反之亦然,我只需要取两个 table 共有的城市。

所以我的问题是:

我如何 SELECT 每个 visitdate 低于访问日期的 MAX(weatherdate)

它应该是这样的:

    | visitdate           | city     | beforedate          |
    --------------------------------------------------------
    | 2014-12-01 00:00:02 | Paris    | 2014-11-30 21:00:00 |
    | 2015-01-03 15:07:26 | Marseille| 2015-01-03 09:00:00 |

我试过这样的事情:

SELECT t.city, t.visitdate, d.weatherdate as beforedate
    FROM visitsub as t
    JOIN cityweatherfrsub as d
    ON  d.weatherdate = 
        ( SELECT MAX(d.weatherdate)
            FROM cityweatherfrsub
            WHERE d.weatherdate <= t.visitdate AND d.city=t.city
        )
    AND d.city = t.city;

但是 table 的大小使得无法在 "reasonnable" 时间内计算它(10^14 步):

    | id | select_type        | table       | type  | possible_keys         | key          | key_len | ref          | rows    | Extra                     |
    ---------------------------------------------------------------------------------------------------------------------------------------------------------
    | 1  | PRIMARY            | d           | ALL   | idx_city,Idx_citydate | NULL         | NULL    | NULL         | 1204305 | Using where               |
    | 1  | PRIMARY            | t           | ref   | Idxcity, Idxcitydate  | Idxcitydate  | 303     | meteo.d.city | 111     | Using where; Using index  |
    | 2  | DEPENDANT SUBQUERY | cityweather | index | NULL                  | Idx_date     | 6       | NULL         | 1204305 | Using where; Using index  |

我现在正在研究 user-variable 领域,比如 @variable,但我对它很陌生,只写了一些不起作用的东西 Error Code: 1111. Invalid use of group function:

SET @j :=0;
SET @k :=0;
SET @l :=0;
SET @m :=0;
CREATE TABLE intermedweather
    SELECT @l as city, @k as visitdate, @j as beforedate
    FROM visitsub t
    JOIN cityweatherfrsub d
    WHERE (@j := d.weatherdate) <= (@k := t.visitdate) 
      AND (@l := d.city) = (@m := t.city) 
      AND  @j = MAX(d.weatherdate);

You can find here a similar post but it can't work for my problem

可能是这样的:

select
    V.*,
    (
        select
            MAX(weatherdate) 
        from Weather W
        where
            W.weatherdate < V.visitdate and
            W.city = V.city
    ) beforedate
from Visit V
where
    exists ( select 1 from Weather where V.city = W.city)

试试这个:

 SELECT t.visitdate, t.city, max(d.weatherdate) beforedate
  FROM visit t inner JOIN cityweather d
  on t.city=d.city
  group by t.city,t.visitdate

我不确定这是否是您所需要的,但它应该可以解决问题。

SELECT t.visitdate, d.city, MAX(d.weatherdate) as beforedate
   FROM cityweather d
   JOIN visit t
   ON d.weatherdate <= t.visitdate
   AND d.city=t.city
   GROUP BY t.visitdate, d.city;

替代方法,避免 MAX()

SELECT v.visitdate, v.city, w.weatherdate AS beforedate
FROM visit v
JOIN cityweather w
        ON v.city = w.city
        AND v.visitdate >= w.weatherdate
        AND NOT EXISTS ( SELECT * FROM cityweather nx
                WHERE nx.city = v.city
                AND nx.weatherdate <= v.visitdate
                AND nx.weatherdate > w.weatherdate
        );

最后我自己找到了答案。这一切都归结为缩小 table 城市天气的选择范围。所以我分两步做了,以避免我们到目前为止遇到的 O(n^2) 问题,我减少了其他答案中发现的第一个 table(有时是虚拟的 table)的大小:

第一步(关键一步):

CREATE TABLE intermedtable 
   SELECT t.city, t.visitdate, d.weatherdate
      FROM visit as t 
      JOIN cityweather as d
      WHERE d.city=t.city AND d.weatherdate <= t.visitdate AND d.weatherdate +  interval 1 day >= t.visitdate;

与之前相比,这里的关键是 d.weatherdate + interval 1 day >= t.visitdate 条件。 "only" 花了 22 分钟。

第二步是为每对 (city, visitdate) 找到 MAX(weatherdate) :

Create table beforedatetable
   SELECT city, visitdate, max(weatherdate) as beforedate 
       FROM intermedtable
       GROUP BY city, visitdate;

通过这个解决方案,我从 16 小时的计算(最后崩溃)减少到 32 分钟。

这个答案的核心是通过添加 d.weatherdate + interval 1 day >= t.visitdate 条件来减少上一个答案中创建的虚拟 table 的大小。这是基于这样一个事实,即感兴趣的天气日期离访问日期不能超过一天。