在 Python for 循环中慢 MySQL 数据库查询时间

Question

我有一项任务是运行 8 个相等的查询（每个国家/地区 1 个查询），并这样做 return 来自 MySQL 数据库的数据。我不能运行 1 查询所有国家的原因是每个国家需要有不同的列名。此外，结果需要使用动态日期范围（过去 7 天）每天更新。是的，我可以运行所有国家并使用 Pandas 进行列命名和所有操作，但我认为以下解决方案会更有效。因此，我的解决方案是创建一个 for 循环，该循环使用预定义列表，其中包含所有国家/地区各自的维度和日期范围变量，这些变量会根据当前日期发生变化。我遇到的问题是 MySQL 在循环中查询运行比我直接在我们的数据仓库中运行相同的查询花费更多的时间（~140-500 秒对比 30 秒）。该解决方案适用于 DWH 的较小表格。问题是我不知道到底是哪个部分导致了问题以及如何解决它。

这是我的代码示例，其中实现了一些较小的 "tests"：

#Import libraries:
from google.cloud import storage
from google.oauth2 import service_account
import mysql.connector
import pandas as pd
import time
from datetime import timedelta, date

#Create a connection to new DWH:
coon = mysql.connector.connect(
  host="the host goes here",
  user="the user goes here",
  passwd="the password goes here"
)

#Create Google Cloud Service credential references:
credentials = service_account.Credentials.from_service_account_file(r'C:\Users\ivo.vancans\OneDrive\Documents\Python Workspace\my credential json goes here.json')
project_id='my project id goes here'

cursor = coon.cursor()

#Create lists of countries and dimensions
countries = ['EE','FI','LV','LT']
appp_id_dim = ['ga:dimension5','ga:dimension5','ga:dimension5','ga:dimension5']
status_dim = ['ga:dimension21','ga:dimension12','ga:dimension20','ga:dimension15']
score_dim = ['ga:dimension11','ga:dimension11','ga:dimension19','ga:dimension14']

#Define the current date and date that was 7 days before current date:
date_now = date.today() - timedelta(days=1)
date_7d_prev = date_now - timedelta(days=7)

#Create a loop
for c,s in zip(countries, score_dim):
    start_time = time.time()
    #Create the query using string formating:
    query = f"""select ca.ID, sv.subType, SUM(svl.score) as '{s}'
    from aio.CreditApplication ca
    join aio.ScoringResult sr 
    on sr.creditApplication_ID = ca.ID 
    join aio.ScorecardVariableLine svl 
    on svl.id = sr.scorecardVariableLine_ID
    join aio.ScorecardVariable sv 
    on sv.ID = svl.scorecardVariable_ID
    where sv.country='{c}'  
    #and sv.subType ="asc"
    and sv.subType != 'fsc'
    and sr.created >= '2020-01-01'
    and sr.created between '{date_7d_prev} 00:00:00' and '{date_now} 23:59:59'
    group by ca.id,sv.subType"""

    #Check of sql query
    print('query is done', time.time()-start_time)

    start_time = time.time()
    sql = pd.read_sql_query(query, coon)
    #check of assigning sql:
    print ('sql is assigned',time.time()-start_time)

    start_time = time.time()
    df = pd.DataFrame(sql
                      #, columns = ['created','ID','state']
                      )
    #Check the df assignment:
    print ('df has been assigned', time.time()-start_time)

    #Create a .csv file from the final dataframe:
    start_time = time.time()
    df.to_csv(fr"C:\Users\ivo.vancans\OneDrive\Documents\Python Workspace\Testing Ground\{c}_sql_loop_test.csv", index = False, header=True, encoding='utf-8', sep=';')
    #Check csv file creation:
    print ('csv has been created',time.time()-start_time)

    #Close the session
    start_time = time.time()
    cursor.close()

    #Check the session closing:
    print('The cursor is closed',time.time()-start_time)

此示例有 4 个国家/地区，因为我尝试将数量减半，但这也无济于事。那是我认为我对 DWH 端有某种查询限制，因为主要的减速总是从第五个国家开始。运行它们分别花费几乎相同的时间，但仍然花费太长时间。所以，我的测试表明循环总是滞后于查询数据的步骤。每隔一步花费不到一秒，但查询时间会增加到 140-500 秒，有时甚至更多，如前所述。那么，您认为问题是什么？

Answer 1

找到解决方案！在与我公司的一位对 SQL 和我们特定的 DWH 引擎有更多经验的人交谈后，他同意提供帮助并重写了 SQL 部分。我没有加入子查询，而是不得不重写它，这样就没有子查询了。为什么？因为我们的特定引擎不会为子查询创建索引，所以我敢打赌单独连接的表会有索引。这显着缩短了整个脚本的时间，从约 40 分钟运行缩短到约不到 1 分钟。

在 Python for 循环中慢 MySQL 数据库查询时间

Slow MySQL database query time in a Python for loop

python

mysql

eclipse

pydev

pandas