如何抓取 Tableau 仪表板,其中数据仅在单击地图后显示在图中?

How to scrape a Tableau dashboard in which data is only displayed in a plot after clicking in a map?

我正在尝试从 this public Tableau dashboard. The ineterest is in the time series plotted data. If i click in a spcific state in the map, the time series changes to that specific state. Following and posts 中抓取数据我得到了在国家/地区级别聚合的时间序列的结果(使用下面提供的代码)。但我感兴趣的是州级数据。

import requests
from bs4 import BeautifulSoup
import json
import re

# get the second tableau link
r = requests.get(
    f"https://public.tableau.com/views/MKTScoredeisolamentosocial/VisoGeral",
    params= {
        ":showVizHome":"no"
    }
)
soup = BeautifulSoup(r.text, "html.parser")
tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)
dataUrl = f'https://public.tableau.com{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'
r = requests.post(dataUrl, data= {
    "sheet_id": tableauData["sheetId"],

})

dataReg = re.search('\d+;({.*})\d+;({.*})', r.text, re.MULTILINE)
info = json.loads(dataReg.group(1))
data = json.loads(dataReg.group(2))


print(data["secondaryInfo"]["presModelMap"]["dataDictionary"]["presModelHolder"]["genDataDictionaryPresModel"]["dataSegments"]["0"]["dataColumns"])

我研究了 Tableau 类别,发现可以在 URL 中插入一些参数以获得理想的结果,但我找不到此类参数。我注意到我想要的数据存储在名为“time_line_BR”的工作表中,其中 BR 代表巴西。但我想为各州更改此设置,例如圣保罗 (SP)。我还注意到 tableauData 中的一些参数,例如“current_view_id”,我怀疑它们可能与时间序列中加载的数据有关。

有没有可能 post 拉取的数据与我手动 select 特定状态时在图中看到的数据相同的请求?

编辑

我做到了a python library to scrape tableau dashboard。实现更直接:

from tableauscraper import TableauScraper as TS

url = "https://public.tableau.com/views/MKTScoredeisolamentosocial/VisoGeral"

ts = TS()
ts.loads(url)
dashboard = ts.getDashboard()

for t in dashboard.worksheets:
    #show worksheet name
    print(f"WORKSHEET NAME : {t.name}")
    #show dataframe for this worksheet
    print(t.data)

run this on repl.it


旧答案

当您点击地图时,会触发对 :

的调用
POST https://public.tableau.com/{vizql_root}/sessions/{session_id}/commands/tabdoc/select

一些表单数据如下:

worksheet: map_state_mobile
dashboard: Visão Geral
selection: {"objectIds":[17],"selectionType":"tuples"}
selectOptions: select-options-simple

它有状态索引(这里是 17)和作品sheet 名称。我注意到当您单击某个州时,sheet 名称是 map_state_mobilemap_state (2)

所以,有必要:

  • 获取州名列表,为要选择的州选择正确的索引
  • 调用上面的 API 到 select 状态并提取数据

提取字段值(州名)

状态按字母顺序排序(倒序),因此如果您对它们进行硬编码并按如下方式排序,则无需使用以下方法:

['Tocantins', 'Sergipe', 'São Paulo', 'Santa Catarina', 'Roraima', 'Rondônia', 'Rio Grande do Sul', 'Rio Grande do Norte', 'Rio de Janeiro', 'Piauí', 'Pernambuco', 'Paraná', 'Paraíba', 'Pará', 'Minas Gerais', 'Mato Grosso do Sul', 'Mato Grosso', 'Maranhão', 'Goiás', 'Espírito Santo', 'Distrito Federal', 'Ceará', 'Bahia', 'Amazonas', 'Amapá', 'Alagoas', 'Acre']

在其他情况下,如果我们不想对它们进行硬编码(对于其他画面用例),请执行以下方法:

提取州名单并不简单,因为数据如下所示:

{
     "secondaryInfo": {
         "presModelMap": {
            "dataDictionary": {
                "presModelHolder": {
                    "genDataDictionaryPresModel": {
                        "dataSegments": {
                            "0": {
                                "dataColumns": []
                            }
                        }
                    }
                }
            },
             "vizData": {
                     "presModelHolder": {
                         "genPresModelMapPresModel": {
                             "presModelMap": {
                                 "map_state (2)": {},
                                 "map_state_mobile": {},
                                 "time_line_BR": {},
                                 "time_line_BR_mobile": {},
                                 "total de casos": {},
                                 "total de mortes": {}
                             }
                         }
                     }
             }
         }
     }
}

我的方法是进入“vizData”并进入 presModelMap 内部的作品sheet,其结构如下:

"presModelHolder": {
    "genVizDataPresModel": {
        "vizColumns": [],
        "paneColumnsData": {
            "vizDataColumns": [],
            "paneColumnsList": []
        }
    }
}

vizDataColumns 具有 属性 localBaseColumnName 的对象集合。查找值为 [state_name]fieldRolemeasurelocalBaseColumnName :

{
    "fn": "[federated.124ags61tmhyti14im1010h1elsu].[attr:state_name:nk]",
    "fnDisagg": "",
    "localBaseColumnName": "[state_name]", <============================= MATCH THIS
    "baseColumnName": "[federated.124ags61tmhyti14im1010h1elsu].[state_name]",
    "fieldCaption": "ATTR(State Name)",
    "formatStrings": [],
    "datasourceCaption": "federated.124ags61tmhyti14im1010h1elsu",
    "dataType": "cstring",
    "aggregation": "attr",
    "stringCollation": {
        "name": "LEN_RUS_S2",
        "charsetId": 0
    },
    "fieldRole": "measure", <=========================================== MATCH THIS
    "isAutoSelect": true,
    "paneIndices": [
        0  <=========================================== EXTRACT THIS
    ],
    "columnIndices": [
        7  <=========================================== EXTRACT THIS
    ]
} 

paneIndices 匹配 paneColumnsList 数组中的索引。并且 columnIndices 匹配 vizPaneColumns 数组中的索引。 vizPaneColumns 数组恰好位于 selected in paneColumnsList 数组

的项目中

从那里你可以得到像这样搜索的索引:

[222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248]

dataDictionary 对象中,获取数据值(就像您在问题中提取的一样)并从上述范围中提取州名称

然后你得到状态列表:

['Tocantins', 'Sergipe', 'São Paulo', 'Santa Catarina', 'Roraima', 'Rondônia', 'Rio Grande do Sul', 'Rio Grande do Norte', 'Rio de Janeiro', 'Piauí', 'Pernambuco', 'Paraná', 'Paraíba', 'Pará', 'Minas Gerais', 'Mato Grosso do Sul', 'Mato Grosso', 'Maranhão', 'Goiás', 'Espírito Santo', 'Distrito Federal', 'Ceará', 'Bahia', 'Amazonas', 'Amapá', 'Alagoas', 'Acre']

调用 select 端点

您只需要作品sheet名称和字段索引(上面列表中的状态索引)

r = requests.post(f'{data_host}{tableauData["vizql_root"]}/sessions/{tableauData["sessionid"]}/commands/tabdoc/select',
    data = {
    "worksheet": "map_state (2)",
    "dashboard": "Visão Geral",
    "selection": json.dumps({
        "objectIds":[int(selected_index)],
        "selectionType":"tuples"
    }),
    "selectOptions": "select-options-simple"
})

下面的代码提取画面数据,用上面的方法提取状态名称(如果你喜欢硬编码列表则不需要),提示用户输入状态索引,调用 select 端点并提取此状态的数据:

import requests
from bs4 import BeautifulSoup
import json
import re

data_host = "https://public.tableau.com"

# get the second tableau link
r = requests.get(
    f"{data_host}/views/MKTScoredeisolamentosocial/VisoGeral",
    params= {
        ":showVizHome":"no"
    }
)
soup = BeautifulSoup(r.text, "html.parser")
tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)
dataUrl = f'{data_host}{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'
r = requests.post(dataUrl, data= {
    "sheet_id": tableauData["sheetId"],
})

dataReg = re.search('\d+;({.*})\d+;({.*})', r.text, re.MULTILINE)
info = json.loads(dataReg.group(1))
data = json.loads(dataReg.group(2))

stateIndexInfo = [ 
    (t["fieldRole"], {
        "paneIndices": t["paneIndices"][0], 
        "columnIndices": t["columnIndices"][0], 
        "dataType": t["dataType"]
    }) 
    for t in data["secondaryInfo"]["presModelMap"]["vizData"]["presModelHolder"]["genPresModelMapPresModel"]["presModelMap"]["map_state (2)"]["presModelHolder"]["genVizDataPresModel"]["paneColumnsData"]["vizDataColumns"]
    if t.get("localBaseColumnName") and t["localBaseColumnName"] == "[state_name]"
]

stateNameIndexInfo = [t[1] for t in stateIndexInfo if t[0] == 'dimension'][0]

panelColumnList = data["secondaryInfo"]["presModelMap"]["vizData"]["presModelHolder"]["genPresModelMapPresModel"]["presModelMap"]["map_state (2)"]["presModelHolder"]["genVizDataPresModel"]["paneColumnsData"]["paneColumnsList"]
stateNameIndices = panelColumnList[stateNameIndexInfo["paneIndices"]]["vizPaneColumns"][stateNameIndexInfo["columnIndices"]]["valueIndices"]

# print [222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248]
#print(stateNameIndices)

dataValues = [
    t
    for t in data["secondaryInfo"]["presModelMap"]["dataDictionary"]["presModelHolder"]["genDataDictionaryPresModel"]["dataSegments"]["0"]["dataColumns"]
    if t["dataType"] == stateNameIndexInfo["dataType"]
][0]["dataValues"]

stateNames = [dataValues[t] for t in stateNameIndices]

# print ['Tocantins', 'Sergipe', 'São Paulo', 'Santa Catarina', 'Roraima', 'Rondônia', 'Rio Grande do Sul', 'Rio Grande do Norte', 'Rio de Janeiro', 'Piauí', 'Pernambuco', 'Paraná', 'Paraíba', 'Pará', 'Minas Gerais', 'Mato Grosso do Sul', 'Mato Grosso', 'Maranhão', 'Goiás', 'Espírito Santo', 'Distrito Federal', 'Ceará', 'Bahia', 'Amazonas', 'Amapá', 'Alagoas', 'Acre']
#print(stateNames)

for idx, val in enumerate(stateNames):
    print(f"{val} - {idx+1}")

selected_index = input("Please select a state by indices : ")
print(f"selected : {stateNames[int(selected_index)-1]}")

r = requests.post(f'{data_host}{tableauData["vizql_root"]}/sessions/{tableauData["sessionid"]}/commands/tabdoc/select',
    data = {
    "worksheet": "map_state (2)",
    "dashboard": "Visão Geral",
    "selection": json.dumps({
        "objectIds":[int(selected_index)],
        "selectionType":"tuples"
    }),
    "selectOptions": "select-options-simple"
})

dataSegments = r.json()["vqlCmdResponse"]["layoutStatus"]["applicationPresModel"]["dataDictionary"]["dataSegments"]
print(dataSegments[max([*dataSegments])]["dataColumns"])

Try this on repl.it

州名单硬编码的代码更直接:

import requests
from bs4 import BeautifulSoup
import json

data_host = "https://public.tableau.com"

r = requests.get(
    f"{data_host}/views/MKTScoredeisolamentosocial/VisoGeral",
    params= {
        ":showVizHome":"no"
    }
)
soup = BeautifulSoup(r.text, "html.parser")
tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)
dataUrl = f'{data_host}{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'
r = requests.post(dataUrl, data= {
    "sheet_id": tableauData["sheetId"],
})
stateNames = ['Tocantins', 'Sergipe', 'São Paulo', 'Santa Catarina', 'Roraima', 'Rondônia', 'Rio Grande do Sul', 'Rio Grande do Norte', 'Rio de Janeiro', 'Piauí', 'Pernambuco', 'Paraná', 'Paraíba', 'Pará', 'Minas Gerais', 'Mato Grosso do Sul', 'Mato Grosso', 'Maranhão', 'Goiás', 'Espírito Santo', 'Distrito Federal', 'Ceará', 'Bahia', 'Amazonas', 'Amapá', 'Alagoas', 'Acre']

for idx, val in enumerate(stateNames):
    print(f"{val} - {idx+1}")

selected_index = input("Please select a state by indices : ")
print(f"selected : {stateNames[int(selected_index)-1]}")

r = requests.post(f'{data_host}{tableauData["vizql_root"]}/sessions/{tableauData["sessionid"]}/commands/tabdoc/select',
    data = {
    "worksheet": "map_state (2)",
    "dashboard": "Visão Geral",
    "selection": json.dumps({
        "objectIds":[int(selected_index)],
        "selectionType":"tuples"
    }),
    "selectOptions": "select-options-simple"
})

dataSegments = r.json()["vqlCmdResponse"]["layoutStatus"]["applicationPresModel"]["dataDictionary"]["dataSegments"]
print(dataSegments[max([*dataSegments])]["dataColumns"])

Try this on repl.it

请注意,在这种情况下,即使我们不关心第一次调用的输出 (/bootstrapSession/sessions/{tableauData["sessionid"]})。需要验证 session_id 并在之后调用 select (否则 select 不会 return 任何东西)