是否可以使用一个 dlpJob 检查 BigQuery 数据集中的所有表?
Is it possible to inspect all tables in a BigQuery dataset with one dlpJob?
我正在使用 Google Cloud DLP 检查 BigQuery 中的敏感数据。我想知道是否可以使用一个 dlpJob 检查数据集中的所有 tables?如果是这样,我应该如何设置配置?
我试图在配置中省略 BQ tableId 字段。但它会 return http 400 错误“table_id
必须设置”。是不是说一个dlpJob只能检查一个table,要扫描多个table就需要多个dlpJob?或者有没有一种方法可以使用一些正则表达式技巧在同一数据集中扫描多个 table?
目前,一项作业仅扫描一项 table。该团队正在研究该功能——与此同时,您可以使用粗略的 shell 脚本手动创建作业,就像我在下面放置的那样,它结合了 gcloud 和对 dlp api 的其余调用。您可能可以使用云函数做一些更顺畅的事情。
先决条件:
1.安装gcloud。 https://cloud.google.com/sdk/install
2. 运行 此脚本具有以下参数:
3.
1. project_id扫描bigquery tables。
2. 用于存储结果的输出 table 的数据集 ID。
3. 输出 table 的 table id 用于存储结果。
4. 一个数字,表示要扫描的行的百分比。
#例子:
# ./inspect_all_bq_tables.sh dlapi-test findings_daataset
# Reports a status of execution message to the log file and serial port
function report() {
local tag=""
local message=""
local timestamp="$(date +%s)000"
echo "${timestamp} - ${message}"
}
readonly -f report
# report_status_update
#
# Reports a status of execution message to the log file and serial port
function report_status_update() {
report "${MSGTAG_STATUS_UPDATE}" "STATUS="
}
readonly -f report_status_update
# create_job
#
# Creates a single dlp job for a given bigquery table.
function create_dlp_job {
local dataset_id=""
local table_id=""
local create_job_response=$(curl -s -H \
"Authorization: Bearer $(gcloud auth print-access-token)" \
-H "X-Goog-User-Project: $PROJECT_ID" \
-H "Content-Type: application/json" \
"$API_PATH/v2/projects/$PROJECT_ID/dlpJobs" \
--data '
{
"inspectJob":{
"storageConfig":{
"bigQueryOptions":{
"tableReference":{
"projectId":"'$PROJECT_ID'",
"datasetId":"'$dataset_id'",
"tableId":"'$table_id'"
},
"rowsLimitPercent": "'$PERCENTAGE'"
},
},
"inspectConfig":{
"infoTypes":[
{
"name":"ALL_BASIC"
}
],
"includeQuote":true,
"minLikelihood":"LIKELY"
},
"actions":[
{
"saveFindings":{
"outputConfig":{
"table":{
"projectId":"'$PROJECT_ID'",
"datasetId":"'$FINDINGS_DATASET_ID'",
"tableId":"'$FINDINGS_TABLE_ID'"
},
"outputSchema": "BASIC_COLUMNS"
}
}
},
{
"publishFindingsToCloudDataCatalog": {}
}
]
}
}')
if [[ $create_job_response != *"dlpJobs"* ]]; then
report_status_update "Error creating dlp job: $create_job_response"
exit 1
fi
local new_dlpjob_name=$(echo "$create_job_response" \
head -5 | grep -Po '"name": *\K"[^"]*"' | tr -d '"' | head -1)
report_status_update "DLP New Job: $new_dlpjob_name"
}
readonly -f create_dlp_job
# List the datasets for a given project. Once we have these we can list the
# tables within each one.
function create_jobs() {
# The grep pulls the dataset id. The td removes the quotation marks.
local list_datasets_response=$(curl -s -H \
"Authorization: Bearer $(gcloud auth print-access-token)" -H \
"Content-Type: application/json" \
"$BIGQUERY_PATH/projects/$PROJECT_ID/datasets")
if [[ $list_datasets_response != *"kind"* ]]; then
report_status_update "Error listing bigquery datasets: $list_datasets_response"
exit 1
fi
local dataset_ids=$(echo $list_datasets_response \
| grep -Po '"datasetId": *\K"[^"]*"' | tr -d '"')
# Each row will look like "datasetId", with the quotation marks
for dataset_id in ${dataset_ids}; do
report_status_update "Looking up tables for dataset $dataset_id"
local list_tables_response=$(curl -s -H \
"Authorization: Bearer $(gcloud auth print-access-token)" -H \
"Content-Type: application/json" \
"$BIGQUERY_PATH/projects/$PROJECT_ID/datasets/$dataset_id/tables")
if [[ $list_tables_response != *"kind"* ]]; then
report_status_update "Error listing bigquery tables: $list_tables_response"
exit 1
fi
local table_ids=$(echo "$list_tables_response" \
| grep -Po '"tableId": *\K"[^"]*"' | tr -d '"')
for table_id in ${table_ids}; do
report_status_update "Creating DLP job to inspect table $table_id"
create_dlp_job "$dataset_id" "$table_id"
done
done
}
readonly -f create_jobs
PROJECT_ID=
FINDINGS_DATASET_ID=
FINDINGS_TABLE_ID=
PERCENTAGE=
API_PATH="https://dlp.googleapis.com"
BIGQUERY_PATH="https://www.googleapis.com/bigquery/v2"
# Main
create_jobs
我正在使用 Google Cloud DLP 检查 BigQuery 中的敏感数据。我想知道是否可以使用一个 dlpJob 检查数据集中的所有 tables?如果是这样,我应该如何设置配置?
我试图在配置中省略 BQ tableId 字段。但它会 return http 400 错误“table_id
必须设置”。是不是说一个dlpJob只能检查一个table,要扫描多个table就需要多个dlpJob?或者有没有一种方法可以使用一些正则表达式技巧在同一数据集中扫描多个 table?
目前,一项作业仅扫描一项 table。该团队正在研究该功能——与此同时,您可以使用粗略的 shell 脚本手动创建作业,就像我在下面放置的那样,它结合了 gcloud 和对 dlp api 的其余调用。您可能可以使用云函数做一些更顺畅的事情。
先决条件:
1.安装gcloud。 https://cloud.google.com/sdk/install
2. 运行 此脚本具有以下参数:
3.
1. project_id扫描bigquery tables。
2. 用于存储结果的输出 table 的数据集 ID。
3. 输出 table 的 table id 用于存储结果。
4. 一个数字,表示要扫描的行的百分比。
#例子: # ./inspect_all_bq_tables.sh dlapi-test findings_daataset
# Reports a status of execution message to the log file and serial port
function report() {
local tag=""
local message=""
local timestamp="$(date +%s)000"
echo "${timestamp} - ${message}"
}
readonly -f report
# report_status_update
#
# Reports a status of execution message to the log file and serial port
function report_status_update() {
report "${MSGTAG_STATUS_UPDATE}" "STATUS="
}
readonly -f report_status_update
# create_job
#
# Creates a single dlp job for a given bigquery table.
function create_dlp_job {
local dataset_id=""
local table_id=""
local create_job_response=$(curl -s -H \
"Authorization: Bearer $(gcloud auth print-access-token)" \
-H "X-Goog-User-Project: $PROJECT_ID" \
-H "Content-Type: application/json" \
"$API_PATH/v2/projects/$PROJECT_ID/dlpJobs" \
--data '
{
"inspectJob":{
"storageConfig":{
"bigQueryOptions":{
"tableReference":{
"projectId":"'$PROJECT_ID'",
"datasetId":"'$dataset_id'",
"tableId":"'$table_id'"
},
"rowsLimitPercent": "'$PERCENTAGE'"
},
},
"inspectConfig":{
"infoTypes":[
{
"name":"ALL_BASIC"
}
],
"includeQuote":true,
"minLikelihood":"LIKELY"
},
"actions":[
{
"saveFindings":{
"outputConfig":{
"table":{
"projectId":"'$PROJECT_ID'",
"datasetId":"'$FINDINGS_DATASET_ID'",
"tableId":"'$FINDINGS_TABLE_ID'"
},
"outputSchema": "BASIC_COLUMNS"
}
}
},
{
"publishFindingsToCloudDataCatalog": {}
}
]
}
}')
if [[ $create_job_response != *"dlpJobs"* ]]; then
report_status_update "Error creating dlp job: $create_job_response"
exit 1
fi
local new_dlpjob_name=$(echo "$create_job_response" \
head -5 | grep -Po '"name": *\K"[^"]*"' | tr -d '"' | head -1)
report_status_update "DLP New Job: $new_dlpjob_name"
}
readonly -f create_dlp_job
# List the datasets for a given project. Once we have these we can list the
# tables within each one.
function create_jobs() {
# The grep pulls the dataset id. The td removes the quotation marks.
local list_datasets_response=$(curl -s -H \
"Authorization: Bearer $(gcloud auth print-access-token)" -H \
"Content-Type: application/json" \
"$BIGQUERY_PATH/projects/$PROJECT_ID/datasets")
if [[ $list_datasets_response != *"kind"* ]]; then
report_status_update "Error listing bigquery datasets: $list_datasets_response"
exit 1
fi
local dataset_ids=$(echo $list_datasets_response \
| grep -Po '"datasetId": *\K"[^"]*"' | tr -d '"')
# Each row will look like "datasetId", with the quotation marks
for dataset_id in ${dataset_ids}; do
report_status_update "Looking up tables for dataset $dataset_id"
local list_tables_response=$(curl -s -H \
"Authorization: Bearer $(gcloud auth print-access-token)" -H \
"Content-Type: application/json" \
"$BIGQUERY_PATH/projects/$PROJECT_ID/datasets/$dataset_id/tables")
if [[ $list_tables_response != *"kind"* ]]; then
report_status_update "Error listing bigquery tables: $list_tables_response"
exit 1
fi
local table_ids=$(echo "$list_tables_response" \
| grep -Po '"tableId": *\K"[^"]*"' | tr -d '"')
for table_id in ${table_ids}; do
report_status_update "Creating DLP job to inspect table $table_id"
create_dlp_job "$dataset_id" "$table_id"
done
done
}
readonly -f create_jobs
PROJECT_ID=
FINDINGS_DATASET_ID=
FINDINGS_TABLE_ID=
PERCENTAGE=
API_PATH="https://dlp.googleapis.com"
BIGQUERY_PATH="https://www.googleapis.com/bigquery/v2"
# Main
create_jobs