如何在 Azure 数据工厂中排除复制数据 Activity 中的行?
How can I exclude rows in a Copy Data Activity in Azure Data Factory?
我已经构建了一个具有一个复制数据 activity 的管道,它从 Azure Data Lake
复制数据并将其输出到 Azure Blob Storage
。
在输出中,我可以看到我的一些行没有数据,我想将它们从副本中排除。在下面的例子中,第二行没有有用的数据:
{"TenantId":"qa","Timestamp":"2019-03-06T10:53:51.634Z","PrincipalId":2,"ControlId":"729c3b6e-0442-4884-936c-c36c9b466e9d","ZoneInternalId":0,"IsAuthorized":true,"PrincipalName":"John","StreetName":"Rue 1","ExemptionId":8}
{"TenantId":"qa","Timestamp":"2019-03-06T10:59:09.74Z","PrincipalId":null,"ControlId":null,"ZoneInternalId":null,"IsAuthorized":null,"PrincipalName":null,"StreetName":null,"ExemptionId":null}
问题
在复制数据 activity 中,如何设置规则以排除缺少某些值的行?
这是我的管道代码:
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "Copy from Data Lake to Blob",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [
{
"name": "Source",
"value": "tenantdata/events/"
},
{
"name": "Destination",
"value": "controls/"
}
],
"typeProperties": {
"source": {
"type": "AzureDataLakeStoreSource",
"recursive": true
},
"sink": {
"type": "BlobSink",
"copyBehavior": "MergeFiles"
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"columnMappings": {
"Body.TenantId": "TenantId",
"Timestamp": "Timestamp",
"Body.PrincipalId": "PrincipalId",
"Body.ControlId": "ControlId",
"Body.ZoneId": "ZoneInternalId",
"Body.IsAuthorized": "IsAuthorized",
"Body.PrincipalName": "PrincipalName",
"Body.StreetName": "StreetName",
"Body.Exemption.Kind": "ExemptionId"
}
}
},
"inputs": [
{
"referenceName": "qadl",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "datalakestaging",
"type": "DatasetReference"
}
]
}
]
}
}
这是一个很好的问题(+1),几个月前我也遇到过同样的问题,令我惊讶的是我在副本 Activity 中找不到任何东西来处理这个问题(我什至尝试过具有容错功能,但没有运气)。
考虑到我在管道中使用 U-SQL, I ended up using it to accomplish this. So, instead of a Copy Activity I have a U-SQL Activity in ADF using the IS NOT NULL 运算符进行了其他转换,这取决于您的数据,但您可以使用它,也许您的字符串包含 "NULL" 或空字符串“”,这是它的样子:
DECLARE @file_set_path string = "adl://myadl.azuredatalake.net/Samples/Data/{date_utc:yyyy}{date_utc:MM}{date_utc:dd}T{date_utc:HH}{date_utc:mm}{date_utc:ss}Z.txt";
@data =
EXTRACT
[id] string,
date_utc DateTime
FROM @file_set_path
USING Extractors.Text(delimiter: '\u0001', skipFirstNRows : 1, quoting:false);
@result =
SELECT
[id] ,
date_utc.ToString("yyyy-MM-ddTHH:mm:ss") AS SourceExtractDateUTC
FROM @data
WHERE id IS NOT NULL -- you can also use WHERE id <> "" or <> "NULL";
OUTPUT @result TO "wasb://samples@mywasb/Samples/Data/searchlog.tsv" USING Outputters.Text(delimiter: '\u0001', outputHeader:true);
注意:支持 ADLS 和 Blob 存储INPUT/OUTPUT files
让我知道这是否有帮助,或者上面的示例是否不适用于您的数据。
希望有人会 post 使用复制 Activity 的答案,那会很棒,但这是目前的一种可能性。
我已经构建了一个具有一个复制数据 activity 的管道,它从 Azure Data Lake
复制数据并将其输出到 Azure Blob Storage
。
在输出中,我可以看到我的一些行没有数据,我想将它们从副本中排除。在下面的例子中,第二行没有有用的数据:
{"TenantId":"qa","Timestamp":"2019-03-06T10:53:51.634Z","PrincipalId":2,"ControlId":"729c3b6e-0442-4884-936c-c36c9b466e9d","ZoneInternalId":0,"IsAuthorized":true,"PrincipalName":"John","StreetName":"Rue 1","ExemptionId":8}
{"TenantId":"qa","Timestamp":"2019-03-06T10:59:09.74Z","PrincipalId":null,"ControlId":null,"ZoneInternalId":null,"IsAuthorized":null,"PrincipalName":null,"StreetName":null,"ExemptionId":null}
问题
在复制数据 activity 中,如何设置规则以排除缺少某些值的行?
这是我的管道代码:
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "Copy from Data Lake to Blob",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [
{
"name": "Source",
"value": "tenantdata/events/"
},
{
"name": "Destination",
"value": "controls/"
}
],
"typeProperties": {
"source": {
"type": "AzureDataLakeStoreSource",
"recursive": true
},
"sink": {
"type": "BlobSink",
"copyBehavior": "MergeFiles"
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"columnMappings": {
"Body.TenantId": "TenantId",
"Timestamp": "Timestamp",
"Body.PrincipalId": "PrincipalId",
"Body.ControlId": "ControlId",
"Body.ZoneId": "ZoneInternalId",
"Body.IsAuthorized": "IsAuthorized",
"Body.PrincipalName": "PrincipalName",
"Body.StreetName": "StreetName",
"Body.Exemption.Kind": "ExemptionId"
}
}
},
"inputs": [
{
"referenceName": "qadl",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "datalakestaging",
"type": "DatasetReference"
}
]
}
]
}
}
这是一个很好的问题(+1),几个月前我也遇到过同样的问题,令我惊讶的是我在副本 Activity 中找不到任何东西来处理这个问题(我什至尝试过具有容错功能,但没有运气)。
考虑到我在管道中使用 U-SQL, I ended up using it to accomplish this. So, instead of a Copy Activity I have a U-SQL Activity in ADF using the IS NOT NULL 运算符进行了其他转换,这取决于您的数据,但您可以使用它,也许您的字符串包含 "NULL" 或空字符串“”,这是它的样子:
DECLARE @file_set_path string = "adl://myadl.azuredatalake.net/Samples/Data/{date_utc:yyyy}{date_utc:MM}{date_utc:dd}T{date_utc:HH}{date_utc:mm}{date_utc:ss}Z.txt";
@data =
EXTRACT
[id] string,
date_utc DateTime
FROM @file_set_path
USING Extractors.Text(delimiter: '\u0001', skipFirstNRows : 1, quoting:false);
@result =
SELECT
[id] ,
date_utc.ToString("yyyy-MM-ddTHH:mm:ss") AS SourceExtractDateUTC
FROM @data
WHERE id IS NOT NULL -- you can also use WHERE id <> "" or <> "NULL";
OUTPUT @result TO "wasb://samples@mywasb/Samples/Data/searchlog.tsv" USING Outputters.Text(delimiter: '\u0001', outputHeader:true);
注意:支持 ADLS 和 Blob 存储INPUT/OUTPUT files
让我知道这是否有帮助,或者上面的示例是否不适用于您的数据。 希望有人会 post 使用复制 Activity 的答案,那会很棒,但这是目前的一种可能性。