KDB - 文本解析和编目文本数据
KDB - Text parsing and cataloging text data
我有由不同周期性字符串组成的数据,这些字符串实际上是一个时间值列表,其中包含一个周期性标志。不幸的是,每个字符串长度可以有不同数量的元素,但不超过 7 个。
以下示例 -(每个字符串末尾的 # 和 #/M 表示这些是月度值)从 2020 年 8 月开始,而 #/Y 是年度数字,因此我们除以 12 例如得到月度值价值。 # 开头简单表示从上一期继续。
从 CSV 复制
ID,seg,strField
AAA,1,8/2020 2333 2456 2544 2632 2678 #/M
AAA,2,# 3333 3456 3544 3632 3678 #
AAA,3,# 4333 4456 4544 4632 4678 #/M
AAA,4,11/2021 5333 5456 #/M
AAA,5,# 6333 6456 6544 6632 6678 #/Y
t:("SSS";enlist",") 0:`:./Data/src/strField.csv; // read in csv data above
t:update result:count[t]#enlist`float$() from t; // initiate empty result column
我通常会标记化,然后将 7 列中的每一列传递给一个函数,但限制是 8 个参数,除了这 7 个参数之外,我还想发送其他元数据。
t:@[t;`tok1`tok2`tok3`tok4`tok5`tok6`tok7;:;flip .Q.fu[{" " vs'x}]t `strField];
t: ungroup t;
//Desired result
ID seg iDate result
AAA 1 8/31/2020 2333
AAA 1 9/30/2020 2456
AAA 1 10/31/2020 2544
AAA 1 11/30/2020 2632
AAA 1 12/31/2020 2678
AAA 2 1/31/2021 3333
AAA 2 2/28/2021 3456
AAA 2 3/31/2021 3544
AAA 2 4/30/2021 3632
AAA 2 5/31/2021 3678
AAA 3 6/30/2021 4333
AAA 3 7/31/2021 4456
AAA 3 8/31/2021 4544
AAA 3 9/30/2021 4632
AAA 3 10/31/2021 4678
AAA 4 11/30/2021 5333
AAA 4 12/31/2021 5456
AAA 5 1/31/2022 527.75 <-- 6333/12
AAA 5 2/28/2022 527.75
AAA 5 3/31/2022 527.75
AAA 5 4/30/2022 527.75
AAA 5 5/31/2022 527.75
AAA 5 6/30/2022 527.75
AAA 5 7/31/2022 527.75
AAA 5 8/31/2022 527.75
AAA 5 9/30/2022 527.75
AAA 5 10/31/2022 527.75
AAA 5 11/30/2022 527.75
AAA 5 12/31/2022 527.75
AAA 5 1/31/2023 538.00 <--6456/12
AAA 5 2/28/2023 538.00
AAA 5 3/31/2023 538.00
AAA 5 4/30/2023 538.00
AAA 5 5/31/2023 538.00
AAA 5 6/30/2023 538.00
AAA 5 7/31/2023 538.00
AAA 5 8/31/2023 538.00
AAA 5 9/30/2023 538.00
AAA 5 10/31/2023 538.00
AAA 5 11/30/2023 538.00
AAA 5 12/31/2023 538.00
AAA 5 1/31/2024 etc..
AAA 5 2/29/2024
AAA 5 3/31/2024
AAA 5 4/30/2024
AAA 5 5/31/2024
AAA 5 6/30/2024
AAA 5 7/31/2024
能否将列传递到字典中,然后将字典传递到函数中?这避免了最多有 8 个参数的问题,因为字典可以根据需要设置。
ddonelly 是正确的,字典或列表绕过了函数 8 个参数的限制,但我认为这不是正确的方法。下面实现了所需的输出:
t:("SSS";enlist",") 0:`:so.csv;
// This will process each distinct ID separately as the date logic I have here would break if you had a BBB entry that starts date over
{[t]
t:@[{[x;y] select from x where ID = y}[t;]';exec distinct ID from t];
raze {[t]
t:@[t;`strField;{" "vs string x}'];
t:ungroup update`$date from delete strField from @[t;`date`result`year;:;({first x}each t[`strField];"J"${-1_1_x}each t[`strField];
`Y =fills @[("#/Y";"#/M";"#")!`Y`M`;last each t[`strField]])];
delete year from ungroup update date:`$'string date from update result:?[year;result%12;result],
date:{x+til count x} each {max($[z;12#(x+12-x mod 12);1#x+1];y)}\[0;"M"$/:raze each reverse each
"/" vs/: string date;year] from t
} each t
}[t]
ID seg date result
AAA 1 2020.08 2333
AAA 1 2020.09 2456
AAA 1 2020.10 2544
AAA 1 2020.11 2632
AAA 1 2020.12 2678
AAA 2 2021.01 3333
AAA 2 2021.02 3456
AAA 2 2021.03 3544
AAA 2 2021.04 3632
AAA 2 2021.05 3678
AAA 3 2021.06 4333
AAA 3 2021.07 4456
AAA 3 2021.08 4544
AAA 3 2021.09 4632
AAA 3 2021.10 4678
AAA 4 2021.11 5333
AAA 4 2021.12 5456
AAA 5 2022.01 527.75
AAA 5 2022.02 527.75
AAA 5 2022.03 527.75
...
AAA 5 2023.01 538
AAA 5 2023.02 538
AAA 5 2023.03 538
AAA 5 2023.04 538
...
AAA 5 2024.01 545.3333
AAA 5 2024.02 545.3333
...
下面是嵌套函数内部发生的事情的完整分解,如果您需要它来理解。
// vs (vector from scalar) is useful for string manipulation to separate the strField column into a more manageable list of seperate strings
t:@[t;`strField;{" "vs string x}'];
// split the strField out to more manageable columns
t:@[t;`date`result`year;:;
// date column from the first part of strField
({first x}each t[`strField];
// result for the actual value fields in the middle
"J"${-1_1_x}each t[`strField];
// year column which is a boolean to indicate special handling is needed.
// I also forward fill to account for rows which are continuation of
// the previous rows time period,
// e.g. if you had 2 or 3 lines in a row of continuous yearly data
`Y =fills @[("#/Y";"#/M";"#")!`Y`M`;last each t[`strField]])];
// ungroup to split each result into individual rows
t:ungroup update`$date from delete strField from t;
t:update
// divide yearly rows where necessary with a vector conditional
result:?[year;result%12;result],
// change year into a progressive month list
date:{x+til count x} each
// check if a month exists, if not take previous month + 1.
// If a year, previous month + 12 and convert to Jan
// create a list of Jans for the year which I convert to Jan->Dec above
{max($[z;12#(x+12-x mod 12);1#x+1];y)}\
// reformat date to kdb month to feed with year into the scan iterator above
[0;"M"$/:raze each reverse each "/" vs/: string date;year] from t;
// finally convert date to symbol again to ungroup year rows into individual rows
delete year from ungroup update date:`$'string date from t
我有由不同周期性字符串组成的数据,这些字符串实际上是一个时间值列表,其中包含一个周期性标志。不幸的是,每个字符串长度可以有不同数量的元素,但不超过 7 个。
以下示例 -(每个字符串末尾的 # 和 #/M 表示这些是月度值)从 2020 年 8 月开始,而 #/Y 是年度数字,因此我们除以 12 例如得到月度值价值。 # 开头简单表示从上一期继续。
从 CSV 复制
ID,seg,strField
AAA,1,8/2020 2333 2456 2544 2632 2678 #/M
AAA,2,# 3333 3456 3544 3632 3678 #
AAA,3,# 4333 4456 4544 4632 4678 #/M
AAA,4,11/2021 5333 5456 #/M
AAA,5,# 6333 6456 6544 6632 6678 #/Y
t:("SSS";enlist",") 0:`:./Data/src/strField.csv; // read in csv data above
t:update result:count[t]#enlist`float$() from t; // initiate empty result column
我通常会标记化,然后将 7 列中的每一列传递给一个函数,但限制是 8 个参数,除了这 7 个参数之外,我还想发送其他元数据。
t:@[t;`tok1`tok2`tok3`tok4`tok5`tok6`tok7;:;flip .Q.fu[{" " vs'x}]t `strField];
t: ungroup t;
//Desired result
ID seg iDate result
AAA 1 8/31/2020 2333
AAA 1 9/30/2020 2456
AAA 1 10/31/2020 2544
AAA 1 11/30/2020 2632
AAA 1 12/31/2020 2678
AAA 2 1/31/2021 3333
AAA 2 2/28/2021 3456
AAA 2 3/31/2021 3544
AAA 2 4/30/2021 3632
AAA 2 5/31/2021 3678
AAA 3 6/30/2021 4333
AAA 3 7/31/2021 4456
AAA 3 8/31/2021 4544
AAA 3 9/30/2021 4632
AAA 3 10/31/2021 4678
AAA 4 11/30/2021 5333
AAA 4 12/31/2021 5456
AAA 5 1/31/2022 527.75 <-- 6333/12
AAA 5 2/28/2022 527.75
AAA 5 3/31/2022 527.75
AAA 5 4/30/2022 527.75
AAA 5 5/31/2022 527.75
AAA 5 6/30/2022 527.75
AAA 5 7/31/2022 527.75
AAA 5 8/31/2022 527.75
AAA 5 9/30/2022 527.75
AAA 5 10/31/2022 527.75
AAA 5 11/30/2022 527.75
AAA 5 12/31/2022 527.75
AAA 5 1/31/2023 538.00 <--6456/12
AAA 5 2/28/2023 538.00
AAA 5 3/31/2023 538.00
AAA 5 4/30/2023 538.00
AAA 5 5/31/2023 538.00
AAA 5 6/30/2023 538.00
AAA 5 7/31/2023 538.00
AAA 5 8/31/2023 538.00
AAA 5 9/30/2023 538.00
AAA 5 10/31/2023 538.00
AAA 5 11/30/2023 538.00
AAA 5 12/31/2023 538.00
AAA 5 1/31/2024 etc..
AAA 5 2/29/2024
AAA 5 3/31/2024
AAA 5 4/30/2024
AAA 5 5/31/2024
AAA 5 6/30/2024
AAA 5 7/31/2024
能否将列传递到字典中,然后将字典传递到函数中?这避免了最多有 8 个参数的问题,因为字典可以根据需要设置。
ddonelly 是正确的,字典或列表绕过了函数 8 个参数的限制,但我认为这不是正确的方法。下面实现了所需的输出:
t:("SSS";enlist",") 0:`:so.csv;
// This will process each distinct ID separately as the date logic I have here would break if you had a BBB entry that starts date over
{[t]
t:@[{[x;y] select from x where ID = y}[t;]';exec distinct ID from t];
raze {[t]
t:@[t;`strField;{" "vs string x}'];
t:ungroup update`$date from delete strField from @[t;`date`result`year;:;({first x}each t[`strField];"J"${-1_1_x}each t[`strField];
`Y =fills @[("#/Y";"#/M";"#")!`Y`M`;last each t[`strField]])];
delete year from ungroup update date:`$'string date from update result:?[year;result%12;result],
date:{x+til count x} each {max($[z;12#(x+12-x mod 12);1#x+1];y)}\[0;"M"$/:raze each reverse each
"/" vs/: string date;year] from t
} each t
}[t]
ID seg date result
AAA 1 2020.08 2333
AAA 1 2020.09 2456
AAA 1 2020.10 2544
AAA 1 2020.11 2632
AAA 1 2020.12 2678
AAA 2 2021.01 3333
AAA 2 2021.02 3456
AAA 2 2021.03 3544
AAA 2 2021.04 3632
AAA 2 2021.05 3678
AAA 3 2021.06 4333
AAA 3 2021.07 4456
AAA 3 2021.08 4544
AAA 3 2021.09 4632
AAA 3 2021.10 4678
AAA 4 2021.11 5333
AAA 4 2021.12 5456
AAA 5 2022.01 527.75
AAA 5 2022.02 527.75
AAA 5 2022.03 527.75
...
AAA 5 2023.01 538
AAA 5 2023.02 538
AAA 5 2023.03 538
AAA 5 2023.04 538
...
AAA 5 2024.01 545.3333
AAA 5 2024.02 545.3333
...
下面是嵌套函数内部发生的事情的完整分解,如果您需要它来理解。
// vs (vector from scalar) is useful for string manipulation to separate the strField column into a more manageable list of seperate strings
t:@[t;`strField;{" "vs string x}'];
// split the strField out to more manageable columns
t:@[t;`date`result`year;:;
// date column from the first part of strField
({first x}each t[`strField];
// result for the actual value fields in the middle
"J"${-1_1_x}each t[`strField];
// year column which is a boolean to indicate special handling is needed.
// I also forward fill to account for rows which are continuation of
// the previous rows time period,
// e.g. if you had 2 or 3 lines in a row of continuous yearly data
`Y =fills @[("#/Y";"#/M";"#")!`Y`M`;last each t[`strField]])];
// ungroup to split each result into individual rows
t:ungroup update`$date from delete strField from t;
t:update
// divide yearly rows where necessary with a vector conditional
result:?[year;result%12;result],
// change year into a progressive month list
date:{x+til count x} each
// check if a month exists, if not take previous month + 1.
// If a year, previous month + 12 and convert to Jan
// create a list of Jans for the year which I convert to Jan->Dec above
{max($[z;12#(x+12-x mod 12);1#x+1];y)}\
// reformat date to kdb month to feed with year into the scan iterator above
[0;"M"$/:raze each reverse each "/" vs/: string date;year] from t;
// finally convert date to symbol again to ungroup year rows into individual rows
delete year from ungroup update date:`$'string date from t