KDB - 文本解析和编目文本数据

KDB - Text parsing and cataloging text data

我有由不同周期性字符串组成的数据,这些字符串实际上是一个时间值列表,其中包含一个周期性标志。不幸的是,每个字符串长度可以有不同数量的元素,但不超过 7 个。

以下示例 -(每个字符串末尾的 # 和 #/M 表示这些是月度值)从 2020 年 8 月开始,而 #/Y 是年度数字,因此我们除以 12 例如得到月度值价值。 # 开头简单表示从上一期继续。

从 CSV 复制

ID,seg,strField
AAA,1,8/2020 2333 2456 2544 2632 2678 #/M
AAA,2,# 3333 3456 3544 3632 3678 #
AAA,3,# 4333 4456 4544 4632 4678 #/M
AAA,4,11/2021 5333 5456 #/M
AAA,5,# 6333 6456 6544 6632 6678 #/Y
t:("SSS";enlist",") 0:`:./Data/src/strField.csv; // read in csv data above
t:update result:count[t]#enlist`float$() from t; // initiate empty result column

我通常会标记化,然后将 7 列中的每一列传递给一个函数,但限制是 8 个参数,除了这 7 个参数之外,我还想发送其他元数据。

t:@[t;`tok1`tok2`tok3`tok4`tok5`tok6`tok7;:;flip .Q.fu[{" " vs'x}]t `strField];  

t: ungroup t; 

//Desired result
ID   seg    iDate   result
AAA  1  8/31/2020   2333    
AAA  1  9/30/2020   2456    
AAA  1  10/31/2020  2544    
AAA  1  11/30/2020  2632    
AAA  1  12/31/2020  2678    
AAA  2  1/31/2021   3333    
AAA  2  2/28/2021   3456    
AAA  2  3/31/2021   3544    
AAA  2  4/30/2021   3632    
AAA  2  5/31/2021   3678    
AAA  3  6/30/2021   4333    
AAA  3  7/31/2021   4456    
AAA  3  8/31/2021   4544    
AAA  3  9/30/2021   4632    
AAA  3  10/31/2021  4678    
AAA  4  11/30/2021  5333    
AAA  4  12/31/2021  5456    
AAA  5  1/31/2022    527.75     <-- 6333/12
AAA  5  2/28/2022    527.75     
AAA  5  3/31/2022    527.75     
AAA  5  4/30/2022    527.75     
AAA  5  5/31/2022    527.75     
AAA  5  6/30/2022    527.75     
AAA  5  7/31/2022    527.75     
AAA  5  8/31/2022    527.75     
AAA  5  9/30/2022    527.75     
AAA  5  10/31/2022   527.75     
AAA  5  11/30/2022   527.75     
AAA  5  12/31/2022   527.75     
AAA  5  1/31/2023    538.00     <--6456/12
AAA  5  2/28/2023    538.00     
AAA  5  3/31/2023    538.00     
AAA  5  4/30/2023    538.00     
AAA  5  5/31/2023    538.00     
AAA  5  6/30/2023    538.00     
AAA  5  7/31/2023    538.00     
AAA  5  8/31/2023    538.00     
AAA  5  9/30/2023    538.00     
AAA  5  10/31/2023   538.00     
AAA  5  11/30/2023   538.00     
AAA  5  12/31/2023   538.00     
AAA  5  1/31/2024       etc..
AAA  5  2/29/2024       
AAA  5  3/31/2024       
AAA  5  4/30/2024       
AAA  5  5/31/2024       
AAA  5  6/30/2024       
AAA  5  7/31/2024       
        

能否将列传递到字典中,然后将字典传递到函数中?这避免了最多有 8 个参数的问题,因为字典可以根据需要设置。

ddonelly 是正确的,字典或列表绕过了函数 8 个参数的限制,但我认为这不是正确的方法。下面实现了所需的输出:

t:("SSS";enlist",") 0:`:so.csv;

// This will process each distinct ID separately as the date logic I have here would break if you had a BBB entry that starts date over
{[t]
    
    t:@[{[x;y] select from x where ID = y}[t;]';exec distinct ID from t];  

    raze {[t]
        t:@[t;`strField;{" "vs string x}'];
        t:ungroup update`$date from delete strField from @[t;`date`result`year;:;({first x}each t[`strField];"J"${-1_1_x}each t[`strField];
            `Y =fills @[("#/Y";"#/M";"#")!`Y`M`;last each t[`strField]])];
        delete year from ungroup update date:`$'string date from update result:?[year;result%12;result],
            date:{x+til count x} each {max($[z;12#(x+12-x mod 12);1#x+1];y)}\[0;"M"$/:raze each reverse each 
                "/" vs/: string date;year] from t
     } each t
    
    }[t]

ID  seg date    result
AAA 1   2020.08 2333
AAA 1   2020.09 2456
AAA 1   2020.10 2544
AAA 1   2020.11 2632
AAA 1   2020.12 2678
AAA 2   2021.01 3333
AAA 2   2021.02 3456
AAA 2   2021.03 3544
AAA 2   2021.04 3632
AAA 2   2021.05 3678
AAA 3   2021.06 4333
AAA 3   2021.07 4456
AAA 3   2021.08 4544
AAA 3   2021.09 4632
AAA 3   2021.10 4678
AAA 4   2021.11 5333
AAA 4   2021.12 5456
AAA 5   2022.01 527.75
AAA 5   2022.02 527.75
AAA 5   2022.03 527.75
...
AAA 5   2023.01 538
AAA 5   2023.02 538
AAA 5   2023.03 538
AAA 5   2023.04 538
...
AAA 5   2024.01 545.3333
AAA 5   2024.02 545.3333
...

下面是嵌套函数内部发生的事情的完整分解,如果您需要它来理解。

// vs (vector from scalar) is useful for string manipulation to separate the strField column into a more manageable list of seperate strings 
t:@[t;`strField;{" "vs string x}'];

// split the strField out to more manageable columns
t:@[t;`date`result`year;:;
    
    // date column from the first part of strField 
    ({first x}each t[`strField];
    
    // result for the actual value fields in the middle
    "J"${-1_1_x}each t[`strField];
     
    // year column which is a boolean to indicate special handling is needed. 
    // I also forward fill to account for rows which are continuation of 
    // the previous rows time period, 
    // e.g. if you had 2 or 3 lines in a row of continuous yearly data 
    `Y =fills @[("#/Y";"#/M";"#")!`Y`M`;last each t[`strField]])];

// ungroup to split each result into individual rows
t:ungroup update`$date from delete strField from t;

t:update 
    // divide yearly rows where necessary with a vector conditional
    result:?[year;result%12;result],
    
    // change year into a progressive month list
    date:{x+til count x} each 
        
        // check if a month exists, if not take previous month + 1. 
        // If a year, previous month + 12 and convert to Jan
        // create a list of Jans for the year which I convert to Jan->Dec above
        {max($[z;12#(x+12-x mod 12);1#x+1];y)}\
             // reformat date to kdb month to feed with year into the scan iterator above
             [0;"M"$/:raze each reverse each "/" vs/: string date;year] from t;

// finally convert date to symbol again to ungroup year rows into individual rows
delete year from ungroup update date:`$'string date from t