仅当一个值出现两次时才从 CSV 中获取最近的一行

Question

我正在 Mule 中处理一个 CSV 文件，它可能类似于以下内容：

ID|LastUpdated
01|01/12/2016 09:00:00
01|01/12/2016 09:45:00
02|01/12/2016 09:00:00
02|01/12/2016 09:45:00
03|01/12/2016 09:00:00

我正在尝试找到一种方法，通过仅采用由 LastUpdated 列确定的最近的值来去除所有重复出现的 ID 值。我正在尝试使用 DataWeave 来实现这一点，但到目前为止还没有成功。我愿意将逻辑写入自定义 Java class，但对如何做到这一点的了解也有限。

我想要的输出类似于以下内容：

ID|LastUpdated
01|01/12/2016 09:45:00
02|01/12/2016 09:45:00
03|01/12/2016 09:00:00

如有任何帮助或指导，我们将不胜感激。

编辑：值得注意的是，我希望入站文件非常大（最多 000 行），因此我需要了解我的解决方案的性能

编辑：可以在 Mulesoft 论坛 here.

上找到使用 DataWeave 的解决方案

Answer 1

如果 dates/hours 总是像您给出的示例中那样被分类到您的 CSV 中，您可以将所有 ID 的引用作为键保存到 Map 中，并且只需更新与 ID 对应的值：

public static void main(String[] arg){
    // I replace all the CSV reading by this list for the example
    ArrayList<String> lines = new ArrayList<>();
    lines.add("01|01/12/2016 09:00:00");
    lines.add("01|01/12/2016 09:45:00");
    lines.add("02|01/12/2016 09:00:00");
    lines.add("02|01/12/2016 09:45:00");
    lines.add("03|01/12/2016 09:00:00");
    Iterator it = lines.iterator();
    
    Map<String, String> lastLines = new HashMap<String, String>();
    while (it.hasNext()) { // Iterator on the CVS lines here
        String s = (String)it.next();
        String id = s.substring(0,  s.indexOf("|"));
        String val = s.substring(s.indexOf("|") + 1 , s.length());
        lastLines.put(id, val);
    }
    Iterator<String> keys = lastLines.keySet().iterator();
    while (keys.hasNext()) {
        String id = (String) keys.next();
        System.out.println(id + "|" + lastLines.get(id));
    }
}

本产品：

01|01/12/2016 09:45:00

02|01/12/2016 09:45:00

03|01/12/2016 09:00:00

如果 CSV 记录可以按任何顺序排列，那么您需要添加日期验证以仅保留每个 ID 的最新日期。

private static final SimpleDateFormat sdf = new SimpleDateFormat("dd/MM/yyyy hh:mm:ss");

public static void main(String... args) {
    // I replace all the CSV reading by this list for the example
    ArrayList<String> lines = new ArrayList<>();
    
    lines.add("01|01/12/2016 09:45:00");
    lines.add("01|01/12/2016 09:00:00");
    lines.add("02|01/12/2016 09:00:00");
    lines.add("02|01/12/2016 09:45:00");
    lines.add("03|01/12/2016 09:00:00");
    Iterator it = lines.iterator();

    Map<String, String> lastLines = new HashMap<String, String>();
    while (it.hasNext()) { // Iterator on the CVS lines here
        String s = (String)it.next();
        String id = s.substring(0,  s.indexOf("|"));
        String val = s.substring(s.indexOf("|") + 1 , s.length());
        if(lastLines.containsKey(id)){
            try{
                Date storeDate = sdf.parse(lastLines.get(id));
                Date readDate = sdf.parse(val);
                if(readDate.getTime() > storeDate.getTime())
                    lastLines.put(id, val);
            }catch(ParseException pe){
                pe.printStackTrace();
            }
        }else{
            lastLines.put(id, val);
        }
    }
    Iterator<String> keys = lastLines.keySet().iterator();
    while (keys.hasNext()) {
        String id = (String) keys.next();
        System.out.println(id + "|" + lastLines.get(id));
    }

}

我不确定您当前使用的日期格式。您可能需要更改解析器的格式"dd/MM/yyyy hh:mm:ss"。您可以找到相关文档 here

Answer 2

刚看到这个，我相信@danw 也在 Mule 论坛上问过这个问题。使用 DataWeave 有更好的方法来实现它。在 mule 论坛上查看我的答案 - http://forums.mulesoft.com/questions/40897/only-take-most-recent-line-from-csv-when-a-value-a.html#answer-40975

仅当一个值出现两次时才从 CSV 中获取最近的一行

Only take most recent line from CSV when a value appears twice

java

csv

duplicates

mule

dataweave