基于文本搜索的算法未按预期运行

Question

更新

我已经用其他 SO 用户建议的较新代码更新了问题，并将澄清之前存在的任何含糊不清的文本。

更新 #2

我只能访问相关应用程序生成的日志文件。因此，我只能在日志文件的内容范围内工作，并且不可能有超出该范围的解决方案。我将稍微修改示例数据。我想指出以下关键 variables.

Thread ID - 范围从 0..19 - 一个线程被多次使用。因此 ScriptExecThread(2) 可能会在日志中出现多次。

Script - 每个线程都会运行一个针对特定文件的脚本。同样，同一脚本可能运行在同一线程上，但不会运行在同一线程和文件上。

File - 每个 Thread ID 运行在 File 上有一个 Script。如果 Thread(10) 是运行ning myscript.script on myfile.file，那么将不会再次执行该 EXACT 行。使用上述示例的成功示例应该是这样的。

------START------

Thread(10) starting myscript.script on myfile.file

Thread(10) finished myscript.script on myfile.file

------END-------

使用上述示例的不成功示例是：

------START------

Thread(10) starting myscript.script on myfile.file

------END------

在解决我的查询之前，我将给出运行使用的代码和所需行为的摘要。

总结

我目前正在解析大型日志文件（平均 100k - 600k 行）并试图按特定顺序检索特定信息。我已经计算出我的请求背后的布尔代数，它似乎在纸上可行，但在代码上却没有那么多（我一定错过了一些明显的东西）。我想提前通知代码没有任何形状或形式的优化，现在我只是想让它工作。

在此日志文件中，您可以看到某些线程在启动但从未完成时挂起。可能的线程 ID 范围数。这是一些伪代码：

    REGEX = "ScriptExecThread(\([0-9]+\)).*?(finished|starting)" //in java
    Set started, finished
    for (int i=log.size()-1; i >=0; i--) {
    if(group(2).contains("starting")
        started.add(log.get(i))
    else if(group(2).contains("finished")
        finished.add(log.get(i)    
    }
    started.removeAll(finished);

搜索挂起的线程

Set<String> started = new HashSet<String>(), finished = new HashSet<String>();
            
for(int i = JAnalyzer.csvlog.size()-1; i >= 0; i--) {
    if(JAnalyzer.csvlog.get(i).contains("ScriptExecThread")) 
        JUtility.hasThreadHung(JAnalyzer.csvlog.get(i), started, finished);     
}
started.removeAll(finished);
            
commonTextArea.append("Number of threads hung: " + noThreadsHung + "\n");
for(String s : started) { 
    JLogger.appendLineToConsole(s);
    commonTextArea.append(s+"\n");
}

有线程挂起

public static boolean hasThreadHung(final String str, Set<String> started, Set<String> finished) {      
    Pattern r = Pattern.compile("ScriptExecThread(\([0-9]+\)).*?(finished|starting)");
    Matcher m = r.matcher(str);
    boolean hasHung = m.find();
    
        if(m.group(2).contains("starting"))
            started.add(str);
        else if (m.group(2).contains("finished"))
            finished.add(str);
        
        System.out.println("Started size: " + started.size());
        System.out.println("Finished size: " + finished.size());
        
    return hasHung;
}

示例数据

ScriptExecThread(1) started on afile.xyz

ScriptExecThread(2) started on bfile.abc

ScriptExecThread(3) started on cfile.zyx

ScriptExecThread(4) started on dfile.zxy

ScriptExecThread(5) started on efile.yzx

ScriptExecThread(1) finished on afile.xyz

ScriptExecThread(2) finished on bfile.abc

ScriptExecThread(3) finished on cfile.zyx

ScriptExecThread(4) finished on dfile.zxy

ScriptExecThread(5) finished on efile.yzy

ScriptExecThread(1) started on bfile.abc

ScriptExecThread(2) started on dfile.zxy

ScriptExecThread(3) started on afile.xyz

ScriptExecThread(1) finished on bfile.abc

END OF LOG

如果您以此为例，您会注意到线程 2 和 3 已启动但未能完成（不需要原因，我只需要获取该行）。

示例数据

09.08 15:06.53, ScriptExecThread(7),Info,########### starting

09.08 15:06.54, ScriptExecThread(18),Info,###################### starting

09.08 15:06.54, ScriptExecThread(13),Info,######## finished in #########

09.08 15:06.54, ScriptExecThread(13),Info,########## starting

09.08 15:06.55, ScriptExecThread(9),Info,##### finished in ########

09.08 15:06.55, ScriptExecThread(0),Info,####finished in ###########

09.08 15:06.55, ScriptExecThread(19),Info,#### finished in ########

09.08 15:06.55, ScriptExecThread(8),Info,###### finished in 2777 #########

09.08 15:06.55, ScriptExecThread(19),Info,########## starting

09.08 15:06.55, ScriptExecThread(8),Info,####### starting

09.08 15:06.55, ScriptExecThread(0),Info,##########starting

09.08 15:06.55, ScriptExecThread(19),Info,Post ###### finished in #####

09.08 15:06.55, ScriptExecThread(0),Info,###### finished in #########

09.08 15:06.55, ScriptExecThread(19),Info,########## starting

09.08 15:06.55, ScriptExecThread(0),Info,########### starting

09.08 15:06.55, ScriptExecThread(9),Info,########## starting

09.08 15:06.56, ScriptExecThread(1),Info,####### finished in ########

09.08 15:06.56, ScriptExecThread(17),Info,###### finished in #######

09.08 15:06.56, ScriptExecThread(17),Info,###################### starting

09.08 15:06.56, ScriptExecThread(1),Info,########## starting

目前代码只显示整个日志文件，其中的行以“starting”开头。当我查看代码时，这确实有点道理。

我删除了所有不想显示的冗余信息。如果有任何我可能遗漏的内容，请随时告诉我，我会添加。

Answer 1

如果我没理解错的话，您的文件很大，并且正在尝试为 X 的所有数值查找 "X started (but no mention of X finished)" 形式的模式。

如果我要这样做，我会使用这个伪代码：

Pattern p = Pattern.compile(
   "ScriptExecThread\(([0-9]+).*?(finished|started)");
Set<Integer> started, finished;
Search for p; for each match m,
     int n = Integer.parseInt(m.group(1));
     if (m.group(2).equals("started")) started.add(n);
     else finished.add(n);
started.removeAll(finished); // found 'em: contains started-but-not-finished

这需要对每个文件进行一次正则表达式传递，并进行 O(size-of-finished) 集合减法；它应该比您当前的方法快 20 倍。正则表达式将使用可选的 (|) 匹配来同时查找两个备选方案，从而减少遍历次数。

编辑：使正则表达式显式化。编译正则表达式一次而不是每行一次应该减少一些额外的运行时间。

编辑 2：实现的伪代码，适合我

编辑 3：替换实现以显示文件和行。减少内存需求（不将整个文件加载到内存中）；但是打印该行确实需要存储所有 "start" 行。

public class T {

    public static Collection<String> findHung(Iterable<String> data) {
        Pattern p = Pattern.compile(   
            "ScriptExecThread\(([0-9]+).*?(finished|starting)");
        HashMap<Integer, String> started = new HashMap<Integer, String>();
        Set<Integer> finished = new HashSet<Integer>();
        for (String d : data) {
            Matcher m = p.matcher(d);
            if (m.find()) {
                int n = Integer.parseInt(m.group(1));
                if (m.group(2).equals("starting")) started.put(n, d);
                else finished.add(n);
            }                
        }
        for (int f : finished) {
            started.remove(f);
        }
        return started.values();
    }

    static Iterable<String> readFile(String path, String encoding) throws IOException {
        final Scanner sc = new Scanner(new File(path), encoding).useDelimiter("\n");
        return new Iterable<String>() {
            public Iterator<String> iterator() { return sc; }
        };
    }

    public static void main(String[] args) throws Exception {
        for (String fileName : args) {
            for (String s : findHung(readFile(fileName, "UTF-8"))) {
                System.out.println(fileName + ": '" + s + "' unfinished");
            }
        }
    }
}

输入：上面的示例数据，作为第一个参数，称为"data.txt"。另一个名为 "data2.txt" 的文件中的相同数据作为第二个参数 (javac T.java && java T data.txt data2.txt)。输出：

data.txt: '    09.08 15:06.54, ScriptExecThread(18),Info,###################### starting' unfinished
data.txt: '    09.08 15:06.53, ScriptExecThread(7),Info,########### starting' unfinished
data2.txt: '    09.08 15:06.54, ScriptExecThread(18),Info,###################### starting' unfinished
data2.txt: '    09.08 15:06.53, ScriptExecThread(7),Info,########### starting' unfinished

Answer 2

为什么不换个方式解决问题。如果你想要的只是挂起的线程，可以通过编程方式获取线程堆栈。也可以使用外部工具，但我认为在内部拥有 JVM 是最简单的。然后将其公开为 API 或使用线程转储定期保存日期时间戳文件。另一个程序只需要分析线程转储。如果相同的线程在相同的位置（相同的堆栈跟踪或超过相同的 3-5 个函数）超过线程转储，它很可能挂起。

有工具帮你分析https://www.google.com/search?q=java+thread+dump+tool+open+source

Answer 3

保持两组独立的 started 和 finished 线程（如@tucuxi 所述）是行不通的。如果 ID 为 5 的线程开始、运行s 并结束，则 5 将永远出现在 finished 集中。如果另一个 ID 为 5 的线程启动并挂起，则不会报告。

不过，暂时假设线程 ID 未被重用。每个创建的线程都会收到一个新的 ID。通过保持单独的 started 和 finished 集合，您将在完成时每个集合中拥有数十万个元素。这些数据结构的性能与它们在操作时获得的内容成正比。性能不太可能对您的用例很重要，但如果您要执行更昂贵的操作，或者运行处理 100 倍大的数据，它可能。

前言不碍事，这是@tucuxi 代码的工作版本：

import java.util.*;
import java.io.*;
import java.util.regex.*;

public class T {
    public static Collection<String> findHung(Iterable<String> data) {
        Pattern p = Pattern.compile(
            "ScriptExecThread\(([0-9]+).*?(finished|starting)");
        HashMap<Integer, String> running = new HashMap<Integer, String>();
        for (String d : data) {
            Matcher m = p.matcher(d);
            if (m.find()) {
                int n = Integer.parseInt(m.group(1));
                if (m.group(2).equals("starting"))
                    running.put(n, d);
                else
                    running.remove(n);
            }
        }
        return running.values();
    }

    static Iterable<String> readFile(String path, String encoding) throws IOException {
        final Scanner sc = new Scanner(new File(path), encoding).useDelimiter("\n");
        return new Iterable<String>() {
            public Iterator<String> iterator() { return sc; }
        };
    }

    public static void main(String[] args) throws Exception {
        for (String fileName : args) {
            for (String s : findHung(readFile(fileName, "UTF-8"))) {
                System.out.println(fileName + ": '" + s + "' unfinished");
            }
        }
    }
}

请注意，我删除了 finished 集，HashMap 现在称为 running。当新线程启动时，它们进入，当线程结束时，它被拉出。这意味着 HashMap 将始终是当前运行ning 线程数的大小，它将始终小于（或等于）曾经运行的线程总数。所以对它的操作会更快。（作为一个令人愉快的副作用，您现在可以逐个日志行跟踪有多少线程在日志行上运行ning，这可能有助于确定线程挂起的原因。）

这是我用来生成大量测试用例的 Python 程序：

#!/usr/bin/python

from random import random, choice
from datetime import datetime
import tempfile

all_threads = set([])
running = []
hung = []
filenames = { }

target_thread_count = 16
hang_chance = 0.001

def log(id, msg):
    now = datetime.now().strftime("%m:%d %H:%M:%S")
    print "%s, ScriptExecThread(%i),Info,%s" % (now, id, msg)

def new_thread():
    if len(all_threads)>0:
        for t in range(0, 2+max(all_threads)):
            if t not in all_threads:
                all_threads.add(t)
                return t
    else:
        all_threads.add(0)
        return 0

for i in range(0, 100000):
    if len(running) > target_thread_count:
        new_thread_chance = 0.25
    else:
        new_thread_chance = 0.75
        pass

    if random() < new_thread_chance:
        t = new_thread()
        name = next(tempfile._get_candidate_names())+".txt"
        filenames[t] = name
        log(t, "%s starting" % (name,))
        if random() < hang_chance:
            hung.append(t)
        else:
            running.append(t)
    elif len(running)>0:
        victim = choice(running)
        all_threads.remove(victim)
        running.remove(victim)
        log(t, "%s finished" % (filenames[victim],))

Answer 4

removeAll 永远行不通。
hasThreadHung 正在存储整个字符串。
因此 started 中的值永远不会与 finished 中的值匹配。

您想做这样的事情：

class ARecord {
    // Proper encapsulation of the members omitted for brevity
    String thread;
    String line;
    public ARecord (String thread, String line) {
        this.thread = thread;
        this.line = line;
    }
    public int hashcode() {
        return thread.hashcode();
    }
    public boolean equals(ARecord o) {
        return thread.equals(o.thread);
    }
}

然后在 hasHungThread 中创建一个 ARecord 并将其添加到 Set 中。
例如：

started.add(new ARecord(m.group(2), str));

在 searchHungThreads 中，您将从 started 中检索 ARecord 并将其输出为：

for(ARecord rec : started) { 
    JLogger.appendLineToConsole(rec.line);
    commonTextArea.append(rec.line+"\n");
}

基于文本搜索的算法未按预期运行

Text Search based algorithm not behaving as intended

java

algorithm

text

更新

更新 #2

总结

搜索挂起的线程

有线程挂起

示例数据

示例数据