速度优化树数据解析器

Speed-optimizing tree data parser

我正在处理输入格式如下的作业,我必须尽快解析它:

5 (
 5 (
  3 (
  )
 )
 3 (
  3 (
  )
  3 (
  )
 )
 5 (
  2 (
  )
  4 (
  )
 )
)

是"Employees"的树形结构,编号为后续任务(语言索引)

每个员工可以有任意数量的下属和一个上级(根节点是"Boss")。

这是我的解析器:(最初我使用 Scanner,它简短而简单,但速度慢了大约两倍)

// Invocation
// Employee boss = collectEmployee(null, 0, reader);

private Employee collectEmployee(final Employee parent, int indent, final Reader r) throws IOException
{
    final StringBuilder sb = new StringBuilder();
    boolean nums = false;
    while (true) {
        char c = (char) r.read();
        if (c == 10 || c == 13) continue; // newline
        if (c == ' ') {
            if (nums) break;
        } else {
            nums = true;
            sb.append(c);
        }
    }
    final int lang = Integer.parseInt(sb.toString());
    final Employee self = new Employee(lang, parent);

    r.skip(1); // opening paren
    int spaces = 0;
    while (true) {
        r.mark(1);
        int i = r.read();
        char c = (char) i;
        if (c == 10 || c == 13) continue; // newline
        if (c == ' ') {
            spaces++;
        } else {
            if (spaces == indent) {
                break; // End of this employee
            } else {
                spaces = 0; // new line.
                r.reset();
                self.add(collectEmployee(self, indent + 1, r));
            }
        }
    }
    return self; // the root employee for this subtree
}

我需要再削减几个代码周期,这样它才能通过严格的要求。我已经对它进行了分析,这部分确实是降低应用程序速度的原因。输入文件最多可以有 30 MiB,所以任何一点改进都会有很大的不同。

任何想法表示赞赏。谢谢。

(为了完整起见,此处提供了 Scanner 实现 - 它可以让您了解我是如何解析它的)

private Employee collectEmployee(final Employee parent, final Scanner sc)
{
    final int lang = Integer.parseInt(sc.next());
    sc.nextLine(); // trash the opening parenthesis

    final Employee self = new Employee(lang, parent);

    while (sc.hasNextInt()) {
        Employee sub = collectEmployee(self, sc);
        self.add(sub);
    }

    sc.nextLine(); // trash the closing parenthesis

    return self;
}
  1. 您正在使用 StringBuilder 进行大量数据推送 — 保留您在遇到十进制字符时更新的 int 值可能是有益的 ('0'- '9') (num = num * 10 + (c - '0')) 和 storing/resetting 遇到非小数。这样你也可以摆脱 Integer.parseInt.

  2. 你似乎是层次结构的 using/checking 缩进,但你的输入格式包含大括号,这使得它成为基于 S-Expression 的语法——所以你的解析器做了比需要更多的工作(您可以忽略空格并使用一堆 Employees 来处理大括号)。

  3. 我会考虑使用 JMH 基准和 运行 以及 perf-asm(如果可用)来查看您的代码将时间花在哪里。真的,这是一个非常宝贵的工具。

正确的实现应该真正使用状态机和 Builder。不确定 more/less 这有多有效,但它肯定有助于以后的增强和一些真正的简单性。

static class Employee {

    final int language;
    final Employee parent;
    final List<Employee> children = new ArrayList<>();

    public Employee(int language, Employee parent) {
        this.language = language;
        this.parent = parent;
    }

    @Override
    public String toString() {
        StringBuilder s = new StringBuilder();
        s.append(language);
        if (!children.isEmpty()) {
            for (Employee child : children) {
                s.append("(").append(child.toString()).append(")");
            }
        } else {
            s.append("()");
        }
        return s.toString();
    }

    static class Builder {

        // Make a boss to wrap the data.
        Employee current = new Employee(0, null);
        // The number that is growing into the `language` field.
        StringBuilder number = new StringBuilder();
        // Bracket counter - not sure if this is necessary.
        int brackets = 0;
        // Current state.
        State state = State.Idle;

        enum State {

            Idle {

                        @Override
                        State next(Builder builder, char ch) {
                            // Any digits kick me into Number state.
                            if (Character.isDigit(ch)) {
                                return Number.next(builder, ch);
                            }
                            // Watch for brackets.
                            if ("()".indexOf(ch) != -1) {
                                return Bracket.next(builder, ch);
                            }
                            // No change - stay as I am.
                            return this;
                        }
                    },
            Number {

                        @Override
                        State next(Builder builder, char ch) {
                            // Any non-digits treated like an idle.
                            if (Character.isDigit(ch)) {
                                // Store it.
                                builder.number.append(ch);
                            } else {
                                // Now we have his number - make the new employee.
                                builder.current = new Employee(Integer.parseInt(builder.number.toString()), builder.current);
                                // Clear the number for next time around.
                                builder.number.setLength(0);
                                // Remember - could be an '('.
                                return Idle.next(builder, ch);
                            }
                            // No change - stay as I am.
                            return this;
                        }
                    },
            Bracket {

                        @Override
                        State next(Builder builder, char ch) {
                            // Open or close.
                            if (ch == '(') {
                                builder.brackets += 1;
                            } else {
                                builder.brackets -= 1;
                                // Keep that child.
                                Employee child = builder.current;
                                // Up to parent.
                                builder.current = builder.current.parent;
                                // Add the child.
                                builder.current.children.add(child);
                            }
                            // Always back to Idle after a bracket.
                            return Idle;
                        }
                    };

            abstract State next(Builder builder, char ch);
        }

        Builder data(String data) {
            for (int i = 0; i < data.length(); i++) {
                state = state.next(this, data.charAt(i));
            }
            return this;
        }

        Employee build() {
            // Current should hold the boss.
            return current;
        }
    }
}

static String testData = "5 (\n"
        + " 5 (\n"
        + "  3 (\n"
        + "  )\n"
        + " )\n"
        + " 3 (\n"
        + "  3 (\n"
        + "  )\n"
        + "  3 (\n"
        + "  )\n"
        + " )\n"
        + " 5 (\n"
        + "  2 (\n"
        + "  )\n"
        + "  4 (\n"
        + "  )\n"
        + " )\n"
        + ")";

public void test() throws IOException {
    Employee e = new Employee.Builder().data(testData).build();
    System.out.println(e.toString());
    File[] ins = Files.listFiles(new File("C:\Temp\datapub"),
            new FileFilter() {

                @Override
                public boolean accept(File file) {
                    return file.getName().endsWith(".in");
                }

            });
    for (File f : ins) {
        Employee.Builder builder = new Employee.Builder();
        String[] lines = Files.readLines(f);
        ProcessTimer timer = new ProcessTimer();
        for (String line : lines) {
            builder.data(line);
        }
        System.out.println("Read file " + f + " took " + timer);
    }
}

打印

0(5(5(3()))(3(3())(3()))(5(2())(4())))

请注意 0 第一个元素是您提到的 boss

嗯,基础知识是读取和解析,以及您如何处理数据。

通过递归下降的读取和解析应该完全是 IO 绑定的。 它应该 运行 只需阅读字符所需时间的一小部分。

你如何处理数据取决于你如何设计数据结构。 如果你不小心,你可能会在内存管理上花费比你想要的更多的时间。

无论如何,这是一个用 C++ 编写的非常简单的解析器。您可以将其转换为您喜欢的任何语言。

void scanWhite(const char* &pc){while(WHITE(*pc)) pc++;}

bool seeChar(const char* &pc, char c){
  scanWhite(pc);
  if (*pc != c) return False;
  pc++;
  return True;
}

bool seeNum((const char* &pc, int &n){
  scanWhite(pc);
  if (!DIGIT(*pc)) return False;
  n = 0; while(DIGIT(*pc)) n = n * 10 + (*pc++ - '0');
  return True;
}

// this sucks up strings of the form: either nothing or number ( ... )
bool readNumFollowedByList(const char* &pc){
  int n = 0;
  if (!seeNum(pc, n)) return False;
  // what you do with this number and what follows is up to you
  // if you hit the error, print a message and throw to the top level
  if (!seeChar(pc, LP)){ /* ERROR - NUMBER NOT FOLLOWED BY LEFT PAREN */ }
  // read any number of number ( ... )
  while(readNumFollowedByList(*pc)); // <<-- note the recursion
  if (!seeChar(pc, RP)){ /* ERROR - MISSING RIGHT PAREN */ }
  return True; 
}