使用线程爬取文件系统的高效方式Java

Question

我目前正在开发一个 java 项目，该项目在文件系统的 PDF 中执行 OCR 以搜索其内容。

在这个项目中，我在用户指定的文件夹中搜索。我正在通过 OCR 获取 PDF 内容并检查它们是否包含用户提供的关键字。

我正在尝试确保在 PDF 上完成 OCR 后，爬网或遍历继续（必须在另一个线程或几个线程上），以便系统的性能不会显着降低。

有没有办法做到这一点？我在下面包含了我正在使用的遍历代码..

public void traverseDirectory(File[] files) {
    if (files != null) {
        for (File file : files) {
            if (file.isDirectory()) {
                traverseDirectory(file.listFiles());
            } else {
                String[] type = file.getName().toString().split("\.(?=[^\.]+$)");
                if (type.length > 1) {
                    if (type[1].equals("pdf")) {
                        //checking content goes here
                    }
                }
            }
        }
    }
}

Answer 1

你可以直接使用 Files.walkFileTree:

ExecutorService executor = Executors.newFixedThreadPool(threadCount);
PdfOcrService service = ...
Path rootPath = Paths.get("/path/to/your/directory");
Files.walkFileTree(rootPath, new SimpleFileVisitor<Path>() {
    public void visitFile(Path path, BasicFileAttributes attrs) {
        executor.submit(() -> {
            service.performOcrOnFile(path);
        });
    }
});

使用线程爬取文件系统的高效方式Java

Efficient way of crawling file system using threads Java

java

depth-first-search