将 GZIP 内容提取到字符串以获得大数据字节 java

Question

我有一个很大的字符串内容，压缩为 GZIP 并在数据库中存储为 BLOB。从数据库中提取时，我能够从中检索字符串：

        try (
             ByteArrayInputStream bis = new ByteArrayInputStream(bytes);
             BufferedInputStream bufis = new BufferedInputStream(new GZIPInputStream(bis));
             ByteArrayOutputStream bos = new ByteArrayOutputStream()
        ) {
            byte[] buf = new byte[4096];
            int len;
            while ((len = bufis.read(buf)) > 0) {
                bos.write(buf, 0, len);
            }
            retval = bos.toString();
        }

我的问题是对于一些输入记录，我的 BLOB 太大了，我不得不从 BLOB 中 grep 几乎 5-6 行。而且我必须批量处理这些记录，这会增加内存占用量。

有没有办法以块的形式从 GZIP 中提取内容，如果我只在初始部分获得这些行，我可以丢弃所有剩余的块。

提前感谢您的帮助。

Answer 1

不要一次将 BLOB 中的所有字节读入内存。读取您的 BLOB as an InputStream.

使用 BufferedReader 一次阅读和检查一行。

一个 BufferedReader 包装另一个 Reader。要将解压缩的 InputStream 转换为 Reader，请使用 InputStreamReader。指定要解压的文本的字符集非常重要；您不想依赖您碰巧运行所在的任何计算机的默认字符集，因为它可能因您运行它所在的位置而有所不同。

所以它看起来像这样：

List<String> matchingLines = new ArrayList<>();
String targetToMatch = "pankaj";

try (BufferedReader lines = new BufferedReader(
        new InputStreamReader(
            new GZIPInputStream(
                blob.getBinaryStream()),
            StandardCharsets.UTF_8))) {

    String line;
    while ((line = lines.readLine()) != null) {
        if (line.contains(targetToMatch)) {
            matchingLines.add(line);
        }
    }
}

既然你提到了grep，你也可以使用正则表达式来匹配行，尽管出于性能原因我更喜欢String.contains而不是正则表达式，除非你真的需要正则表达式。

List<String> matchingLines = new ArrayList<>();
Matcher matcher = Pattern.comple("(?i)pankaj.*ar").matcher("");

try (BufferedReader lines = new BufferedReader(
        new InputStreamReader(
            new GZIPInputStream(
                blob.getBinaryStream()),
            StandardCharsets.UTF_8))) {

    String line;
    while ((line = lines.readLine()) != null) {
        if (matcher.reset(line).find()) {
            matchingLines.add(line);
        }
    }
}

将 GZIP 内容提取到字符串以获得大数据字节 java

Extract GZIP content to String for large data bytes java

java

arrays

gzip