Bash: 如何获取字符串中匹配项的完整子串？

Question

我有一个 TXT 文件，它是从 Windows 机器发送的，并以 ISO-8859-1 编码。我的 Qt 应用程序应该读取此文件，但 QString 仅支持 UTF-8（我想避免使用 QByteArray）。我一直在努力寻找一种在 Qt 中执行此操作的方法，因此我决定编写一个小脚本来为我进行转换。我完全可以针对我的情况编写它，但我想让它更通用——适用于所有 ISO-8859 编码。

到目前为止我有以下内容：

#!/usr/bin/env bash

output=$(file -i )

# If the output contains any sort of ISO-8859 substring
if echo "$output" | grep -qi "ISO-8859"; then
  # Retrieve actual encoding
  encoding=...
  # run iconv to convert
  iconv -f $encoding  -t UTF-8 -o 
else
  echo "Text file not encoded in ISO-8859"
fi

我正在努力解决的部分是如何获取已在 grep 命令中成功处理的完整子字符串。

假设我有文件 helloworld.txt 并且它是用 ISO-8859-15 编码的。在这种情况下

$~: ./fixEncodingToUtf8 helloworld.txt
stations.txt: text/plain; charset=iso-8859-15

将是终端中的 output。在内部 grep 找到 iso-8859（因为我使用 -i 标志，它以不区分大小写的方式处理输入）。此时脚本需要 "extract" 整个子字符串，即不仅仅是 iso-8859 而是 iso-8859-15 并将其存储在 encoding 变量中以便稍后与 iconv 一起使用（当涉及到编码名称时，不区分大小写（呸！））。

注意： 上面的脚本可以进一步扩展，只需检索 charset 后面的值并将其用于 encoding。然而，这有一个巨大的缺陷——如果输入文件的编码字符集大于 UTF-8（简单示例：UTF-16 和 UTF-32）怎么办？

Answer 1

您可以使用 cut 或 awk 来获得：

awk:

encoding=$(echo $output | awk -F"=" '{print }')

剪切：

encoding=$(echo $output | cut -d"=" -f2)

我认为您可以将其直接提供给您的 iconv 命令并将您的脚本缩减为：

iconv -f $(file  | cut -d"=" -f2) -t UTF-8 file

Answer 2

或使用 bash 如下所示的功能

$ str="stations.txt: text/plain; charset=iso-8859-15"
$ echo "${str#*=}"
iso-8859-15

保存在变量中

$ myvar="${str#*=}"

Answer 3

好吧，在这种情况下，这是毫无意义的……

$ file --brief --mime-encoding ""
iso-8859-15

文件手册

-b, --brief
        Do not prepend filenames to output lines (brief mode).
...
--mime-type, --mime-encoding
        Like -i, but print only the specified element(s).

Bash: 如何获取字符串中匹配项的完整子串？

Bash: how to get the complete substring of a match in a string?

bash

encoding

grep

utf-8

iso-8859-1