Bash 用于连接破折号的脚本

Question

我已经抓取了大量 (10GB) 的 PDF 并将它们转换为文本文件，但由于原始 PDF 的格式，存在一个问题：

许多跨行的单词中有一个破折号，人为地把单词分开，像这样：

您可以看到发生这种情况是因为原始 PDF 文件有中断：

在 .txt 文件中“加入”与此模式匹配的每个单词实例的最干净、最快的方法是什么？

也许某种正则表达式搜索，例如某种 [a-z]\-\s \w（单词字符后跟破折号后跟 space）会起作用吗？或者某种 sed 替代品会更好吗？

目前，我正在尝试使用 sed 正则表达式，但我不确定如何将其转换为使用捕获组来替换所选文本：

sed -n '\%\w\- [a-z]%p' Filename.txt

我的输入文本如下所示：

The dog rolled down the st- eep hill and pl- ayed outside.

输出将是：

The dog rolled down the steep hill and played outside.

理想情况下，该表达式也适用于由换行符分隔的单词，如下所示：

The rule which provided for the consid-
eration of the resolution, was agreed to earlier by a

为此：

The rule which provided for the consideration 
of the resolution, was agreed to earlier by a

Answer 1

在 sed 中很简单：

sed -e ':a' -e '/-$/{N;s/-\n//;ba
}' -e 's/- //g' filename

这大致翻译为“如果该行以破折号结尾，也请阅读下一行（以便中间有一个回车 return 的一行）然后删除破折号和回车return，并从头开始循环，以防万一新行也以破折号结尾。然后删除 - ".

的所有实例

Answer 2

您只需添加反斜杠括号（或使用 -r 或 -E 选项（如果可用，以取消在捕获括号之前放置反斜杠的要求）并使用 [= 调用匹配的文本13=] 用于第一个捕获括号，</code> 用于第二个，依此类推 </p> <pre><code>sed 's/\(\w\)\- \([a-z]\)//g' Filename.txt

\w 转义不是标准的 sed 但如果它适合你，请随意使用它。否则，很容易替换为 [A-Za-z0-9_@] 或任何你想称之为“单词字符”的东西。

我猜并非所有匹配项都是带连字符的单词，因此可能运行通过拼写检查器或其他工具来验证结果是否为英文单词。（不过，我可能会切换到功能更强大的脚本语言，例如 Python。）

Answer 3

您可以使用这个 gnu-awk 代码：

cat file

The dog rolled down the st- eep hill and pl- ayed outside.
The rule which provided for the consid-
eration of the resolution, was agreed to earlier by a

然后像这样使用 awk:

awk 'p != "" {
   w = 
    = ""
   sub(/^[[:blank:]]+/, ORS)
   [=11=] = p w [=11=]
   p = ""
}
{
   [=11=] = gensub(/([_[:alnum:]])-[[:blank:]]+([_[:alnum:]])/, "\1\2", "g")
}
/-$/ {
   p = [=11=]
   sub(/-$/, "", p)
}
p == ""' file

The dog rolled down the steep hill and played outside.
The rule which provided for the consideration
of the resolution, was agreed to earlier by a

如果您可以考虑 perl 那么这可能也适合您：

然后使用：

perl -0777 -pe 's/(\w)-\h+(\w)//g; s/(\w)-\R(\w+)\s+/\n/g' file

Bash 用于连接破折号的脚本

Bash Script for Concatenating Broken Dashed Words

regex

string

bash

sed

data-cleaning