从 txt 文件中提取文本
Extracting text from a txt file
我有一个包含记录的 txt 文件。记录遵循以下模式:
six lines, blank space, six lines,
.....就像这个例子:
string line 1
string line 2
string line 3
string line 4
string line 5 (year format yyyy)
string line 6 (can use several lines)
<blank space> (always a blank space when a new txt block begins)
string line 1
string line 2
string line 3
string line 4
string line 5 (year format yyyy)
string line 6
这是一个合适的例子:我需要标题(第 2 行)和年份(第 5 行)
Hualong Yu, Geoffrey I. Webb,
Adaptive online extreme learning machine by regulating forgetting factor by concept drift map,
Neurocomputing,
Volume 343,
2019,
Pages 141-153,
ISSN 0925-2312,
https://doi.org/10.1016/j.neucom.2018.11.098.
https://www.sciencedirect.com/science/article/pii/S0925231219301572
Antonino Feitosa Neto, Anne M.P. Canuto,
EOCD: An ensemble optimization approach for concept drift applications,
Information Sciences,
Volume 561,
2021,
Pages 81-100,
ISSN 0020-0255,
https://doi.org/10.1016/j.ins.2021.01.051.
https://www.sciencedirect.com/science/article/pii/S002002552100089X
我想提取第 2 行中的字符串和第 5 行中的所有文本块(以空格分隔),将其保存到另一个 txt 文件作为此输出:
string line2 , yyyy
我没有经验 linux shell 所以我在这里寻求一些意见来帮助我完成这项任务。
谢谢
类似于:
perl -00 -nE 'my @ln = (split /,\n/)[1,4]; say join(",", @ln)' input.txt > output.txt
至少应该作为一个起点。一次阅读一个段落,分成几行,然后在同一行上打印您要查找的两个段落,并用逗号分隔。
如果您不关心第 5 行中的尾随逗号,只需执行以下操作:
awk '{print , }' RS= FS='\n' input > output
这假定分隔记录的空白行确实是完全空白的并且不包含任何空格。如果该行中有任何空格,您需要预过滤数据以将其删除。
例如:
$ cat input
Hualong Yu, Geoffrey I. Webb,
Adaptive online extreme learning machine by regulating forgetting factor by concept drift map,
Neurocomputing,
Volume 343,
2019,
Pages 141-153,
ISSN 0925-2312,
https://doi.org/10.1016/j.neucom.2018.11.098.
https://www.sciencedirect.com/science/article/pii/S0925231219301572
Antonino Feitosa Neto, Anne M.P. Canuto,
EOCD: An ensemble optimization approach for concept drift applications,
Information Sciences,
Volume 561,
2021,
Pages 81-100,
ISSN 0020-0255,
https://doi.org/10.1016/j.ins.2021.01.051.
https://www.sciencedirect.com/science/article/pii/S002002552100089
$ awk '{print , }' RS= FS='\n' input
Adaptive online extreme learning machine by regulating forgetting factor by concept drift map, 2019,
EOCD: An ensemble optimization approach for concept drift applications, 2021,
我有一个包含记录的 txt 文件。记录遵循以下模式:
six lines, blank space, six lines,
.....就像这个例子:
string line 1
string line 2
string line 3
string line 4
string line 5 (year format yyyy)
string line 6 (can use several lines)
<blank space> (always a blank space when a new txt block begins)
string line 1
string line 2
string line 3
string line 4
string line 5 (year format yyyy)
string line 6
这是一个合适的例子:我需要标题(第 2 行)和年份(第 5 行)
Hualong Yu, Geoffrey I. Webb,
Adaptive online extreme learning machine by regulating forgetting factor by concept drift map,
Neurocomputing,
Volume 343,
2019,
Pages 141-153,
ISSN 0925-2312,
https://doi.org/10.1016/j.neucom.2018.11.098.
https://www.sciencedirect.com/science/article/pii/S0925231219301572
Antonino Feitosa Neto, Anne M.P. Canuto,
EOCD: An ensemble optimization approach for concept drift applications,
Information Sciences,
Volume 561,
2021,
Pages 81-100,
ISSN 0020-0255,
https://doi.org/10.1016/j.ins.2021.01.051.
https://www.sciencedirect.com/science/article/pii/S002002552100089X
我想提取第 2 行中的字符串和第 5 行中的所有文本块(以空格分隔),将其保存到另一个 txt 文件作为此输出:
string line2 , yyyy
我没有经验 linux shell 所以我在这里寻求一些意见来帮助我完成这项任务。
谢谢
类似于:
perl -00 -nE 'my @ln = (split /,\n/)[1,4]; say join(",", @ln)' input.txt > output.txt
至少应该作为一个起点。一次阅读一个段落,分成几行,然后在同一行上打印您要查找的两个段落,并用逗号分隔。
如果您不关心第 5 行中的尾随逗号,只需执行以下操作:
awk '{print , }' RS= FS='\n' input > output
这假定分隔记录的空白行确实是完全空白的并且不包含任何空格。如果该行中有任何空格,您需要预过滤数据以将其删除。
例如:
$ cat input
Hualong Yu, Geoffrey I. Webb,
Adaptive online extreme learning machine by regulating forgetting factor by concept drift map,
Neurocomputing,
Volume 343,
2019,
Pages 141-153,
ISSN 0925-2312,
https://doi.org/10.1016/j.neucom.2018.11.098.
https://www.sciencedirect.com/science/article/pii/S0925231219301572
Antonino Feitosa Neto, Anne M.P. Canuto,
EOCD: An ensemble optimization approach for concept drift applications,
Information Sciences,
Volume 561,
2021,
Pages 81-100,
ISSN 0020-0255,
https://doi.org/10.1016/j.ins.2021.01.051.
https://www.sciencedirect.com/science/article/pii/S002002552100089
$ awk '{print , }' RS= FS='\n' input
Adaptive online extreme learning machine by regulating forgetting factor by concept drift map, 2019,
EOCD: An ensemble optimization approach for concept drift applications, 2021,