是否有一个 Bash 函数允许我在文件具有相同的第一个单词时从文件中 separate/delete/isolate 行

Question

我有一个这样的文本文件：

id ; lorem ipsum  fgdg df gdg
id ; lorem ipsum  fgdg df gdg
id ; lorem ipsum  fgdg df gdg
id ; lorem ipsum  fgdg df gdg
id ; lorem ipsum  fgdg df gdg

如果 2 个 ID 相似，我想将 2 个 ID 相似的行和唯一的行分开。

uniquefile 包含具有唯一 ID 的行。 notuniquefile 包含没有的行。

我已经找到了一种几乎可以做到的方法，但只能用第一个词。基本上它只是隔离 id 并删除其余行。

命令 1：隔离唯一 ID（但缺少该行）：

awk -F ";" '{!seen[]++};END{for(i in seen) if(seen[i]==1)print i }' originfile >> uniquefile

命令 2：隔离不唯一的 id（但缺少该行并丢失可能因行而异的 "lorem ipsum" 内容）：

awk -F ":" '{!seen[]++;!ligne[=14=]};END{for(i in seen) if(seen[i]>1)print i  }' originfile >> notuniquefile

所以在一个完美的世界里，我希望你能帮助我获得这样的结果：

originfile:

1 ; toto
2 ; toto
3 ; toto
3 ; titi
4 ; titi

uniquefile:

1 ; toto
2 ; toto
4 ; titi

notuniquefile:

3 ; toto
3 ; titi

祝你有美好的一天。

Answer 1

这是一个小的 Python 脚本，可以执行此操作：

#!/usr/bin/env python3

import sys

unique_markers = []
unique_lines = []
nonunique_markers = set()
for line in sys.stdin:
  marker = line.split(' ')[0]
  if marker in nonunique_markers:
    # found a line which is not unique
    print(line, end='', file=sys.stderr)
  elif marker in unique_markers:
    # found a double
    index = unique_markers.index(marker)
    print(unique_lines[index], end='', file=sys.stderr)
    print(line, end='', file=sys.stderr)
    del unique_markers[index]
    del unique_lines[index]
    nonunique_markers.add(marker)
  else:
    # marker not known yet
    unique_markers.append(marker)
    unique_lines.append(line)
for line in unique_lines:
  print(line, end='', file=sys.stdout)

这不是一个纯粹的 shell 解决方案（恕我直言，这会很麻烦且难以维护），但也许它对您有所帮助。

这样称呼它：

separate_uniq.py < original.txt > uniq.txt 2> nonuniq.txt

Answer 2

未测试：处理文件两次：第一次计算 ID，第二次决定打印记录的位置：

awk -F';' '
    NR == FNR      {count[]++; next}
    count[] == 1 {print > "uniquefile"}
    count[]  > 1 {print > "nonuniquefile"}
' file file

Answer 3

使用纯 bash 脚本，您可以这样做：

duplicate_file="duplicates.txt"
unique_file="unique.txt"
file="${unique_file}"
rm $duplicate_file $unique_file
last_id=""
cat testfile.txt | sort | ( 
    while IFS=";" read id line ; do
      echo $id
      if [[ "${last_id}" != "" ]] ; then
          if [[ "${last_id}" != "${id}" ]] ; then
             echo "${last_id};${last_line}" >> "${file}"
             file="${unique_file}"
          else
             file="${duplicate_file}"
             echo "${last_id};${last_line}" >> "${file}"
          fi
      fi
      last_line="${line}"
      last_id="${id}"
    done
    echo "${last_id};${last_line}" >> "${file}"
)

输入文件为：

1;line A
2;line B
2;line C
3;line D
3;line E
3;line F
4;line G

它输出：

$ cat duplicates.txt 
2;line B
2;line C
3;line D
3;line E
3;line F
work$ cat unique.txt 
1;line A
4;line G

Answer 4

另一种只有两个 unix 命令的方法，如果您的 id 字段始终具有相同的长度，则该方法有效（假设它们的长度与我的测试数据中的一样，但它当然也适用于更长的字段）：

# feed the testfile.txt sorted to uniq
# -w means: only compare the first 1 character of each line
# -D means: output only duplicate lines (fully not just one per group)
sort testfile.txt | uniq -w 1 -D > duplicates.txt

# then filter out all duplicate lines from the text file
# to just let the unique files slip through
# -v means: negate the pattern
# -F means: use fixed strings instead of regex
# -f means: load the patterns from a file
grep -v -F -f duplicates.txt testfile.txt > unique.txt

并且输出是（对于与我的另一个 post 中使用的相同输入线）：

$uniq -w 2 -D  testfile.txt 
2;line B
2;line C
3;line D
3;line E
3;line F

和：

$ grep -v -F -f duplicates.txt testfile.txt 
1;line A
4;line G

顺便说一句。如果你想避免 grep，你也可以存储排序的输出（假设在 sorted_file.txt 中）并将第二行替换为

uniq -w 1 -u sorted_file.txt > unique.txt

-w 后面的数字又是您的 ID 字段的字符长度。

是否有一个 Bash 函数允许我在文件具有相同的第一个单词时从文件中 separate/delete/isolate 行

Is there a Bash function that allow me to separate/delete/isolate line from a file when they have the same first word

bash

character

line

filter