Bash 读取文件、排序和打印重复记录及其标识号的脚本

Question

我有一个包含数千条记录的文件，这些记录根据它们共有的身份号码的前 6 位数字分组到子组中，但有些记录是重复的。我正在尝试创建一个 bash 脚本来读取文件，查找重复记录和它们共享的身份号码，并打印出身份号码和它们下面的重复记录。

当前脚本：

#!/bin/bash
########## script to find duplicate records & their ID
INPUT="sourceFile.txt"
while read varName; do
  echo "$varName"
  if [ "$varName" = "NEXT" ]; then
    sort $INPUT | uniq -d
    echo "END OF ONE ID-NUMBER IN FILE"
  fi
done < "$INPUT"

样本INPUT_FILE：

NEXT
123456-
# requesting: displayName
displayName: Alpha Beta
displayName: Charly Delta Echo
displayName: Xerox Yingyang Zenox
displayName: Xerox Yingyang Zenox

NEXT
123999-
# requesting: displayName
displayName: Golf Harvey Indigo
displayName: Jaguar Kingston Lambda
displayName: Alma Nano Matter
displayName: Oxygen Pascal Queen
displayName: Romeo Saint Tropez Unicorn
displayName: Vauxhall Wellignton Woolwhich
displayName: Rodrigo Compton Hilside
displayName: Vauxhall Wellignton Woolwhich
NEXT

期望输出/预期输出：

NEXT
123456-
displayName: Xerox Yingyang Zenox
displayName: Xerox Yingyang Zenox

END OF ONE ID-NUMBER IN FILE

NEXT
123999-
displayName: Vauxhall Wellignton Woolwhich
displayName: Vauxhall Wellignton Woolwhich

感谢您提供预期的想法和线索。

Answer 1

我不知道为什么你想要重复的行两次，我不明白输出中间的行“END OF ONE ID-NUMBER IN FILE”在做什么。

以下仅显示重复项。

#! /bin/bash

read -r next; unset next
while true; do
  read -r id || break
  read -r comment; unset comment
  dns=()
  while read -r dn; do
    if [[ $dn =~ ^NEXT$ ]]; then
      printf 'NEXT\n'
      printf '%s\n' "$id"
      printf '%s\n' "${dns[@]}" | sort | uniq -d
      break
    else
      dns+=("$dn")
    fi
  done
done

如果你真的想硬编码输入文件的名称，你可以在开头添加以下行：

exec < sourceFile.txt

Answer 2

sort 显然是对整个文件进行排序。我会将其重构为一个简单的 Awk 脚本。


awk '/^NEXT/ { delete a;
      if(NR>1) { print ""; print "END OF ONE ID-NUMBER IN FILE"; print ""; }
      id=""; print; next }
    id == "" { id = [=10=]; print; next }
    !/^displayName:/ { next }
    [=10=] in a { print; if (a[[=10=]] == 1) print; }
    { a[[=10=]]++ }' sourceFile.txt

一旦您熟悉了 Awk 的基础知识，这应该相当简单。但简而言之，我们保留一个关联数组 a ，我们在其中记住我们已经看过的 displayName: 行，当我们看到重复时，我们打印（如果还没有打印则为原始行，并且) 最近一次出现。

有些有点难看，因为你的要求不太吸引人；也许更好的设计是在同一行上只打印实际的副本及其关联的 ID 号。

awk '/^NEXT/ { delete a; id=""; next }
    id == "" { id = [=11=]; next }
    !/^displayName:/ { next }
    [=11=] in a { if(a[[=11=]] == 1) print id ":" [=11=] }
    { a[[=11=]]++ }' sourceFile.txt

重复的事实已经足够了，所以我们只打印记录中第二次出现的任何东西。

Bash 读取文件、排序和打印重复记录及其标识号的脚本

Bash Script to Read File, Sort and Print Duplicate Records, and their Identity Number

sorting

bash

file

duplicates

script