从间隔不均匀的文本文件中提取 table 数据

Extracting table data from unevenly spaced text file

         CLASS RECORD OF THE STUDENT FROM THE PREVIOUS BATCH WHO TOPPED
Name (Roll no) #    Location   Section     Rank (MARKS)     Gender   
Anna (+)            USA        A1          First (100)      Female
(04)                California V
ADDITIONAL RECORDS OF THE STUDENTS FROM THE PREVIOUS BATCH NEXT IN LIST
Name (Roll no) #    Location   Section     Rank (MARKS)     Gender
Bob (-)             USA        A2          First (99)       Male
(07)                Florida    VI
Eva (+)             USA        A4          Second (96)      Female
(12)                Ohio       V           English (99)
                                           Maths(100)
Other records are not available currently.Some records may be present which can be given on request.

使用 pdftotext 从 PDF 获取文本文件。使用下面的 AWK 命令我得到了上面的数据。
Table 数据 不均匀 space 分离
删除 整行 大写 的行
删除 table 内容后的所有 最后一行

pdftotext -layout INPUTFILE.pdf INPUTFILE.txt
awk '/RESULTS/{flag=1;next}/OTHER DATA/{flag=0}flag' INPUTFILE.txt | column -ts $'\t' -n


如何以制表符分隔格式(以下格式)获取 table 数据?
以通用方式编写代码,因此它也适用于其他类型的 table。

Name (Roll no) #    Location    Section     Rank (MARKS)    Gender  
Anna (+)            USA         A1          First (100)     Female
(04)                California  V
Bob (-)             USA         A2          First (99)      Male
(07)                Florida     VI
Eva (+)             USA         A4          Second (96)     Female
(12)                Ohio        V           English (99)
                                            Maths (100)

看起来提取的数据在删除不需要的行后是 fixed-width 格式。你可以试试

txt = """CLASS RECORD OF THE STUDENT FROM THE PREVIOUS BATCH WHO TOPPED
Name (Roll no) #    Location   Section     Rank (MARKS)     Gender   
Anna (+)            USA        A1          First (100)      Female
(04)                California V
ADDITIONAL RECORDS OF THE STUDENTS FROM THE PREVIOUS BATCH NEXT IN LIST
Name (Roll no) #    Location   Section     Rank (MARKS)     Gender
Bob (-)             USA        A2          First (99)       Male
(07)                Florida    VI
Eva (+)             USA        A4          Second (96)      Female
(12)                Ohio       V           English (99)
                                           Maths(100)
Other records are not available currently.Some records may be present which can be given on request"""

data = [[line[:20], line[20:31], line[31:43], line[60:]] 
        for line in txt.split('\n')[1:-1] if line != line.upper()]    # add .strip() if you want to remove the white space at beginning and the end
del data[3]   # Remove the header for additional records

>>> for line in data:
...     print(line)

# ['Name (Roll no) #    ', 'Location   ', 'Section     ', 'Rank (MARKS)     ', 'Gender   ']
# ['Anna (+)            ', 'USA        ', 'A1          ', 'First (100)      ', 'Female']
# ['(04)                ', 'California ', 'V', '', '']
# ['Bob (-)             ', 'USA        ', 'A2          ', 'First (99)       ', 'Male']
# ['(07)                ', 'Florida    ', 'VI', '', '']
# ['Eva (+)             ', 'USA        ', 'A4          ', 'Second (96)      ', 'Female']
# ['(12)                ', 'Ohio       ', 'V           ', 'English (99)', '']
# ['                    ', '           ', '            ', 'Maths(100)', '']

我在这里介绍的方法是 awk 方法。我将在其中做出以下假设:

  • header-lineName (Roll no) ... Gender可以出现多次
  • header-line下的列表有固定的field-width,但字段宽度未知。我从其中包含 California 的行假设这一点,因为该词后面只有一个 space。
  • 在每个 header-line 之后,字段宽度可以更改。

awk中我们可以使用内部变量FIELDWIDTHS:

设置一个固定的字段宽度

FIELDWIDTHS # A space-separated list of columns that tells gawk how to split input with fixed columnar boundaries. Starting in version 4.2, each field width may optionally be preceded by a colon-separated value specifying the number of characters to skip before the field starts. Assigning a value to FIELDWIDTHS overrides the use of FS and FPAT for field splitting. See Constant Size for more information.

note: this is a gawk extension

要确定 FIELDWIDTHS 变量,我们将使用 matchRSTART:

RSTART The start index in characters of the substring that is matched by the match() function (see String Functions). RSTART is set by invoking the match() function. Its value is the position of the string where the matched substring starts, or zero if no match was found.

因此这已经为我们提供了以下内容(注意 OFS 设置为 | 以演示正确的工作行为)

awk 'BEGIN{OFS="|"}
     /^[- A-Z]*$/{next}          # skips only caps lines
     /Other records might/{next} # skips the last line
     /^Name.*$/{                 # find header line
       match([=10=],"Location");i2=RSTART;
       match([=10=],"Section"); i3=RSTART;
       match([=10=],"Rank");    i4=RSTART;
       match([=10=],"Gender");  i5=RSTART;
       FIELDWIDTHS= i2-1" "i3-i2" "i4-i3" "i5-i4" 6"
       [=10=]=[=10=]                     # reprocess header line
       # print header line only the first time
       if (v==0) {print ,,,,}
       v++; next      
     }
     {print ,,,,}'

这已经输出

Name (Roll no) #    |Location   |Section     |Rank (MARKS)     |Gender
Anna (+)            |USA        |A1          |First (100)      |Female
(04)                |California |V||
Bob (-)             |USA        |A2          |First (99)       |Male
(07)                |Florida    |VI||
Eva (+)             |USA        |A4          |Second (96)      |Female
(12)                |Ohio       |V           |English (99)|
                    |           |            |Maths(100)|

评论:此时它看起来已经是"OK",但要考虑到每个[=75之后的列不需要相同的宽度=](假设 3)。

你想要一个 tab-delimited 列系统,但是标签是邪恶的。一切都取决于您的系统如何解释制表符的宽度。是 48 还是 17。我在这里展示一个 space 分隔系统。最好的办法是去掉每个字段末尾的所有space,然后使用命令column。这导致:

awk 'BEGIN{OFS="|"}
     /^[- A-Z]*$/{next}          # skips only caps lines
     /Other records might/{next} # skips the last line
     /^Name.*$/{                 # find header line
       match([=12=],"Location");i2=RSTART;
       match([=12=],"Section"); i3=RSTART;
       match([=12=],"Rank");    i4=RSTART;
       match([=12=],"Gender");  i5=RSTART;
       FIELDWIDTHS= i2-1" "i3-i2" "i4-i3" "i5-i4" 6"
       [=12=]=[=12=]                     # reprocess header line
       # print header line only the first time
       for(i=1;i<=NF;i++) sub(/ *$/,"",$i);
       if (v==0) {print ,,,,}
       v++; next      
     }
     {
       for(i=1;i<=NF;i++) sub(/ *$/,"",$i);
       print ,,,,
     }' <file> | column -t -s '|'

这输出:

Name (Roll no) #  Location    Section  Rank (MARKS)  Gender  
Anna (+)          USA         A1       First (100)   Female  
(04)              California  V                              
Bob (-)           USA         A2       First (99)    Male    
(07)              Florida     VI                             
Eva (+)           USA         A4       Second (96)   Female  
(12)              Ohio        V        English (99)          
                                       Maths(100)          

请注意,column 将根据需要调整列,因此它们不必每次都具有相同的宽度。如果您知道列宽,我建议在 awk 中使用 printf 语句,然后是:

awk 'BEGIN{format="%-18s%-12s%-9s%-14s%-6s\n"}
     /^[- A-Z]*$/{next}          # skips only caps lines
     /Other records might/{next} # skips the last line
     /^Name.*$/{                 # find header line
       match([=14=],"Location");i2=RSTART;
       match([=14=],"Section"); i3=RSTART;
       match([=14=],"Rank");    i4=RSTART;
       match([=14=],"Gender");  i5=RSTART;
       FIELDWIDTHS= i2-1" "i3-i2" "i4-i3" "i5-i4" 6"
       [=14=]=[=14=]                     # reprocess header line
       # print header line only the first time
       if (v==0) {printf format,,,,,}
       v++; next      
     }
     { printf format,,,,, }' <file>

作为输出:

Name (Roll no) #  Location    Section  Rank (MARKS)  Gender
Anna (+)          USA         A1       First (100)   Female
(04)              California  V                            
Bob (-)           USA         A2       First (99)    Male  
(07)              Florida     VI                           
Eva (+)           USA         A4       Second (96)   Female
(12)              Ohio        V        English (99)        
                                       Maths(100)