如何单元测试是否已正确生成 PDF 文件？

Question

我写了一个小的 python 库，它使用 matplotlib 和 seaborn 来绘制图表，我想知道如何测试图表是否像我真正想要的那样。

因此，给定一个我声明为正确的参考 pdf 文件，我将如何自动检查它是否等于一个动态生成的具有虚拟数据的文件？

我认为由于时间戳等原因，对文件进行哈希处理是不可靠的。

Answer 1

一些想法：

使用diff-pdf
将其转换为图像（例如使用 ImageMagick）并使用 PerceptualDiff
以某种方式从 PDF 中获取数据（PyPDF2 也许吧？）并进行比较
使用某些东西（PyPDF2？pdftk？）将 header 信息（如时间戳）修补到文件相等的点并比较哈希值

Answer 2

为了与回归测试一起使用，我编写了 diffpdf.sh 来执行 PDF 的逐页视觉差异。它使用 ImageMagick 和 Poppler PDF 实用程序 pdftoppm 和 pdfinfo.

如果 PDF 显示不同，

diffpdf.sh 将输出非零 return 代码，并打印不同页面的页码，以及反映页面数量的数字不同。每个页面的视觉差异图像也保存到 pdfdiff 目录。

#!/bin/bash

# usage: diffpdf.sh fidle_1.pdf file_2.pdf

# requirements:
# - ImageMagick
# - Poppler's pdftoppm and pdfinfo tools (works with 0.18.4 and 0.41.0,
#                                         fails with 0.42.0)

DIFFDIR="pdfdiff"                        # directory to place diff images in
MAXPROCS=$(getconf _NPROCESSORS_ONLN)    # number of parallel processes

pdf_file1=
pdf_file2=

function diff_page {
    # based on 
    pdf_file1=
    pdf_file2=
    page_number=
    page_index=$(($page_number - 1))

    (cat $pdf_file1 | pdftoppm -f $page_number -singlefile -gray - | convert - miff:- ; \
     cat $pdf_file2 | pdftoppm -f $page_number -singlefile -gray - | convert - miff:- ) | \
    convert - \( -clone 0-1 -compose darken -composite \) \
            -channel RGB -combine $DIFFDIR/$page_number.jpg

    if (($? > 0)); then
        echo "Problem running pdftoppm or convert!"
        exit 1
    fi
    grayscale=$(convert pdfdiff/$page_number.jpg -colorspace HSL -channel g -separate +channel -format "%[fx:mean]" info:)
    if [ "$grayscale" != "0" ]; then
        echo "page $page_number ($grayscale)"
        return 1
    fi
    return 0
}

function num_pages {
    pdf_file=

    pdfinfo $pdf_file | grep "Pages:" | awk '{print }'
}

function minimum {
    echo $((  <  ?  :  ))
}

# guard agains accidental deletion of files in the root directory
if [ -z "$DIFFDIR" ]; then
    echo "DIFFDIR needs to be set!"
    exit 1
fi

echo "Running $MAXPROCS processes in parallel"

pdf1_num_pages=$(num_pages $pdf_file1)
pdf2_num_pages=$(num_pages $pdf_file2)

min_pages=$(minimum $pdf1_num_pages $pdf2_num_pages)

if [ "$pdf1_num_pages" -ne "$pdf2_num_pages" ]; then
    echo "PDF files have different lengths ($pdf1_num_pages and $pdf2_num_pages)"
    rc=1
fi

if [ -d "$DIFFDIR" ]; then
    rm -f $DIFFDIR/*
else
    mkdir $DIFFDIR
fi


# get exit status from subshells (
function wait_for_processes {
    local rc=0

    while (( "$#" )); do
        # wait returns the exit status for the process
        if ! wait ""; then
            rc=1
        fi
        shift
    done
    return $rc
}

function howmany() {
    echo $#
}

rc=0
pids=""
for page_number in `seq 1 $min_pages`;
do
    diff_page $pdf_file1 $pdf_file2 $page_number &
    pids+=" $!"
    if [ $(howmany $pids) -eq "$MAXPROCS" ]; then
        if ! wait_for_processes $pids; then
            rc=1
        fi
        pids=""
    fi
done

if ! wait_for_processes $pids; then
    rc=1
fi

exit $rc

编辑： 这个脚本的改进版本，写在 Python 可以找到 here.

如何单元测试是否已正确生成 PDF 文件？

How can I unittest whether PDF files have been generated correctly?

python

pdf

unit-testing

pytest