Bits and Pieces: Extract Pages From a PDF

https://www.linuxjournal.com/content/tech-tip-extract-pages-pdf

There are a number of ways to extract a range of pages from a PDF file: there are PDF related toolkits for doing it, or you can use Ghostscript directly.

For example, to extract pages 22-36 from a 100-page PDF file using pdftk:

  $ pdftk A=100p-inputfile.pdf cat A22-36 output outfile_p22-p36.pdf

Or use a combination of xpdf-utils (or poppler-tools) with psutils and the ps2pdf command (which ships as part of Ghostscript):

  $ pdftops 100p-inputfile.pdf - | psselect -p22-36 | \
         ps2pdf14 - outfile_p22-p36.pdf

Or, just use Ghostscript (which, unlike pdftk, is installed nearly everywhere; and you've been using it in the last command anyway):

  $ gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER \
       -dFirstPage=22 -dLastPage=36 \
       -sOutputFile=outfile_p22-p36.pdf 100p-inputfile.pdf

Regarding speed and efficiency of the processing and more important the quality of the output file, the 2nd method above is for sure the worst of the 3. The conversion of the original PDF to PostScript and back to PDF (also known as "refrying" the PDF) is very unlikely to completely preserve advanced PDF features (such as transparency information, font hinting, overprinting information, color profiles, trapping instructions, etc.).

The 3rd method uses Ghostscript only (which the 2nd one uses anyway, because ps2pdf14 is nothing more than a wrapper script around a more or less complicated Ghostscript commandline. The 3rd method also preserves all the important PDF objects on your pages as they are, without any "roundtrip" conversions....

The only drawback of the 3rd method is that it's a longer, more complicated command line to type. But you can overcome that drawback if you save it as a bash function. Just put these lines in your ~/.bashrc file:

function pdfpextr()
{
    # this function uses 3 arguments:
    #     $1 is the first page of the range to extract
    #     $2 is the last page of the range to extract
    #     $3 is the input file
    #     output file will be named "inputfile_pXX-pYY.pdf"
    gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER \
       -dFirstPage=${1} \
       -dLastPage=${2} \
       -sOutputFile=${3%.pdf}_p${1}-p${2}.pdf \
       ${3}
}

Now you only need to type (after starting a new copy bash or sourcing .bashrc) the following:

  $ pdfpextr 22 36 inputfile.pdf

which will result in the file inputfile_p22-p36.pdf in the same directory as the input file.

Bits and Pieces

Thursday, 17 May 2018

Extract Pages From a PDF

No comments:

Post a Comment

Blog Archive