PDF: The Portable Document Format

From Free Knowledge Base- The DUCK Project: information for everyone
Jump to: navigation, search

The Portable Document Format (PDF) is the file format created by Adobe Systems in 1993 for document exchange. PDF is used for representing two-dimensional documents in a device-independent and display resolution-independent fixed-layout document format. Each PDF file encapsulates a complete description of a 2-D document (and, with Acrobat 3-D, embedded 3-D documents) that includes the text, fonts, images, and 2-D vector graphics that compose the document.

PDF is an open standard, and is now being prepared for submission as an ISO standard. Adobe is an evil company.

PDF Types

Consider the two main distinctions in PDF file types, scanned versus native. A native PDF file is superior to a scanned PDF file in capabilities, flexibility, and efficiency. This is due to a distinction of true text in the PDF from a PDF that is only images of text.

PDF Types:

  • Native
  • Scanned

Native PDF

A native PDF file will contain literal text as part of the structure, including information about the text. This is not to say that there are no images. It is to stay that the text itself is actual text and not just part of an image. A native PDF has an internal structure that can be read and interpreted. Only a native PDF can utilize all of the capabilities that the format lends to the reader software.

Scanned PDF

PDF files created by scanning hard-copy documents containing primarily text do not have the same structure as a PDF file of the same document created directly. The scanned document internally contains a picture of the document, with no information about the text. As far as a user can see it is just another PDF file, with a name and extension indistinguishable from any other; a good scan may look exactly the same as a native PDF file, although a visually poor-quality file, often with skewed pages, gives away its nature. However, the file size will be different, and it will not be possible to search for text. For a scan of adequate quality it is possible with suitable software to regenerate the text of the document with Optical character recognition (OCR), and embed it in the file so as to make it searchable, subject to the accuracy of the OCR.

Conversion

To use software to convert a Scanned PDF into a Native PDF involves Optical character recognition (OCR) technology. OCR will analyze the "image" of each character and match it to an electronic character-based file. The level of accuracy depends on the quality of the scan and the font used. OCR works primarily on typeset characters and not hand written text.

PDF Document Viewers

Evince PDF

35star.png

Windows, FreeBSD, Linux

Evince is a document viewer for multiple document formats. The goal of evince is to replace the multiple document viewers that exist on the GNOME Desktop with a single simple application.

Evince currently supports PDF, Postscript, djvu, tiff, dvi, XPS, SyncTex with gedit, comics books (cbr,cbz,cb7 and cbt), and many more.

Review: Evince opens PDF files into a well laid out reader. The DRM flag is ignored making Evince far more useful than Sumatra PDF or Adobe reader. Loading speed was similar to Sumatra. One notable glitch occurs when text is selected, the text becomes distorted. This can somewhat hinder text selection. It has been reported that the Windows version will only open PDF files. In our test on Microsoft Windows we confirmed Evince was unable to open .epub an eBook format.

The fact that Evince PDF is not handicapped by DRM restrictions makes it far more useful as a PDF reader when compared to Sumatra PDF. For this reason Evince is our choice for a Windows PDF reader.

An annoying flaw in Evidence costs it half a star. On some PDF documents when print is selected, the printer outputs only blank paper. Certain PDF files will not print correctly using Evince. This is a reoccurring problem. Ultimately this is a serious issue with Evidence and results in the software being inadequate.

PDFlite

05star.png

PDFlite can be used to read any PDF file. Simple design. View PDF documents with all common features such as search, print, zoom. Use the PDFlite printer so you can convert any document to PDF file.

PUP alert: Malware in installer. Even if you uncheck the toolbar and other software it still installs PUP in the background! Avoid unless you want to take the time to install it yourself from the sourcecode they provide.

Sumatra PDF

20star.png

Microsoft Windows Only

A minimalistic PDF reader. Sumatra PDF has a minimalistic design, and its simplicity is attained at the expense of many other features. As is characteristic of many portable applications, Sumatra takes up little disk space - it has a 1mb setup file (compared to Adobe Reader's 27.5mb setup file), and it starts up rapidly. It was designed for portable use in the sense that it's just one file with no external dependencies so you can easily run it from external USB drive[1]. This would classify it as a portable application.

One interesting feature of Sumatra PDF is that it remembers exactly the last opened page for each pdf file. This helps it be a very useful pdf e-book reader.

Review: Sumatra PDF contains anti-features. It enforces DRM restrictions. As stated on a Sourceforge review, "it supports DRM of "protected" PDF files, and the author stubbornly refuses to make it optional. So you can't print PDFs for offline reading, and you can't copy text to the clipboard for pasting into Google translate, saving to your notes, quoting in a paper, etc."

The Sumatra PDF software developers are crybabies. Read their little rant about PDFLite is a SumatraPDF ripoff. The title should be Sumatra PDF developers do not understand Open Source.

GhostScript

40star.png

Windows, FreeBSD, Linux

Command Line. Ghostscript is a suite of software. You can view, convert, and manipulate PDF files. Ghostscript is an interpreter for PostScript and Portable Document Format (PDF) files. Postscript can be picky and inconsistent about the PDF files it will open.

Example: view a PDF on Windows XP

gswin32c.exe -dSAFER -dBATCH "C:\Program Files\GPLGS\test3.pdf"

The example will open the pdf document in a GUI window for viewing.

PDF Authoring

PDF Utilities

The GUI Way: Using Gimp and LibreOffice Draw

It is fast, simple, and can all be accomplished without dropping to console, the creation of PDF documents from scanned images and other data sources.

This method is for people that wish to: Scan documents to images, make any modifications to the images, order the images and generate a custom multiple page PDF document.

Learn how to Create PDF Documents with Gimp and LibreOffice Draw.

The GUI Way: Using Simple Scan and PDF Chain

If all you are looking to do is scan some documents page by page, then combine them as a single ordered PDF without the need to make any edits or do any fancy OCR, compression, or other modification related activity, you can accomplish this quite quickly and easily using two programs:

  • Simple Scan
  • PDF Chain

With Simple Scan you can scan each page, and save each page as a PDF. You can even skip using PDF Chain and scan a number of pages to save as a PDF. However, if you need to re-order you can load each PDF you save into PDF chain and do some order changes, annotation, or other basic PDF related modification.

Linux PDF Tools: tiff2ps and ps2pdf

On Linux the tiff2ps command is part of libtiff-tools. The command line tools in libtiff-tools include tiffcp, tiff2ps', tiffdump and tiffsplit. Windows executables for libtiff-tools can be found at stillhq.com, e.g. http://www.stillhq.com/libtiff/win32/3.5.4/tiffcp.exe and http://www.stillhq.com/libtiff/win32/3.5.4/tiff2ps.exe

The Linux ps2pdf command is part of Ghostscript. Those command line tools are ps2pdf, gs or gswin32 (Win32 version). Ghostscript for Windows is gs651w32.exe

Netpbm for Windows is netpbm-9.19-bin.zip and requires Cygwin.

make pdf: from tiff, Use Tiff to PS (in linux)

tiff2ps *.tiff > tiffs.ps

from PS to PDF

ps2pdf tiffs.ps

You can compress an existing PDF (like one made with Gimp) into a smaller file size (ref: Compress PDF File In Linux)

ps2pdf big.pdf smaller.pdf

Linux PDF Tools: imagemagick

This particular method I highly recommend if you are comfortable with the linux shell. I found found this to yield the best results with the least amount of labor.

From the imagemagick package, use the convert command to perform tasks such as taking a folder of jpg images and creating a single PDF document. If the images are numbered in a way such as 01 02 03 04 05 (use leading zeros) then the page order will concur.

convert *.jpg document.pdf

It also works with png files

convert *.png document.pdf

The PDF contracts.pdf is black and white and contains multiple pages, we can generate a tiff image for each page and add parameters so there isn't a bunch of quality loss.

convert -colorspace rgb -density 300 contracts.pdf -monochrome  contracts-%03d.tiff

You can install imagemagick with apt

sudo apt install imagemagick

See also: Create PDF Documents with ImageMagick and Ghostscript

Linux PDF Tools: qpdf PDF transformation software

The qpdf program is used to convert one PDF file to another equivalent PDF file. It is capable of performing a variety of transformations such as linearization (also known as web optimization or fast web viewing), encryption, and decryption of PDF files. It also has many options for inspecting or checking PDF files, some of which are useful primarily to PDF developers.

For example, I have a password protected PDF and I know the password, I simply wish to remove password protection:

qpdf –password=password –decrypt /home/nicole/Documents/resume.pdf /home/nicole/Documents/resume2.pdf

Replace "password" with the actual password of the document. qpdf was installed by default on my Linux Mint 18 system. If it is not installed on yours:

sudo apt install qpdf

Linux PDF Tools: tiff2pdf and tiffcp

The tiff2pdf utility can convert a single tiff file into a pdf document. For multiple pages it will be necessary to create a multi-page tiff file. Yes, a single tiff file can contain multiple pages.

A 12 page black and white document was scanned into jpeg images. Although jpeg was not the best choice for black and white documents, this is how it was presented and thus needed to be converted to a pdf. imagemagick convert produced a large pdf over 6mb that was not optimized for black and white. This is not referring to compression, as applying jpeg compression or changing the dpi is not the correct way to optimize black and white scanned images.

Our fat pdf that was created from jpeg and not optimized for black and white is called: document.pdf It will be deconstructed back to images, except this time into optimized for black and white tiff images. A larger multi-page tiff file will then be created from the multiple tiff images. The single multi-page tiff file will then be converted back into a much smaller optimized pdf document.

convert -colorspace rgb -density 300 document.pdf -monochrome document-%03d.tiff
tiffcp document-???.tiff multipage.tiff
tiff2pdf -o documentfinal.pdf multipage.tiff

While the original document.pdf is over 6 mb, the documentfinal.pdf is less than 1mb.

Linux PDF Tools: pdfcrack

To unlock a password protected PDF file when you do NOT know the password. PDFCrack is a GNU/Linux tool for recovering passwords and content from PDF-files. It is small, command line driven without external dependencies.

pdfcrack -f 2020CrackMe.pdf

If you see the error

The specific version is not supported (Standard - 6)

Then the version of pdfcrack does not support 256-bit

Other resources, look into John the Ripper to brute force crack a protected PDF. John the Ripper is a fast password cracker. Its primary purpose is to detect weak Unix passwords.

Print to PDF in Windows

CutePDF Writer

There is a free version and a more feature rich pay version on their web site, http://www.cutepdf.com/Products/CutePDF/writer.asp

Print to PDF in Linux

One simple option that works in Debian distributions such as the popular Ubuntu Linux is to use cups-pdf.

See: Install and Use cups-pdf in Ubuntu for a detailed guide.

Convert Images to PDF in Windows

Free Image to PDF Converter. Supported formats are BMP, DIB, GIF, JPEG, JPG, JPE, JFIF, PNG, TIFF,TIF. Multiple files to a multi-page PDF. The tool combines multiple directories and images into one PDF.

Installer: PDFdu_Image_To_PDF_setup.exe
Developer Web Site: http://pdfdu.com/app/image-to-pdf-converter.aspx

Convert PDF to Images in Windows

Windows Print Driver: PDF to TIFF

The Virtual Image Printer driver by tariel will allow you to convert a PDF to multiple page image files in several image formats. This is not all The Virtual Image Printer and it is not exclusively for converting PDF to images. However, it is very handy for performing this task under the Windows XP operating system.

GhostScript

The installer "gs915w32.exe" is the Win32 installer as of Dec 2014 for Microsoft Windows 32-bit Operating Systems such as Windows XP. Using GhostScript a PDF can be converted to PNG for example.

 gswin32c.exe -dNOPAUSE -dBATCH -sDEVICE=pnggray -sOutputFile="test.png" "test.pdf"

GhostScript requires a proper PDF. Some PDF files are broken, in that they will open in some viewers, but are not completely compliant with the standard. In short, GhostScript is picky.

OCR Scanned Images for your PDF Pages

Tesseract is an optical character recognition utility that will work in Linux and Microsoft Windows as well as other operating systems.

Tesseract up to and including version 2 could only accept TIFF images of simple one-column text as inputs. Since version 3.00 Tesseract has supported output text formatting and besides TIFF allows for a number of new image formats.Tesseract is suitable for use as a backend and can be used for more complicated OCR tasks including layout analysis by using a frontend such as OCRopus or gImageReader.

Before using Tesseract is is very important to properly process all the images so they will be most efficiently read by tesseract.

  • text x-height is at least 20 pixels
  • reduce or eliminate rotation or skew of the text
  • high contract is recommended
  • eliminate any border or dark boxes around text

see: Tesseract for usage and examples of this powerful OCR tool that beats many expensive commercial software products including Adobe. It is pretty impressive!

References

Related Pages