Pdf Source Code

Code

  • Let's discuss some of the best open source PDF editors available online, along with the pros and cons of each one. Top 5 Open Source PDF Editors for Windows 1. LibreOffice Draw PDF editor LibreOffice is a strong competitor in the world of PDF editing. It is a free and oen source software much like MS Office.
  • Free PHP Source Code. Download from a vast collections of free PHP source code below. You can modify and integrate it in your own personal use. Just give a little credit to the original author whenever you use it on your system's project.
  • SumatraPDF Reader. SumatraPDF is a multi-format (PDF, EPUB, MOBI, FB2, CHM, XPS, DjVu) reader for Windows under (A)GPLv3 license, with some code under BSD license (see AUTHORS).

For years, the only name in the game for working with PDF documents was Adobe Acrobat, whether in the form of their free reader edition or one of their paid editions for PDF creation and editing. But today, there are numerous open source PDF applications which have chipped away at this market dominance.

21 Jun 2004
Source code that shows how to decompress and extract text from PDF documents.

Read Pdf Source Code

Introduction

PDF documents are commonly used and their content is usually compressed. This article shows a simple C code that can be used to extract plain text from the PDF file.

Why?

Pdf creator source code

Adobe does allows you to submit PDF files and will extract the text or HTML and mail it back to you. But there are times when you need to extract the text yourself or do it inside an application. You may also want to apply special formatting (e.g., add tabs) so that the text can be easily imported into Excel for example (when your PDF document mostly contains tables that you need to port to Excel, which is how this code got developed).

There are several projects on 'The Code Project' that show how to create PDF documents, but none that provide free code that shows how to extract text without using a commercial library. In the reader comments, a need was expressed for code just like what is being supplied here.

There are several libraries out there that read or create PDF file, but you have to register them for commercial use or sign various agreements. The code supplied here is very simple and basic, but it is entirely free. It only use the ZLIB library which is also free.

Basics

You can download documents such as PDFReference15_v5.pdf from here that explains some of the inners of PDF files. In short, each PDF file contains a number of objects. Each object may require one or more filters to decompress it and may also provide a stream of data. Text streams are usually compressed using the FlateDecode filter and may be uncompressed using code from the ZLIB (http://www.zlib.org/) library.

The data for each object can be found between 'stream' and 'endstream' sections. Once inflated, the data needs to be processed to extract the text. The data usually contains one or more text objects (starting with BT and ending with ET) with formatting instructions inside. You can learn a lot from the structure of PDF file by stepping through this application.

About Code

Pdf

This single source code file contains very simple, very basic C code. It initially reads in the entire PDF file into one buffer and then repeatedly scans for 'stream' and 'endstream' sections. It does not check which filter should be applied and always assumes FlateDecode. (If it gets it wrong, usually no output is generated for that section of the file, so it is not a big issue). Once the data stream is inflated (uncompressed), it is processed. During the processing, the code searches for the BT and ET tokens that signify text objects. The contents of each is processed to extract the text and a guess is made as to whether tabs or new line characters are needed.

The code is far from complete or being any sort of general utility class, but it does demonstrate how you can extract the text yourself. It is enough to show you how and get you going.

The code is however fully functional, so when it is applied to a PDF document, it generally does a fair job of extracting the text. It has been tested on several PDF files.

This code is supplied as is, no warranties. Use at your own risk.

Using The Code

The download contains one C file. To use it, create a simple Windows 32 Console project and add the pdf.c file to the project. You also need to go here (bless them!) and download the free 'zlib compiled DLL' zip file. Extract zdll.lib to your project directory and add it as a project dependency (link against it). Also put zlib1.dll in your project directory. Also put zconf.h and zlib.h in your project directory and add them to the project.

Now, step through the application and note that the input PDF and output text file names are hardwired at the start of the main method.

Future Enhancements

If there is enough interest, the author may consider uploading a release version with a Windows interface. The code is quite good for extracting data from tables in a form that can be readily imported into Excel, with the column preserved (because of the tabs that get added).

Code Snippets

Stream sections are located using initially:

And then once the data portion is identified, it is inflated as follows:

The main work gets done in the ProcessOutput method which processes the uncompressed stream to extract text portion of any text object. It looks as follows:

OpenPDF is a Java library for creating and editing PDF files with a LGPL and MPL open source license. OpenPDF is the LGPL/MPL open source successor of iText, and is based on a fork, of a fork, of iText 4 svn tag. We welcome contributions from other developers. Please feel free to submit pull-requests and bugreports to this GitHub repository.

OpenPDF version 1.3.11 released 2019-09-19

Get version 1.3.11 here - https://github.com/LibrePDF/OpenPDF/releases/tag/1.3.11

Use OpenPDF as Maven dependency

Add this to your pom.xml file to use the latest version of OpenPDF:

See Pdf Source Code

License

GNU General Lesser Public License (LGPL) version 3.0 - http://www.gnu.org/licenses/lgpl.html

Mozilla Public License Version 2.0 - http://www.mozilla.org/MPL/2.0/

We want OpenPDF to consist of source code which is consistently licensed with the LGPL and MPL licences only. This also means that any new contributions to the project must have a dual LGPL and MPL license only.

Documentation

  • Tutorial (wiki, work in progress)

Background

Pdf File Source Code

OpenPDF is open source software with a LGPL and MPL license. It is a fork of iText version 4, more specifically iText svn tag 4.2.0, which was hosted publicly on sourceforge with LGPL and MPL license headers in the source code, and lgpl and mpl license documents in the svn repository.Beginning with version 5.0 of iText, the developers have moved to the AGPL to improve their ability to sell commercial licenses.

Projects using OpenPDF

  • Spring Framework https://github.com/spring-projects/spring-framework
  • flyingsaucer https://github.com/flyingsaucerproject/flyingsaucer
  • Confluence PDF Export
  • Digital Signature Service - https://github.com/esig/dss
  • OpenCMS, Nuxeo Web Framework, QR Invoice Library and many closed source commercial applications as well.
  • Full list here: https://mvnrepository.com/artifact/com.github.librepdf/openpdf/usages

Android support

OpenPDF now has Android support, more info here: Android-support

Contributing

Release the hounds! Please send all pull requests.Make sure that your contributions can be released with a dual LGPL and MPL license. In particular, pull requests to the OpenPDF project must only contain code that you have written yourself. GPL or AGPL licensed code will not be acceptable.

Coding Style

  • Code indentation style is 4 spaces.
  • Generally try to preserve the coding style in the file you are modifying.

Dependencies

Required Dependencies:

  • Java 8 or later is required to use OpenPDF. All versions Java 8 to Java 12 have been tested to work.

Optional:

  • BouncyCastle (BouncyCastle is used to sign PDF files, so it's a recommended dependency)
    • Provider
    • PKIX/CMS
  • TwelveMonkeys imageio-tiff - optional by default, but required if TIFF image support is needed.
  • JUnit 5 - for unit testing
  • JFreeChart - for testing graphical examples
    • JFreeChart
    • JCommon
    • Servlet
  • DOM4j is required for the pdf-swing submodule.

Credits

Significant Contributors to OpenPDF on GitHub:

@andreasrosdal - Andreas Røsdal - Maintainer of OpenPDF from 1.0 to 1.3.11, now retired from OpenPDF development.
@daviddurand - David G. Durand
@tlxtellef - Tellef
@asturio
@ymasory
@albfernandez - Alberto Fernández
@noavarice
@bengolder - Benjamin Golder
@glarfs
@Kindrat
@syakovyn
@ubermichael - Michael Joyce
@weiyeh
@SuperPat45
@lapo-luchini
@MartinKocour - Martin Kocour
@jokimaki
@sullis
@lapo-luchini
@jeffrey-easyesi
@V-F
@sixdouglas
@razilein - Sita Geßner
@PalAditya - Aditya Pal

Also, a very special thanks to the iText developers ;)