Extract data from PDF files with Python

We have to deal with PDF, one of the most horrible format around but one of the most used.

Mainly the issue with that is the compatibility because there are various standard around since when the project started so you can be never sure if everything is working when you need to manipulate those files.

An example you can check is how Mozilla improved the PDF reader in Firefox in the recent releases.
Anyway you can do it with Python, of course you just need to sanitize the data extracted, let me show it with an example.

If you open this script you can see various things also without having access to the pdf.

We read 2 different files at line 14/15:

cedolinipdf = [filename for filename in os.listdir(sys.argv[1]) if filename.startswith("CEDOLINI")]
modf24pdf = [filename for filename in os.listdir(sys.argv[1]) if filename.startswith("MOD.F24")]

This code is also to automatically found files that have those terms in the name to speed up.

At line 30 we read the binary file of the first one, we loop the pages if they are and create a unique string. PDFMiner as I understood convert the whole document line by line in pure text, so the rest of the code for both the files is just loop the various line and find the one with the content we need.

with open(modf24pdf, 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):

Once we have those values with a bit of calculation like, if the previous line has a specific text, this one has what I need. We can save that data in a variable and convert it as integer and do whatever we need.

output_string = output_string.getvalue().splitlines()

for line in output_string:
    if line != 'Hi':

I wanted to attach also a PDF but this script was created to extract some information from Italian taxes PDF and do some calculation for fisco reasons.
The only way is try a lot but as you can see is not so much complicated, when you understood where is the text.

Liked it? Take a second to support Mte90 on Patreon!

Leave a Reply

Your email address will not be published. Required fields are marked *

Extract data from PDF files with Python

time to read: 2 min