Get Underlined Text from Any PDF with Python – Towards Data Science

A step-by-step guide to get underlined text as an array from PDF files.

If you want to see the code for this project, check out my repository: https://github.com/sasha-korovkina/pdfUnderlinedExtractor

PDF data extraction can be a real headache, and it gets even trickier when youre trying to snag underlined text believe it or not, there arent any go-to solutions or libraries that handle this out of the box. But dont worry, Im here to show you how to tackle this.

Extracting underlined text from PDFs can take a few different paths. You might consider using OCR to detect text components with bottom lines or delve into PyMuPDFs markup capabilities. However, Ive found that OCR tends to falter, suffering from inconsistency and low accuracy. PyMuPDF isnt my favorite either it demands finicky parameter tuning, which is time-consuming. Plus, one wrong setting and you could lose a bunch of data.

It is important to remember that PDFs are:

But fear not, as we have a strategy to resolve this.

We will use the pdfquery library, the most comprehensive PDF to XML converter which I have come across.

2. Studying the XML

The XML has a few key components which we are interested in:

LTRect component example:

Therefore, by converting the whole document into XML format, we can replicate its structure as XML components, lets do just that!

Now, we will re-create the structure of our document as bounding box coordinates. To do this, we will parse the XML to define the page, component boxes, lines and rectangles, and then draw them all on our canvas in 3 different colors.

Here is our inital PDF, it has been generated in Microsoft Word, by exporting a document with some underlines to the PDF file format:

After applying the algorithm above, here is the visual representation we get:

This image represents the structure of our document, where the black box is used to describe all components on the page, and the blue is used to describe the LTRect elements, hence the underlined text.

Now, lets visualize all of the text within the PDF in its respective positions, with the following line of code:

Here is the output:

Note that the text is not exactly where it was in the original document, due to the difference in size and font of the mark-up language in the pdfquery library.

As the result of our XML, we will have an array of coordinates of underlined regions, in my case I have called it underline_text.

Heres the process:

This method of extracting text from PDFs using coordinate rectangles and Tesseract OCR is effective for several reasons:

And this is the code:

Make sure that you have tesseract installed on your system before running this function. For in-depth instructions, check out their official installation guide here: https://github.com/tesseract-ocr/tessdoc/blob/main/Installation.md or in my GitHub repository here: https://github.com/sasha-korovkina/pdfUnderlinedExtractor.

Now, If we take any PDF file, like this example file:

We have some underlined words in this file:

After running the code described above, here is what we get:

After getting this array, you can use these words for further processing!

Enjoy using this script! Id love to hear about any creative applications you come up with or if youd like to contribute. Let me know!

Read the original here:

Get Underlined Text from Any PDF with Python - Towards Data Science

Related Posts

Comments are closed.