How to extract data from PDF files with Python — nice and easy
My challenge this week was to extract MSCI’s Environmental, Social, and Governance (ESG) IVA scores from multiple PDF files. I only had access to the PDF version available on the website here, so I had to figure out a way to extract all these numbers and put them in a nice data frame that allows me to analyze the data.
Sounds easy right? Just copy-paste a bunch of numbers from a PDF file? Well, yeah. Except that there were more than 100 files. But, even though it took me a while to figure out how to do it, I managed to complete the task, and here is the code.
The report
Each file looks like this:
The PDF files have a fixed structure, so all I needed to do was to browse the file and extract the ESG score.
The packages
Let’s start by listing the packages we are going to use in this case.
import pandas as pd
import pdfplumber
import re
pdfplumber: Extract text and tables from PDF files.
pandas: The good old pandas package will help us transform Data Frames.
re: Will help us find and match strings to patterns.
Preparing the file
To prep the file, we use the pdf plumber package. One thing I must mention is that there are more than one way and several packages that can help you approach this problem with Python. The package I decided to use, pdf plumber, is the one that worked best for the particular file I was using. However, I have used other packages on other files, like PyPDF2, that have also helped me extract data from PDF files.
It´s good to try the best package that works for your file because each one parses data differently. Here is a good article that includes some more about the different packages available and how you can use them.
pdf=pdfplumber.open('your-file-folder/your-file-name.pdf')
page = pdf.pages[0]
text = page.extract_text()
print(text)
Set up
Since the objective is to extract ESG scores from the document, I used a regex function to define the pattern that I was looking for.
MSCI IVA ESG scores are decimal numbers between 0 and 10, with one decimal number. We are looking to extract the digits on the ones and tenths, so we are looking for numbers with the pattern: X.X
It could happen that we have numbers such as 34.6 in the same line we are extracting, and we end up extracting the 4.6 part of it. To avoid extracting other numbers, I added a blank space (\s) before each digit. I define this with a regex function that finds all the strings with this pattern within a certain piece of text. If you want to learn more about the code you need to use to create regex functions, check this page out.
def getNumbers(str):
array = re.findall(r'\s\d.\d\s', str)
return array)
Extracting data
Once we have the data in text format, we can extract the lines where the pieces of information are.
Since we are looking to extract the scores for environmental, social, and governance, we can locate the pieces of the text where these strings are.
In this case, we are looking for lines that contain the word Environmental, Social, and Governance.
We are also going to look for the ticker. The ticker is in our second line. So in order to extract the line, we also create a string called “TICKER”, to extract all lines that start with this word.
#Create a dummy dataframe
df = pd.DataFrame()
#Define the strings you are looking for
String = "ENVIRONMENTAL"
e_score=re.compile(String)
String = "SOCIAL"
s_score=re.compile(String)
String = "GOVERNANCE"
g_score=re.compile(String)
String_t = "TICKER"
ticker_data=re.compile(String_t)
The next step is to split the text into lines. After that we locate the specific lines that contain the strings we are looking for, and apply the function getNumbers, to extract the numbers that follow the pattern we specified before. Each number is saved as a variable and appended to a column in the df data Frame.
The last part of this code creates a column for the ticker. In this case, we are looking for Facebook´s ticker, which is FB. Since it only has two letters, we create another regex function, that looks for patterns that have a colon (:), followed by a blank space (\s), two letters ([A-Z]{2}), and another blank space(\s).
for line in text.split('\n'):
if e_score.search(line):
array=getNumbers(line)
e=array
df['e']=e
if s_score.search(line):
array=getNumbers(line)
s=array
df['s']=s
if g_score.search(line):
array=getNumbers(line)
g=array
df['g']=g
if ticker_data.search(line):
ticker=line
match1 = re.findall(r':\s[A-Z]{2}\s', ticker)
tckr=match1
df['ticker']=tckr
Cleaning data
I ended up having an extra : on the ticker column, so I used the string replace function to clean this.
df['ticker'] = df['ticker'].str.replace(r':', '')
print(df)
If you enjoyed this article, please follow me here on Medium for more stories about data processing.