Python: Baca PDF bahasa Inggris untuk jadi text file

From OnnoWiki
Jump to navigation Jump to search

Sumber: https://medium.com/@rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f

Persiapan terutama untuk textract

sudo apt remove --purge python-pip
sudo apt -y install python3-pip libpulse-dev build-essential autoconf libtool \
pkg-config python-opengl python-pyrex python-pyside.qtopengl python-pdfminer \
qt4-dev-tools qt4-designer libqtgui4 libqtcore4 libqt4-xml libqt4-test \
libqt4-script libqt4-network libqt4-dbus python-qt4 python-qt4-gl libgle3 \
python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext \
tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig \
python-dev libssl-dev python-imaging-doc-html python-imaging-doc-pdf

Cek pip version

pip3 --version

Exit sudo. Exit superuser. Sebagai user biasa. Lakukan install apps via pip

pip3 install greenlet
pip3 install gevent
pip3 install PyPDF2
pip3 install nltk
pip3 install textract

Source Code (misalnya readpdf.py)

# Load Library
import PyPDF2 
import textract

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

#write a for-loop to open many files
filename = 'enter the name of the file here' 

#open allows you to read the file
pdfFileObj = open(filename,'rb')

#The pdfReader variable is a readable object that will be parsed
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

#discerning the number of pages will allow us to parse through all #the pages
num_pages = pdfReader.numPages
count = 0
text = ""

#The while loop will read each page
while count < num_pages:
    pageObj = pdfReader.getPage(count)
    count +=1
    text += pageObj.extractText() 

#This if statement exists to check if the above library returned #words. It's done because PyPDF2 cannot read scanned files.
if text != "":
   text = text

#If the above returns as False, we run the OCR library textract to #convert scanned/image based PDF files into text 
else:
   text = textract.process(fileurl, method='tesseract', language='eng')

# Now we have a text variable which contains all the text derived #from our PDF file. Type print(text) to see what it contains. It #likely contains a lot of spaces, possibly junk such as '\n' etc.

# Now, we will clean our text variable, and return it as a list of keywords.



#The word_tokenize() function will break our text phrases into #individual words
tokens = word_tokenize(text)

#we'll create a new list which contains punctuation we wish to clean
punctuations = ['(',')',';',':','[',']',',']

#We initialize the stopwords variable which is a list of words like #"The", "I", "and", etc. that don't hold much value as keywords
stop_words = stopwords.words('english')

#We create a list comprehension which only returns a list of words #that are NOT IN stop_words and NOT IN punctuations.
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]


Menjalankan

python3 readpdf.py



Referensi

Pranala Menarik