Python: Baca PDF bahasa Inggris untuk jadi text file
Revision as of 09:38, 30 October 2018 by Onnowpurbo (talk | contribs)
Sumber: https://medium.com/@rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f
Persiapan terutama untuk textract
sudo apt remove --purge python-pip sudo apt -y install python3-pip libpulse-dev build-essential autoconf libtool \ pkg-config python-opengl python-pyrex python-pyside.qtopengl \ qt4-dev-tools qt4-designer libqtgui4 libqtcore4 libqt4-xml libqt4-test \ libqt4-script libqt4-network libqt4-dbus python-qt4 python-qt4-gl libgle3 \ python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext \ tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig \ python-dev libssl-dev python-imaging-doc-html python-imaging-doc-pdf
Cek pip version
pip3 --version
Exit sudo. Exit superuser. Sebagai user biasa. Lakukan install apps via pip
pip3 install greenlet pip3 install gevent pip3 install PyPDF2 pip3 install nltk pip3 install textract
Source Code (misalnya readpdf.py)
# Load Library import PyPDF2 import textract from nltk.tokenize import word_tokenize from nltk.corpus import stopwords #write a for-loop to open many files filename = 'enter the name of the file here' #open allows you to read the file pdfFileObj = open(filename,'rb') #The pdfReader variable is a readable object that will be parsed pdfReader = PyPDF2.PdfFileReader(pdfFileObj) #discerning the number of pages will allow us to parse through all #the pages num_pages = pdfReader.numPages count = 0 text = "" #The while loop will read each page while count < num_pages: pageObj = pdfReader.getPage(count) count +=1 text += pageObj.extractText() #This if statement exists to check if the above library returned #words. It's done because PyPDF2 cannot read scanned files. if text != "": text = text #If the above returns as False, we run the OCR library textract to #convert scanned/image based PDF files into text else: text = textract.process(fileurl, method='tesseract', language='eng') # Now we have a text variable which contains all the text derived #from our PDF file. Type print(text) to see what it contains. It #likely contains a lot of spaces, possibly junk such as '\n' etc. # Now, we will clean our text variable, and return it as a list of keywords.
#The word_tokenize() function will break our text phrases into #individual words tokens = word_tokenize(text) #we'll create a new list which contains punctuation we wish to clean punctuations = ['(',')',';',':','[',']',','] #We initialize the stopwords variable which is a list of words like #"The", "I", "and", etc. that don't hold much value as keywords stop_words = stopwords.words('english') #We create a list comprehension which only returns a list of words #that are NOT IN stop_words and NOT IN punctuations. keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
Menjalankan
python3 readpdf.py