Python: Baca PDF bahasa Inggris untuk jadi text file
Revision as of 09:04, 30 October 2018 by Onnowpurbo (talk | contribs)
Sumber: https://medium.com/@rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f
Persiapan terutama untuk textract
sudo su apt remove --purge python-pip apt -y install python3-pip apt -y install libpulse-dev apt -y install build-essential autoconf libtool pkg-config python-opengl \ python-pyrex python-pyside.qtopengl \ qt4-dev-tools qt4-designer libqtgui4 libqtcore4 libqt4-xml libqt4-test \ libqt4-script libqt4-network libqt4-dbus python-qt4 python-qt4-gl libgle3 \ python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext \ tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig \ python-dev libssl-dev python-imaging-doc-html python-imaging-doc-pdf
Cek pip version
pip3 --version
Install apps via pip
pip3 install greenlet pip3 install gevent pip3 install PyPDF2 pip3 install textract pip3 install nltk
Code
# Load Library import PyPDF2 import textract from nltk.tokenize import word_tokenize from nltk.corpus import stopwords
#write a for-loop to open many files filename = 'enter the name of the file here' #open allows you to read the file pdfFileObj = open(filename,'rb') #The pdfReader variable is a readable object that will be parsed pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#discerning the number of pages will allow us to parse through all #the pages num_pages = pdfReader.numPages count = 0 text = "" #The while loop will read each page while count < num_pages: pageObj = pdfReader.getPage(count) count +=1 text += pageObj.extractText() #This if statement exists to check if the above library returned #words. It's done because PyPDF2 cannot read scanned files. if text != "": text = text #If the above returns as False, we run the OCR library textract to #convert scanned/image based PDF files into text else: text = textract.process(fileurl, method='tesseract', language='eng') # Now we have a text variable which contains all the text derived #from our PDF file. Type print(text) to see what it contains. It #likely contains a lot of spaces, possibly junk such as '\n' etc. # Now, we will clean our text variable, and return it as a list of keywords.
#The word_tokenize() function will break our text phrases into #individual words tokens = word_tokenize(text) #we'll create a new list which contains punctuation we wish to clean punctuations = ['(',')',';',':','[',']',','] #We initialize the stopwords variable which is a list of words like #"The", "I", "and", etc. that don't hold much value as keywords stop_words = stopwords.words('english') #We create a list comprehension which only returns a list of words #that are NOT IN stop_words and NOT IN punctuations. keywords = [word for word in tokens if not word in stop_words and not word in punctuations]