Difference between revisions of "Python: Baca PDF bahasa Inggris untuk jadi text file"

From OnnoWiki
Jump to navigation Jump to search
 
(4 intermediate revisions by the same user not shown)
Line 3: Line 3:
 
Persiapan terutama untuk textract
 
Persiapan terutama untuk textract
  
  sudo su
+
  sudo apt remove --purge python-pip
apt remove --purge python-pip
+
  sudo apt -y install python3-pip libpulse-dev build-essential autoconf libtool \
  apt -y install python3-pip
+
pkg-config python-opengl python-pyrex python-pyside.qtopengl python-pdfminer \
apt -y install libpulse-dev
 
apt -y install build-essential autoconf libtool pkg-config python-opengl \
 
python-pyrex python-pyside.qtopengl \
 
 
  qt4-dev-tools qt4-designer libqtgui4 libqtcore4 libqt4-xml libqt4-test \
 
  qt4-dev-tools qt4-designer libqtgui4 libqtcore4 libqt4-xml libqt4-test \
 
  libqt4-script libqt4-network libqt4-dbus python-qt4 python-qt4-gl libgle3 \
 
  libqt4-script libqt4-network libqt4-dbus python-qt4 python-qt4-gl libgle3 \
Line 19: Line 16:
 
  pip3 --version
 
  pip3 --version
  
Install apps via pip
+
Exit sudo. Exit superuser.
 +
Sebagai user biasa.
 +
Lakukan install apps via pip
  
 
  pip3 install greenlet
 
  pip3 install greenlet
 
  pip3 install gevent
 
  pip3 install gevent
 
  pip3 install PyPDF2
 
  pip3 install PyPDF2
 +
pip3 install nltk
 
  pip3 install textract
 
  pip3 install textract
pip3 install nltk
 
  
 
+
Source Code (misalnya readpdf.py)
Code
 
  
 
  # Load Library
 
  # Load Library
Line 36: Line 34:
 
  from nltk.tokenize import word_tokenize
 
  from nltk.tokenize import word_tokenize
 
  from nltk.corpus import stopwords
 
  from nltk.corpus import stopwords
 
+
 
 
 
 
 
  #write a for-loop to open many files
 
  #write a for-loop to open many files
 
  filename = 'enter the name of the file here'  
 
  filename = 'enter the name of the file here'  
Line 47: Line 43:
 
  #The pdfReader variable is a readable object that will be parsed
 
  #The pdfReader variable is a readable object that will be parsed
 
  pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
 
  pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
 
+
 
  #discerning the number of pages will allow us to parse through all #the pages
 
  #discerning the number of pages will allow us to parse through all #the pages
 
  num_pages = pdfReader.numPages
 
  num_pages = pdfReader.numPages
Line 89: Line 85:
  
  
 +
==Menjalankan==
  
 +
python3 readpdf.py
  
  

Latest revision as of 09:50, 30 October 2018

Sumber: https://medium.com/@rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f

Persiapan terutama untuk textract

sudo apt remove --purge python-pip
sudo apt -y install python3-pip libpulse-dev build-essential autoconf libtool \
pkg-config python-opengl python-pyrex python-pyside.qtopengl python-pdfminer \
qt4-dev-tools qt4-designer libqtgui4 libqtcore4 libqt4-xml libqt4-test \
libqt4-script libqt4-network libqt4-dbus python-qt4 python-qt4-gl libgle3 \
python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext \
tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig \
python-dev libssl-dev python-imaging-doc-html python-imaging-doc-pdf

Cek pip version

pip3 --version

Exit sudo. Exit superuser. Sebagai user biasa. Lakukan install apps via pip

pip3 install greenlet
pip3 install gevent
pip3 install PyPDF2
pip3 install nltk
pip3 install textract

Source Code (misalnya readpdf.py)

# Load Library
import PyPDF2 
import textract

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

#write a for-loop to open many files
filename = 'enter the name of the file here' 

#open allows you to read the file
pdfFileObj = open(filename,'rb')

#The pdfReader variable is a readable object that will be parsed
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

#discerning the number of pages will allow us to parse through all #the pages
num_pages = pdfReader.numPages
count = 0
text = ""

#The while loop will read each page
while count < num_pages:
    pageObj = pdfReader.getPage(count)
    count +=1
    text += pageObj.extractText() 

#This if statement exists to check if the above library returned #words. It's done because PyPDF2 cannot read scanned files.
if text != "":
   text = text

#If the above returns as False, we run the OCR library textract to #convert scanned/image based PDF files into text 
else:
   text = textract.process(fileurl, method='tesseract', language='eng')

# Now we have a text variable which contains all the text derived #from our PDF file. Type print(text) to see what it contains. It #likely contains a lot of spaces, possibly junk such as '\n' etc.

# Now, we will clean our text variable, and return it as a list of keywords.



#The word_tokenize() function will break our text phrases into #individual words
tokens = word_tokenize(text)

#we'll create a new list which contains punctuation we wish to clean
punctuations = ['(',')',';',':','[',']',',']

#We initialize the stopwords variable which is a list of words like #"The", "I", "and", etc. that don't hold much value as keywords
stop_words = stopwords.words('english')

#We create a list comprehension which only returns a list of words #that are NOT IN stop_words and NOT IN punctuations.
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]


Menjalankan

python3 readpdf.py



Referensi

Pranala Menarik