Python: NLTK download corpus

From OnnoWiki
Jump to navigation Jump to search

Corpus untuk NLTK bisa di download menggunakan script, misalnya download-corpus.py

import nltk
nltk.download()

jalankan

python download-corpus.py

akan keluar

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------

Pilih d untuk mendownload semua corpus yang ada supaya tidak pusing kepala, akan keluar,

Packages:
  [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
  [ ] mwa_ppdb............ The monolingual word aligner (Sultan et al.
                           2015) subset of the Paraphrase Database.
  [ ] nonbreaking_prefixes Non-Breaking Prefixes (Moses Decoder)
  [-] panlex_lite......... PanLex Lite Corpus
  [ ] pe08................ Cross-Framework and Cross-Domain Parser
                           Evaluation Shared Task
  [-] perluniprops........ perluniprops: Index of Unicode Version 7.0.0
                           character properties in Perl
  [ ] porter_test......... Porter Stemmer Test Files
  [-] stopwords........... Stopwords Corpus
  [ ] vader_lexicon....... VADER Sentiment Lexicon
  [ ] wmt15_eval.......... Evaluation data from WMT15

Collections:
  [-] all-corpora......... All the corpora
  [-] all................. All packages
  [-] book................ Everything used in the NLTK Book

([*] marks installed packages; [-] marks out-of-date or corrupt packages)

Download which package (l=list; x=cancel)?
  Identifier>

Pilih

all

supaya tidak pusing, tapi ini akan memakan banyak bandwidth, akan keluar

   Downloading collection u'all'
      | 
      | Downloading package abc to /home/onno/nltk_data...
      |   Package abc is already up-to-date!
      | Downloading package alpino to /home/onno/nltk_data...
      |   Package alpino is already up-to-date!
      | Downloading package biocreative_ppi to
      |     /home/onno/nltk_data...
      |   Package biocreative_ppi is already up-to-date!
...
...
dst ...




Corpus NLTK aKan tersimpan di

~/nltk_data/

Lumayan besar ..