Difference between revisions of "Python: NLTK download corpus"

From OnnoWiki
Jump to navigation Jump to search
Line 15: Line 15:
 
  ---------------------------------------------------------------------------
 
  ---------------------------------------------------------------------------
  
Pilih d untuk mendownload semua corpus yang ada supaya tidak pusing kepala
+
Pilih d untuk mendownload semua corpus yang ada supaya tidak pusing kepala, akan keluar,
 +
 
 +
Packages:
 +
  [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
 +
  [ ] mwa_ppdb............ The monolingual word aligner (Sultan et al.
 +
                            2015) subset of the Paraphrase Database.
 +
  [ ] nonbreaking_prefixes Non-Breaking Prefixes (Moses Decoder)
 +
  [-] panlex_lite......... PanLex Lite Corpus
 +
  [ ] pe08................ Cross-Framework and Cross-Domain Parser
 +
                            Evaluation Shared Task
 +
  [-] perluniprops........ perluniprops: Index of Unicode Version 7.0.0
 +
                            character properties in Perl
 +
  [ ] porter_test......... Porter Stemmer Test Files
 +
  [-] stopwords........... Stopwords Corpus
 +
  [ ] vader_lexicon....... VADER Sentiment Lexicon
 +
  [ ] wmt15_eval.......... Evaluation data from WMT15
 +
 +
Collections:
 +
  [-] all-corpora......... All the corpora
 +
  [-] all................. All packages
 +
  [-] book................ Everything used in the NLTK Book
 +
 +
([*] marks installed packages; [-] marks out-of-date or corrupt packages)
 +
 +
Download which package (l=list; x=cancel)?
 +
  Identifier>
 +
 
 +
Pilih
 +
 
 +
all
 +
 
 +
supaya tidak pusing, tapi ini akan memakan banyak bandwidth
  
  

Revision as of 05:17, 2 February 2017

Corpus untuk NLTK bisa di download menggunakan script, misalnya download-corpus.py

import nltk
nltk.download()

jalankan

python download-corpus.py

akan keluar

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------

Pilih d untuk mendownload semua corpus yang ada supaya tidak pusing kepala, akan keluar,

Packages:
  [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
  [ ] mwa_ppdb............ The monolingual word aligner (Sultan et al.
                           2015) subset of the Paraphrase Database.
  [ ] nonbreaking_prefixes Non-Breaking Prefixes (Moses Decoder)
  [-] panlex_lite......... PanLex Lite Corpus
  [ ] pe08................ Cross-Framework and Cross-Domain Parser
                           Evaluation Shared Task
  [-] perluniprops........ perluniprops: Index of Unicode Version 7.0.0
                           character properties in Perl
  [ ] porter_test......... Porter Stemmer Test Files
  [-] stopwords........... Stopwords Corpus
  [ ] vader_lexicon....... VADER Sentiment Lexicon
  [ ] wmt15_eval.......... Evaluation data from WMT15

Collections:
  [-] all-corpora......... All the corpora
  [-] all................. All packages
  [-] book................ Everything used in the NLTK Book

([*] marks installed packages; [-] marks out-of-date or corrupt packages)

Download which package (l=list; x=cancel)?
  Identifier>

Pilih

all

supaya tidak pusing, tapi ini akan memakan banyak bandwidth


AKan tersimpan di

~/nltk_data/

Lumayan besar ..