Scrapping: Word Cloud, Word Frekuensi

From OnnoWiki
Revision as of 06:34, 29 March 2025 by Onnowpurbo (talk | contribs) (Created page with "==✅ Fitur Analisis:== 1. '''Frekuensi kata''' (word count) 2. '''Word cloud''' visual 3. '''Top N kata paling sering muncul''' 4. Opsional: '''hapus stopwords Bahasa Indones...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

✅ Fitur Analisis:

1. Frekuensi kata (word count) 2. Word cloud visual 3. Top N kata paling sering muncul 4. Opsional: hapus stopwords Bahasa Indonesia biar hasilnya lebih bersih

Install dulu lib tambahan:

pip install nltk matplotlib wordcloud

Lalu jalankan ini sekali untuk download stopwords-nya:

import nltk
nltk.download('stopwords')

Tambahan Script Analisis (lanjut dari script sebelumnya):

import os
import string
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from nltk.corpus import stopwords

# Buat daftar stopwords Bahasa Indonesia
stop_words = set(stopwords.words('indonesian'))

def bersihkan_teks(teks):
    # Lowercase + hilangkan tanda baca
    teks = teks.lower()
    teks = teks.translate(str.maketrans(, , string.punctuation))
    return teks

def analisis_folder(folder='outputs', top_n=20):
    semua_konten = 

    for filename in os.listdir(folder):
        if filename.endswith('.txt'):
            filepath = os.path.join(folder, filename)
            with open(filepath, 'r', encoding='utf-8') as f:
                isi = f.read()
                # Ambil hanya bagian isi, setelah header
                if '\n\n' in isi:
                    isi = isi.split('\n\n', 1)[1]
                semua_konten += isi + ' '

    # Bersihkan teks dan tokenisasi
    teks_bersih = bersihkan_teks(semua_konten)
    kata_kata = teks_bersih.split()

    # Hapus stopwords
    kata_kata = [kata for kata in kata_kata if kata not in stop_words and kata.isalpha()]

    # Hitung frekuensi
    counter = Counter(kata_kata)
    top_kata = counter.most_common(top_n)

    print(f"\n🔍 Top {top_n} Kata Paling Sering Muncul:")
    for kata, jumlah in top_kata:
        print(f"{kata}: {jumlah}")

    # Buat Word Cloud
    wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(counter)

    # Tampilkan
    plt.figure(figsize=(12, 6))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title('Word Cloud dari Semua Konten')
    plt.tight_layout()
    plt.savefig('wordcloud.png')
    plt.show()

    print("\n✅ Word Cloud disimpan sebagai 'wordcloud.png'")

# Jalankan analisis
if __name__ == '__main__':
    analisis_folder('outputs', top_n=30)

Output:

  • Console akan menampilkan 30 kata paling sering muncul
  • Word cloud disimpan sebagai: `wordcloud.png`

Hasil bisa dipakai untuk:

  • Lihat topik dominan
  • Tambahan insight untuk laporan/jurnal
  • Bahan untuk visualisasi di presentasi


Pranala Menarik