Scrapping: Save text setiap pencarian

From OnnoWiki
Revision as of 06:22, 29 March 2025 by Onnowpurbo (talk | contribs) (Created page with "Simpan konten '''per URL ke file `.txt`''', 1 file per halaman Jadi tiap hasil pencarian disimpan jadi file `.txt` sendiri biar bisa dibaca atau dianalisis dengan mudah. ==...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Simpan konten per URL ke file `.txt`, 1 file per halaman Jadi tiap hasil pencarian disimpan jadi file `.txt` sendiri biar bisa dibaca atau dianalisis dengan mudah.

FITUR-FITUR:

  • Baca `keywords.txt` → cari di Google
  • Ambil `top-N` hasil pencarian tiap keyword
  • Kunjungi tiap URL, ambil judul + isi artikel (5 paragraf pertama)
  • Simpan ke folder `outputs/` dalam format `.txt`:
    • Nama file: `keyword_rank_judul.txt`

PERSIAPAN:

pip install googlesearch-python requests beautifulsoup4

Buat file `keywords.txt`:

berita teknologi Indonesia
politik 2025
game PS5 terbaru

SCRIPT PYTHON SUPER LENGKAP:

import os
import requests
from googlesearch import search
from bs4 import BeautifulSoup
import re
import time

def load_keywords(filename):
    with open(filename, 'r', encoding='utf-8') as f:
        return [line.strip() for line in f if line.strip()]

def clean_filename(text):
    # Hilangkan karakter ilegal untuk nama file
    return re.sub(r'[\\/*?:"<>|]', , text).strip().replace(' ', '_')[:50]

def get_page_content(url):
    try:
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
        }
        response = requests.get(url, headers=headers, timeout=10)
        soup = BeautifulSoup(response.content, 'html.parser')
        title = soup.title.string if soup.title else 'No Title'
        paragraphs = soup.find_all('p')
        content = '\n\n'.join([p.get_text() for p in paragraphs[:5]])
        return title.strip(), content.strip()
    except Exception as e:
        return 'Error', f"Gagal mengambil konten: {e}"

def save_to_txt(keyword, rank, title, url, content, folder='outputs'):
    os.makedirs(folder, exist_ok=True)
    filename = f"{clean_filename(keyword)}_{rank}_{clean_filename(title)}.txt"
    filepath = os.path.join(folder, filename)

    with open(filepath, 'w', encoding='utf-8') as f:
        f.write(f"Keyword   : {keyword}\n")
        f.write(f"Peringkat : {rank}\n")
        f.write(f"Judul     : {title}\n")
        f.write(f"URL       : {url}\n\n")
        f.write(content)

def scrape_and_save_txt(keywords, num_results=5):
    for keyword in keywords:
        print(f"\n🔍 Searching: {keyword}")
        try:
            results = search(keyword, num_results=num_results)
            for i, url in enumerate(results):
                print(f"  → ({i+1}) Fetching: {url}")
                title, content = get_page_content(url)
                save_to_txt(keyword, i+1, title, url, content)
                time.sleep(2)
        except Exception as e:
            print(f"❌ Error saat mencari '{keyword}': {e}")

    print("\n✅ Semua konten telah disimpan di folder 'outputs/'")

# Main
if __name__ == '__main__':
    keywords = load_keywords('keywords.txt')
    scrape_and_save_txt(keywords, num_results=5)

Output:

Folder `outputs/` akan berisi file seperti:

berita_teknologi_Indonesia_1_Tekno_Terbaru_dari_Tempo.txt
politik_2025_2_Media_Indonesia_Politik.txt

Isi filenya:

Keyword   : berita teknologi Indonesia
Peringkat : 1
Judul     : Berita Teknologi Terbaru Hari Ini - Tempo
URL       : https://tekno.tempo.co/...

[ISI PARAGRAF]


Pranala Menarik