Scrapping: Save text setiap URL

From OnnoWiki
Jump to navigation Jump to search

FITUR:

  • Input dari `keywords.txt`
  • Cari tiap keyword di Google (ambil `top-N` URL)
  • Kunjungi tiap URL dan ambil kontennya (judul + paragraf)
  • Simpan semua ke `scraped_results.csv`

Kebutuhan:

pip install googlesearch-python requests beautifulsoup4

SCRIPT FULL:

from googlesearch import search
import requests
from bs4 import BeautifulSoup
import csv
import time

def load_keywords(filename):
    with open(filename, 'r', encoding='utf-8') as f:
        return [line.strip() for line in f if line.strip()]

def get_page_content(url):
    try:
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
        }
        response = requests.get(url, headers=headers, timeout=10)
        soup = BeautifulSoup(response.content, 'html.parser') 

        # Ambil judul halaman
        title = soup.title.string if soup.title else 'No Title'
        
        # Ambil konten paragraf utama
        paragraphs = soup.find_all('p')
        text_content = ' '.join([p.get_text() for p in paragraphs[:5]])  # Batasi 5 paragraf pertama
        return title.strip(), text_content.strip()

    except Exception as e:
        return 'Error', f"Failed to fetch content: {e}"

def google_scrape_with_content(keywords, num_results=5, output_file='scraped_results.csv'):
    with open(output_file, mode='w', newline=, encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(['Keyword', 'Rank', 'Title', 'URL', 'Content']) 

        for keyword in keywords:
            print(f"\n🔍 Searching for: {keyword}")
            try:
                results = search(keyword, num_results=num_results)
                for i, url in enumerate(results):
                    print(f"  → Fetching: {url}")
                    title, content = get_page_content(url)
                    writer.writerow([keyword, i+1, title, url, content])
                    time.sleep(2)  # Delay biar aman
            except Exception as e:
                print(f"❌ Error while searching '{keyword}': {e}") 

    print(f"\n✅ All results + content saved to '{output_file}'")

# Main
if __name__ == '__main__':
    keywords = load_keywords('keywords.txt')
    google_scrape_with_content(keywords, num_results=5)

Output (`scraped_results.csv`):

| Keyword | Rank | Title | URL | Content |
|--------|------|-------|-----|---------|
| berita teknologi Indonesia | 1 | Judul dari halaman | https://... | Paragraf-paragraf pertama |
| ... | ... | ... | ... | ... |

Tips:

  • Jangan pakai `num_results > 10` kalau nggak pakai delay besar.
  • Bisa diubah agar simpan ke `.txt` atau `.json` juga.
  • Mau filter halaman yang bukan berita? Bisa ditambahkan regex atau `if "news" in url`.

Pranala Menarik