Scrapping: Save text setiap URL
FITUR:
- Input dari `keywords.txt`
- Cari tiap keyword di Google (ambil `top-N` URL)
- Kunjungi tiap URL dan ambil kontennya (judul + paragraf)
- Simpan semua ke `scraped_results.csv`
Kebutuhan:
pip install googlesearch-python requests beautifulsoup4
SCRIPT FULL:
from googlesearch import search
import requests
from bs4 import BeautifulSoup
import csv
import time
def load_keywords(filename):
    with open(filename, 'r', encoding='utf-8') as f:
        return [line.strip() for line in f if line.strip()]
def get_page_content(url):
    try:
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
        }
        response = requests.get(url, headers=headers, timeout=10)
        soup = BeautifulSoup(response.content, 'html.parser') 
        # Ambil judul halaman
        title = soup.title.string if soup.title else 'No Title'
        
        # Ambil konten paragraf utama
        paragraphs = soup.find_all('p')
        text_content = ' '.join([p.get_text() for p in paragraphs[:5]])  # Batasi 5 paragraf pertama
        return title.strip(), text_content.strip()
    except Exception as e:
        return 'Error', f"Failed to fetch content: {e}"
def google_scrape_with_content(keywords, num_results=5, output_file='scraped_results.csv'):
    with open(output_file, mode='w', newline=, encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(['Keyword', 'Rank', 'Title', 'URL', 'Content']) 
        for keyword in keywords:
            print(f"\n🔍 Searching for: {keyword}")
            try:
                results = search(keyword, num_results=num_results)
                for i, url in enumerate(results):
                    print(f"  → Fetching: {url}")
                    title, content = get_page_content(url)
                    writer.writerow([keyword, i+1, title, url, content])
                    time.sleep(2)  # Delay biar aman
            except Exception as e:
                print(f"❌ Error while searching '{keyword}': {e}") 
    print(f"\n✅ All results + content saved to '{output_file}'")
# Main
if __name__ == '__main__':
    keywords = load_keywords('keywords.txt')
    google_scrape_with_content(keywords, num_results=5)
Output (`scraped_results.csv`):
| Keyword | Rank | Title | URL | Content | |--------|------|-------|-----|---------| | berita teknologi Indonesia | 1 | Judul dari halaman | https://... | Paragraf-paragraf pertama | | ... | ... | ... | ... | ... |
Tips:
- Jangan pakai `num_results > 10` kalau nggak pakai delay besar.
- Bisa diubah agar simpan ke `.txt` atau `.json` juga.
- Mau filter halaman yang bukan berita? Bisa ditambahkan regex atau `if "news" in url`.