Scrapping: Save text setiap URL
- Input dari `keywords.txt`
- Cari tiap keyword di Google (ambil `top-N` URL)
- Kunjungi tiap URL dan ambil kontennya (judul + paragraf)
- Simpan semua ke `scraped_results.csv`
pip install googlesearch-python requests beautifulsoup4
from googlesearch import search import requests from bs4 import BeautifulSoup import csv import time def load_keywords(filename): with open(filename, 'r', encoding='utf-8') as f: return [line.strip() for line in f if line.strip()] def get_page_content(url): try: headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)" } response = requests.get(url, headers=headers, timeout=10) soup = BeautifulSoup(response.content, 'html.parser') # Ambil judul halaman title = soup.title.string if soup.title else 'No Title' # Ambil konten paragraf utama paragraphs = soup.find_all('p') text_content = ' '.join([p.get_text() for p in paragraphs[:5]]) # Batasi 5 paragraf pertama return title.strip(), text_content.strip() except Exception as e: return 'Error', f"Failed to fetch content: {e}" def google_scrape_with_content(keywords, num_results=5, output_file='scraped_results.csv'): with open(output_file, mode='w', newline=, encoding='utf-8') as file: writer = csv.writer(file) writer.writerow(['Keyword', 'Rank', 'Title', 'URL', 'Content']) for keyword in keywords: print(f"\nđ Searching for: {keyword}") try: results = search(keyword, num_results=num_results) for i, url in enumerate(results): print(f" â Fetching: {url}") title, content = get_page_content(url) writer.writerow([keyword, i+1, title, url, content]) time.sleep(2) # Delay biar aman except Exception as e: print(f"â Error while searching '{keyword}': {e}") print(f"\nâ All results + content saved to '{output_file}'") # Main if __name__ == '__main__': keywords = load_keywords('keywords.txt') google_scrape_with_content(keywords, num_results=5)
Output (`scraped_results.csv`):
| Keyword | Rank | Title | URL | Content | |--------|------|-------|-----|---------| | berita teknologi Indonesia | 1 | Judul dari halaman | https://... | Paragraf-paragraf pertama | | ... | ... | ... | ... | ... |
- Jangan pakai `num_results > 10` kalau nggak pakai delay besar.
- Bisa diubah agar simpan ke `.txt` atau `.json` juga.
- Mau filter halaman yang bukan berita? Bisa ditambahkan regex atau `if "news" in url`.