Scrapping: Save text setiap URL
FITUR:
- Input dari `keywords.txt`
- Cari tiap keyword di Google (ambil `top-N` URL)
- Kunjungi tiap URL dan ambil kontennya (judul + paragraf)
- Simpan semua ke `scraped_results.csv`
Kebutuhan:
pip install googlesearch-python requests beautifulsoup4
SCRIPT FULL:
from googlesearch import search
import requests
from bs4 import BeautifulSoup
import csv
import time
def load_keywords(filename):
with open(filename, 'r', encoding='utf-8') as f:
return [line.strip() for line in f if line.strip()]
def get_page_content(url):
try:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}
response = requests.get(url, headers=headers, timeout=10)
soup = BeautifulSoup(response.content, 'html.parser')
# Ambil judul halaman
title = soup.title.string if soup.title else 'No Title'
# Ambil konten paragraf utama
paragraphs = soup.find_all('p')
text_content = ' '.join([p.get_text() for p in paragraphs[:5]]) # Batasi 5 paragraf pertama
return title.strip(), text_content.strip()
except Exception as e:
return 'Error', f"Failed to fetch content: {e}"
def google_scrape_with_content(keywords, num_results=5, output_file='scraped_results.csv'):
with open(output_file, mode='w', newline=, encoding='utf-8') as file:
writer = csv.writer(file)
writer.writerow(['Keyword', 'Rank', 'Title', 'URL', 'Content'])
for keyword in keywords:
print(f"\n🔍 Searching for: {keyword}")
try:
results = search(keyword, num_results=num_results)
for i, url in enumerate(results):
print(f" → Fetching: {url}")
title, content = get_page_content(url)
writer.writerow([keyword, i+1, title, url, content])
time.sleep(2) # Delay biar aman
except Exception as e:
print(f"❌ Error while searching '{keyword}': {e}")
print(f"\n✅ All results + content saved to '{output_file}'")
# Main
if __name__ == '__main__':
keywords = load_keywords('keywords.txt')
google_scrape_with_content(keywords, num_results=5)
Output (`scraped_results.csv`):
| Keyword | Rank | Title | URL | Content | |--------|------|-------|-----|---------| | berita teknologi Indonesia | 1 | Judul dari halaman | https://... | Paragraf-paragraf pertama | | ... | ... | ... | ... | ... |
Tips:
- Jangan pakai `num_results > 10` kalau nggak pakai delay besar.
- Bisa diubah agar simpan ke `.txt` atau `.json` juga.
- Mau filter halaman yang bukan berita? Bisa ditambahkan regex atau `if "news" in url`.