This Python application is designed to scrape sitemap data from websites and store it in a PostgreSQL database. The tool is developed using Python 3.10 and utilizes Peewee as the ORM for database operations. It scrapes sitemap URLs from a given website, downloads XML and GZipped files, processes nested sitemaps, and saves the data into a PostgreSQL database.
-
Python 3.10: Make sure you have Python 3.10 installed. You can download it from the official Python website.
-
Virtual Environment (venv): Create a virtual environment to manage project dependencies.
python3 -m venv venv source venv/bin/activate # On Windows, use venv\Scripts\activate
-
PostgreSQL Database: Create a
.env
file in the project root to store your PostgreSQL database connection details.DATABASE_URL=postgresql://username:password@localhost:5432/database_name
-
Change Website URL: Open
pipeline.py
and change theEXAMPLE_URL
variable to the desired website. -
Install Dependencies: Install project dependencies using
requirements.txt
.pip install -r requirements.txt
-
config.py
: Configuration file containing constants and settings for the application. -
models/
: Directory containing Peewee models for database schema. -
sitemap_files/
: Directory where downloaded sitemap files are stored. -
pipeline.py
: Python script that handles scraping and processing sitemaps. Change theEXAMPLE_URL
variable to the desired website. -
main.py
: Main Python script that orchestrates the scraping process.
-
Activate Virtual Environment:
source venv/bin/activate # On Windows, use venv\Scripts\activate
-
Database Setup:
- Create a PostgreSQL database and update the
.env
file with your database connection details.
- Create a PostgreSQL database and update the
-
Change Website URL: Open
pipeline.py
and change theEXAMPLE_URL
variable to the desired website. -
Run the Scraper:
python main.py
The scraper will start processing sitemap URLs from the specified website. If there are no sitemap URLs on the website, it will be logged using Loguru logger.
-
Python Version: 3.10
-
Database: PostgreSQL
-
Dependencies: Check
requirements.txt
for the list of Python packages used in this project. -
Notes: Ensure proper network connectivity, and permissions to access the website's sitemaps. If you encounter a 403 Forbidden error, check the website's
robots.txt
file for sitemap access restrictions.
Last updated on: 2024-02-13
Last updated on: 2024-02-14
Last updated on: 2024-02-15
Last updated on: 2024-02-16
Last updated on: 2024-02-21
Last updated on: 2024-02-25
Last updated on: 2024-02-27
Last updated on: 2024-03-02
Last updated on: 2024-03-05
Last updated on: 2024-03-14
Last updated on: 2024-03-16
Last updated on: 2024-03-20
Last updated on: 2024-03-21
Last updated on: 2024-03-26
Last updated on: 2024-03-29
Last updated on: 2024-04-01
Last updated on: 2024-04-01
Last updated on: 2024-04-06
Last updated on: 2024-04-06
Last updated on: 2024-04-13
Last updated on: 2024-04-14
Last updated on: 2024-04-15
Last updated on: 2024-04-17
Last updated on: 2024-05-01
Last updated on: 2024-05-05
Last updated on: 2024-05-13
Last updated on: 2024-12-10
Last updated on: 2024-12-10