Sitemap Scraper

This Python application is designed to scrape sitemap data from websites and store it in a PostgreSQL database. The tool is developed using Python 3.10 and utilizes Peewee as the ORM for database operations. It scrapes sitemap URLs from a given website, downloads XML and GZipped files, processes nested sitemaps, and saves the data into a PostgreSQL database.

Prerequisites

Python 3.10: Make sure you have Python 3.10 installed. You can download it from the official Python website.

Virtual Environment (venv): Create a virtual environment to manage project dependencies.

python3 -m venv venv
source venv/bin/activate  # On Windows, use venv\Scripts\activate

PostgreSQL Database: Create a .env file in the project root to store your PostgreSQL database connection details.
```
DATABASE_URL=postgresql://username:password@localhost:5432/database_name
```
Change Website URL: Open pipeline.py and change the EXAMPLE_URL variable to the desired website.
Install Dependencies: Install project dependencies using requirements.txt.
```
pip install -r requirements.txt
```

Project Structure

config.py: Configuration file containing constants and settings for the application.
models/: Directory containing Peewee models for database schema.
sitemap_files/: Directory where downloaded sitemap files are stored.
pipeline.py: Python script that handles scraping and processing sitemaps. Change the EXAMPLE_URL variable to the desired website.
main.py: Main Python script that orchestrates the scraping process.

How to Use

Activate Virtual Environment:

source venv/bin/activate  # On Windows, use venv\Scripts\activate

Database Setup:
- Create a PostgreSQL database and update the .env file with your database connection details.
Change Website URL: Open pipeline.py and change the EXAMPLE_URL variable to the desired website.
Run the Scraper:
```
python main.py
```
The scraper will start processing sitemap URLs from the specified website. If there are no sitemap URLs on the website, it will be logged using Loguru logger.

Additional Information

Python Version: 3.10
Database: PostgreSQL
Dependencies: Check requirements.txt for the list of Python packages used in this project.
Notes: Ensure proper network connectivity, and permissions to access the website's sitemaps. If you encounter a 403 Forbidden error, check the website's robots.txt file for sitemap access restrictions.

Last updated on: 2024-02-13

Last updated on: 2024-02-14

Last updated on: 2024-02-15

Last updated on: 2024-02-16

Last updated on: 2024-02-21

Last updated on: 2024-02-25

Last updated on: 2024-02-27

Last updated on: 2024-03-02

Last updated on: 2024-03-05

Last updated on: 2024-03-14

Last updated on: 2024-03-16

Last updated on: 2024-03-20

Last updated on: 2024-03-21

Last updated on: 2024-03-26

Last updated on: 2024-03-29

Last updated on: 2024-04-01

Last updated on: 2024-04-06

Last updated on: 2024-04-13

Last updated on: 2024-04-14

Last updated on: 2024-04-15

Last updated on: 2024-04-17

Last updated on: 2024-05-01

Last updated on: 2024-05-05

Last updated on: 2024-05-13

Last updated on: 2024-12-10

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
models		models
scripts		scripts
.gitignore		.gitignore
README.md		README.md
config.py		config.py
main.py		main.py
pipeline.py		pipeline.py
requirements.in		requirements.in
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sitemap Scraper

Prerequisites

Project Structure

How to Use

Additional Information

About

Releases

Packages

Languages

vatche-t/sitemaps

Folders and files

Latest commit

History

Repository files navigation

Sitemap Scraper

Prerequisites

Project Structure

How to Use

Additional Information

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages