Tor Onion Site Scraper

This repository contains a Python-based scraper designed to collect HTML files from URLs accessible through the Tor network. The scraper was developed by Joel Hägvall and Giancarlo Valverde, with assistance from ChatGPT.

📝 Features

Collects HTML data from URLs using Tor as a proxy.
Retrieves titles and descriptions from the HTML files located in a specified directory.
Gathers titles and stats to provide additional details about the listings.
Stores the extracted data in CSV files.
Plots the frequency of keywords.

📦 Prerequisites

Before utilizing this scraper, ensure the following dependencies are installed on your system:

Python 3.6 or higher
Tor Browser or Tor service running in the background
Required Python libraries: requests, pysocks, beautifulsoup4, pandas, matplotlib

🛠️ Installation

Clone the Repository

git clone https://github.com/**yourusername**/tor-onion-scraper.git
cd tor-onion-scraper

Install Dependencies
- Install the required libraries using pip:
```
pip install requests pysocks beautifulsoup4 pandas matplotlib
```
- Install Tor from https://www.torproject.org/download/
Ensure Tor is Running
- Ensure Tor is active in the background. If you are using the Tor Browser, keep it open while the scraper runs.

🚀 Usage

Steps for Data Scraping and Content Analysis

Prepare Category Scraping

Navigate to the scraping folder.
Edit the scrapeCategorySiteHTML.py script, updating the category array to match the first category to be scraped.
Execute the script.
A new directory "named new_onion_sites_html" is created, along with a subdirectory named after the category, containing its HTML files.
Repeat this process for all 24 selected categories.

Scrape Index Page

In the scraping folder, execute the scrapeIndexSiteHTML.py script.
A new directory named faq is created within the new_onion_sites_html directory, containing all HTML files found on the "Help & Infos" page.

Extract Category Elements

Open the extractElementsHTMLCategory.py script in the scraping directory.
Ensure all 24 categories are present in the categories array.
Execute the script to generate CSV files in the "categoryPageTitleStats" directory. These files will contain the columns Title and Stats for each product in all categories.

Extract Product Elements

Open the extractElementsHTMLProduct.py script in the scraping folder.
Update the categories array to include all 24 categories.
Execute the script to create CSV files for all categories in the newly created "productPageResultsHTML" directory.

Compile Product Data

Navigate to the "productPageResultsHTML" folder.
Ensure all CSV files are created for each category, containing elements such as Title, Description, Refund Policy, and Comments.

Merge CSV Files

In the main directory, edit the mergeCsv.py script.
Update the "results_directory" array to "categoryPageTitleStats" and set the CSV filename to categoryTitleStats.csv.
Execute the script to merge all CSV files in the directory.
Repeat the process by updating the array to "productPageResultsHTML" and the CSV filename to merged_data.csv.

Create Coding Scheme

Develop a table with categories representing the coding scheme for content analysis.
Create subcategories and develop keywords for each subcategory.
Add a Frequency column to track keyword frequencies and a Total column for the total count.

Keyword Frequency Analysis

Extract keywords from the first subcategory of the first crime category and paste them into the visualBarKeywordFreq.py script located in the visual folder, within the keywords array.
Use the generated plot to record total frequencies in the table.
Repeat this process for every subcategory of each category.

Visualize Subcategory Frequencies

Open the visualBarSubcategories.py script.
Update the categories array to match all subcategories of the first crime category.
Edit the frequencies array to match their respective frequencies.

In-Depth Analysis

For each crime category, perform an in-depth analysis of listings related to the research question.
Choose examples based on observations and the highest frequency subcategory.
Use the site’s search function to find examples and extract statistics from the categoryTitleStats.csv file.

Authors

Giancarlo Valverde - Developer
Joel Hägvall - Developer
ChatGPT (OpenAI) - Large Language Model

Disclaimer

Please note that this program is primarily developed for research purposes, particularly for our thesis. We do not support or endorse any illegal or unethical activities.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Tor Onion Site Scraper

📝 Features

📦 Prerequisites

🛠️ Installation

🚀 Usage

Authors

Disclaimer

Files

README.md

Latest commit

History

README.md

File metadata and controls

Tor Onion Site Scraper

📝 Features

📦 Prerequisites

🛠️ Installation

🚀 Usage

Authors

Disclaimer