A program that accepts the contents of an html file through standard input and sends the contents of the table(s) contained in the file to standard output. Relevant data is extracted using regular expressions. The extracted content of the tables is in .csv format.
- Clone the repository using git clone https://github.com/YOURUSERNAME/TableToCSV.git
- Run the program from the command line using the following format python table_to_csv.py < input.html > output.txt
The code for the following tables was taken from https://www.w3schools.com/html/html_tables.asp. You can view this code in the example file called index.html which is located in the repository.
Company | Contact | Country |
---|---|---|
Alfreds Futterkiste | Maria Anders | Germany |
Centro comercial Moctezuma | Francisco Chang | Mexico |
Ernst Handel | Roland Mendel | Austria |
Island Trading | Helen Bennett | UK |
Laughing Bacchus Winecellars | Yoshi Tannamuri | Canada |
Magazzini Alimentari Riuniti | Giovanni Rovelli | Italy |
Firstname | Lastname | Age |
---|---|---|
Jill | Smith | 50 |
Eve | Jackson | 94 |
$ python table_to_csv.py < input.html > output.txt
The contents of output.txt are shown below. Notice that tables are labeled in the same order as they appear in the document.
TABLE 1:
Company,Contact,Country
Alfreds Futterkiste,Maria Anders,Germany
Centro comercial Moctezuma,Francisco Chang,Mexico
Ernst Handel,Roland Mendel,Austria
Island Trading,Helen Bennett,UK
Laughing Bacchus Winecellars,Yoshi Tannamuri,Canada
Magazzini Alimentari Riuniti,Giovanni Rovelli,Italy
TABLE 2:
Firstname,Lastname,Age
Jill,Smith,50
Eve,Jackson,94
The purpose of this program is to extract the contents of html tables and put it into .csv format. Once the data is converted into .csv format, the user can use different programs to analyze this data. One program they could use to do this is the online analytical processing program (OLAP.py) located in the OnlineAnalyticalProcessing repository.
This program will not work if the tables make use of the rowspan or colspan html attributes.