-
Notifications
You must be signed in to change notification settings - Fork 20
2 Download images
Please download the folder code/download_imgs/
.
Set up environment with requirements-non_cv.txt
.
An access token is required to download data from Mapillary. You can register one for free from Mapillary. Update your mapillary token in the following files:
code/download_imgs/download_jpegs.py
code/download_imgs/download_jpegs_mapillary.py
download_jpegs.py
downloads from Mapillary and KartaView all images specified in a csv
input file, in .jpeg format, to a specified output folder.
Input
The input csv
should have each row representing an image and contain minimally three columns:
-
uuid
: the universally unique identifier (UUID) assigned to each image in the dataset. The downloaded image files will be named with their UUIDs, i.e.{uuid}.jpeg
. -
source
: indicates whether the image was obtained from Mapillary or KartaView. The script uses this information to select the appropriate download function and API. -
orig_id
: the original image ID given by Mapillary or KartaView in metadata. This ID is used to query the Mapillary / KartaView API to download the images.
All these three columns can be found in every csv
file we provide in the dataset.
This means you can use any of the csv
files as input for download_jpegs.py
.
Output
Images are downloaded into subfolders with maximum 10,000 images per subfolder.
Each image file is named by its UUID - {uuid}.jpeg
.
Adjustable variables
User can adjust the following variables in download_jpegs.py
to suit their needs:
-
access_token
(str):- Insert your Mapillary access token.
-
in_csvPath
(str):- Insert the path to your input
csv
.
- Insert the path to your input
-
out_mainFolder
(str):- Insert the path to the main output folder, under which subfolders will be created automatically by the script to group downloaded images so that each subfolder has maximally 10,000 images.
-
chunk_size
(int):- Maximum number of images per output subfolder. Default to 10000.
-
num_thread
(int):- Number of threads or download tasks to run concurrently. Default to 100.
Set up environment with requirements-non_cv.txt
.
To reproduce sample_output
Insert your access_token
.
Modify out_mainFolder
to your output folder.
Uncomment the line:
data_l = pd.concat([data_l[data_l['source']=='Mapillary'].sample(n=25, random_state=0), data_l[data_l['source']=='KartaView'].sample(n=25, random_state=0)], ignore_index=True) # sample 50 images to download just for illustration purpose
Then run:
python3 download_jpegs.py
About sample_output
We sampled 50 images from code/raw_download/sample_output/points.csv
to download the image files, stored in code/download_imgs/sample_output/all/1_50
.
These 50 images will also be used as input to demonstrate the subsequent CV (computer vision) processing:
Use any of the csv
files provided in our dataset as input.
Modify the adjustable variables to suit your needs.
Ensure there is more than 6 TB of available space since all imagery would take up at least 6 TB.
Run
python3 download_jpegs.py
The whole download might take days to complete.
You may be interested in downloading the imagery for just a subset of the dataset you need.
You can produce a subset of the dataset by filtering the appropriate metadata.
See info.csv
for a list of the different features and their meaning.
The notebook sample_subset_download.ipynb
contains an example of filtering for images from Singapore taken during the day time.
As seen in the notebook, ensure the resulting filtered csv
file contains at least the three columns (uuid
, source
, and orig_id
) as mentioned before.
Once you have saved the csv
, change in_csvPath
and out_mainFolder
accordingly.
Run
python3 download_jpegs.py`
-
download_jpegs.py
imports the functions fromdownload_jpegs_mapillary.py
anddownload_jpegs_kartaview.py
to download imagery from Mapillary and KartaView respectively, by sending requests to the respective APIs. For this reason, it is best to keep these three.py
files within the same folder so thatdownload_jpegs.py
works smoothly. - Run the script a few times until you observe no change in total number of downloaded images, or as indicated by the message that all images have been downloaded, because not all images can be downloaded in one go due to network issues.
- Sometimes, despite running the script a few times until no more change is observed in the total number of downloaded images, some images could still be missing. This is because sometimes the image file could just be unavailable (despite presence of its metadata), due to unknown reasons (e.g. contributor deleted the image, or maybe the image didn't pass some kind of internal check by Mapillary/KartaView etc.). As a result, you may also see some error messages during the download process, but the download process should continue on its own.
- If your download is ever interrupted halfway, re-run
python3 download_jpegs.py
to resume the download. The script would check everything inout_mainFolder
against your input CSV (in_csvPath
) and only attempt to download the images that do not yet exist in the output folder.