Skip to content

Latest commit

 

History

History
96 lines (79 loc) · 3.51 KB

02-data-preparation.md

File metadata and controls

96 lines (79 loc) · 3.51 KB

BioEncoder root dir

BioEncoder uses a single working directory to store all output related to a project. The different functions and steps will create different sub-folders inside - i.e., split_dataset will create the data folder, train will create logs, runs and weights folders, and interactive_plots will create plots. In the end your working directory will look like this:

bioencoder_wd/
    data
        <run-name>
            train
                class_1/
                    image_1.jpg
                    image_2.jpg
                    ...
                class_2/
                    image_1.jpg
                    image_2.jpg
                    ...
                ...
            val
                ...
    logs
        <run-name>
            first
                <run-name>_first.log
            second
                <run-name>_second.log
    plots
        <run-name>.html
    runs
        <run-name>
            first
                events.out.tfevents...machine-name.15832.0
            second
                events.out.tfevents...machine-name.15832.1
    weights
        <run-name>
            first
                epoch0
                epoch1
                ...
                swa
            second
                epoch0
                epoch1
                ...
                swa

Global configuration

Run configure to set the bioencoder working directory and the run name. The run name will be used to name all output generated by BioEncoder, i.e., plots, logs, etc.:

import bioencoder

bioencoder.configure(root_dir=r"bioencoder_wd", run_name="v1")

You will get something like this:

BioEncoder config:
- root_dir: bioencoder_wd
- root_dir_abs: /home/mlurig/temp/bioencoder_wd
- run_name: v1
Given your Python WD (/home/mlurig/temp), the current BioEncoder run directory will be:
- /home/mlurig/temp/bioencoder_wd/v1
/home/mlurig/temp/bioencoder_wd does not exist but will be created when adding data!

This will create a root folder inside your project, where all relevant bioencoder data, logs, etc. will be stored.

Split dataset

Run split_dataset on your input folder to create the data folder containing training and validation images. Your input folder should have as many subdirectories as classes. The key is to make sure that all images belonging to the same class are stored in the same subdirectory. Also, you do not need to worry about image resolution at this stage. The images will be resized during training using the parameters defined within the YAML configuration files. If a single class contains an overwhelming percentage of images consider undersampling it using max_ratio. Use random_seed to reproduce the randomized selection. There are a few more options - use help(split_dataset) to read about them.

bioencoder.split_dataset(image_dir=r"~/Downloads/damselflies-aligned-trai_val", max_ratio=6, random_seed=42)

You will get something like this:

Number of images per class prior to balancing: [2240, 147, 999, 5000] (8386 total)
Minimum number of images per class: 147 * max ratio 6 = 882 max per class
Number of images per class after balancing: [882, 147, 882, 882] (2793 total)
Mode "flat": 279 validation images in total, min. 69 per class - processing:

Processing class androchrome...
Processing class infuscans-obsoleta...
Processing class infuscans...
Processing class male...

Use the option dry_run=Trueif you want to experiment with different split modes or max ratios without actually doing the split.