BioEncoder uses a single working directory to store all output related to a project. The different functions and steps will create different sub-folders inside - i.e., split_dataset
will create the data folder, train
will create logs, runs and weights folders, and interactive_plots
will create plots. In the end your working directory will look like this:
bioencoder_wd/
data
<run-name>
train
class_1/
image_1.jpg
image_2.jpg
...
class_2/
image_1.jpg
image_2.jpg
...
...
val
...
logs
<run-name>
first
<run-name>_first.log
second
<run-name>_second.log
plots
<run-name>.html
runs
<run-name>
first
events.out.tfevents...machine-name.15832.0
second
events.out.tfevents...machine-name.15832.1
weights
<run-name>
first
epoch0
epoch1
...
swa
second
epoch0
epoch1
...
swa
Run configure
to set the bioencoder working directory and the run name. The run name will be used to name all output generated by BioEncoder, i.e., plots, logs, etc.:
import bioencoder
bioencoder.configure(root_dir=r"bioencoder_wd", run_name="v1")
You will get something like this:
BioEncoder config:
- root_dir: bioencoder_wd
- root_dir_abs: /home/mlurig/temp/bioencoder_wd
- run_name: v1
Given your Python WD (/home/mlurig/temp), the current BioEncoder run directory will be:
- /home/mlurig/temp/bioencoder_wd/v1
/home/mlurig/temp/bioencoder_wd does not exist but will be created when adding data!
This will create a root folder inside your project, where all relevant bioencoder data, logs, etc. will be stored.
Run split_dataset
on your input folder to create the data folder containing training and validation images. Your input folder should have as many subdirectories as classes. The key is to make sure that all images belonging to the same class are stored in the same subdirectory. Also, you do not need to worry about image resolution at this stage. The images will be resized during training using the parameters defined within the YAML
configuration files. If a single class contains an overwhelming percentage of images consider undersampling it using max_ratio
. Use random_seed
to reproduce the randomized selection. There are a few more options - use help(split_dataset)
to read about them.
bioencoder.split_dataset(image_dir=r"~/Downloads/damselflies-aligned-trai_val", max_ratio=6, random_seed=42)
You will get something like this:
Number of images per class prior to balancing: [2240, 147, 999, 5000] (8386 total)
Minimum number of images per class: 147 * max ratio 6 = 882 max per class
Number of images per class after balancing: [882, 147, 882, 882] (2793 total)
Mode "flat": 279 validation images in total, min. 69 per class - processing:
Processing class androchrome...
Processing class infuscans-obsoleta...
Processing class infuscans...
Processing class male...
Use the option dry_run=True
if you want to experiment with different split modes or max ratios without actually doing the split.