Python package for generating synthetic data.
At least python version 3.7.1 is required. Once you have a working python (virtual) environment, you can install the package. All the methods below install necessary dependencies too. A fresh default installation of Anaconda already contains those dependencies.
$ pip install git+https://github.com/cursorinsight/biometricblender.git
After cloning the repo, you can install from the folder containing setup.py
.
The -e .
or --editable .
switch makes sure not to copy sources but always
import from the local repo. This is ideal for modifying the code in-place:
$ pip install -e .
Alternatively, where using develop
instead of install
is the equivalent
of -e
:
$ python setup.py develop
In the example, we referenced the local repo by the relative path .
but you
can use absolute path as well.
You can install for the current user with the --user
switch in the above
command without requiring admin (root) privileges, e.g.
$ pip install --user -e .
but expect parallel use of --user
and -e .
to fail due to the presence of
the pyproject.toml
file, for details see
pypa/pip#7953.
If you wish to separate your current install from your global python configuration then consider creating a virtual environment for the current install. For details read https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/. Under Unix systems this looks like:
$ python3 -m venv data_synthesis
$ source data_synthesis/bin/activate
$ pip install -e .
The package contains a single module called biometric_blender
.
The purpose of the data generated by this package is to establish a test benchmark for
- multiclass classification
- with a huge number of features
- with nontrivial feature correlations
- with approximate a-priori knowledge about the usefulness of features
Run as
$ python -m biometric_blender
or load by
>>> import biometric_blender
Command line options are:
--n-classes N_CLASSES
number of classes (or labels) of the classification
problem to simulate (default: 100)
--n-samples-per-class N_SAMPLES_PER_CLASS
number of samples per class (default: 16)
--n-true-features N_TRUE_FEATURES
number of underlying true hidden features, they are
meant to be useful features (default: 40)
--n-fake-features N_FAKE_FEATURES
number of underlying fake hidden features, they are
meant to be fixed random noise (default: 0)
--min-usefulness MIN_USEFULNESS
minimum usefulness of true hidden features (default:
0.5)
--max-usefulness MAX_USEFULNESS
maximum usefulness of true hidden features (default:
0.95)
--usefulness-scheme {linear,exponential,longtailed}
distribution of usefulness in true hidden features
(default: linear)
--tail-power TAIL_POWER
exponent for longtailed usefulness-scheme (default:
1.5)
--location-distribution {norm,uniform}
distribution type of the characteristic trait of
classes, i.e., the envelop of locations for true
features (default: norm)
--sampling-distribution {norm,uniform}
distribution type of the uncertainty of
reproduction,i.e., the noise for different samples
from the same class (or label) in hidden features
(default: norm)
--location-ordering-extent LOCATION_ORDERING_EXTENT
keep segments of locations of given block size
together in each feature independently, use -1 to use
exactly the same location order (default: 0)
--location-sharing-extent LOCATION_SHARING_EXTENT
make locations shared by multiple classes in each
feature independently, use 0 to make all locations
unique (default: 0)
--polynomial use polynomial mixing of features (default: False)
--n-features-out N_FEATURES_OUT
number of visible features to be simulated (default:
10000)
--blending-mode {linear,logarithmic}
how to simulate measured features (default: linear)
--min-count MIN_COUNT
minimum number of hidden features taking part in one
specific output feature (default: 5)
--max-count MAX_COUNT
maximum number of hidden features taking part in one
specific output feature (default: 10)
--min-noise MIN_NOISE
minimum noise of output features (default: 0.0)
--max-noise MAX_NOISE
maximum noise of output features (default: 1.0)
--store-hidden store the hidden feature space for later analysis
(default: False)
--random-state RANDOM_STATE
integer random seed (default: 137)
--output OUTPUT output file name (default: out_data.hdf5)
For more details, run python -m biometric_blender --help
.
If you use BiometricBlender in scientific research, please cite our publication as follows:
Marcell Stippinger, Dávid Hanák, Marcell T. Kurbucz, Gergely Hanczár, Olivér M. Törteli, Zoltán Somogyvári,
BiometricBlender: Ultra-high dimensional, multi-class synthetic data generator to imitate biometric feature space,
SoftwareX,
Volume 22,
2023,
101366,
ISSN 2352-7110,
https://doi.org/10.1016/j.softx.2023.101366.
Or, using BiBTex:
@article{STIPPINGER2023101366,
title = {BiometricBlender: Ultra-high dimensional, multi-class synthetic data generator to imitate biometric feature space},
journal = {SoftwareX},
volume = {22},
pages = {101366},
year = {2023},
issn = {2352-7110},
doi = {https://doi.org/10.1016/j.softx.2023.101366},
url = {https://www.sciencedirect.com/science/article/pii/S2352711023000626},
author = {Marcell Stippinger and Dávid Hanák and Marcell T. Kurbucz and Gergely Hanczár and Olivér M. Törteli and Zoltán Somogyvári},
keywords = {Dataset generator, Biometrics, Feature screening, Ultra-high dimensionality, Multi-class classification},
abstract = {The lack of freely available (real-life or synthetic) high or ultra-high dimensional, multi-class datasets may hamper the rapidly growing research on feature screening, especially in the field of biometrics, where the usage of such datasets is common. This paper reports a Python package called BiometricBlender, which is an ultra-high dimensional, multi-class synthetic data generator to benchmark a wide range of feature screening methods. During the data generation process, the overall usefulness and the intercorrelations of blended features can be controlled by the user, thus the synthetic feature space is able to imitate the key properties of a real biometric dataset.}
}