Skip to content

Latest commit

 

History

History
233 lines (187 loc) · 9.54 KB

getting-started.md

File metadata and controls

233 lines (187 loc) · 9.54 KB

Getting Started with Stamp

This guide is designed to help you with your first steps using the stamp pipeline to predict biomarkers and other attributes from whole slide images (WSIs). To follow along, you will need some WSIs, a table mapping each of these slides to a patient as well as some ground truth we will eventually train a neural network on.

Whole Slide Images

The whole slide images have to be in any of the formats supported by OpenSlide. For the next steps we assume that all these WSIs are stored in the same directory. We will call this directory the WSI directory.

Creating a Configuration File

Stamp is configured using configuration files. We recommend creating one configuration file per experiment and storing in the same folder as the eventual results, as this makes it easier to reconstruct which data and parameters a model was trained with later.

The stamp init command creates a new configuration file with dummy values. By default, it is created in $PWD/config.yaml, but we can use the --config option to specify its location:

# Create a directory to save our experiment results to
mkdir stamp-test-experiment
# Create a new config file in said directory
stamp --config stamp-test-experiment/config.yaml init

Feature Extraction

To do any kind of training on our data, we first have to convert it into a form more easily usable by neural networks. We do this using a feature extractor. A feature extractor is a neural network has been trained on a large amount of WSIs to extract extract the information relevant for our domain from images. This way, we can compress WSIs into a more compact representation, which in turn allows us to efficiently train machine learning models with them.

Stamp currently supports the following feature extractors:

As some of the above require you to request access to the model on huggingface, we will stick with ctranspath for this example.

In order to use a feature extractor, you also have to install their respective dependencies. You can do so by specifying the feature extractor you want to use when installing stamp:

# Install stamp including the dependencies for all feature extractors
pip install "git+https://github.com/KatherLab/stamp@v2[all]"

Open the stamp-test-experiment/config.yaml we created in the last step and modify the output_dir, wsi_dir and cache_dir entries in the preprocessing section to contain the absolute paths of the directory the configuration file resides in. wsi_dir Needs to point to a path containing the WSIs you want to extract features from.

The cache_dir will be used to save intermediate data. Should you decide to try another feature extractor later, using the same cache dir again will significantly speed up the extraction process. If you will only extract features once, it can be set to none.

# stamp-test-experiment/config.yaml

preprocessing:
  output_dir: "/absolute/path/to/stamp-test-experiment"
  wsi_dir: "/absolute/path/to/wsi_dir"

  # Other possible values are "mahmood-uni" and "mahmood-conch"
  extractor: "ctranspath"

  # Having a cache dir will speed up extracting features multiple times,
  # e.g. with different feature extractors.
  # Optional.
  cache_dir: "/absolute/path/to/stamp-test-experiment/../cache"
  # If you do not want to use a cache,
  # change the cache dir to the following:
  # cache_dir: null

  # Device to run feature extraction on.
  # Set this to "cpu" if you do not have a CUDA-capable GPU.
  device: "cuda"

  # How many workers to use for tile extraction.  Should be less or equal to
  # the number of cores of your system.
  max_workers: 8

Extracting the features is then as easy as running

stamp --config stamp-test-experiment/config.yaml preprocess

Depending on the size of your dataset and your hardware, this process may take anything between a few hours and days.

You can interrupt this process at any time. It will continue where you stopped it the next time you run stamp preprocess.

As the preprocessing is running, you can see the output directory fill up with the features, saved in .h5 files, as well as .jpgs showing from which parts of the slide features are extracted. Most of the background should be marked in red, meaning ignored that it was ignored during feature extraction.

If you are using the UNI or CONCH models and working in an environment where your home directory storage is limited, you may want to also specify your huggingface storage directory by setting the HF_HOME environment variable:

export HF_HOME=/path/to/directory/to/store/huggingface/data/in
huggingface-cli login   # only needs to be done once per $HF_HOME
stamp -c stamp-test-experiment/config.yaml preprocess

Doing Cross-Validation on the Data Set

One way to quickly ascertain if a neural network can be trained to recognize a specific pattern without the need to source a separate testing set is to perform a cross-validation on it. During a cross validation, we train multiple models on a subset of the data, testing its effectiveness on the held-out part of the data not used during training. To perform a cross-validation, add the following lines to your stamp-test-experiment/config.yaml, with feature_dir adapted to match the directory the .h5 files were output to in the last step. clini_table and slide_table both need to point to tables, either in excel or .csv format, with contents as described below. Finally, ground_truth_label needs to contain the column name of the data we want to train our model on. Stamp only can be used to train neural networks for categorical targets. We recommend explicitly setting the possible classes using the categories field.

# stamp-test-experiment/config.yaml

crossval:
  output_dir: "/absolute/path/to/stamp-test-experiment"

  # An excel (.xlsx) or CSV (.csv) table containing the clinical information of
  # patients.  Patients not present in this file will be ignored during training.
  # Has to contain at least two columns, one titled "PATIENT", containing a patient ID,
  # and a second column containing the categorical ground truths for that patient.
  clini_table: "metadata-CRC/TCGA-CRC-DX_CLINI.xlsx"

  # Directory the extracted features are saved in.
  feature_dir: "/absolute/path/to/stamp-test-experiment/xiyuewang-ctranspath-7c998680-112fc79c"

  # A table (.xlsx or .csv) relating every patient to their feature files.
  # The table must contain at least two columns, one titled "PATIENT",
  # containing the patient ID (matching those in the `clini_table`), and one
  # called "FILENAME", containing the feature file path relative to `feature_dir`.
  # Patient IDs not present in the clini table as well as non-existent feature
  # paths are ignored.
  slide_table: "slide.csv"

  # Name of the column from the clini table to train on.
  ground_truth_label: "isMSIH"

  # Optional settings:

  # The categories occurring in the target label column of the clini table.
  # If unspecified, they will be inferred from the table itself.
  categories: ["yes", "no"]

  # Number of folds to split the data into for cross-validation
  #n_splits: 5

After specifying all the parameters of our cross-validation, we can run it by invoking:

stamp --config stamp-test-experiment/config.yaml crossval

Generating Statistics

After training and validating your model, you may want to generate statistics to evaluate its performance. This can be done by adding a statistics section to your stamp-test-experiment/config.yaml file. The configuration should look like this:

# stamp-test-experiment/config.yaml

statistics:
  output_dir: "/absolute/path/to/stamp-test-experiment/statistics"

  # Name of the target label.
  ground_truth_label: "isMSIH"

  # A lot of the statistics are computed "one-vs-all", i.e. there needs to be
  # a positive class to calculate the statistics for.
  true_class: "yes"

  pred_csvs:
  - "/absolute/path/to/stamp-test-experiment/split-0/patient-preds.csv"
  - "/absolute/path/to/stamp-test-experiment/split-1/patient-preds.csv"
  - "/absolute/path/to/stamp-test-experiment/split-2/patient-preds.csv"
  - "/absolute/path/to/stamp-test-experiment/split-3/patient-preds.csv"
  - "/absolute/path/to/stamp-test-experiment/split-4/patient-preds.csv"

To generate the statistics, run the following command:

stamp --config stamp-test-experiment/config.yaml statistics

Afterwards, the output_dir should contain the following files:

  • isMSIH-categorical-stats-individual.csv contains statistical scores for each individual split.
  • isMSIH-categorical-stats-aggregated.csv contains the mean as well as the 95% confidence interval for the statistical scores for the splits.
  • roc-curve_isMSIH=yes.svg and pr-curve_isMSIH=yes.svg contain the ROC and precision recall curves of the splits.