Bug Fix: A bug in pipeline.py
that prevented the recognition of integer-type sample_id
values, thus causing file saving failures, has been fixed.
New Feature: Added pipeline_example.py
. This new file demonstrates how to use pipeline.py
for data preprocessing with the metadata.csv
file obtained from LLM_metadata.py
. Although a small bug remains, it doesn't hinder functionality.
New Feature: You can now automatically generate metadata.csv
using an LLM!
This update introduces a fully automated process that uses LLM (Large Language Model) to automatically generate the metadata.csv
file for medical image datasets. This process simplifies data preprocessing and significantly reduces manual effort and errors. Here are the core steps:
First, the analyze_directory
function analyzes and samples the dataset's directory structure. It traverses the root directory, identifies all Level A folders (sample folders), and randomly samples files from these folders and their subfolders. The results are saved as a JSON file named directory_analysis.json
.
- Folder Structure Analysis: Recursively traverses the folder structure to create a directory tree.
- Random Sampling: Randomly samples a defined number of files from each Level A folder.
- Result Saving: Saves the analysis as
directory_analysis.json
, for subsequent steps.
analyze_directory(root_directory, sample_folder_count=5, sample_file_count=10)
root_directory
: The path to the root directory of the dataset.sample_folder_count
: The number of Level A folders to randomly sample.sample_file_count
: The number of files to randomly sample from each folder.
Next, the generate_metadata
function takes directory_analysis.json
as input and uses the DeepSeek API to generate Python code for building metadata.csv
. The LLM automatically identifies multimodal files and mask files based on file naming patterns and generates the appropriate code.
- File Naming Pattern Analysis: The LLM analyzes file names to identify multimodal and mask files.
- Code Generation: Generates Python code for building the
metadata.csv
file. - Result Saving: Saves the generated code as
generate_metadata.py
.
generate_metadata(root_directory, your_api_key=None)
root_directory
: The path to the root directory of the dataset.your_api_key
: Your DeepSeek API key (you can get it for free from DeepSeek's official website).
Finally, the execute_metadata_script
function executes the generated generate_metadata.py
script to automatically create metadata.csv
. The function checks if the CSV file is created correctly and prints its first 5 rows for verification.
- Code Execution: Executes the
generate_metadata.py
script to create themetadata.csv
file. - Result Verification: Checks if the CSV file exists and prints the first 5 rows.
execute_metadata_script(root_directory)
root_directory
: The path to the root directory of the dataset.
Through these three steps, you can easily generate the metadata.csv
using LLM, without manual coding or analyzing file naming patterns. This fully automated process is suitable for various medical image datasets.
if __name__ == "__main__":
# 1. Analyze the directory structure
root_directory = "/teamspace/studios/this_studio/kaggle_3m" # Fill in the root directory of the dataset (must be an absolute path)
analyze_directory(root_directory=root_directory, sample_folder_count=5, sample_file_count=10)
# 2. Generate Python code for metadata.csv
generate_metadata(root_directory=root_directory, your_api_key=None)
# 3. Execute the code and check metadata.csv
execute_metadata_script(root_directory=root_directory)
- Ensure your dataset files have consistent naming patterns, so the LLM can correctly identify multimodal and mask files.
- For larger datasets, increase
sample_folder_count
andsample_file_count
for better LLM analysis accuracy. - If you use the DeepSeek API, make sure you have obtained your API key and included it in
LLM_metadata.py
.
This project uses the BraTS2021 dataset for preprocessing examples. The primary files and directories are:
PreProcPipe/BraTS2021_Training_Data
Contains the original BraTS2021 training data, with each sample stored in its own folder using its ID.
PreProcPipe/BraTS2021_Training_Data/BraTS2021_00000/
BraTS2021_00000_flair/
- Contains the FLAIR modality file, e.g.,
00000057_brain_flair.nii
- Contains the FLAIR modality file, e.g.,
BraTS2021_00000_seg/
- Contains the segmentation file.
BraTS2021_00000_t1/
- Contains the T1 modality file.
BraTS2021_00000_t1ce/
- Contains the T1CE modality file.
BraTS2021_00000_t2/
- Contains the T2 modality file.
Similar structures for other samples, e.g.:
PreProcPipe/BraTS2021_Training_Data/BraTS2021_00002/
PreProcPipe/BraTS2021_Training_Data/BraTS2021_00003/
PreProcPipe/tutorial.ipynb
- A detailed tutorial on using
pipeline.py
, demonstrating how to load, preprocess, and save data. Recommended for new users.
- A detailed tutorial on using
PreProcPipe/pipeline.py
- Main preprocessing script containing the code logic for cropping, normalizing, and resampling BraTS2021 data.
PreProcPipe/LLM_metadata.py
- A script for using an LLM to generate
metadata.csv
.
- A script for using an LLM to generate
PreProcPipe/How_I_Use_LLM_to_DIY_metadata.ipynb
- A notebook documenting the steps for using an LLM to get
metadata.csv
.
- A notebook documenting the steps for using an LLM to get
The SimplePreprocessor
is the core class for multimodal image preprocessing, designed to handle multimodal MRI or CT data. It performs cropping, normalization, resampling, and resizing.
The __init__
method configures the preprocessing parameters:
target_spacing
: The target voxel spacing (default[1.0, 1.0, 1.0]
).normalization_scheme
: Normalization method (z-score
ormin-max
).target_size
: The target image size (e.g.,[256, 256]
, defaults toNone
for no resizing).
read_images(image_paths)
: Loads multimodal image data, returning a list of NumPy arrays and voxel spacing.read_seg(seg_path)
: Loads segmentation data, returning a NumPy array.
crop(data_list, seg)
:- Crops the image and segmentation data along the Z-axis only to remove all-zero regions.
- Returns cropped image data, segmentation data, and cropping properties (bounding box and shape changes).
_normalize_single_modality(data)
:- Normalizes single-modality data.
- Supports
z-score
andmin-max
normalization methods.
compute_new_shape(old_shape, old_spacing, new_spacing)
:- Calculates the target shape based on the original shape and voxel spacing.
- Outputs the resampling factor and the new shape.
resample_data(data, new_shape, order=3)
:- Resamples image data to the target shape using cubic interpolation by default.
resize_to_target_size(data, target_size, order=3)
:- Resizes image data to the specified target size (e.g.,
[256, 256]
). - Keeps the Z-axis depth unchanged by default.
- Resizes image data to the specified target size (e.g.,
run_case(image_paths, seg_path=None)
: Executes the following steps:- Data Loading: Loads all modality images and the corresponding segmentation data.
- Z-axis Cropping: Calls the
crop
method to only crop Z-axis all-zero regions, preserving other dimensions. - Normalization: Normalizes each modality independently using
_normalize_single_modality
. - Resampling: Uses
compute_new_shape
andresample_data
to adjust voxel resolution. - Resizing: Adjusts the data size based on the target size, using
resize_to_target_size
. - Return Results: Outputs the cropped data, segmentation data, original spacing info, and cropping properties.
SimplePreprocessor
is designed for preprocessing medical image data with multimodal images and segmentation. Each step is modular for easy extension and reuse, and it supports most common preprocessing needs for high-dimensional data.
SimplePreprocessor
provides a flexible interface for preprocessing multimodal and single-modality image data. Here's a detailed guide on how to use it:
For multimodal data (e.g., FLAIR, T1, T1CE, T2), the input should be a list of file paths, each pointing to a .nii
file for a specific modality. For example:
image_paths = [
"BraTS2021_00000/BraTS2021_00000_flair/00000057_brain_flair.nii",
"BraTS2021_00000/BraTS2021_00000_t1/00000057_brain_t1.nii",
"BraTS2021_00000/BraTS2021_00000_t1ce/00000057_brain_t1ce.nii",
"BraTS2021_00000/BraTS2021_00000_t2/00000057_brain_t2.nii"
]
For single-modality data, the input data should be a list containing only one .nii
file path:
image_paths = [
"BraTS2021_00000/BraTS2021_00000_flair/00000057_brain_flair.nii"
]
Segmentation data is input as a single file path to a .nii
segmentation file. For example:
seg_path = "BraTS2021_00000/BraTS2021_00000_seg/00000057_seg.nii"
Segmentation data is optional. Set seg_path
to None
if no segmentation data is available.
Create an instance of SimplePreprocessor
, specifying the following parameters:
target_spacing
: The target voxel spacing (default is[1.0, 1.0, 1.0]
).normalization_scheme
: The normalization method (default is"z-score"
).target_size
: The target size (default isNone
, which means no resizing).
Example:
from pipeline import SimplePreprocessor
preprocessor = SimplePreprocessor(
target_spacing=[1.0, 1.0, 1.0],
normalization_scheme="z-score",
target_size=[256, 256]
)
Use the run_case
method to preprocess a single sample:
# Input Data
image_paths = [
"BraTS2021_00000/BraTS2021_00000_flair/00000057_brain_flair.nii",
"BraTS2021_00000/BraTS2021_00000_t1/00000057_brain_t1.nii",
"BraTS2021_00000/BraTS2021_00000_t1ce/00000057_brain_t1ce.nii",
"BraTS2021_00000/BraTS2021_00000_t2/00000057_brain_t2.nii"
]
seg_path = "BraTS2021_00000/BraTS2021_00000_seg/00000057_seg.nii"
# Run the preprocessing
data_list, seg, spacing, properties = preprocessor.run_case(image_paths, seg_path)
data_list
: Preprocessed multimodal image data (after cropping, normalization, resampling, and resizing).seg
: Preprocessed segmentation data (if available).spacing
: Voxel spacing information from the original image.properties
: Attributes related to cropping and preprocessing (e.g., shape before and after cropping, cropping boundaries).
For single-modality data, the input list contains only one file path:
image_paths = [
"BraTS2021_00000/BraTS2021_00000_flair/00000057_brain_flair.nii"
]
seg_path = None # If no segmentation data
data_list, seg, spacing, properties = preprocessor.run_case(image_paths, seg_path)
data_list
will contain the preprocessing results for the single modality.seg
will beNone
.
To process multiple samples, store the inputs (image_paths
and seg_path
) for each sample in a list, and use a multiprocessing tool (e.g., run_in_parallel
).
from pipeline import run_in_parallel
cases = [
{
"image_paths": [
"BraTS2021_00000/BraTS2021_00000_flair/00000057_brain_flair.nii",
"BraTS2021_00000/BraTS2021_00000_t1/00000057_brain_t1.nii"
],
"seg_path": "BraTS2021_00000/BraTS2021_00000_seg/00000057_seg.nii"
},
{
"image_paths": [
"BraTS2021_00001/BraTS2021_00001_flair/00000058_brain_flair.nii"
],
"seg_path": None
}
]
# Batch Process
results = run_in_parallel(preprocessor, cases, num_workers=4, output_root="preprocessed_data")
results
returns the preprocessing results for each sample. The data will also be saved in specified the output_root="preprocessed_data"
directory.
After processing, each sample's return value includes:
data_list
: A list storing the preprocessed data for each modality.seg
: Preprocessed segmentation data (if available).spacing
: The original voxel spacing.properties
: Information about cropping, normalization, and resampling, for example:{ "shape_before_cropping": [(240, 240, 155), ...], "shape_after_cropping": [(240, 240, 120), ...], "z_bbox": [10, 130] }
This structured return allows you to easily save or analyze the results in your next steps.
This detailed breakdown should make it easy to understand and use the provided code. Let me know if you have any more questions.