[Wrap] Use a ctypes-based kernel wrapper instead of dlpack for runtime efficiency #95

LeiWang1999 · 2025-02-19T12:52:13Z

This pull request includes several changes to enhance the functionality and maintainability of the tilelang project. The changes include updates to the documentation, addition of new scripts, and significant refactoring of the codebase to improve modularity and support for different backends.

Documentation Updates:

Updated README.md to correct the capitalization of "WebGPU Codegen" in the latest news section.

New Scripts:

Added maint/scripts/pypi_distribution_tox.sh to automate the process of installing multiple Python versions and building wheels for distribution.

Refactoring and Enhancements:

Refactored tilelang/engine/lower.py to introduce helper functions for device and host call checks, and updated the lower function to use these helpers for better readability and maintainability. [1] [2] [3] [4] [5] [6]
Updated tilelang/jit/adapter to include a new CtypesKernelAdapter and refactored the BaseKernelAdapter to use abstract methods and improve initialization. [1] [2] [3] [4] [5] [6] [7]
Modified tilelang/jit/kernel.py to support the new CtypesKernelAdapter and added a target_host parameter for cross-compilation. [1] [2] [3]

…date .gitignore

…tion instructions

…ase and develop modes

…develop_install

…tion.md

…-Python version support

…develop_install

….txt

…ysis passes

… loading

- Rename ThreadSync and TileLangThreadSync functions in C++ code - Update Python docstring for ThreadSync with more detailed description - Reorder library path detection in tilelang environment setup - Minor comment and code cleanup in CUDA and warp specialization modules

- Standardize pointer type spacing in storage_access.h and storage_access.cc - Update whitespace and indentation in thread_storage_sync.cc - Reorder include statements in thread_partial_sync.cc - Minor code formatting improvements across thread synchronization files

- Correct global function registration to use ThreadSync instead of TileLangThreadSync - Update TVM global registration to match recent refactoring efforts

- Remove unnecessary whitespace in global function registration - Compact the TVM global registration line for ThreadSync

- Implement WebGPU code generator (codegen_webgpu.cc and codegen_webgpu.h) - Add WebGPU target support in lower.py and target.py - Update CMakeLists.txt to include WebGPU codegen source files - Introduce WebGPU-specific code generation for WGSL shader language

- Enhance code formatting in codegen_webgpu.cc and codegen_webgpu.h - Standardize pointer type spacing and indentation - Improve line breaks and reduce line length for better readability - Minor code style improvements in WebGPU code generation

- Implement test_webgpu_codegen.py for WebGPU matrix multiplication - Add assert_gemm_codegen function to validate WebGPU code generation - Include basic matrix multiplication kernel test case

…multi_pypi

- Introduce `is_cpu_device_backend` function to detect CPU backend with C code generation - Modify `lower` function to handle special case of CPU device backend - Update host and device call filtering for CPU backend - Add conditional source code generation for C host target - Extend JITKernel to support optional target_host parameter

- Add CtypesKernelAdapter with dynamic library generation and kernel wrapping - Implement TorchCPPKernelAdapter for CUDA kernel compilation - Refactor BaseKernelAdapter to support more flexible initialization - Improve error handling and argument processing in kernel adapters - Update adapter initialization to support various execution backends

- Apply consistent code formatting and whitespace in CTypes adapter files - Remove unused imports and improve import organization - Enhance readability of code in adapter, libgen, and wrapper modules - Add missing whitespace and improve line breaks - Minor linting and code style improvements across CTypes adapter files

- Implement comprehensive test for matrix multiplication using CTypes execution backend - Create test functions for GEMM with float16 data type - Add kernel source verification with custom callback - Implement reference implementation using PyTorch for result validation - Support various matrix multiplication configurations (transposition, block sizes)

- Modify tilelang_callback_cuda_postproc to use @tvm.register_func(override=True) - Ensure proper function registration with ability to replace existing implementations

LeiWang1999 added 30 commits February 11, 2025 15:53

bump version into v0.1.0

c16c0c5

[Enhancement] Add custom develop command for editable installs and up…

94e05a1

…date .gitignore

[Documentation] Update README to include system dependencies installa…

200bc52

…tion instructions

[Build] Update setup.py to support library file copying for both rele…

1b302ea

…ase and develop modes

[Build] Refactor library file copying logic in setup.py

927795f

Merge branch 'main' of https://github.com/tile-ai/tilelang into feat_…

8d406a0

…develop_install

[Documentation] Remove unnecessary install section header in Installa…

0b3cf60

…tion.md

[Build] Add tox configuration and local distribution script for multi…

e56befa

…-Python version support

[Build] Improve git submodule update function with better error handling

02d80df

Merge branch 'main' of https://github.com/tile-ai/tilelang into feat_…

c8ab95f

…develop_install

[Build] Update LLVM configuration path in ROCm installation script

212cb48

[Build] Add .tox/ to .gitignore for tox testing environment

b78efb0

[Build] Add support for TVM prebuild path configuration in CMakeLists…

4d82c6c

….txt

[Cleanup] Remove unused TVM runtime error codes header

545e7c5

[Cleanup] Fix TVM grid constant type reference in CUDA module

ce92489

[Cleanup] Remove unused customized_code function from IR module

6b550d8

[Feature] Add TileLang thread synchronization and storage access anal…

6306f3a

…ysis passes

[Build] Reorder DLL search path directories for more flexible library…

ed6a322

… loading

[Refactor] Fix global function registration for ThreadSync

893d19e

- Correct global function registration to use ThreadSync instead of TileLangThreadSync - Update TVM global registration to match recent refactoring efforts

[Refactor] Simplify ThreadSync global function registration

912b675

- Remove unnecessary whitespace in global function registration - Compact the TVM global registration line for ThreadSync

[Test] Add WebGPU matrix multiplication code generation test

975d6fd

- Implement test_webgpu_codegen.py for WebGPU matrix multiplication - Add assert_gemm_codegen function to validate WebGPU code generation - Include basic matrix multiplication kernel test case

Merge branch 'main' of https://github.com/tile-ai/tilelang into webgpu

41e2823

Update README with WebGPU codegen support announcement

72fff29

Support multi version pypi package build via tox

470fef7

Merge branch 'main' of https://github.com/tile-ai/tilelang into feat_…

c0d432d

…multi_pypi

LeiWang1999 added 6 commits February 18, 2025 07:24

lint fix

3bdd039

test fix

1a62699

Update TileLang JIT callback registration with override parameter

0f3c512

- Modify tilelang_callback_cuda_postproc to use @tvm.register_func(override=True) - Ensure proper function registration with ability to replace existing implementations

LeiWang1999 merged commit fca18c4 into tile-ai:main Feb 19, 2025
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Wrap] Use a ctypes-based kernel wrapper instead of dlpack for runtime efficiency #95

[Wrap] Use a ctypes-based kernel wrapper instead of dlpack for runtime efficiency #95

LeiWang1999 commented Feb 19, 2025

[Wrap] Use a ctypes-based kernel wrapper instead of dlpack for runtime efficiency #95

[Wrap] Use a ctypes-based kernel wrapper instead of dlpack for runtime efficiency #95

Conversation

LeiWang1999 commented Feb 19, 2025