-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Wrap] Use a ctypes-based kernel wrapper instead of dlpack for runtime efficiency #95
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…tion instructions
…ase and develop modes
…develop_install
…-Python version support
…develop_install
- Rename ThreadSync and TileLangThreadSync functions in C++ code - Update Python docstring for ThreadSync with more detailed description - Reorder library path detection in tilelang environment setup - Minor comment and code cleanup in CUDA and warp specialization modules
- Standardize pointer type spacing in storage_access.h and storage_access.cc - Update whitespace and indentation in thread_storage_sync.cc - Reorder include statements in thread_partial_sync.cc - Minor code formatting improvements across thread synchronization files
- Correct global function registration to use ThreadSync instead of TileLangThreadSync - Update TVM global registration to match recent refactoring efforts
- Remove unnecessary whitespace in global function registration - Compact the TVM global registration line for ThreadSync
- Implement WebGPU code generator (codegen_webgpu.cc and codegen_webgpu.h) - Add WebGPU target support in lower.py and target.py - Update CMakeLists.txt to include WebGPU codegen source files - Introduce WebGPU-specific code generation for WGSL shader language
- Enhance code formatting in codegen_webgpu.cc and codegen_webgpu.h - Standardize pointer type spacing and indentation - Improve line breaks and reduce line length for better readability - Minor code style improvements in WebGPU code generation
- Implement test_webgpu_codegen.py for WebGPU matrix multiplication - Add assert_gemm_codegen function to validate WebGPU code generation - Include basic matrix multiplication kernel test case
- Introduce `is_cpu_device_backend` function to detect CPU backend with C code generation - Modify `lower` function to handle special case of CPU device backend - Update host and device call filtering for CPU backend - Add conditional source code generation for C host target - Extend JITKernel to support optional target_host parameter
- Add CtypesKernelAdapter with dynamic library generation and kernel wrapping - Implement TorchCPPKernelAdapter for CUDA kernel compilation - Refactor BaseKernelAdapter to support more flexible initialization - Improve error handling and argument processing in kernel adapters - Update adapter initialization to support various execution backends
- Apply consistent code formatting and whitespace in CTypes adapter files - Remove unused imports and improve import organization - Enhance readability of code in adapter, libgen, and wrapper modules - Add missing whitespace and improve line breaks - Minor linting and code style improvements across CTypes adapter files
- Implement comprehensive test for matrix multiplication using CTypes execution backend - Create test functions for GEMM with float16 data type - Add kernel source verification with custom callback - Implement reference implementation using PyTorch for result validation - Support various matrix multiplication configurations (transposition, block sizes)
- Modify tilelang_callback_cuda_postproc to use @tvm.register_func(override=True) - Ensure proper function registration with ability to replace existing implementations
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request includes several changes to enhance the functionality and maintainability of the
tilelang
project. The changes include updates to the documentation, addition of new scripts, and significant refactoring of the codebase to improve modularity and support for different backends.Documentation Updates:
README.md
to correct the capitalization of "WebGPU Codegen" in the latest news section.New Scripts:
maint/scripts/pypi_distribution_tox.sh
to automate the process of installing multiple Python versions and building wheels for distribution.Refactoring and Enhancements:
tilelang/engine/lower.py
to introduce helper functions for device and host call checks, and updated thelower
function to use these helpers for better readability and maintainability. [1] [2] [3] [4] [5] [6]tilelang/jit/adapter
to include a newCtypesKernelAdapter
and refactored theBaseKernelAdapter
to use abstract methods and improve initialization. [1] [2] [3] [4] [5] [6] [7]tilelang/jit/kernel.py
to support the newCtypesKernelAdapter
and added atarget_host
parameter for cross-compilation. [1] [2] [3]