Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Wrap] Use a ctypes-based kernel wrapper instead of dlpack for runtime efficiency #95

Merged
merged 36 commits into from
Feb 19, 2025

Conversation

LeiWang1999
Copy link
Contributor

This pull request includes several changes to enhance the functionality and maintainability of the tilelang project. The changes include updates to the documentation, addition of new scripts, and significant refactoring of the codebase to improve modularity and support for different backends.

Documentation Updates:

  • Updated README.md to correct the capitalization of "WebGPU Codegen" in the latest news section.

New Scripts:

  • Added maint/scripts/pypi_distribution_tox.sh to automate the process of installing multiple Python versions and building wheels for distribution.

Refactoring and Enhancements:

  • Refactored tilelang/engine/lower.py to introduce helper functions for device and host call checks, and updated the lower function to use these helpers for better readability and maintainability. [1] [2] [3] [4] [5] [6]
  • Updated tilelang/jit/adapter to include a new CtypesKernelAdapter and refactored the BaseKernelAdapter to use abstract methods and improve initialization. [1] [2] [3] [4] [5] [6] [7]
  • Modified tilelang/jit/kernel.py to support the new CtypesKernelAdapter and added a target_host parameter for cross-compilation. [1] [2] [3]

- Rename ThreadSync and TileLangThreadSync functions in C++ code
- Update Python docstring for ThreadSync with more detailed description
- Reorder library path detection in tilelang environment setup
- Minor comment and code cleanup in CUDA and warp specialization modules
- Standardize pointer type spacing in storage_access.h and storage_access.cc
- Update whitespace and indentation in thread_storage_sync.cc
- Reorder include statements in thread_partial_sync.cc
- Minor code formatting improvements across thread synchronization files
- Correct global function registration to use ThreadSync instead of TileLangThreadSync
- Update TVM global registration to match recent refactoring efforts
- Remove unnecessary whitespace in global function registration
- Compact the TVM global registration line for ThreadSync
- Implement WebGPU code generator (codegen_webgpu.cc and codegen_webgpu.h)
- Add WebGPU target support in lower.py and target.py
- Update CMakeLists.txt to include WebGPU codegen source files
- Introduce WebGPU-specific code generation for WGSL shader language
- Enhance code formatting in codegen_webgpu.cc and codegen_webgpu.h
- Standardize pointer type spacing and indentation
- Improve line breaks and reduce line length for better readability
- Minor code style improvements in WebGPU code generation
- Implement test_webgpu_codegen.py for WebGPU matrix multiplication
- Add assert_gemm_codegen function to validate WebGPU code generation
- Include basic matrix multiplication kernel test case
- Introduce `is_cpu_device_backend` function to detect CPU backend with C code generation
- Modify `lower` function to handle special case of CPU device backend
- Update host and device call filtering for CPU backend
- Add conditional source code generation for C host target
- Extend JITKernel to support optional target_host parameter
- Add CtypesKernelAdapter with dynamic library generation and kernel wrapping
- Implement TorchCPPKernelAdapter for CUDA kernel compilation
- Refactor BaseKernelAdapter to support more flexible initialization
- Improve error handling and argument processing in kernel adapters
- Update adapter initialization to support various execution backends
- Apply consistent code formatting and whitespace in CTypes adapter files
- Remove unused imports and improve import organization
- Enhance readability of code in adapter, libgen, and wrapper modules
- Add missing whitespace and improve line breaks
- Minor linting and code style improvements across CTypes adapter files
- Implement comprehensive test for matrix multiplication using CTypes execution backend
- Create test functions for GEMM with float16 data type
- Add kernel source verification with custom callback
- Implement reference implementation using PyTorch for result validation
- Support various matrix multiplication configurations (transposition, block sizes)
- Modify tilelang_callback_cuda_postproc to use @tvm.register_func(override=True)
- Ensure proper function registration with ability to replace existing implementations
@LeiWang1999 LeiWang1999 merged commit fca18c4 into tile-ai:main Feb 19, 2025
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant