Triton-Augment: GPU Kernel Fusion for 5-73x Faster Image/Video Augmentation

(github.com)

3 points | by seedlingfl 6 hours ago

2 comments

seedlingfl 6 hours ago
I just released v0.2.0 of Triton-Augment, a PyTorch library to eliminate the GPU data augmentation bottleneck.
The core issue is the "Global Memory Tax": Sequential transforms (Crop, Jitter, Normalize) force the GPU to repeatedly read/write intermediate tensors to VRAM. This kills performance.
The Solution: I use Triton to fuse the entire augmentation pipeline into a single, highly-optimized GPU kernel. This eliminates all intermediate memory I/O.
The Results:
Video: Up to 73.7x faster than Kornia on 5D video tensors.
Image: 8.1x average speedup (up to 12x) over Torchvision v2.
It's designed as a drop-in replacement for your existing Compose pipeline. Check out the GitHub repository for the full API and detailed benchmarks.
I'm focused on developing the next phase (Resize, Rotation, etc.) and welcome any feedback on the kernels or usage patterns!
GitHub: https://github.com/yuhezhang-ai/triton-augment
seedlingfl 6 hours ago
Hi everyone, I wanted to share a small library that I've been working on: Triton-Augment. Full technical details in the comments