I just released v0.2.0 of Triton-Augment, a PyTorch library to eliminate the GPU data augmentation bottleneck.
The core issue is the "Global Memory Tax": Sequential transforms (Crop, Jitter, Normalize) force the GPU to repeatedly read/write intermediate tensors to VRAM. This kills performance.
The Solution: I use Triton to fuse the entire augmentation pipeline into a single, highly-optimized GPU kernel. This eliminates all intermediate memory I/O.
The Results:
Video: Up to 73.7x faster than Kornia on 5D video tensors.
Image: 8.1x average speedup (up to 12x) over Torchvision v2.
It's designed as a drop-in replacement for your existing Compose pipeline. Check out the GitHub repository for the full API and detailed benchmarks.
I'm focused on developing the next phase (Resize, Rotation, etc.) and welcome any feedback on the kernels or usage patterns!
I just released v0.2.0 of Triton-Augment, a PyTorch library to eliminate the GPU data augmentation bottleneck.
The core issue is the "Global Memory Tax": Sequential transforms (Crop, Jitter, Normalize) force the GPU to repeatedly read/write intermediate tensors to VRAM. This kills performance.
The Solution: I use Triton to fuse the entire augmentation pipeline into a single, highly-optimized GPU kernel. This eliminates all intermediate memory I/O.
The Results:
Video: Up to 73.7x faster than Kornia on 5D video tensors.
Image: 8.1x average speedup (up to 12x) over Torchvision v2.
It's designed as a drop-in replacement for your existing Compose pipeline. Check out the GitHub repository for the full API and detailed benchmarks.
I'm focused on developing the next phase (Resize, Rotation, etc.) and welcome any feedback on the kernels or usage patterns!
GitHub: https://github.com/yuhezhang-ai/triton-augment
Hi everyone, I wanted to share a small library that I've been working on: Triton-Augment. Full technical details in the comments