Ultrafast machine learning on FPGAs via Kolmogorov-Arnold Networks

(aarushgupta.io)

227 points | by ag2718 15 hours ago

33 comments

potato-peeler 27 minutes ago
Bit off topic but I have always wondered how is it decided whose names would come first in a paper. You mentioned you and Duc Hoang having equal contribution, so how did you both decide this? Was it that persons idea first or you were his roommate and owe him a beer? Coin toss? I never had an traditional college life. Always wondered about all this.
[-]
- Epa095 9 minutes ago
  This differes between fields, sometimes even down to the niche subject. In my subfield (of computer science) it was strictly alphabetical by surname, and the idea was that either you contributed or not, and there is no gradient. In other fields it's 'main author' and everyone else, with the expectation that main author does more. Some have the group leader as the first name, or the 'big shot' is always first. My impression is that in medicine it is often a kind of ranking from 'most to least' main author.
Lerc 9 hours ago
Has there been much exploration on how much benefit comes from precision in activation functions in KANs? There's a little niggle in the back of my head that maybe 90% of the benefit of KANs can be gained from a quite small variety of function shapes. Combined with input weighting, I almost feel you could have a representation that scales from a standard relu perceptron though KANs to something with weighted inputs and fancy weighted activation functions.
Mark that out in 2d with axes of input weight precision and activation weight precision, you could perhaps do sweeps to find the best accuracy per parameter bit, or accuracy/speed, or some sweet spot that has a nice balance of operating speed, accuracy, and model size.
[-]
- ag2718 8 hours ago
  There is definitely a precision-performance tradeoff to consider. We explored this through ablation studies on bitwidth precision / resource usage in our work (Figure 6a in https://arxiv.org/pdf/2512.12850, Figure 4 in https://arxiv.org/pdf/2602.02056). Further exploration into the mechanics here would definitely be useful.
  Regarding your point that "90% of the benefit of KANs can be gained from a small variety of function shapes": even within the B-spline basis, the shapes are quite uniform. Much of the actual benefit of scaling up the basis size comes from learning more complex, piecewise-polynomial activation functions. Scaling up the number of basis functions (i.e. more granular intervals) also increases locality and allows the activation function's value across different parts of the domain to be learned semi-independently. (There obviously is a tradeoff here with overfitting.)
  The number of basis functions (G+S) is largely what determines how expressive the activation is, as it relates to your point: "you could have a representation that scales from a standard relu perceptron though KANs to something with weighted inputs and fancy weighted activation functions."
  [-]
  - zipy124 9 minutes ago
    Can I just say that this is extremely impressive work for a master's level thesis. Incredible work and I hope you manage to continue fulfilling your fantastic potential in your career!
- hodgehog11 9 hours ago
  The benefit in KANs is interpretability, not expressivity. It's a structure that lends itself well to performing symbolic regression or other interpretable downstream tasks. This can make it better suited for scientific tasks, for example. You can easily replicate the practical performance of any KAN with an MLP, and it will train and run faster on modern architectures. This proposes a method it might be faster, but it's early days to me.
  Precision in the activation function is targetting a part of neural networks that you don't want. There are many other methods that work with high precision. You use neural networks because of their implicit bias toward regular solutions. That means there is a sweet spot at low precision that you're targetting.
  [-]
  - ag2718 5 hours ago
    A key benefit of KANs is expressivity, as each layer is significantly more expressive than an MLP layer. This can be seen in our benchmarks: KAN networks need fewer layers than MLPs to match or beat their performance, even in software.
    However, on GPUs, KAN implementations are far less efficient than MLPs: since B-spline locality is hard to exploit and lookup operations aren't as efficient. This is your original point about MLPs training and running faster on modern architectures: each KAN layer is more expressive, but its poor hardware efficiency makes it a net negative (at least for current approaches).
    On FPGAs, LUT lookups are cheap, so KANs' expressive layers map to very hardware-efficient implementations, and the resulting networks are thus much more compact and efficient than equivalent MLPs.
    On your second point: low precision is certainly viable for both inference and learning (as shown in our work), and quantization can even have a mild regularizing effect. However, task performance generally worsens with lower precision (here and across the literature): the use of low precision is fundamentally a result of the efficiency-performance tradeoff.
    [-]
    - hodgehog11 3 hours ago
      I generally agree with this rebuttal. Each KAN layer is more expressive on a per-layer basis, although there is a mapping to an MLP with more layers. With the current hardware implementations, yes, MLPs have an advantage overall. I can certainly respect the intention to make KANs faster, since it is a serious issue for more widespread adoption, and KANs certainly have their value.
      I'm still very skeptical of arguing for KANs as an eventual replacement, like I've seen some papers on the subject argue. The reduced depth may not be an advantage. For example, higher depth for standard neural networks doesn't just add to expressivity, it actually induces spectral sparsity bias. KANs have a bias of their own, but it is different, and is sometimes better, sometimes worse, depending on the task. If increasing depth turns out to be important, KANs might remain less efficient overall.
mikeayles 13 hours ago
So for people wondering if it can be used to accelerate LLM inference, sadly not.
I've been trying to hit 100,000tokens/s with a 3.28m dumb model, and even this is an order of magnitude too large to benefit.
It appears to be focussed more on latency, than throughput. Happy to be corrected?
[-]
- ssivark 8 hours ago
  When aiming for 100k tok/s, you would still have CUDA overheads (on the order of microseconds) -- which might become the bottleneck, even if you do everything else right with the inference architecture. How are you planning to overcome that?
  EDIT: Oh, on second read, do you mean you're running the model on an FPGA?
  [-]
  - taneq 7 hours ago
    You might be conflating throughput with latency. 100k tok/s is very different to 1 tok/10us.
- ag2718 13 hours ago
  You're correct that this work is not very applicable for LLMs and that the focus here is primarily on latency.
- ai_fry_ur_brain 10 hours ago
  Was anyone thinking this?
RantyDave 14 hours ago
Right. But ... this would limit you to either extremely small models or extremely large FPGA's, yes? If there's a simple machine learning task that requires a sub microsecond latency I can see the point but otherwise??
[-]
- ag2718 13 hours ago
  Yes, this work is focused on accelerating very small models, typically for real-time systems that require extremely low power or low latency.
  One primary application of this work is in high-energy physics (https://home.cern/smarter-decisions-at-the-speed-of-collisio...). Ultrafast and real-time learning is also very applicable for problems in quantum computing, plasma control, etc. (https://arxiv.org/pdf/2602.02005).
  [-]
  - laughing_man 9 hours ago
    Drone target recognition?
  - poly2it 13 hours ago
    I'm not in HFT, but I assume this is also an interesting applicable domain?
    [-]
    - UltraSane 12 hours ago
      The author actually works at Jane Street.
    - ag2718 13 hours ago
      Yes, definitely: this type of work is applicable in domains where software run on general-purpose processors cannot meet latency or power requirements.
Cadwhisker 9 hours ago
If you want to experiment with KANs yourself in a non-FPGA environment, there's a GitHub repo here: https://github.com/KindXiaoming/pykan
HN comments page on that is here: https://news.ycombinator.com/item?id=40219205
tomrod 13 hours ago
Happy to hear that KANs continue to find solid footing.
Animats 14 hours ago
This guy will be hired by a high-frequency trading firm, and the next time we hear about him, he will have a net worth in 9 figures.
[-]
- throwaw12 14 hours ago
  he is already at Jane Street
  [-]
  - Animats 13 hours ago
    Of course.
- ai_fry_ur_brain 10 hours ago
  Sure, if they worked for 100 years maybe.. FPGA guy at jane st probably makes 600k to low seven figures... Maybe.
  Not everyone in quant is a centi-millionaire, probably almost none of them in r&d actually.
woggy 5 hours ago
I love the name 'Kolmogorov'
[-]
- boulos 5 hours ago
  Because it's complex?
semessier 6 hours ago
and where is the Transformer library ;)
DeathArrow 5 hours ago
I know enough to understand this is interesting but sadly I don't know enough to understand how it works.
cwmoore 11 hours ago
took long enough
babelfish 13 hours ago
Archive link, as it looks like the original post was taken down: https://web.archive.org/web/20260609200156/https://aarushgup...
[-]
- ag2718 13 hours ago
  Hmm the post is still up for me?
  [-]
  - dang 13 hours ago
    For us too, but we'll put the archive link in the toptext since these things seem to vary a lot by region.
    p.s. Thanks for posting this and welcome to HN!