> Barely amortising at the bottom. At small batch each new token added to the batch tends to activate fresh experts
Whether this is true depends on what you mean by small. In general, AIUI you don't need more than a handful of experts to get a meaningful probability of overlap. DeepSeek V4 Pro is an exceptionally sparse model and even there you start to get meaningful overlap for a batch size of 5 or more. Moreover, in general you can think of the average amount of activated experts for a batch of size b as being n(1 - (1 - k/n)^b) where k is the number of active and n of total experts. For DeepSeek V4, k=6 and n is 256 for Flash, 384 for Pro. (The sampling is repeated per layer, not just per token.)
True but OP says that there is a meaningful "knee" at b=n/k (about 43 for DeepSeek V4 Flash) and I'm not sure that's all that relevant. If anything, it might be a bit more meaningful to highlight the point where on average half the experts are covered, which is coincidentally around 43 for Pro and 30 for Flash. Since that ought to be approximately where the variance in that expectation is maximized.
The way I read OP is that it's ultimately highlighting the expense of verifying a possibly-wrong speculated token (in fact, the first wrong token invalidates all subsequent tokens too) which also applies to things like MTP that are a core feature of the model. You can decide you just don't care about matching the accuracy of the original model and skip the verification part altogether, but then you're moving closer to something like a text-diffusion model, with very different tradeoffs involved.
> Barely amortising at the bottom. At small batch each new token added to the batch tends to activate fresh experts
Whether this is true depends on what you mean by small. In general, AIUI you don't need more than a handful of experts to get a meaningful probability of overlap. DeepSeek V4 Pro is an exceptionally sparse model and even there you start to get meaningful overlap for a batch size of 5 or more. Moreover, in general you can think of the average amount of activated experts for a batch of size b as being n(1 - (1 - k/n)^b) where k is the number of active and n of total experts. For DeepSeek V4, k=6 and n is 256 for Flash, 384 for Pro. (The sampling is repeated per layer, not just per token.)
https://fergusfinn.com/blog/economics-of-speculative-decodin...
good point tho - plus for Deepseek the shared expert increases the overlap slightly
The article includes that formula too and takes the overlap into account in its calculations.
True but OP says that there is a meaningful "knee" at b=n/k (about 43 for DeepSeek V4 Flash) and I'm not sure that's all that relevant. If anything, it might be a bit more meaningful to highlight the point where on average half the experts are covered, which is coincidentally around 43 for Pro and 30 for Flash. Since that ought to be approximately where the variance in that expectation is maximized.
I wonder if new models will be trained with speculative decoding as a core feature allowing fewer experts to be needed for a pass.
The way I read OP is that it's ultimately highlighting the expense of verifying a possibly-wrong speculated token (in fact, the first wrong token invalidates all subsequent tokens too) which also applies to things like MTP that are a core feature of the model. You can decide you just don't care about matching the accuracy of the original model and skip the verification part altogether, but then you're moving closer to something like a text-diffusion model, with very different tradeoffs involved.