For anyone who is relatively new to the field and interested in trying this: Benchmark performance with and without it.
Optical compression is one of those things that pops up across different subfields of ML at different times, and it's always an interesting research direction, but its applied utility is very uneven. You absolutely can reduce the number of input tokens, but often at a real cost in output quality.
In Gemini at least, if you look at how they process PDFs, they do an OCR and then feed the text + image to the model, without charging you for the text tokens (I believe).
So my guess is that Claude’s backend is doing the same — so this hack is probably more of a loophole in token accounting that might get closed if Claude is doing what Gemini does
I tried the same thing last year (with openai models), back then it worked to reduce prompt tokens, but you needed way more completion tokens, ultimately more expensive (and slower)
https://pagewatch.ai/blog/post/llm-text-as-image-tokens/
Step back and think about it another way - ask which scenario is more likely:
Some random person discovered a 60% across the board gain in all LLMs, using an extremely simple trick that none of the labs noticed in all these years. That trick being to rasterize 8bit characters into 8x8 pixels in a big image. 60% in a market worth trillions of dollars.
or
Anthropic's marketing team arbitrarily prices tokens to drive growth, according to vibes and feelings, and didn't think they needed to price images on par with text in their rush to burn cash & drive growth. Some folks take advantage of the trick during the first few days of the model's availability before Anthopic corrects their pricing, to align more proportionally with actual compute costs.
> Some random person discovered a 60% across the board gain in all LLMs, using an extremely simple trick that none of the labs noticed in all these years of multi-trillion dollar growth
DeepSeek published a pretty well circulated paper on exactly this many months ago. It just hasn’t been attempted and shared publicly, AFAIK.
Also, it’s no free lunch, the readme indicates that this “use images” hack is lossy and reduces success rates alongside the reduced cost.
huh, what if the image encoding is 8 bits per R, G, B values of the pixel, then one can encode the same amount of text in less pixel dimensions (3 letters would need 1 pixel instead of three 12x12 pixels)
The top line can be the OCR-able instruction on how to decode the rest of the image, and the rest of the image would be random-looking colourful palette. It might not even need to use 8 bits per character, since ANSI is 7 bits/character.
A token is probably not a single char, and an image is probably decomposed into tokens as well (and god knows how many tokens an image is decomposed into) which probably map to similar float-hungry vectors. Your counterargument could use a bit more flesh.
And we're talking about images of texts, not images that represent complex imagery such as a very detailed scene or what have you.
Not really. They arent actually using more resources this way either. This might be a fundamental inefficiency thats being removed
It kinda makes sense too. Because while people do read code word by word, we often "glance over" it and do roughly pattern recognition on it to know what it does. Only homing in on something when we need to answer a specific question. I think humans kinda naturally do this exploit anyway
seems really dumb and like it would need to violate basic information theory to work?
input tokens are cheaper than output tokens. seems like it would maybe reduce input tokens at the expense of many more output tokens if you're actually triggering OCR via thinking?
For anyone who is relatively new to the field and interested in trying this: Benchmark performance with and without it.
Optical compression is one of those things that pops up across different subfields of ML at different times, and it's always an interesting research direction, but its applied utility is very uneven. You absolutely can reduce the number of input tokens, but often at a real cost in output quality.
In Gemini at least, if you look at how they process PDFs, they do an OCR and then feed the text + image to the model, without charging you for the text tokens (I believe).
So my guess is that Claude’s backend is doing the same — so this hack is probably more of a loophole in token accounting that might get closed if Claude is doing what Gemini does
I tried the same thing last year (with openai models), back then it worked to reduce prompt tokens, but you needed way more completion tokens, ultimately more expensive (and slower) https://pagewatch.ai/blog/post/llm-text-as-image-tokens/
Ahhh my eyes the vibe coded readme
What, you don't like your caveats to be honest?
Related: https://blog.can.ac/2026/06/10/snapcompact/
This seems like a pricing hack that burns resources, that when the loophole gets closed the price of OCR will have to rise?
It’s not a loophole, it just happens that encoding information as optical tokens is much more efficient than text.
Step back and think about it another way - ask which scenario is more likely:
Some random person discovered a 60% across the board gain in all LLMs, using an extremely simple trick that none of the labs noticed in all these years. That trick being to rasterize 8bit characters into 8x8 pixels in a big image. 60% in a market worth trillions of dollars.
or
Anthropic's marketing team arbitrarily prices tokens to drive growth, according to vibes and feelings, and didn't think they needed to price images on par with text in their rush to burn cash & drive growth. Some folks take advantage of the trick during the first few days of the model's availability before Anthopic corrects their pricing, to align more proportionally with actual compute costs.
> Some random person discovered a 60% across the board gain in all LLMs, using an extremely simple trick that none of the labs noticed in all these years of multi-trillion dollar growth
DeepSeek published a pretty well circulated paper on exactly this many months ago. It just hasn’t been attempted and shared publicly, AFAIK.
Also, it’s no free lunch, the readme indicates that this “use images” hack is lossy and reduces success rates alongside the reduced cost.
Truly a picture is worth a thousand words.
Of course it isn't
A text encoding uses 8bits per character on average, tokenization further compresses that
An image font would be 25 bits if 5x5, and most fonts are 12 pixels high
Of course it isn't efficient, this is a pricing inefficiency and a hack to exploit it (even the author describes it as an exploit)
huh, what if the image encoding is 8 bits per R, G, B values of the pixel, then one can encode the same amount of text in less pixel dimensions (3 letters would need 1 pixel instead of three 12x12 pixels)
The top line can be the OCR-able instruction on how to decode the rest of the image, and the rest of the image would be random-looking colourful palette. It might not even need to use 8 bits per character, since ANSI is 7 bits/character.
You are wrong.
Text tokens are high-dimensional vectors, not 8 bits per character. Every token has a deep embedding, e.g. 1024 float values per text token.
DeepSeek-OCR proved 10x+ compression from visual embedding of text, which was a groundbreaking result. [1]
Very cool to see OP's project hacking on this principle. It's still not lossless, as noted in the github, but is a promising research direction.
[1] https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSe...
A token is probably not a single char, and an image is probably decomposed into tokens as well (and god knows how many tokens an image is decomposed into) which probably map to similar float-hungry vectors. Your counterargument could use a bit more flesh.
And we're talking about images of texts, not images that represent complex imagery such as a very detailed scene or what have you.
Not really. They arent actually using more resources this way either. This might be a fundamental inefficiency thats being removed
It kinda makes sense too. Because while people do read code word by word, we often "glance over" it and do roughly pattern recognition on it to know what it does. Only homing in on something when we need to answer a specific question. I think humans kinda naturally do this exploit anyway
seems really dumb and like it would need to violate basic information theory to work?
input tokens are cheaper than output tokens. seems like it would maybe reduce input tokens at the expense of many more output tokens if you're actually triggering OCR via thinking?
there's also a DeepSeek whitepaper on this technique https://www.seangoedecke.com/text-tokens-as-image-tokens
That is hilarious and an amazing find.
I want to see more text-free foundation models