Interesting, the pacing seemed very slow when conversing in english, but when I spoke to it in spanish, it sounded much faster. It's really impressive that these models are going to be able to do real time translation and much more.
The Chinese are going to end up owning the AI market if the American labs don't start competing on open weights. Americans may end up in a situation where they have some $1000-2000 device at home with an open Chinese model running on it, if they care about privacy or owning their data. What a turn of events!
sitting here in the US, reading that China is strongly urging the adoption of Linux and pushing for open CPU architectures like RISC-V and also self-hosted open models
It is in their selfish interest to push for open weights.
That's not to say they are being selfish, or to judge in any way the morality of their actions. But because of that incentive, you can't logically infer moral agency in their decision to release open-weights, IP-free CPUs, etc.
Leaving China aside, it's arguably immoral that our leading AI models are closed and concentrated in the hands of billionaires with questionable ethical histories (at best).
I mean China's push for open weights/source/architecture probably has more to do with them wanting legal access to markets than it does with those things being morally superior.
Of course, but that translates in a benefit for most people, even for Americans. In my case (European), I cannot but support the Chinese companies in this respect, as we would be especially in trouble if the common models are the norm.
Depends on whether you want to be in it. A ladder might be enough to peek over the top and rip it off. Do it better. Which seems to be what is happening.
That only works for China's domestic market. As long as the IP they are "taking inspiration from" is protected in the target markets, they effectively lock themselves out by doing that.
In the case of technology like RISC, pretty much all the value add is unprotected, so they can sell those products in the US/EU without issue.
This is exactly what I do. I have two 3090s at home, with Qwen3 on it. This is tied into my Home Assistant install, and I use esp32 devices as voice satellites. It works shockingly well.
I run Home Assistant on an RPi4 and have an ESP32-based Core2 with mic (https://shop.m5stack.com/products/m5stack-core2-esp32-iot-de...), along with a 16GB 4070 Ti Super in an always-on Windows system I only use for occasional gaming and serving media. I'd love to set up something like you have. Can you recommended a starting place, or ideally, a step-by-step tutorial?
I've never set up any AI system. Would you say setting up such a self-hosted AI is at a point now where an AI novice can get an AI system installed and integrated with an existing Home Assistant install in a couple hours?
I mean - the AI itself will help you get all that setup.
Claude code is your friend.
I run proxmox on an old Dell R710 in my closet that hosts my homeassistant (amongst others) VM and then I've setup my "gaming" PC (which hasn't done any gaming in quite some time) to dual boot (Windows or Deb/Proxmox) and just keep it booted into Deb as another proxmox node. That PC also has a 4070 Super that I have setup to passthru to a VM and on that VM I've got various services utilizing the GPU. This includes some that are utilized by my hetzner bare metal servers for things like image/text embeddings as well as local LLM use (though, rather minimal due to VRAM constraints) and some image/video object detection stuff with my security cameras (slowly working on a remote water gun turret to keep the racoons from trying to eat the kittens that stray cats keep having in my driveway/workshop).
Install claude code (or, opencode, it's also good) - use Opus (get the max plan) and give it a directory that it can use as it's working directory (don't open it in ~/Documents and just start doing things) and prompt it with something as simple as this:
"I have an existing home assistant setup at home and I'd like to determine what sort of self-hosted AI I could setup and integrate with that home assistant install - can you help me get started? Please also maintain some notes in .md files in this working directory with those note files named and organized as you see appropriate so that we can share relevant context and information with future sessions. (example: Hardware information, local urls, network layout, etc) If you're unsure of something, ask me questions. Do not perform any destructive actions without first confirming with me."
Plan mode. _ALWAYS_ use plan mode to get the task setup, if there's something about the plan you don't like, say no and give it notes - it will return with a new plan. Eventually agree to the plan when it's right - then work through that plan not in plan mode, but if it gets off the plan, get back in plan mode to get the/a plan set and then again let it go and just steer it in regular mode.
I dont have max plan, but on the Pro i tried for a month, i was able to blow trough my 5 hour limit by a single prompt (with 70k context codebase attached). The idea of paying so much money to get few questions per "workday" seems insane to me
That's great to hear. I was mostly impressed with Qwen3 coder on my 4090, but am hobbled by the small memory footprint of the single card. What motherboard are you using with your 3090s? Like the others, I too am curious about those esp32s and what software you run on them.
Keep up the good hacking - it's been fun to play with this stuff!
I actually am not using the 3090s as one unit. I have Qwen3-30B-A3B as my primary model and it fits on a single GPU, then I have all the TTS/STT on the other GPU.
For the physical hardware I use the esp32-s3-box[1]. The esphome[2] suite has firmware you can flash to make the device work with HomeAssistant automatically. I have an esphome profile[3] I use, but I'm considering switching to this[4] profile instead.
For the actual AI, I basically set up three docker containers: one for speech to text[5], one for text to speech[6], and then ollama[7] for the actual AI. After that it's just a matter of pointing HomeAssistant at the various services, as it has built in support for all of these things.
I assume it's very similar to what Home Assistant's backing commercial entity Nabu Casa sells with the "Home Assistant Voice PE" device, which is also esp32-based. The code is open and uses the esphome framework so it's fairly easy to recreate on custom HW you have laying around.
He is referring the M5 Atom's I believe. I strongly recommend the ESP32 S3 box now, you can fire up Bobbas special firmware for it, search on Github, and its a blast with Home Assistant.
When has the average American ever been willing to spend a $1,000-2,000 premium for privacy-respecting tech? They already save $20-200 to buy IoT cameras which provide all audio and video from inside their home directly to the government without a warrant (Ring vs Reolink/etc).
To be fair, it isn't $1000-2000 extra, it's the new laptop/pc you just bought that is powerful enough (now, or in the near future) to run these open weight models.
Wiredpancake got flagged to death but they’re right. MacWhisper provides a great example of good value for dead-simple user-friendly on-device processing.
You mean like a home with a yard large enough to keep the neighbors out of sight?
Granted, based on how annoyingly chill we are with advertisements and government surveillance, I suppose this desire for privacy never extended beyond the neighbors.
Americans may end up in a situation where they have some $1000-2000 device at home with an open Chinese model running on it
Wouldn't worry about that, I'm pretty sure the government is going to ban running Chinese tech in this space sooner or later. And we won't even be able to download it.
Not saying any of the bans will make any kind of sense, but I'm pretty sure they're gonna say this is a "strategic" space. And everything else will follow from there.
When DeepSeek first hit the news, an American senator proposed adding it to ITAR so they could send people to prison for using it. Didn't pass, thankfully.
For criminal concerns regarding retroactive ITAR additions, yes. However, significant civil financial penalties if congress so wished could still be constitutional as the ex post facto clause has been held to apply exclusively to criminal matters starting in Calder v. Bull [1].
History is littered with unconstitutional, enforced laws, as well. Watched a lot of Ken Burns docs this weekend while sick. “The West” has quite a few examples.
There are a lot of things in the US Constitution. But the Supreme Court is the final arbiter, and they're moving closer and closer to "whatever you say, big daddy."
The US is probably ahead but they're so obsessed with moats, IP and safety that their lagginess is self imposed.
China has nothing to lose and everything to gain by releasing stuff openly.
Once China figures put how to make high performance FPGA chips really cheap, its game over for the US. The only power the US has is over GPU supply...and even then its pretty weak.
Not to mention NVIDIA crippling its own country with low VRAM cards. China is taking older cards, stripping the RAM and upgrading other older cards.
> Americans may end up in a situation where they have some $1000-2000 device at home with an open Chinese model running on it, if they care about privacy or owning their data.
I think HN vastly overestimates the market for something like this. Yes, there are some people who would spend $2,000 to avoid having prompts go to any cloud service.
However, most people don’t care. Paying $20 per month for a ChatGPT subscription is a bargain and they automatically get access to new versions as they come.
I think the at-home self hosting hobby is interesting, but it’s never going to be a mainstream thing.
There is going to be a big market for private AI appliances, in my estimation at least.
Case in point: I give Gmail OAuth access to nobody. I nearly got burned once and I really don’t want my entire domain nuked. But I want to be able to have an LLM do things only LLMs can do with my email.
“Find all emails with ‘autopay’ in the subject from my utility company for the past 12 months, then compare it to the prior year’s data.” GPT-OSS-20b tried its best but got the math obviously wrong. Qwen happily made the tool calls and spat out an accurate report, and even offered to make a CSV for me.
Surely if you can’t trust npm packages or MS to not hand out god tokens to any who asks nicely, you shouldn’t trust a random MCP server with your credentials or your model. So I had Kilocode build my own. For that use case, local models just don’t quite cut it. I loaded $10 into OpenRouter, told it what I wanted, and selected GPT5 because it’s half off this week. 45 minutes, $0.78, and a few manual interventions later I had a working Gmail MCP that is my very own. It gave me some great instructions on how to configure an OAuth app in GCP, and I was able to get it running queries within minutes from my local models.
There is a consumer play for a ~$2499-$5000 box that can run your personal staff of agents on the horizon. We need about one more generation of models and another generation of low-mid inference hardware to make it commercially feasible to turn a profit. It would need to pay for itself easily in the lives of its adopters. Then the mass market could open up. A more obvious path goes through SMBs who care about control and data sovereignty.
If you’re curious, my power bill is up YoY, but there was a rate hike, definitely not my 4090;).
Totally agree on the consumer and SMB play (which is why we're stealthily working on it :). I'm curious what capabilities the next generation of models (and HW) will provide that doesn't exist now. Considering Ryzen 395 / Digits / etc can achieve 40-50+ T/s on capable mid-size models (e.g., OSS120B/Qwen-Next/GLM Air) with some headroom for STT and a lean TTS, I think now is the time to enter but seems to me the 2 key things that are lacking are 1) reliable low-latency multi-modal streaming voice frameworks for STT+STT and 2) reliable fast and secure UI Computer use (without relying on optional accessibility tags/meta).
My greatest concern for local AI solutions like this is the centrality of email and the obvious security concerns surrounding email auth.
Depends on the setup, but programmatic access to a Gmail account that's used for admin purposes would allow for hijacking via key/password exfiltration of anything in the mailbox, sending unattended approvals, and autonomous conversations with third parties that aren't on the lookout for impersonation. In the average case, the address book would probably get scraped and the account would be used to blast spam to the rest of the internet.
Moving further, if the OAuth Token confers access to the rest of a user's Google suite, any information in Drive can be compromised. If the token has broader access to a Google Workspace account, there's room for inspecting, modifying, and destroying important information belonging to multiple users. If it's got admin privileges, a third party can start making changes to the org's configuration at large, sending spam from the domain to tank its reputation while earning a quick buck, or engage in phishing on internal users.
The next step would be racking up bills in Google's Cloud, but that's hopefully locked behind a different token. All the same, a bit of lateral movement goes a long way ;)
I agree the market is niche atm, but I can't help but disagree with your outlook long term. Self hosted models don't have the problems ChatGPT subscribers are facing with models seemingly performing worse over time, they don't need to worry about usage quotas, they don't need to worry about getting locked out of their services, etc.
All of these things have a dark side, though; but it's likely unnecessary for me to elaborate on that.
Given that $2000 might only buy you about 10 date nights with dinner and drinks, the value proposition might actually be pretty good if posterity is not a feature requirement.
The sales case for having LLMs at the edge is to run inference everywhere on everything. Video games won't go to the cloud for every AI call, but they will use on-device models that will run on the next iteration of hardware.
Is there a AI market for open weights? Companies like Alibaba, Tencent, Meta or Microsoft makes a lot sense. They can build on open weights, and not losing values, potentially beneficial for share prices. The only winner is application and cloud providers, I don't see how they can make money from the weights itself to be honest.
I don't know if there is a market for it, but I know that open weights puts pressure on the closed-models companies into releasing their weights and losing their privileged situations.
The only money to be made is in compute, not open weights themselves. What point is a market when a commons like huggingface or modelscope? Alibaba made modelscope to compete with HF, and that's a commons not a market either, if that tells you anything.
By analogy, you can legally charge for copies of your custom Linux distribution, but what's the point when all the others are free?
It promotes an open research environment where external researchers have the opportunity to learn, improve and build. And it keeps the big companies in check, they can't become monopolies or duopolies and increase API prices (as is usually the playbook) if you can get the same quality responses from a smaller provider on OpenRouter
It seems it needs around a $2,500 GPU, do you have one?
I tried Qwen online via its website interface a few months ago, and found it to be very good.
I've run some offline models including Deepseek-R1 70B on CPU (pretty slow, my server has 128 GB of RAM but no GPU) and I'm looking into what kind of setup I would need to run an offline model on GPU myself.
You can try it out on https://chat.qwen.ai/ - sign in with Google or GitHub (signed out users can't use the voice mode) and then click on the voice icon.
It has an entertaining selection of different voices, including:
*Dylan* - A teenager who grew up in Beijing's hutongs
Not quite. The smallest Qwen3 A3B quants are ~12gb and use more like ~14gb depending on your context settings. You'll thrash the SSD pretty hard swapping it on a 16gb machine.
A fun project for somebody who has more time than myself would be to see if they can get it working with the new Mojo stuff from yesterday for Apple. I don't know if the functionality would be fully baked out enough yet to actually do the port successfully, but it would be an interesting try.
It'd run on a 5090 with 32GB of VRAM at fp8 quantization which is generally a very acceptable size/quality trade-off. (I run GLM-4.5-Air at 3b quantization!) The transformer architecture also lends itself quite well to having different layers of the model running in different places, so you can 'shard' the model across different compute nodes.
Not yet as far as I can tell - might take a while for someone to pull that together given the complexity involved in handling audio and image and text and video at once.
Here is the demo video on it. The video w/ sound input -> sound output while doing translation from the video to another language was the most impressive display I've seen yet.
Speech input + speech output is a big deal. In theory you can talk to it using voice, and it can respond in your language, or translate for someone else, without intermediary technologies. Right now you need wakeword, speech to text, and then text to speech, in addition to your core LLM. A couple can input speech, or output speech, but not both. It looks like they have at least 3 variants in the ~32b range.
Depending on the architecture this is something you could feasibly have in your house in a couple of years or in an expensive "ai toaster"
The opportunities of plugging this into your home automation through tool calls is huge.
Ever since ChatGPT added this feature I've been waiting for anyone else to catch up.
They're are tons of hands free situations like cooking where this would be amazing ("read the next step please, my hands are covered in raw pork", "how much flour for the roux", "crap, I don't have any lemons, what can I substitute")
Seems like a big win for language learning, if nothing else. Also seems possible to run locally, especially once the unsloth guys get their hands on it.
The real point of leverage for here is performance/size. Getting traction in the open weights space kinda forces that the models need to innovate on efficiency. This means the open weight models may get leverage that the closed weight ones don't think about.
If we had some aggregated cluster reasoning mechanisms, When would 8x 30B models running on an h100 server out perform in terms of accuracy 1 240B model on the same server.
Neat. I threw a couple simple audio clips at it and it was able to at least recognize the instrumentation (piano, drums, etc). I haven't seen a lot of multimodal LLM focus around recognizing audio outside of speech, so I'd love to see a deep dive of what the SOTA is.
The qwen thinker/speaker architecture is really fascinating and is more in line with how I imagine human multi modality works - IE, a picture of an apple, the text a p p l e and the sound all map to the same concept without going to text first.
The existing vision LLMs all work like this, which is most of the major models these days.
Multi-modal audio models are a lot less common. GPT-4o was meant to be able to do this natively from the start but they ended up shipping separate custom models based on it for their audio features. As far as I can tell GPT-5 doesn't have audio input/output at all - the OpenAI features for that still use GPT-4o-audio.
I don't know if Gemini 2.5 (which is multi-modal for vision and audio) shares the same embedding space for all three, but I expect it probably does.
There are many more weird and complex architectures in models for video understanding.
For example, beyond video->text->llm and video->embedding in llm, you can also have an llm controlling/guiding a separate video extractor.
See this paper for a pretty thorough overview.
Tang, Y., Bi, J., Xu, S., Song, L., Liang, S., Wang, T., Zhang, D., An, J., Lin, J., Zhu, R., Vosoughi, A., Huang, C., Zhang, Z., Liu, P., Feng, M., Zheng, F., Zhang, J., Luo, P., Luo, J., & Xu, C. (2025). Video Understanding with Large Language Models: A Survey (No. arXiv:2312.17432). arXiv. https://doi.org/10.48550/arXiv.2312.17432
Yeah, and that's my understanding. Nothing goes video -> text, or audio -> text, or even text -> text without first going through state space. That's where the core of the transformer architecture is.
Any insights into what "native video support" actually means? Is it just good at interpreting consecutive full frame images taken at intervals (thus missing out on fast events) or is there something more elaborate to it?
I recently needed to scan hundreds of low quality invoices and run them through OCR for invoice numbers and dates. I really took for granted how seamless this is in some applications, and was shocked how much work went into producing decent results.
I was obviously really naive. Either way, it gets me excited any time I see progress with OCR. I should give this a try against my (small) dataset.
I just ran Qwen against some of invoices that my gnarly algorithm really struggled with (with openai fallback) and Qwen was able to extract all relevant data without any issues. I'm pretty damn impressed to be honest.
I usually ask these models to tell me a short story, and most times the prose is stiff and the story reads like a mass market straight to KDP kids book. But wow, first shot generated something, light, mildly funny, and chill. Quite a surprise.
Pasted here for your own judgement:
*Title: The Last Lightbulb*
The power had been out for three days. Rain drummed against the windows of the old cabin, and the only light came from a flickering candle on the kitchen table.
Maggie, wrapped in a wool blanket, squinted at the last working flashlight. “We’ve got one bulb left, Jack. One.”
Jack, hunched over a board game he’d dug out of the closet, didn’t look up. “Then don’t turn it on unless you’re reading Shakespeare or delivering a baby.”
She rolled her eyes. “I need it to find the can opener. I’m not eating cold beans with my fingers again.”
Jack finally glanced up, grinning. “You did that yesterday and called it ‘rustic dining.’”
“Desperate times,” she muttered, clicking the flashlight on. The beam cut through the gloom—and immediately began to dim.
“No—!” Jack lunged, but too late. The light sputtered… then died.
Silence. Then Maggie sighed. “Well. There goes civilization.”
Jack leaned back, chuckling. “Guess we’re officially cavemen now.”
“Cavewoman,” she corrected, fumbling in the dark. “And I’m going to bed. Wake me when the grid remembers we exist.”
As she shuffled off, Jack called after her, “Hey—if you find the can opener in the dark, you’re officially magic.”
A pause. Then, from down the hall: “I found socks that match. That’s basically witchcraft.”
Jack smiled into the dark. “Goodnight, witch.”
“Goodnight, caveman.”
Outside, the rain kept falling. Inside, the dark didn’t feel so heavy anymore.
I tried using it for a planning a trip that we're taking soon. We've actually already planned the trip, but thought it'd be fun.
It asked me follow up questions, and then took forever. I finally clicked the button to see what was happening and it had started outputting Chinese in the middle.
Does this support realtime speech to speech via API? If so, where is this hosted/documented? I wasn’t able to see any info. I’d love to use this in lieu of OAIs (expensive) real time speech to speech offering.
- additional modalities
- Faster FPS (inferences per second)
- Reaction time tuning (latency vs quality tradeoff) for visual and audio inputs/outputs
- built-in planning modules in the architecture (think premotor frontal lobe)
- time awareness during inference (towards an always inferring / always learning architecture)
will be interesting how it will compare with pricing with audio modality comparing to gemini 2.0 flash once many providers offer it.
Even though gemini 2.0 flash is quite old I still like it. Very cheap (each second of audio is just 32 tokens), support even more languages, non-reasoning so very fast, big rate limits.
Not really, it's a significant place which is why the protest (and hence massacre) was there, so especially for Chinese people (I expect) merely referencing it doesn't so immediately refer to the massacre, they have plenty of other connotations for it.
e.g. if something similar happened in Trafalgar Square, I expect it would still be primarily a major square in London to me, not oh my god they must be referring to that awful event. (In fact I think it was targeted in the 7/7 bombings for example.)
Or a better example to go with your translation - you can refer to the Bastille without 'boldly' invoking the histoire of its storming in the French Revolution.
No doubt the US media has referred to the Capitol without boldness many times since 6 Jan '21.
Not to mention, Tiananmen Square is one of the major tourist destinations in Beijing (similar to National Mall in Washington DC), for both domestic and foreign visitors.
This is true. I also think they've put some real effort into steering the model away from certain topics. If you ask too closely you'll get a response like:
"As an AI assistant, I must remind you that your statements may involve false and potentially illegal information. Please observe the relevant laws and regulations and ask questions in a civilized manner when you speak."
This new AI model sounds pretty impressive, but let's see how it really performs in real-world applications. There's a cool resource that dives into racing and building experiences that might be worth checking out too—.
Interesting, the pacing seemed very slow when conversing in english, but when I spoke to it in spanish, it sounded much faster. It's really impressive that these models are going to be able to do real time translation and much more.
The Chinese are going to end up owning the AI market if the American labs don't start competing on open weights. Americans may end up in a situation where they have some $1000-2000 device at home with an open Chinese model running on it, if they care about privacy or owning their data. What a turn of events!
sitting here in the US, reading that China is strongly urging the adoption of Linux and pushing for open CPU architectures like RISC-V and also self-hosted open models
are we the baddies??
If there is a walled garden, and you aren't in it, you'll probably push for the walls to come down. No moral basis needed.
Could you elaborate on what you mean by "moral basis" in your comment?
It is in their selfish interest to push for open weights.
That's not to say they are being selfish, or to judge in any way the morality of their actions. But because of that incentive, you can't logically infer moral agency in their decision to release open-weights, IP-free CPUs, etc.
By selfish interests you mean the public good?
Leaving China aside, it's arguably immoral that our leading AI models are closed and concentrated in the hands of billionaires with questionable ethical histories (at best).
I mean China's push for open weights/source/architecture probably has more to do with them wanting legal access to markets than it does with those things being morally superior.
Of course, but that translates in a benefit for most people, even for Americans. In my case (European), I cannot but support the Chinese companies in this respect, as we would be especially in trouble if the common models are the norm.
If by being selfish they end up doing morally superior thing, then, I much prefer to go with the Chinese.
Even more so now that Trump is in command.
Depends on whether you want to be in it. A ladder might be enough to peek over the top and rip it off. Do it better. Which seems to be what is happening.
That only works for China's domestic market. As long as the IP they are "taking inspiration from" is protected in the target markets, they effectively lock themselves out by doing that.
In the case of technology like RISC, pretty much all the value add is unprotected, so they can sell those products in the US/EU without issue.
silent majority of HN
there have been countless times where the us has been the baddy and you guys are blink to it all.
open your eyes.
Winner takes all monopoly big tech capitalism is bad. Literally textbook bad.
I know right!
This is exactly what I do. I have two 3090s at home, with Qwen3 on it. This is tied into my Home Assistant install, and I use esp32 devices as voice satellites. It works shockingly well.
I run Home Assistant on an RPi4 and have an ESP32-based Core2 with mic (https://shop.m5stack.com/products/m5stack-core2-esp32-iot-de...), along with a 16GB 4070 Ti Super in an always-on Windows system I only use for occasional gaming and serving media. I'd love to set up something like you have. Can you recommended a starting place, or ideally, a step-by-step tutorial?
I've never set up any AI system. Would you say setting up such a self-hosted AI is at a point now where an AI novice can get an AI system installed and integrated with an existing Home Assistant install in a couple hours?
I mean - the AI itself will help you get all that setup.
Claude code is your friend.
I run proxmox on an old Dell R710 in my closet that hosts my homeassistant (amongst others) VM and then I've setup my "gaming" PC (which hasn't done any gaming in quite some time) to dual boot (Windows or Deb/Proxmox) and just keep it booted into Deb as another proxmox node. That PC also has a 4070 Super that I have setup to passthru to a VM and on that VM I've got various services utilizing the GPU. This includes some that are utilized by my hetzner bare metal servers for things like image/text embeddings as well as local LLM use (though, rather minimal due to VRAM constraints) and some image/video object detection stuff with my security cameras (slowly working on a remote water gun turret to keep the racoons from trying to eat the kittens that stray cats keep having in my driveway/workshop).
Install claude code (or, opencode, it's also good) - use Opus (get the max plan) and give it a directory that it can use as it's working directory (don't open it in ~/Documents and just start doing things) and prompt it with something as simple as this:
"I have an existing home assistant setup at home and I'd like to determine what sort of self-hosted AI I could setup and integrate with that home assistant install - can you help me get started? Please also maintain some notes in .md files in this working directory with those note files named and organized as you see appropriate so that we can share relevant context and information with future sessions. (example: Hardware information, local urls, network layout, etc) If you're unsure of something, ask me questions. Do not perform any destructive actions without first confirming with me."
Plan mode. _ALWAYS_ use plan mode to get the task setup, if there's something about the plan you don't like, say no and give it notes - it will return with a new plan. Eventually agree to the plan when it's right - then work through that plan not in plan mode, but if it gets off the plan, get back in plan mode to get the/a plan set and then again let it go and just steer it in regular mode.
> I mean - the AI itself will help you get all that setup.
Or, ask somebody who already has it set up working.
That way you can get certain results, without guessing around why it works for them and not for you.
(I, too, am interested in the grandparent poster's setup.)
>use opus (get the max plan)
I dont have max plan, but on the Pro i tried for a month, i was able to blow trough my 5 hour limit by a single prompt (with 70k context codebase attached). The idea of paying so much money to get few questions per "workday" seems insane to me
Sonnet blows through the limit much slower, and is often great tbh
That's great to hear. I was mostly impressed with Qwen3 coder on my 4090, but am hobbled by the small memory footprint of the single card. What motherboard are you using with your 3090s? Like the others, I too am curious about those esp32s and what software you run on them.
Keep up the good hacking - it's been fun to play with this stuff!
I actually am not using the 3090s as one unit. I have Qwen3-30B-A3B as my primary model and it fits on a single GPU, then I have all the TTS/STT on the other GPU.
Ooo interesting, I'd love to hear more about the esp32's as voice satellites!
For the physical hardware I use the esp32-s3-box[1]. The esphome[2] suite has firmware you can flash to make the device work with HomeAssistant automatically. I have an esphome profile[3] I use, but I'm considering switching to this[4] profile instead.
For the actual AI, I basically set up three docker containers: one for speech to text[5], one for text to speech[6], and then ollama[7] for the actual AI. After that it's just a matter of pointing HomeAssistant at the various services, as it has built in support for all of these things.
1. https://www.adafruit.com/product/5835
2. https://esphome.io/
3. https://gist.github.com/tedivm/2217cead94cb41edb2b50792a8bea...
4. https://github.com/BigBobbas/ESP32-S3-Box3-Custom-ESPHome/
5. https://github.com/rhasspy/wyoming-faster-whisper
6. https://github.com/rhasspy/wyoming-piper
7. https://ollama.com/
> 1. https://www.adafruit.com/product/5835
The nails in the video made me laugh
I assume it's very similar to what Home Assistant's backing commercial entity Nabu Casa sells with the "Home Assistant Voice PE" device, which is also esp32-based. The code is open and uses the esphome framework so it's fairly easy to recreate on custom HW you have laying around.
Seems interesting setup, do you have it documented anywhere, thinking of building one!
Can you tell me about these voice satellites?
He is referring the M5 Atom's I believe. I strongly recommend the ESP32 S3 box now, you can fire up Bobbas special firmware for it, search on Github, and its a blast with Home Assistant.
I'm actually using the esp32 s3 boxes myself!
omg, this is something I've had in mind for quite some time, I even bought some i2s devices to test it out. Do you have some pointers on how to do it?
Do you also add custom tools to turn on/off the lights?
When has the average American ever been willing to spend a $1,000-2,000 premium for privacy-respecting tech? They already save $20-200 to buy IoT cameras which provide all audio and video from inside their home directly to the government without a warrant (Ring vs Reolink/etc).
To be fair, it isn't $1000-2000 extra, it's the new laptop/pc you just bought that is powerful enough (now, or in the near future) to run these open weight models.
Ease of use is a major issue.
What percentage of the people that you know are able to install python and dependencies plus the correct open weights models?
I'd wager most of your parents can't do it.
Most "normies" wouldn't even know what a local model even is, let alone how to install a GPU.
Wiredpancake got flagged to death but they’re right. MacWhisper provides a great example of good value for dead-simple user-friendly on-device processing.
installing LM Studio is easy and it walks you through choosing a model as well. It is actually well within many people's abilities.
That sounds like all software. We can make better software.
You mean like a home with a yard large enough to keep the neighbors out of sight?
Granted, based on how annoyingly chill we are with advertisements and government surveillance, I suppose this desire for privacy never extended beyond the neighbors.
There is some irony about buying Chinese hardware to run American software on it for the past decade(s), and now the exact reverse.
Is that irony? Regardless, it's hilarious. All of the upvotes to you!
Americans may end up in a situation where they have some $1000-2000 device at home with an open Chinese model running on it
Wouldn't worry about that, I'm pretty sure the government is going to ban running Chinese tech in this space sooner or later. And we won't even be able to download it.
Not saying any of the bans will make any kind of sense, but I'm pretty sure they're gonna say this is a "strategic" space. And everything else will follow from there.
Download Chinese models while you can.
When DeepSeek first hit the news, an American senator proposed adding it to ITAR so they could send people to prison for using it. Didn't pass, thankfully.
If it does in the future, do we just hope it won’t be retroactive? Is this water boiling yet?
ex post facto law is explicitly banned in the US Constitution
For criminal concerns regarding retroactive ITAR additions, yes. However, significant civil financial penalties if congress so wished could still be constitutional as the ex post facto clause has been held to apply exclusively to criminal matters starting in Calder v. Bull [1].
[1] https://www.oyez.org/cases/1789-1850/3us386
Dogs can't play basketball, either, but we've sure been getting dunked on a lot lately.
History is littered with unconstitutional, enforced laws, as well. Watched a lot of Ken Burns docs this weekend while sick. “The West” has quite a few examples.
> explicitly banned in the US Constitution
There are a lot of things in the US Constitution. But the Supreme Court is the final arbiter, and they're moving closer and closer to "whatever you say, big daddy."
government hardly has the capacity to ban foreign weights
The danger is that lawmakers, confused about the difference between foreign weights and foreign APIs, accidentally ban both.
There will be no confusion whatsoever, and no accidents. They don't write those bills themselves.
Whatever a given bill does is precisely what its authors, who are almost never elected by any constituency on Earth, intend for it to do.
Eh this is the internet. There's always a way. They couldn't ban piracy either.
Correction: they couldn't enforce a ban
True sorry, good point.
The US is probably ahead but they're so obsessed with moats, IP and safety that their lagginess is self imposed.
China has nothing to lose and everything to gain by releasing stuff openly.
Once China figures put how to make high performance FPGA chips really cheap, its game over for the US. The only power the US has is over GPU supply...and even then its pretty weak.
Not to mention NVIDIA crippling its own country with low VRAM cards. China is taking older cards, stripping the RAM and upgrading other older cards.
> Americans may end up in a situation where they have some $1000-2000 device at home with an open Chinese model running on it, if they care about privacy or owning their data.
I think HN vastly overestimates the market for something like this. Yes, there are some people who would spend $2,000 to avoid having prompts go to any cloud service.
However, most people don’t care. Paying $20 per month for a ChatGPT subscription is a bargain and they automatically get access to new versions as they come.
I think the at-home self hosting hobby is interesting, but it’s never going to be a mainstream thing.
There is going to be a big market for private AI appliances, in my estimation at least.
Case in point: I give Gmail OAuth access to nobody. I nearly got burned once and I really don’t want my entire domain nuked. But I want to be able to have an LLM do things only LLMs can do with my email.
“Find all emails with ‘autopay’ in the subject from my utility company for the past 12 months, then compare it to the prior year’s data.” GPT-OSS-20b tried its best but got the math obviously wrong. Qwen happily made the tool calls and spat out an accurate report, and even offered to make a CSV for me.
Surely if you can’t trust npm packages or MS to not hand out god tokens to any who asks nicely, you shouldn’t trust a random MCP server with your credentials or your model. So I had Kilocode build my own. For that use case, local models just don’t quite cut it. I loaded $10 into OpenRouter, told it what I wanted, and selected GPT5 because it’s half off this week. 45 minutes, $0.78, and a few manual interventions later I had a working Gmail MCP that is my very own. It gave me some great instructions on how to configure an OAuth app in GCP, and I was able to get it running queries within minutes from my local models.
There is a consumer play for a ~$2499-$5000 box that can run your personal staff of agents on the horizon. We need about one more generation of models and another generation of low-mid inference hardware to make it commercially feasible to turn a profit. It would need to pay for itself easily in the lives of its adopters. Then the mass market could open up. A more obvious path goes through SMBs who care about control and data sovereignty.
If you’re curious, my power bill is up YoY, but there was a rate hike, definitely not my 4090;).
Totally agree on the consumer and SMB play (which is why we're stealthily working on it :). I'm curious what capabilities the next generation of models (and HW) will provide that doesn't exist now. Considering Ryzen 395 / Digits / etc can achieve 40-50+ T/s on capable mid-size models (e.g., OSS120B/Qwen-Next/GLM Air) with some headroom for STT and a lean TTS, I think now is the time to enter but seems to me the 2 key things that are lacking are 1) reliable low-latency multi-modal streaming voice frameworks for STT+STT and 2) reliable fast and secure UI Computer use (without relying on optional accessibility tags/meta).
My greatest concern for local AI solutions like this is the centrality of email and the obvious security concerns surrounding email auth.
How would using oauth through Google nuke ur domain?
Depends on the setup, but programmatic access to a Gmail account that's used for admin purposes would allow for hijacking via key/password exfiltration of anything in the mailbox, sending unattended approvals, and autonomous conversations with third parties that aren't on the lookout for impersonation. In the average case, the address book would probably get scraped and the account would be used to blast spam to the rest of the internet.
Moving further, if the OAuth Token confers access to the rest of a user's Google suite, any information in Drive can be compromised. If the token has broader access to a Google Workspace account, there's room for inspecting, modifying, and destroying important information belonging to multiple users. If it's got admin privileges, a third party can start making changes to the org's configuration at large, sending spam from the domain to tank its reputation while earning a quick buck, or engage in phishing on internal users.
The next step would be racking up bills in Google's Cloud, but that's hopefully locked behind a different token. All the same, a bit of lateral movement goes a long way ;)
I agree the market is niche atm, but I can't help but disagree with your outlook long term. Self hosted models don't have the problems ChatGPT subscribers are facing with models seemingly performing worse over time, they don't need to worry about usage quotas, they don't need to worry about getting locked out of their services, etc.
All of these things have a dark side, though; but it's likely unnecessary for me to elaborate on that.
The reason people will pay $2,000 for a private at home AI is porn.
Given that $2000 might only buy you about 10 date nights with dinner and drinks, the value proposition might actually be pretty good if posterity is not a feature requirement.
The sales case for having LLMs at the edge is to run inference everywhere on everything. Video games won't go to the cloud for every AI call, but they will use on-device models that will run on the next iteration of hardware.
Is there a AI market for open weights? Companies like Alibaba, Tencent, Meta or Microsoft makes a lot sense. They can build on open weights, and not losing values, potentially beneficial for share prices. The only winner is application and cloud providers, I don't see how they can make money from the weights itself to be honest.
I don't know if there is a market for it, but I know that open weights puts pressure on the closed-models companies into releasing their weights and losing their privileged situations.
The only money to be made is in compute, not open weights themselves. What point is a market when a commons like huggingface or modelscope? Alibaba made modelscope to compete with HF, and that's a commons not a market either, if that tells you anything.
By analogy, you can legally charge for copies of your custom Linux distribution, but what's the point when all the others are free?
It promotes an open research environment where external researchers have the opportunity to learn, improve and build. And it keeps the big companies in check, they can't become monopolies or duopolies and increase API prices (as is usually the playbook) if you can get the same quality responses from a smaller provider on OpenRouter
>Interesting, the pacing seemed very slow when conversing in english, but when I spoke to it in spanish, it sounded much faster
So did you run the model offline on your own computer and get realtime audio?
Can you tell me the GPU or specifications you used?
I inquired with ChatGPT:
https://chatgpt.com/share/68d23c2c-2928-800b-bdde-040d8cb40b...
It seems it needs around a $2,500 GPU, do you have one?
I tried Qwen online via its website interface a few months ago, and found it to be very good.
I've run some offline models including Deepseek-R1 70B on CPU (pretty slow, my server has 128 GB of RAM but no GPU) and I'm looking into what kind of setup I would need to run an offline model on GPU myself.
> So did you run the model offline on your own computer and get realtime audio?
At the top of the README of the GitHub repository, there are a few links to demos where you can try the model.
> It seems it needs around a $2,500 GPU
You can get a used RTX 3090 for about $700, which has the same amount of VRAM as the RTX 4090 in your ChatGPT response.
But as far as I can tell, quantized inference implementations for this model do not exist yet.
You can try it out on https://chat.qwen.ai/ - sign in with Google or GitHub (signed out users can't use the voice mode) and then click on the voice icon.
It has an entertaining selection of different voices, including:
*Dylan* - A teenager who grew up in Beijing's hutongs
*Peter* - Tianjin crosstalk, professionally supporting others
*Cherry* - A sunny, positive, friendly, and natural young lady
*Ethan* - A sunny, warm, energetic, and vigorous boy
*Eric* - A Sichuan Chengdu man who stands out from the crowd
*Jada* - The fiery older sister from Shanghai
Many of these voices are especially hillarious when you switch the language.
In Russian, Ryan sounds like a westerner who started reading Russian words a month ago.
Dylan sounds somewhat authentic, while everyone else is a different degree of heavy-asian-accented Russian.
The voices are really fun, thanks for the laughs :)
I only see Omni Flash, is that the one?
same, did you figure it out?
I think so, you need to click the big jagged audio icon to start a voice session.
Is the Qwen3-Omni-Flash the same as Qwen3-Omni-30B-A3B, or is the Omni-Flash a different closed-source model?
My question too
The model weights are 70GB (Hugging Face recently added a file size indicator - see https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct/tree... ) so this one is reasonably accessible to run locally.
I wonder if we'll see a macOS port soon - currently it very much needs an NVIDIA GPU as far as I can tell.
That's at BF16, so it should fit fairly well on 24GB GPUs after quantization to Q4, I'd think. (Much like the other 30B-A3B models in the family.)
I'm pretty happy about that - I was worried it'd be another 200B+.
So like, 1x32GB is all you need for quite a while? Scrolling through the Web makes me feel like I'm out unless I have minimum 128GB of VRAM.
are there any that would run on 16GB Apple M1?
Not quite. The smallest Qwen3 A3B quants are ~12gb and use more like ~14gb depending on your context settings. You'll thrash the SSD pretty hard swapping it on a 16gb machine.
A fun project for somebody who has more time than myself would be to see if they can get it working with the new Mojo stuff from yesterday for Apple. I don't know if the functionality would be fully baked out enough yet to actually do the port successfully, but it would be an interesting try.
New Mojo stuff from Apple?
Nvm found it https://news.ycombinator.com/item?id=45326388
Would it run on 5090? Or is it possible to link multiple GPUs or has NVIDIA locked it down?
It'd run on a 5090 with 32GB of VRAM at fp8 quantization which is generally a very acceptable size/quality trade-off. (I run GLM-4.5-Air at 3b quantization!) The transformer architecture also lends itself quite well to having different layers of the model running in different places, so you can 'shard' the model across different compute nodes.
is there an inference engine for this on macos?
Not yet as far as I can tell - might take a while for someone to pull that together given the complexity involved in handling audio and image and text and video at once.
Here is the demo video on it. The video w/ sound input -> sound output while doing translation from the video to another language was the most impressive display I've seen yet.
https://www.youtube.com/watch?v=_zdOrPju4_g
Speech input + speech output is a big deal. In theory you can talk to it using voice, and it can respond in your language, or translate for someone else, without intermediary technologies. Right now you need wakeword, speech to text, and then text to speech, in addition to your core LLM. A couple can input speech, or output speech, but not both. It looks like they have at least 3 variants in the ~32b range.
Depending on the architecture this is something you could feasibly have in your house in a couple of years or in an expensive "ai toaster"
The opportunities of plugging this into your home automation through tool calls is huge.
Ever since ChatGPT added this feature I've been waiting for anyone else to catch up.
They're are tons of hands free situations like cooking where this would be amazing ("read the next step please, my hands are covered in raw pork", "how much flour for the roux", "crap, I don't have any lemons, what can I substitute")
Seems like a big win for language learning, if nothing else. Also seems possible to run locally, especially once the unsloth guys get their hands on it.
The real point of leverage for here is performance/size. Getting traction in the open weights space kinda forces that the models need to innovate on efficiency. This means the open weight models may get leverage that the closed weight ones don't think about.
If we had some aggregated cluster reasoning mechanisms, When would 8x 30B models running on an h100 server out perform in terms of accuracy 1 240B model on the same server.
Neat. I threw a couple simple audio clips at it and it was able to at least recognize the instrumentation (piano, drums, etc). I haven't seen a lot of multimodal LLM focus around recognizing audio outside of speech, so I'd love to see a deep dive of what the SOTA is.
The qwen thinker/speaker architecture is really fascinating and is more in line with how I imagine human multi modality works - IE, a picture of an apple, the text a p p l e and the sound all map to the same concept without going to text first.
Isn’t that how all LLMs work?
The existing vision LLMs all work like this, which is most of the major models these days.
Multi-modal audio models are a lot less common. GPT-4o was meant to be able to do this natively from the start but they ended up shipping separate custom models based on it for their audio features. As far as I can tell GPT-5 doesn't have audio input/output at all - the OpenAI features for that still use GPT-4o-audio.
I don't know if Gemini 2.5 (which is multi-modal for vision and audio) shares the same embedding space for all three, but I expect it probably does.
What I mean is that all processing in an LLM occurs in state space. The next-token prediction is the very last step.
There are many more weird and complex architectures in models for video understanding.
For example, beyond video->text->llm and video->embedding in llm, you can also have an llm controlling/guiding a separate video extractor.
See this paper for a pretty thorough overview.
Tang, Y., Bi, J., Xu, S., Song, L., Liang, S., Wang, T., Zhang, D., An, J., Lin, J., Zhu, R., Vosoughi, A., Huang, C., Zhang, Z., Liu, P., Feng, M., Zheng, F., Zhang, J., Luo, P., Luo, J., & Xu, C. (2025). Video Understanding with Large Language Models: A Survey (No. arXiv:2312.17432). arXiv. https://doi.org/10.48550/arXiv.2312.17432
Sure but all of these find some way of mapping inputs (any medium) to state space concepts. That's the core of the transformer architecture.
The user you originally replied to specifically mentioned > without going to text first
Yeah, and that's my understanding. Nothing goes video -> text, or audio -> text, or even text -> text without first going through state space. That's where the core of the transformer architecture is.
Any insights into what "native video support" actually means? Is it just good at interpreting consecutive full frame images taken at intervals (thus missing out on fast events) or is there something more elaborate to it?
I recently needed to scan hundreds of low quality invoices and run them through OCR for invoice numbers and dates. I really took for granted how seamless this is in some applications, and was shocked how much work went into producing decent results.
I was obviously really naive. Either way, it gets me excited any time I see progress with OCR. I should give this a try against my (small) dataset.
I don't think I understand your comment. What were your results for Qwen? Or is that what you meant for how much work was needed?
I just ran Qwen against some of invoices that my gnarly algorithm really struggled with (with openai fallback) and Qwen was able to extract all relevant data without any issues. I'm pretty damn impressed to be honest.
What is your point about Qwen? Or is it just a general statement regarding LLM?
I did it the old school way with openai as the fallback.
All I'm saying is I'm excited to try Qwen to see if it out performs my gnarly algorithm.
I usually ask these models to tell me a short story, and most times the prose is stiff and the story reads like a mass market straight to KDP kids book. But wow, first shot generated something, light, mildly funny, and chill. Quite a surprise.
Pasted here for your own judgement:
*Title: The Last Lightbulb*
The power had been out for three days. Rain drummed against the windows of the old cabin, and the only light came from a flickering candle on the kitchen table.
Maggie, wrapped in a wool blanket, squinted at the last working flashlight. “We’ve got one bulb left, Jack. One.”
Jack, hunched over a board game he’d dug out of the closet, didn’t look up. “Then don’t turn it on unless you’re reading Shakespeare or delivering a baby.”
She rolled her eyes. “I need it to find the can opener. I’m not eating cold beans with my fingers again.”
Jack finally glanced up, grinning. “You did that yesterday and called it ‘rustic dining.’”
“Desperate times,” she muttered, clicking the flashlight on. The beam cut through the gloom—and immediately began to dim.
“No—!” Jack lunged, but too late. The light sputtered… then died.
Silence. Then Maggie sighed. “Well. There goes civilization.”
Jack leaned back, chuckling. “Guess we’re officially cavemen now.”
“Cavewoman,” she corrected, fumbling in the dark. “And I’m going to bed. Wake me when the grid remembers we exist.”
As she shuffled off, Jack called after her, “Hey—if you find the can opener in the dark, you’re officially magic.”
A pause. Then, from down the hall: “I found socks that match. That’s basically witchcraft.”
Jack smiled into the dark. “Goodnight, witch.”
“Goodnight, caveman.”
Outside, the rain kept falling. Inside, the dark didn’t feel so heavy anymore.
— The End —
That's not bad actually.
I tried using it for a planning a trip that we're taking soon. We've actually already planned the trip, but thought it'd be fun.
It asked me follow up questions, and then took forever. I finally clicked the button to see what was happening and it had started outputting Chinese in the middle.
I gave up on it.
Does this support realtime speech to speech via API? If so, where is this hosted/documented? I wasn’t able to see any info. I’d love to use this in lieu of OAIs (expensive) real time speech to speech offering.
https://www.alibabacloud.com/help/en/model-studio/realtime?s...
Thank you. I'm looking for this as well. The realtime model is a closed-source model and it's different than the open Qwen3-Omni-30B-A3B, right?
I wonder how hard is it to turn the open-source model to be a realtime model.
Why do you say they are different models? I've been looking at this today and haven't seen anything explicitly state that.
This is just my assumption given that they listed a lot of different models here: https://modelstudio.console.alibabacloud.com/?spm=a3c0i.2876...
This is an older link, but they listed two different sections here, commercial and open source models: https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...
For the realtime multimodal, I'm not seeing the open source models tab: https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...
Next steps for AI in general:
I'm running Q3-Next on my MBP and seeing ~ GPT4.1 performance from it.
Impressive what these local models are now capable of.
has anyone figured out how to ask a question with text and have it speak the answer in the app ? i can generate text or talk but not jump between
i was lead to believe that it was possibly by the first image here: https://qwen.ai/blog?id=1f04779964b26eacd0025e68698258faacc7... that shows a voice output (top left) next to the written out detail of the thinking mode
https://x.com/whowillrickwill/status/1920723985311903767
All, what's the best model right now to bring a photo to life (create a short video from a photo etc) ?
Open source? Wan 2.2 i2v
will be interesting how it will compare with pricing with audio modality comparing to gemini 2.0 flash once many providers offer it.
Even though gemini 2.0 flash is quite old I still like it. Very cheap (each second of audio is just 32 tokens), support even more languages, non-reasoning so very fast, big rate limits.
Does anyone have good resources for learning about multimodal models? I'm not sure where to begin.
So much detailed documentation.
How’s the Japanese performance?
とてもおいしです!
it's very good, It looks almost as good as Gemini.
The multilingual example in the launch graphic has Qwen3 producing the text:
> "Bonjour, pourriez-vous me dire comment se rendreà la place Tian'anmen?"
translation: "Hello, could you tell me how to get to Tiananmen Square?"
a bold choice!
Westerners only know it from the massacre but it’s actually just like Times Square for them
Not really, it's a significant place which is why the protest (and hence massacre) was there, so especially for Chinese people (I expect) merely referencing it doesn't so immediately refer to the massacre, they have plenty of other connotations for it.
e.g. if something similar happened in Trafalgar Square, I expect it would still be primarily a major square in London to me, not oh my god they must be referring to that awful event. (In fact I think it was targeted in the 7/7 bombings for example.)
Or a better example to go with your translation - you can refer to the Bastille without 'boldly' invoking the histoire of its storming in the French Revolution.
No doubt the US media has referred to the Capitol without boldness many times since 6 Jan '21.
Not to mention, Tiananmen Square is one of the major tourist destinations in Beijing (similar to National Mall in Washington DC), for both domestic and foreign visitors.
This is true. I also think they've put some real effort into steering the model away from certain topics. If you ask too closely you'll get a response like:
"As an AI assistant, I must remind you that your statements may involve false and potentially illegal information. Please observe the relevant laws and regulations and ask questions in a civilized manner when you speak."
only really a reference with the date or at least 89
This new AI model sounds pretty impressive, but let's see how it really performs in real-world applications. There's a cool resource that dives into racing and building experiences that might be worth checking out too—.