OK, here's my quick critique of the article (having built a similar AM4-based system in 2023 for 2300€):
1) [I thought] The page is blocking cut & paste. Super annoying!
2) The exact mainboard is not specified exactly. There are 4 different boards called "ASUS ROG Strix X670E Gaming" and some of them only have one PCIe x16 slot. None of them can do PCIe x8 when using two GPUs.
3) The shopping link for the mainboard leads to the "ASUS ROG Strix X670E-E Gaming" model. This model can use the 2nd PCIe 5.0 port at only x4 speeds. The RTX 3090 can only do PCIe 4.0 of course so it will run at PCIe 4.0 x4. If you choose a desktop mainboard for having two GPUs, make sure it can run at PCIe x8 speeds when using both GPU slots! Having NVLink between the GPUs is not a replacement for having a fast connection between the CPU+RAM and the GPU and its VRAM.
4) Despite having a last-modified date of September 22nd, he is using his rig mostly with rather outdated or small LLMs and his benchmarks do not mention their quantization, which makes them useless. Also they seem not to be benchmarks at all, but "estimates". Perhaps the headline should be changed to reflect this?
Yeah, this page seems to be not great for beginners and also useless for people with experience.
A 2x 3090 build is okay for inference, but even with nvlink you're a bit handicapped for training. You're much better off with getting a 4090 48GB from China for $2.5k and just using that. Example: https://www.alibaba.com/trade/search?keywords=4090+48gb&pric...
Also, this phrasing is concerning:
> WARNING - these components don't fit if you try to copy this build. The bottom GPU is resting on the Arctic p12 slim fans at the bottom of the case and pushing up on the GPU. Also the top arctic p14 Max fans don't have mounting points for half of their screw holes, and are in place by being very tightly wedged against the motherboard, case, and PSU. Also, there's probably way too much pressure on the pcie cables coming off the gpus when you close the glass.
What an indictment on NVidia market segmentation that there's an industry doing aftermarket VRAM upgrades on gaming cards due their intentionally hobbled VRAM.
I wish AMD and Intel Arc would step up their game.
Intel Arc Pro B60 will come in a 48GB dual-GPU model. So yeah, hardware is gonna be there, and the 24GB model will be $599 from Sparkle. I assume 48GB will be cheaper than a hacked RTX 4090.
Keep in mind that the dual-GPU is done via PCIe bifurcation, so that if you use two B60's on a similar motherboard to what's in the article, you'll only see two GPUs, not the full four. Hence just 48GB VRAM not 96GB.
Yeah, but the B60 is basically half the speed of a 3090... in 2025. I'd rather buy 5yr old nVidia hardware for $100 more on eBay than an intel product with horrendous software support that's half the speed effectively. This build is so cool because the 2x 3090 setup is still maybe the best option 5yrs+ after the GPU was released by nVidia.
Blowers are loud, but they're easier to pack together, particularly given how most motherboards don't seem to space their two slots sufficiently to accomodate the massive coolers on recent GPUs.
Simply replacing the 3090's with 4090's would provide a major performance uplift assuming your model fits. (I have rented both 3090 and 4090 systems online for research, this comment is based on my personal experience, it is well worth the price increase and the hourly rate for the inference speed you get)
Don’t those modified cards require hacked drivers? I would not want my expensive video card to depend on hacked drivers that may or may not continue to be available with new updates.
I've also learned the hard way to Google "AM4 main board tier list" before buying.
Some boards can run a 5950X in name only, while others can comfortably run it close to double its spec power all day. VRMs are a real differentiator for this tier of hardware.
(If anyone can comment on the airflow required for 400-500W Epyc CPUs with the tiny VRM heatsinks that Supermicro uses, I'm all ears.)
I don't know if the page actually f's with copy/paste or not since I already have the extension. It's usually most useful on forms where they force you to type in stuff.
Back in my Amiga-days we had PowerSnap[1] which did the bargain basement version of OCR: Check the font settings of the window you wanted to cut and paste from, and try to match the font to the bitmap, to let you copy and paste from apps that didn't support it, or from UI element you normally couldn't.
These days, just throwing the image at an AI model would be far more resilient...
I think we've gotten to the point where it would be hard to compose an image that humans can read but an AI model can't, and easy to compose an image an AI can read but humans can't, so I suspect the only option for your marketing department will be to try to prompt inject the AI into buying your product.
(Oh, look, I have written nearly this same comment once before, 11 years ago, on HN[2] - I was wrong about how it worked, and Orgre was right, and my follow up reply appears to be closer to what it actually does)
> 3) The shopping link for the mainboard leads to the "ASUS ROG Strix X670E-E Gaming" model. This model can use the 2nd PCIe 5.0 port at only x4 speeds. The RTX 3090 can only do PCIe 4.0 of course so it will run at PCIe 4.0 x4. If you choose a desktop mainboard for having two GPUs, make sure it can run at PCIe x8 speeds when using both GPU slots! Having NVLink between the GPUs is not a replacement for having a fast connection between the CPU+RAM and the GPU and its VRAM.
Forgive a noob question: I thought the connection to the GPU was actually fairly unimportant once the model was loaded, because sending input to the model and getting a response is low bandwidth? So it might matter if you're changing models a lot or doing a model that can work on video, but otherwise I thought it didn't really matter.
In general, if all you do is inference with a model that’s in VRAM, you’re right. OTOH it’s simply a matter of picking the right mainboard. If you have one of those sweet new MoE models that won‘t completely fit in your VRAM, offloading means you want PCIe bandwidth, because it will be a bottleneck.
Also swapping between LLMs will be faster.
I'd use caution with the Mi50s. I bought a 16GB one on eBay a while back and it's been completely unusable.
It seems to be a Radeon VII on an Mi50 board, which should technically work. It immediately hangs the first time an OpenCL kernel is run, and doesn't come back up until I reboot. It's possible my issues are due to Mesa or driver config, but I'd strongly recommend buying one to test before going all in.
There are a lot of cheap SXM2 V100s and adapter boards out now, which should perform very well. The adapters unfortunately weren't available when I bought my hardware, or I would have scooped up several.
I've seen the sxm2 (x2) with pci extension cards out on ebay for like $350.
The 32gb v100s with heatsink are like $600 each, so that would be $1500 or so for a one-off 64gb gpu that is less overall performant than a single 3090.
Better to buy one used 3090 than those old cards. Everything is not vram. Or, you can do nothing without vram but you can’t do anything with just vram.
To use the second pair of pcie slots, you _must_ have two cpus installed. Just saying in case someone finds a board with just one cpu socket populated.
In general I can too, but try copying items from the "key specifications". Or perhaps I just had the impression because you can't mark text because I can't tell which text is marked and which isn't when marking text under "Key Specifications". Mea culpa.
Horrible comment and attitude. People are trying to quote you for legitimate comment and criticism. This alone was enough for me to close the tab with your blog and ignore anything else you're going to say.
I'm a huge fan of OpenRouter and their interface for solid LLM's but I recently jumped into fine tuning / modifying my own vision models for FPV drone detection (just for fun) and my daily workstation and it's 2080 just wasn't good enough.
Even in 2025 it's cool how solid a setup dual 3090's still are. nvlink is an absolute must but it's incredibly powerful. I'm able to run the latest Mistral thinking models and relatively powerful yolo based VLM's like the ones RoboFlow is based on.
Curious if anyone else is still using 3090's or has feedback for scaling up to 4-6 3090s.
a used 3090 is around $900 on ebay.
a used rtx 6000 ADA is around $5k
4 3090s are slower at inference and worse at training than 1 rtx 6000.
4x3090 would consume 1400W at load.
Rtx 6000 would consume 300W at load.
If you god forbid live in California and your power averages 45 cents per kwh, 4x3090 would be $1500+ more per year to operate than a single RTX 6000[0]
[0] Back of the napkin/ChatGPT calculation of running the GPU at load for 8 hours per day.
Note: I own a pc with a 3090, but if i had to build an AI training workstation, i would seriously consider cost to operate and resale value(per component).
To make matters worse, the RTX3090 was released during the crypto craze and so a decent amount of the second hand market could contain overused GPUs that won’t last long, even if 3xxx to 4xxx performance difference is not that high, I would avoid the 3xxx series totally for resell value.
I bought 2 ex mining 3090s ~3 years ago. They’re in an always on pc that I remote into. Haven’t had a problem. If there was mass failures of gpus due to mining I would expect to have heard more about it
I have rig of 7 3090s that I bought from crypto bros, they are lasting quite alright and have been chugging along fine for the last 2 years. GPUs are electronic devices not mechanical devices, they rarely blow up.
you get a motherboard designed for the purpose (many pcie slots) and a case (usually open frame) that holds that many cards. riser cables are used so every card doesnt plug directly into the motherboard
I've noticed on ebay there are a lot of 3090s for sale that seem to have rusted or corroded heatsinks. I actually can't recall seeing this with used GPUs before but maybe I just haven't paying attention. Does this have to do with running them flat out in a basement or something?
I have an A6000 and the main advantage over a 3090 cluster is the build simplicity and relative silence of the machine (it is also used as my main dev workstation).
... and this is why napkin calculation is terrible. Even running a GPU at load doesn't mean you are going to use the full wattage. 4 3090 running inference on large model barely uses 350watts combined.
Since you're exploring options just for fun, out of curiosity, would you rent it out whenever you're not using it yourself, so it's not just sitting idle? (Could be noisy and loud). You'd be able to use your computer for other work at the same time and stop whenever you wanted to use it yourself.
(you should also be compensated for the noise and inconvenience from it, not only electricity.) It sounds like you might rent it out if the rental price were higher.
I've built a rig with 14 of them. NVLink is not 'an absolute must', it can be useful depending on the model and the application software you use and whether you're training or inferring.
The most important figure is the power consumed per token generated. You can optimize for that and get to a reasonably efficient system, or you can maximize token generation speed and end up with two times the power consumption for very little gain. You also will likely need to have a way to get rid of excess heat and all those fans get loud. I stuck the system in my garage, that made the noise much more manageable.
I am curious about the setup of 14 GPUs - what kind of platform (motherboard) do you use to support so many PCIe lanes? And do you even have a chassis? Is it rack-mounted? Thanks!
I used a large supermicro server chassis, a dual Xeon motherboard with 7 8 lane PCI Express slots, all the ram it would take (bought second hand), splitters, four massive powersupplies. I extended the server chassis with aluminum angle riveted onto the base. It could be rack mounted but I'd hate to be the person lifting it in. The 3090s were a mix, 10 of the same type (small, and with blower style fans on them) and 4 much larger ones that were kind of hard to accommodate (much wider and longer). I've linked to the splitter board manufacturer in another comment in this thread. That's the 'hard to get' component but once you have those and good cables to go with them the remaining setup problems are mostly power and heat management.
The 3090 are a sweet spot for training. It’s the first generation with seriously fast VRAM. And it’s the last generation before Nvidia blocked NVlink. If you need to copy parameters between GPUs during training, the 3090 can be up to 70% faster than 4090 or 5090. Because the latter two are limited by PCI express bandwidth.
To be fair though, the 4090 and 5090 are much easier capable of saturating PCI express than the 3090 is, even at 4 lanes per card the 3090 rarely manages to saturate the links, it still handsomely pays off to split down to 4 lanes and add more cards.
I bought a 2nd 3090 2 years ago for like 800eur, still a good price even today I think.
It's in my main workstation, and my idea was to always have Ollama running locally. The problem is that once I have a (large-ish) model running, all my VRAM is almost full and GPU struggles to do things like playing back a YouTube video.
Lately I haven't used local AI much, also because I stopped using any coding AIs (as they wasted more time than they saved), I stopped doing local image generations (the AI image generation hype is going down), and for quick questions I just ask ChatGPT, mostly because I also often use web search and other tools, which are quicker on their platform.
Unfortuatenly, my CPU (5900x) doesn't have an iGPU.
The last 5 years iGPU got a bit out of trend. Now maybe they actually make a lot of sense, as there is a clear use-case which involves having dedicated GPU always in-use which is not gaming (and gaming is different, cause you don't often multi-task while gaming).
I do expect to see a surge in iGPU popularity, or maybe a software improvement to allow having a model always available without constantly hogging the VRAM.
PS: I thought Ollama had a way to use RAM instead of VRAM (?) to keep the model active when not in use, but in my experience that didn't solve the problem.
if it's just for detection would audio not be cheaper to process?
I'm imagining a cluster of directional microphones, and then i don't know if it's better to perform some sort of band pass filtering first since it's so computationally cheap or whether it's better to just feed everything into the model directly. No idea.
I guess my first thought was just sounds from a drone likely is detectable reliably at a greater distance than visual, they're so small and a 180 degree by 180 degree hemisphere of pixels is a lot to process.
I built a similar system, meanwhile I've sold one of the RTX 3090's. Local inference is fun and feels liberating, but it's also slow, and once I was used to the immense power of the giant hosted models, the fun quickly disappeared.
I've kept a single GPU to still be able to play a bit with light local models, but not anymore for serious use.
tbf I also run that on a 16GB 5070TI at 25T/S, it's amazing how fast it runs on consumer grade hardware. I think you could push up to a bigger model but I don't know enough about local llama.
I have a similar setup as the author with 2x 3090s.
The issue is not that it's slow. 20-30 tk/s is perfectly acceptable to me.
The issue is that the quality of the models that I'm able to self-host pales in comparison to that of SOTA hosted models. They hallucinate more, don't follow prompts as well, and simply generate overall worse quality content. These are issues that plague all "AI" models, but they are particularly evident on open weights ones. Maybe this is less noticeable on behemoth 100B+ parameter models, but to run those I would need to invest much more into this hobby than I'm willing to do.
I still run inference locally for simple one-off tasks. But for anything more sophisticated, hosted models are unfortunately required.
On my 2x 3090s I am running glm4.5 air q1 and it runs at ~300pp and 20/30 tk/s
works pretty well with roo code on vscode, rarely misses tool calls and produces decent quality code.
I also tried to use it with claude code with claude code router and it's pretty fast.
Roo code uses bigger contexts, so it's quite slower than claude code in general, but I like the workflow better.
it's a transparent proxy that automatically launches your selected model with your preferred inference server so that you don't need to manually start/stop the server when you want to switch model
so, let's say I have configured roo code to use qwen3 30ba3b as the orchestrator and glm4.5 air as coder, roo code would call the proxy server with model "qwen3" when using orchestrator mode and then kill llama.cpp with qwen3 and restart it with "glm4.5air"
well, I tried it and it works for me. llm output is hard to properly evaluate without actually using it.
I read a lot of good comments on r/localllama, with most people suggesting qwen3 coder 30ba3b, but I never got it to work as well as GLM 4.5 air Q1.
As for using Q2, it will fit in vram, but with very small context or spill over to RAM, but with quite an impact on speed depending on your setup. I have slow ddr4 ram and going for Q1 has been a good compromise for me, but YMMV.
> behemoth 100B+ parameter models, but to run those I would need to invest much more into this hobby than I'm willing to do.
Have you tried newer MoE models with llama.cpp's recent '--n-cpu-moe' option to offload MoE layers to the CPU? I can run gpt-oss-120b (5.1B active) on my 4080 and get a usable ~20 tk/s. Had to upgrade my system RAM, but that's easier. https://github.com/ggml-org/llama.cpp/discussions/15396 has a bit on getting that running
I use Ollama which offloads to the CPU automatically IIRC. IME the performance drops dramatically when that happens, and it hogs the CPU making the system unresponsive for other tasks, so I try to avoid it.
I don't believe that's the same thing. That should be the generic offloading that ollama will do to any too big model, while this feature requires MoE models. https://github.com/ollama/ollama/issues/11772 is the feature request for similar on ollama.
One comment in that thread mentions getting almost 30tk/s from gpt-oss-120b on a 3090 with llama.cpp compared to 8tk/s with ollama.
This feature is limited to MoE models, but those seem to be gaining traction with gpt-oss, glm-4.5, and qwen3
Not if you have a queue of work that isn't a high priority, like edge compute to review changes in security cam footage or prepare my next day's tasks (calendar, commitments, needs, etc)
One of the observations is how much difference memory speed and bandwidth makes, even for CPU inference. Obviously a CPU isn't going to match a GPU for inference speed, but it's an affordable way to run much larger models than you can fit in 24GB or even 48GB of VRAM. If you do run inference on a CPU, you might benefit from some of the same memory optimizations made by gamers: favoring low-latency overclocked RAM.
Outside of prompt processing, the only reason GPU's are better than CPU's for inference is memory bandwidth, the performance of apple M* devices at inference is a consequence of this, not of their UMA.
> WARNING - these components don't fit if you try to copy this build. The bottom GPU is resting on the Arctic p12 slim fans at the bottom of the case and pushing up on the GPU.
I built a dual 3090 rig, and this point was why I spent a long time looking for a case where the GPU's could fit side by side with a little gap for airflow
I eventually went with a SilverStone GD11 HTPC which is a PC case for building a media centre, but it's huge inside, has a front fan that takes up 75% of width of the case and also allows the GPUs to stand up right so they don't sag and pull on their thin metal supports.
Highly recommend for a dual GPU build! If you can get dual 5090s instead of 3090s (good luck!) you'd even be able to get "good" airflow in this case.
> The workplace of the coworker I built this for is truly offline, with no potential for LAN or wifi, so to download new models and update the system periodically I need to go pick it up from him and take it home.
I'm surprised that a "truly offline" workplace allows servers to be taken home and being connected to the internet.
I worked in the Arctic for the better part of a decade. There's Starlink now, but I've been TRULY OFFLINE for weeks (with plenty diesel generated power) as recently as 2018. Technically we could use Iridium at like $10 per MB, but my full Wikipedia mirror (+ Debian/Ubuntu packages, PyPI etc) did come in handy more than once.
I know some Antarctic research stations (like McMurdo for example) still have connectivity restrictions depending on time-of-day, and I wouldn't be surprised if they also had mirrors of these sort of things, and/or dual-3090 rigs for llama.cpp in the off hours.
I’m really interested in this space from an AI sovereignty pov. Is it feasible for SMB/SME to use a box like in the article to get offline analysis of their data? It doesn’t have the worry of sending it off to the cloud.
I wanted to speak with businesses in my local area but no one took me up on it.
Yes, this is absolutely doable, and many companies are rolling their own ML models (I work with a MedTech company that does, in fact). LLMs are a little more involved, and you'd probably want something beefier than this (maybe a Framework Desktop cluster, if you're not wanting to get into rackmount stuff), but it's definitely feasible for companies to have their own offline LLMs and ML models.
I was going to say you need an extension cable. My first dual 3090 build I had three issues. First was the pcie extension wouldn't support gen4, so I had to change to gen3 in the bios. Second issue was that depending on which slot, you couldn't get x16/x16 and it would drop to x16/x8 unless you had it configured right. Third, I finally gave up and just had the card resting first inside the case and then outside which if fan kicks up, it'll jiggle around, so I had to make some makeshift holder to keep the card sitting there.
The 3090 I have in my server (Ollama on it is only used occasionally nowadays since I have dual 5080s on my work desktop), also handles accelerating transcoding in Plex, and is in the process of being setup to handle monitoring my 3d printers for failures via camera.
Am also considering setting up Home Assistant with LLM support again.
I use an older machine/GPU for wintertime heating, mining Monero (xmrig).
Should one get lucky and guess the next valid block, that pays the entire month's electricity — since an electric space heater would already be consuming the exact same amount of kWH as this GPU, there is no "negative cost" to operate.
This machine/GPU used to be my main workhorse, and still has ollama3.2 available — but even with HBM, 8GB of VRAM isn't really relevant in LLM-land.
OK, here's my quick critique of the article (having built a similar AM4-based system in 2023 for 2300€):
1) [I thought] The page is blocking cut & paste. Super annoying!
2) The exact mainboard is not specified exactly. There are 4 different boards called "ASUS ROG Strix X670E Gaming" and some of them only have one PCIe x16 slot. None of them can do PCIe x8 when using two GPUs.
3) The shopping link for the mainboard leads to the "ASUS ROG Strix X670E-E Gaming" model. This model can use the 2nd PCIe 5.0 port at only x4 speeds. The RTX 3090 can only do PCIe 4.0 of course so it will run at PCIe 4.0 x4. If you choose a desktop mainboard for having two GPUs, make sure it can run at PCIe x8 speeds when using both GPU slots! Having NVLink between the GPUs is not a replacement for having a fast connection between the CPU+RAM and the GPU and its VRAM.
4) Despite having a last-modified date of September 22nd, he is using his rig mostly with rather outdated or small LLMs and his benchmarks do not mention their quantization, which makes them useless. Also they seem not to be benchmarks at all, but "estimates". Perhaps the headline should be changed to reflect this?
Yeah, this page seems to be not great for beginners and also useless for people with experience.
A 2x 3090 build is okay for inference, but even with nvlink you're a bit handicapped for training. You're much better off with getting a 4090 48GB from China for $2.5k and just using that. Example: https://www.alibaba.com/trade/search?keywords=4090+48gb&pric...
Also, this phrasing is concerning:
> WARNING - these components don't fit if you try to copy this build. The bottom GPU is resting on the Arctic p12 slim fans at the bottom of the case and pushing up on the GPU. Also the top arctic p14 Max fans don't have mounting points for half of their screw holes, and are in place by being very tightly wedged against the motherboard, case, and PSU. Also, there's probably way too much pressure on the pcie cables coming off the gpus when you close the glass.
What an indictment on NVidia market segmentation that there's an industry doing aftermarket VRAM upgrades on gaming cards due their intentionally hobbled VRAM.
I wish AMD and Intel Arc would step up their game.
Intel Arc Pro B60 will come in a 48GB dual-GPU model. So yeah, hardware is gonna be there, and the 24GB model will be $599 from Sparkle. I assume 48GB will be cheaper than a hacked RTX 4090.
Look at this: https://www.maxsun.com/products/intel-arc-pro-b60-dual-48g-t... https://www.sparkle.com.tw/files/20250618145718157.pdf
Keep in mind that the dual-GPU is done via PCIe bifurcation, so that if you use two B60's on a similar motherboard to what's in the article, you'll only see two GPUs, not the full four. Hence just 48GB VRAM not 96GB.
Yeah, but the B60 is basically half the speed of a 3090... in 2025. I'd rather buy 5yr old nVidia hardware for $100 more on eBay than an intel product with horrendous software support that's half the speed effectively. This build is so cool because the 2x 3090 setup is still maybe the best option 5yrs+ after the GPU was released by nVidia.
$2.5k is about $1k more than you'd spend on a pair of 3090s, and people I know who've bought blower 4090s say they sound like hair driers.
Blowers are loud, but they're easier to pack together, particularly given how most motherboards don't seem to space their two slots sufficiently to accomodate the massive coolers on recent GPUs.
I can't wait for blower 3090s from China / MSI to get cheap (although I fear this may never happen)
Simply replacing the 3090's with 4090's would provide a major performance uplift assuming your model fits. (I have rented both 3090 and 4090 systems online for research, this comment is based on my personal experience, it is well worth the price increase and the hourly rate for the inference speed you get)
I am not a lawer, but shouldn't 4090s be worse since they don't have nvlink?
there are patched drivers for enabling p2p but if I remember correctly, they are still slower than having an nvlink
Don’t those modified cards require hacked drivers? I would not want my expensive video card to depend on hacked drivers that may or may not continue to be available with new updates.
Are the Alibaba 4090s modded to reach 48GB VRAM? (I ask only to figure how why they're that cheap...)
Yes, they are modded by replacing the individual VRAM modules. https://www.tomshardware.com/pc-components/gpus/usd142-upgra...
I've also learned the hard way to Google "AM4 main board tier list" before buying.
Some boards can run a 5950X in name only, while others can comfortably run it close to double its spec power all day. VRMs are a real differentiator for this tier of hardware.
(If anyone can comment on the airflow required for 400-500W Epyc CPUs with the tiny VRM heatsinks that Supermicro uses, I'm all ears.)
> The page is blocking cut & paste. Super annoying!
I've been running Don't F* With Paste* for years for this
https://chromewebstore.google.com/detail/dont-f-with-paste/n...
Hmm, I can copy paste just fine from the build page?
I don't know if the page actually f's with copy/paste or not since I already have the extension. It's usually most useful on forms where they force you to type in stuff.
Interesting. I guess our content-based marketing pages need to move to canvas-based rendering. That's probably bum too. Straight to serving up jpgs.
> Straight to serving up jpgs.
Back in my Amiga-days we had PowerSnap[1] which did the bargain basement version of OCR: Check the font settings of the window you wanted to cut and paste from, and try to match the font to the bitmap, to let you copy and paste from apps that didn't support it, or from UI element you normally couldn't.
These days, just throwing the image at an AI model would be far more resilient...
I think we've gotten to the point where it would be hard to compose an image that humans can read but an AI model can't, and easy to compose an image an AI can read but humans can't, so I suspect the only option for your marketing department will be to try to prompt inject the AI into buying your product.
(Oh, look, I have written nearly this same comment once before, 11 years ago, on HN[2] - I was wrong about how it worked, and Orgre was right, and my follow up reply appears to be closer to what it actually does)
[1] https://aminet.net/package/util/cdity/PowerSnap22a
[2] https://news.ycombinator.com/item?id=7631161
thankfully most web browsing will be done by LLMs soon and that won't stop them, good riddance to the mess of a web that google has created
dead Internet for realz
> 3) The shopping link for the mainboard leads to the "ASUS ROG Strix X670E-E Gaming" model. This model can use the 2nd PCIe 5.0 port at only x4 speeds. The RTX 3090 can only do PCIe 4.0 of course so it will run at PCIe 4.0 x4. If you choose a desktop mainboard for having two GPUs, make sure it can run at PCIe x8 speeds when using both GPU slots! Having NVLink between the GPUs is not a replacement for having a fast connection between the CPU+RAM and the GPU and its VRAM.
Forgive a noob question: I thought the connection to the GPU was actually fairly unimportant once the model was loaded, because sending input to the model and getting a response is low bandwidth? So it might matter if you're changing models a lot or doing a model that can work on video, but otherwise I thought it didn't really matter.
In general, if all you do is inference with a model that’s in VRAM, you’re right. OTOH it’s simply a matter of picking the right mainboard. If you have one of those sweet new MoE models that won‘t completely fit in your VRAM, offloading means you want PCIe bandwidth, because it will be a bottleneck. Also swapping between LLMs will be faster.
> None of them can do PCIe x8 when using two GPUs.
Is that important for this workload? I thought most of the effort was spent processing data on the card rather than moving data on or off of it?
Sorry for going off topic. But your insight will be helpful on my build
I'm thinking about a low budget system, which will be using
1.X99 D8 MAX LGA2011-3 Motherboard - It has 4 pcie 3.0 x16 slots, dual cpu socket. They are priced around $260 with both the cpu
2. 4X AMD MI50 32G cards - They are old now, but they have 32 gigs of vram and also can be sources at $110 each
The whole setup would not cost more than $1000, is it a right build ? or something more performant can be built within this budget ?
I'd use caution with the Mi50s. I bought a 16GB one on eBay a while back and it's been completely unusable.
It seems to be a Radeon VII on an Mi50 board, which should technically work. It immediately hangs the first time an OpenCL kernel is run, and doesn't come back up until I reboot. It's possible my issues are due to Mesa or driver config, but I'd strongly recommend buying one to test before going all in.
There are a lot of cheap SXM2 V100s and adapter boards out now, which should perform very well. The adapters unfortunately weren't available when I bought my hardware, or I would have scooped up several.
I've seen the sxm2 (x2) with pci extension cards out on ebay for like $350.
The 32gb v100s with heatsink are like $600 each, so that would be $1500 or so for a one-off 64gb gpu that is less overall performant than a single 3090.
Better to buy one used 3090 than those old cards. Everything is not vram. Or, you can do nothing without vram but you can’t do anything with just vram.
To use the second pair of pcie slots, you _must_ have two cpus installed. Just saying in case someone finds a board with just one cpu socket populated.
Any reason you wouldn't opt for the 4090 or 5090?
3090 second hand can be found at something like $600.
[flagged]
I have js enabled and I can copy text on this page.
In general I can too, but try copying items from the "key specifications". Or perhaps I just had the impression because you can't mark text because I can't tell which text is marked and which isn't when marking text under "Key Specifications". Mea culpa.
yeah the selection is dark grey over black so it is not super visible but you can copy text.
Horrible comment and attitude. People are trying to quote you for legitimate comment and criticism. This alone was enough for me to close the tab with your blog and ignore anything else you're going to say.
That's not the author(I don't think?) just a random troll
I'm a huge fan of OpenRouter and their interface for solid LLM's but I recently jumped into fine tuning / modifying my own vision models for FPV drone detection (just for fun) and my daily workstation and it's 2080 just wasn't good enough.
Even in 2025 it's cool how solid a setup dual 3090's still are. nvlink is an absolute must but it's incredibly powerful. I'm able to run the latest Mistral thinking models and relatively powerful yolo based VLM's like the ones RoboFlow is based on.
Curious if anyone else is still using 3090's or has feedback for scaling up to 4-6 3090s.
Thanks everyone ;)
I am exploring options just for fun.
a used 3090 is around $900 on ebay. a used rtx 6000 ADA is around $5k
4 3090s are slower at inference and worse at training than 1 rtx 6000.
4x3090 would consume 1400W at load.
Rtx 6000 would consume 300W at load.
If you god forbid live in California and your power averages 45 cents per kwh, 4x3090 would be $1500+ more per year to operate than a single RTX 6000[0]
[0] Back of the napkin/ChatGPT calculation of running the GPU at load for 8 hours per day.
Note: I own a pc with a 3090, but if i had to build an AI training workstation, i would seriously consider cost to operate and resale value(per component).
To make matters worse, the RTX3090 was released during the crypto craze and so a decent amount of the second hand market could contain overused GPUs that won’t last long, even if 3xxx to 4xxx performance difference is not that high, I would avoid the 3xxx series totally for resell value.
I bought 2 ex mining 3090s ~3 years ago. They’re in an always on pc that I remote into. Haven’t had a problem. If there was mass failures of gpus due to mining I would expect to have heard more about it
I have rig of 7 3090s that I bought from crypto bros, they are lasting quite alright and have been chugging along fine for the last 2 years. GPUs are electronic devices not mechanical devices, they rarely blow up.
How do you have a rig that fits that many cards?? those things take 3 slots apiece.
Pictures, or it never happened! :D
you get a motherboard designed for the purpose (many pcie slots) and a case (usually open frame) that holds that many cards. riser cables are used so every card doesnt plug directly into the motherboard
I've noticed on ebay there are a lot of 3090s for sale that seem to have rusted or corroded heatsinks. I actually can't recall seeing this with used GPUs before but maybe I just haven't paying attention. Does this have to do with running them flat out in a basement or something?
Run near a saltwater source without AC and that will happen.
I guess it depends on what you want to do: You get half the RAM in the 6000 (48 @ $104/GB) vs 4x3090 (96 @ $37.5/GB).
I have an A6000 and the main advantage over a 3090 cluster is the build simplicity and relative silence of the machine (it is also used as my main dev workstation).
... and this is why napkin calculation is terrible. Even running a GPU at load doesn't mean you are going to use the full wattage. 4 3090 running inference on large model barely uses 350watts combined.
Can you clarify? Even if you down clock the card to 300W, why would running it at load not consume 4x300W?
>I am exploring options just for fun.
Since you're exploring options just for fun, out of curiosity, would you rent it out whenever you're not using it yourself, so it's not just sitting idle? (Could be noisy and loud). You'd be able to use your computer for other work at the same time and stop whenever you wanted to use it yourself.
It depends. At my electricity cost, 1 hour of 3090 or 1 hour of Rtx 6000 would cost the same 0.45
Just checked vast.ai. I will be losing money with 3090 at my electricity cost and making a tiny bit with rtx 6000.
Like with boats it’s probably better to rent GPUs then buy them
Would a solar panel setup be an option for fixing that? :)
(you should also be compensated for the noise and inconvenience from it, not only electricity.) It sounds like you might rent it out if the rental price were higher.
I've built a rig with 14 of them. NVLink is not 'an absolute must', it can be useful depending on the model and the application software you use and whether you're training or inferring.
The most important figure is the power consumed per token generated. You can optimize for that and get to a reasonably efficient system, or you can maximize token generation speed and end up with two times the power consumption for very little gain. You also will likely need to have a way to get rid of excess heat and all those fans get loud. I stuck the system in my garage, that made the noise much more manageable.
I am curious about the setup of 14 GPUs - what kind of platform (motherboard) do you use to support so many PCIe lanes? And do you even have a chassis? Is it rack-mounted? Thanks!
I used a large supermicro server chassis, a dual Xeon motherboard with 7 8 lane PCI Express slots, all the ram it would take (bought second hand), splitters, four massive powersupplies. I extended the server chassis with aluminum angle riveted onto the base. It could be rack mounted but I'd hate to be the person lifting it in. The 3090s were a mix, 10 of the same type (small, and with blower style fans on them) and 4 much larger ones that were kind of hard to accommodate (much wider and longer). I've linked to the splitter board manufacturer in another comment in this thread. That's the 'hard to get' component but once you have those and good cables to go with them the remaining setup problems are mostly power and heat management.
Thanks that is very inspiring. I thought there are no blower type consumer GPUs, but apparently they exist!
I got them second hand off some bitcoin mining guy.
https://www.tomshardware.com/news/asus-blower-rtx3090
Is the model that I have.
You really don't need NVLink, you won't saturate the PCIe lanes on a modern motherboard with dual 3090s.
Tim Dettmers amazing GPU blog post posits NVLink doesn't start to become useful until you are at 128+ GPUs
https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...
The 3090 are a sweet spot for training. It’s the first generation with seriously fast VRAM. And it’s the last generation before Nvidia blocked NVlink. If you need to copy parameters between GPUs during training, the 3090 can be up to 70% faster than 4090 or 5090. Because the latter two are limited by PCI express bandwidth.
To be fair though, the 4090 and 5090 are much easier capable of saturating PCI express than the 3090 is, even at 4 lanes per card the 3090 rarely manages to saturate the links, it still handsomely pays off to split down to 4 lanes and add more cards.
I used:
https://c-payne.com/
Very high quality and manageable prices.
I've purchase 16 of these - cpayne is great! Hope he finds a US distributor to help with tariffs a bit!
What blew me away is the quality and price point of what obviously can't be a very high volume product. This guy makes amazing stuff.
I bought a 2nd 3090 2 years ago for like 800eur, still a good price even today I think.
It's in my main workstation, and my idea was to always have Ollama running locally. The problem is that once I have a (large-ish) model running, all my VRAM is almost full and GPU struggles to do things like playing back a YouTube video.
Lately I haven't used local AI much, also because I stopped using any coding AIs (as they wasted more time than they saved), I stopped doing local image generations (the AI image generation hype is going down), and for quick questions I just ask ChatGPT, mostly because I also often use web search and other tools, which are quicker on their platform.
I run my desktop environment on the iGPU and the AI stuff on the dGPUs.
That's a real good point!
Unfortuatenly, my CPU (5900x) doesn't have an iGPU.
The last 5 years iGPU got a bit out of trend. Now maybe they actually make a lot of sense, as there is a clear use-case which involves having dedicated GPU always in-use which is not gaming (and gaming is different, cause you don't often multi-task while gaming).
I do expect to see a surge in iGPU popularity, or maybe a software improvement to allow having a model always available without constantly hogging the VRAM.
PS: I thought Ollama had a way to use RAM instead of VRAM (?) to keep the model active when not in use, but in my experience that didn't solve the problem.
if it's just for detection would audio not be cheaper to process?
I'm imagining a cluster of directional microphones, and then i don't know if it's better to perform some sort of band pass filtering first since it's so computationally cheap or whether it's better to just feed everything into the model directly. No idea.
I guess my first thought was just sounds from a drone likely is detectable reliably at a greater distance than visual, they're so small and a 180 degree by 180 degree hemisphere of pixels is a lot to process.
Fun problem either wayway.
I built a similar system, meanwhile I've sold one of the RTX 3090's. Local inference is fun and feels liberating, but it's also slow, and once I was used to the immense power of the giant hosted models, the fun quickly disappeared.
I've kept a single GPU to still be able to play a bit with light local models, but not anymore for serious use.
If you have a 24 gb 3090. Try out qwen:30b-a3b-instruct-2507-q4_K_M ( ollama )
It's pretty good.
tbf I also run that on a 16GB 5070TI at 25T/S, it's amazing how fast it runs on consumer grade hardware. I think you could push up to a bigger model but I don't know enough about local llama.
Don't need a 3090, it runs really fast on an RTX 2080 too.
Graphics cards are so expensive (list price) they are cheap (no depreciation liquid market)
Did you really claim GPUs have zero depreciation? That’s obviously false.
I have a similar setup as the author with 2x 3090s.
The issue is not that it's slow. 20-30 tk/s is perfectly acceptable to me.
The issue is that the quality of the models that I'm able to self-host pales in comparison to that of SOTA hosted models. They hallucinate more, don't follow prompts as well, and simply generate overall worse quality content. These are issues that plague all "AI" models, but they are particularly evident on open weights ones. Maybe this is less noticeable on behemoth 100B+ parameter models, but to run those I would need to invest much more into this hobby than I'm willing to do.
I still run inference locally for simple one-off tasks. But for anything more sophisticated, hosted models are unfortunately required.
On my 2x 3090s I am running glm4.5 air q1 and it runs at ~300pp and 20/30 tk/s works pretty well with roo code on vscode, rarely misses tool calls and produces decent quality code.
I also tried to use it with claude code with claude code router and it's pretty fast. Roo code uses bigger contexts, so it's quite slower than claude code in general, but I like the workflow better.
this is my snippet for llama-swap
``` models: "glm45-air": healthCheckTimeout: 300 cmd: | llama.cpp/build/bin/llama-server -hf unsloth/GLM-4.5-Air-GGUF:IQ1_M --split-mode layer --tensor-split 0.48,0.52 --flash-attn on -c 82000 --ubatch-size 512 --cache-type-k q4_1 --cache-type-v q4_1 -ngl 99 --threads -1 --port ${PORT} --host 0.0.0.0 --no-mmap -hfd mradermacher/GLM-4.5-DRAFT-0.6B-v3.0-i1-GGUF:Q6_K -ngld 99 --kv-unified ```
What is llama-swap?
Been looking for more details about software configs on https://llamabuilds.ai
https://github.com/mostlygeek/llama-swap
it's a transparent proxy that automatically launches your selected model with your preferred inference server so that you don't need to manually start/stop the server when you want to switch model
so, let's say I have configured roo code to use qwen3 30ba3b as the orchestrator and glm4.5 air as coder, roo code would call the proxy server with model "qwen3" when using orchestrator mode and then kill llama.cpp with qwen3 and restart it with "glm4.5air"
Thanks, but I find it hard to believe that a Q1 model would produce decent results.
I see that the Q2 version is around 42GB, which might be doable on 2x 3090s, even if some of it spills over to CPU/RAM. Have you tried Q2?
well, I tried it and it works for me. llm output is hard to properly evaluate without actually using it.
I read a lot of good comments on r/localllama, with most people suggesting qwen3 coder 30ba3b, but I never got it to work as well as GLM 4.5 air Q1.
As for using Q2, it will fit in vram, but with very small context or spill over to RAM, but with quite an impact on speed depending on your setup. I have slow ddr4 ram and going for Q1 has been a good compromise for me, but YMMV.
> behemoth 100B+ parameter models, but to run those I would need to invest much more into this hobby than I'm willing to do.
Have you tried newer MoE models with llama.cpp's recent '--n-cpu-moe' option to offload MoE layers to the CPU? I can run gpt-oss-120b (5.1B active) on my 4080 and get a usable ~20 tk/s. Had to upgrade my system RAM, but that's easier. https://github.com/ggml-org/llama.cpp/discussions/15396 has a bit on getting that running
I use Ollama which offloads to the CPU automatically IIRC. IME the performance drops dramatically when that happens, and it hogs the CPU making the system unresponsive for other tasks, so I try to avoid it.
I don't believe that's the same thing. That should be the generic offloading that ollama will do to any too big model, while this feature requires MoE models. https://github.com/ollama/ollama/issues/11772 is the feature request for similar on ollama.
One comment in that thread mentions getting almost 30tk/s from gpt-oss-120b on a 3090 with llama.cpp compared to 8tk/s with ollama.
This feature is limited to MoE models, but those seem to be gaining traction with gpt-oss, glm-4.5, and qwen3
Ah, I was not aware of that, thanks. I'll give it a try.
> 20-30 tk/s
or ~2.2M tk/day. This is how we should be thinking about it imho.
Is it? If you're the only user then you care about latency more than throughput.
Not if you have a queue of work that isn't a high priority, like edge compute to review changes in security cam footage or prepare my next day's tasks (calendar, commitments, needs, etc)
I get a 403 error.
There was an interesting post to r/LocalLLaMA yesterday from someone running inference mostly on CPU: https://carteakey.dev/optimizing%20gpt-oss-120b-local%20infe...
One of the observations is how much difference memory speed and bandwidth makes, even for CPU inference. Obviously a CPU isn't going to match a GPU for inference speed, but it's an affordable way to run much larger models than you can fit in 24GB or even 48GB of VRAM. If you do run inference on a CPU, you might benefit from some of the same memory optimizations made by gamers: favoring low-latency overclocked RAM.
Outside of prompt processing, the only reason GPU's are better than CPU's for inference is memory bandwidth, the performance of apple M* devices at inference is a consequence of this, not of their UMA.
> WARNING - these components don't fit if you try to copy this build. The bottom GPU is resting on the Arctic p12 slim fans at the bottom of the case and pushing up on the GPU.
I built a dual 3090 rig, and this point was why I spent a long time looking for a case where the GPU's could fit side by side with a little gap for airflow
I eventually went with a SilverStone GD11 HTPC which is a PC case for building a media centre, but it's huge inside, has a front fan that takes up 75% of width of the case and also allows the GPUs to stand up right so they don't sag and pull on their thin metal supports.
Highly recommend for a dual GPU build! If you can get dual 5090s instead of 3090s (good luck!) you'd even be able to get "good" airflow in this case.
I love how the prices for various Llama builds are all over the map on this site.
Oh look, here's one for $43K: https://www.llamabuilds.ai/build/a16zs-personal-ai-workstati...
> The workplace of the coworker I built this for is truly offline, with no potential for LAN or wifi, so to download new models and update the system periodically I need to go pick it up from him and take it home.
I'm surprised that a "truly offline" workplace allows servers to be taken home and being connected to the internet.
I worked in the Arctic for the better part of a decade. There's Starlink now, but I've been TRULY OFFLINE for weeks (with plenty diesel generated power) as recently as 2018. Technically we could use Iridium at like $10 per MB, but my full Wikipedia mirror (+ Debian/Ubuntu packages, PyPI etc) did come in handy more than once.
I know some Antarctic research stations (like McMurdo for example) still have connectivity restrictions depending on time-of-day, and I wouldn't be surprised if they also had mirrors of these sort of things, and/or dual-3090 rigs for llama.cpp in the off hours.
I’m really interested in this space from an AI sovereignty pov. Is it feasible for SMB/SME to use a box like in the article to get offline analysis of their data? It doesn’t have the worry of sending it off to the cloud.
I wanted to speak with businesses in my local area but no one took me up on it.
Yes, this is absolutely doable, and many companies are rolling their own ML models (I work with a MedTech company that does, in fact). LLMs are a little more involved, and you'd probably want something beefier than this (maybe a Framework Desktop cluster, if you're not wanting to get into rackmount stuff), but it's definitely feasible for companies to have their own offline LLMs and ML models.
I was going to say you need an extension cable. My first dual 3090 build I had three issues. First was the pcie extension wouldn't support gen4, so I had to change to gen3 in the bios. Second issue was that depending on which slot, you couldn't get x16/x16 and it would drop to x16/x8 unless you had it configured right. Third, I finally gave up and just had the card resting first inside the case and then outside which if fan kicks up, it'll jiggle around, so I had to make some makeshift holder to keep the card sitting there.
I built pretty much this exact rig myself, but now it's gathering dust, any other uses for this rather than localLLMS
Sell it? There are people who want a rig like this.
The 3090 I have in my server (Ollama on it is only used occasionally nowadays since I have dual 5080s on my work desktop), also handles accelerating transcoding in Plex, and is in the process of being setup to handle monitoring my 3d printers for failures via camera.
Am also considering setting up Home Assistant with LLM support again.
Play DnD by yourself with Llama as a DM
Heating
I use an older machine/GPU for wintertime heating, mining Monero (xmrig).
Should one get lucky and guess the next valid block, that pays the entire month's electricity — since an electric space heater would already be consuming the exact same amount of kWH as this GPU, there is no "negative cost" to operate.
This machine/GPU used to be my main workhorse, and still has ollama3.2 available — but even with HBM, 8GB of VRAM isn't really relevant in LLM-land.
vidya
3D rendering and fluid simulation stuff could be interesting.
Playing games, it has a good graphics card
I just don't get why the RTX 4090 is still so expensive on the used market. New Rtx 5090s are almost as expensive!
“Easy” to mod to 48gb
They're dropping. I'm trying to offload 8x 4090s and I'll average $1500 I think.
Are these just for ai now? Or are games pushing video cards that much?
4090 is a great gaming card, the spiritual successor to the 1080. It will be viable for years and years.
Those GPUs are so close to each other, doesn’t the heat cause instability?
Anybody else getting 403 Forbidden error?
The link is down with 403 error.
is it that easy to get started?
total cost?
It says $3090 (maybe easy to miss since it also talks about RTX 3090s?)
It's written quite large on the page, just over 3K
I'm failing to see the point of this article? I mean, people have been building dual GPU workstations for a long long time.
What's so special about this one?