My very own AI server

ODaily · March 24, 2025, 7:49pm

Oh, buddy. Here we go.

Long story short, I’m gonna build out an AI model on a home hosted server. The inspiration came from @amal, by way of the following threads. I’m starting a new thread because I don’t want to clutter up the existing ones with my random attempts / thoughts / musings, and because you can often learn more watching an idiot fail, than an expert succeed (and I AM an AI idiot).

THIS is Amal’s AI thread that got me started, and THIS is how Amal stuffed 3060 GPUs into his R720 server to accomodate the AI build.

So first thing, I bought a R720 dell server for 219.99 (Ebay) which ended up at 241.99 with tax. It’s sitting here in front of me in all it’s enterprise level goodiness.

First impressions, bigger than I thought, followed by damn, that’s heavy.

I have 12 of 24 slots of RAM at 8gb apiece for 96gb ram total. I seriously thought about filling the rest up, but frankly, I don’t think I can really use what I’ve got now without running multiple VMs, which I don’t intend to do.

I have the dual E5-2630 at 2.6GHz processors. Not the best, not the worst, probably be more than I need anyways.

These servers use riser cards to connect GPUs to the PCI. The first one (#2) always supports x16 speed, the second one (#3) supports multiple x8 or on an optional card for a single x16. I have the first option, so if I ever want a second GPU, I’ll need to swap that card out. But not today.

I have the dual 750w power supplies. They’re plug and play hot swappable and as such you’re not going to get anything other than factory. Speaking of which, they did offer 1100w power supplies which are definitely recommended for GPU usage, so I’m going to pick up a pair of those next.

For storage I’ve got 8x 600GB hard drives (overkill) and don’t have the optional SD card (no biggie). There’s also a DVD drive, which kinda suprised me, wasn’t expecting it, but not complaining.

Other than that, per the seller, it’s supposed to be running Windows 10, but I’ve not booted up yet to check and see.

ODaily · March 24, 2025, 8:11pm

GPU(s) Just another name for Graphics cards, the same thing you’re gamer buddy obsesses over.

What I’ve learned so far.

The AI model runs (hopefully / sort / kinda) completely on the GPU. Each GPU has it’s own RAM called VRAM, and each AI model has a size measured in parameters. A 70B model has 70 Billion parameters. The total VRAM available determines how big of a model you can run, Bigger = More Accurate.

The R720 will support a max of 2 GPUs. Using an extra provides almost double the punch. You lose a little when the two boards have to talk to each other, so two 12gb cards are not quite as good as a single 24gb card. But two 24gb cards would be better yet, and expensive as hell. Also note, the R720 is space limited, and will only accept a 2 slot wide GPU max.

So which GPU(s)???
I’ve considered the;

P40, an enterprise level (factory for the server) GPU with 24GB VRAM. Long story short, it’s an older unit that may soon fall out of technological useability. (thanks Amal!) So, workable, but may not be so long term. It is available around 300 bucks and 24gb is not to be sneezed at.

RTX3060, a consumer level (for your PC) GPU with 12GB VRAM. It’s a newer~ish card so better tech, but less VRAM. Amal is using 2 of these for his rig. Price is similar to the P40.

RTX3090, also a consumer level GPU, but with 24GB VRAM, and even newer than the 3060. BUT, the price is… Ouch. Still, if I can figure out a way to finance it, I’d like to go with a single 3090, with a possible second way way way down the line. Might be obsolete by then. Note, this choice is just based in my tech greedy, grab as much as you can, kind of outlook. Bang is great, bang for the buck is gonna suck.

Last thought on GPUs. The R720 is designed to let the case fans blow through the GPU for cooling (Passive Cooling, has no fan on GPU). The consumer GPU has fans that are designed to blow across, dumping the hot air back into the case. You can also get consumer GPUs with blower style cooling, that use a fan to push air out the back. The blower style seems like a hybrid, best of both worlds kind of thing to me, and I’ll probably aim for that if I can.

amal · March 24, 2025, 8:59pm

get 3… always have a spare handy

you can also config them through the iDRAC controller (separate ethernet cable, IP, web interface)

The thing to know is that even though you’ll have 24GB of vram to explore larger models, the 3090 is still somewhat slow processing those models due to the lesser specs on the CUDA and Tensor core counts. Basically if you say “Hey I have 24GB of VRAM to play with here!” and load up a huge model that takes up 20GB of VRAM, your token generation speed is still going to be dogshit because the graphics cores are not going to be able to do the calcs fast enough to connect all those parameters quickly and produce a nice quick output stream for you.

My 3060s are great with models up to about 8B parameters (4 bit) but larger models like 12B and 70B fit in vram just fine and they do work, but they are noticably slow. Where speed really matters is voice assistant interactions, especially if there is reasoning going on where there is a “thinking phase” before the official output begins. Typing with these models is fine, but with a voice assistant you want it to be as instant as possible… and I believe the voice to speech engine requires the complete output before it starts talking… so if your model is slow to start outputting and slow to finish outputting, you will be waiting there quite a long uncomfortable time… potentially while your S.O. is staring daggers at you while holding the Echo Dot (or Google Home Mini) you unplugged.

amal · March 24, 2025, 9:03pm

I find it won’t matter for the short spurts of high power processing your GPU will be doing. Where it matters is if you are actually gaming or training a model … training takes 100% GPU power for extended periods, but holding a model in memory requires next to no power… power draw only spikes while the model is processing a request.

The air flow design of the R720 (and most rack systems) is that a long line of fans just behind the hard drives push air through the case. The fans on the GPU will draw air through the GPU cooling fins in a direction that is perpendicular to the natural air flow, but the overall current of air is front to back through the case, so for these short spurts of high power usage, it won’t matter at all.

See… two models loaded, one in each GPU, and peak is just 41W for one of the GPUs… and that’s because it’s doing ffmpeg work for frigate nvr.

Jammyjellyfish · March 24, 2025, 9:32pm

Man, hard to find reliable benchmarks. The 3090 is roughly twice as fast as a 3060 according to this post:
https://www.reddit.com/r/LocalLLaMA/comments/1augktf/rtx_3090_vs_rtx_3060_inference_comparison/

Which is inline with techpowerup’s table showing the 3060 has roughly 50% performance to the 3090

Also depends on your use case, I have a 3090 running quantized Qwen2.5 Coder 32B, which I use for dumb coding questions (mostly about C++'s stupid approach to polymorphism). I wouldn’t want to go to a smaller model, you can definitely notice the quality degrade. But for Home Assistant? An 8B model would be perfectly fine

I will add I’m extremely impressed with how fast the responses generate on the 3090. Difficult to benchmark, but it can do 2-3 lines of text a second when generating the response to my coding questions (20ish seconds to finish the whole example and explanation)

In the voice chapter 9 blog post, Home assistant announced support for AI response streaming for version 2025.3 for their text interface, so it should be improving rapidly over the next few months

Jammyjellyfish · March 24, 2025, 9:40pm

Agree, but I imagine shoving a 3090 into the r720 case would be… difficult. Given the size, thermal requirements, and power draw (350w), probably best to avoid this route

ODaily · March 24, 2025, 9:55pm

Speaking of power draw.
1100w being pulled from 120v, implies 9.167amps.
Double that for both power supplies, and you’re pulling18.334amps. With most residential circuit breakers tripping at 15amps.

Now I’m not saying you’re gonna actually pull those numbers, but do you have a guesstimate to what you are pulling?
My house wiring is ancient.

amal · March 24, 2025, 10:01pm

Yeah you won’t get 350w from the 75W PCI +150W additional power cable.

Yeah this depends on how you config your power supplies through iDRAC… you can simply config one to run in hot spare mode and limit your total output to 1100W which is well more than enough for the server + 3060 GPU.

Depends on your wiring… lighting circuits are usually 15A while sockets are usually wired for 20A (though I’ve seen plenty of 15A sockets wired to 20A rated Romex connected to a 20A breaker). You should probably have a dedicated circuit for your server stuff.

Once you config your iDRAC and get it updated and logged in, you’re gonna shit your pants. You can manage your server entirely out of band… power it up… shut it down… configure internal components… get realtime data… etc. Here is my current power screen;

ODaily · March 24, 2025, 10:08pm

Just for frame of reference.

That is my ENTIRE electrical circuit for the whole house.

Jammyjellyfish · March 24, 2025, 10:17pm

that’s probably due for an upgrade

I’d recommend checking out one of the online computer power draw calculators. That will tell you your realistic power requirements. I’d expect something like 600w max realistically. The server power supplies are rated up to their wattage, so they can supply it, but don’t draw that amount.

As Amal stated, the second power supply is generally used as a backup in-case the first one shits the bed

amal · March 24, 2025, 10:36pm

With the R720 the behavior depends on configuration;

In this configuration both PSUs contribute to the server at the same time and split the load. However, you can change both the input power redundancy setting and Hot Spare function independently…

This gets a bit confusing though so if you just want one PSU to run primary then put it in redundant mode with a hot spare configured and one PSU will run the whole server with the other in stand-by mode.

ODaily · March 25, 2025, 12:37am

So I concede that the 3060 is the only viable choice. With that said, which one? What am I looking to get or more importantly, avoid?

I got to thinking that I would prefer to avoid the power cable interference that Amal had, so I went looking for a card with a plug on the end.

That’s the HP Geforce RTX 3060. The only end plug I’ve found so far. But are all 3060’s comparable? This one seems to lack the gaudy (and possibly hard to fit) gigantic plastic shroud. But only has one fan.

Don’t even get me started on the little bitty 3060s that look like about 1/3 standard length.

amal · March 25, 2025, 3:01am

I think there are 3060 and 3060 Ti which have more cores but less VRAM and 12gb seems to be the nicest balance for model sizes that run in the 3060.

Don’t worry about the cable, if you’re only using one GPU then put it in riser card 3 (I think) and you’ve got plenty of space. You’ll have a harder time finding or modding the cable to go from the power port to the card as these seem to be two different worlds (server riser power sockets and consumer GPU power sockets)

Not · March 25, 2025, 12:00pm

This doesn’t help much with the cable issue but:

You can always remove the shroud (not heatsink) entirely and set your own fan(s) on it with your preferred combination of zip ties and 3D printed parts.

This saves space, makes things quieter, and can improve cooling overall.

Also keep in mind, with that HP model above, if you look at the fins, the natural airflow of the case will contribute to the cooling of that card so one fan may be more than enough.

ODaily · March 25, 2025, 3:46pm

Now I just need a spare 300 or so dollars.

On the bright side, I plugged it in and she fired right up. Had to go back to the ebay seller to get the password, but everything’s ticking over in apple pie order.

So I figure I need a game plan.
Stage 1, Hardware. Physically assembling the basic system.
Stage 2a, Software. I need to get a conversational AI up and running on text input.
Stage 2b, Software continued. Time to bring together Home Assistant, Voice control, and generally smooth it out.
Stage 3, Voice. I want to give it a unique voice / sound. I have ideas here. Will need to learn about voice training, but that’s part of stage 3 anyways.
Stage 4, Hardware. Time to take voice to multiple rooms, and start controlling lights. I’d also like to work in some sensors. I’m thinking an outdoor weather station, and indoor temp / humidity.
Stage 5, Unrealistic ideas. I might need a robo dog, one that can fetch a cold soda for me, on voice command. Maybe video input as well. By this point, if any of it will actually work, I’ll probably have to do a major upgrade over the R720. But ya gotta have dreams to pursue.

amal · March 25, 2025, 8:33pm

Check out rhasspy piper studio

ODaily · March 25, 2025, 10:02pm

I’ve tried searching, and I’m nowhere near ready for voice yet, but…

Can Piper merge / synthesize two or more voices?
What if I had two different people read the prompts (alternating) for voice training?
What if I had two different people read the SAME prompts, and then edited the files together?

Long story short, I don’t want to use a specific person’s voice. But I live in the south, and there’s a LOT of ladies in the area with significant twang when they speak. If I could somehow merge multiple voices to produce a synthetic, that would be ideal.

But, yeah. This is so far down the road yet.

amal · March 25, 2025, 10:20pm

I seriously doubt it. That’s some pretty advanced commercial stuff right there. It just focuses on training a voice that can synthesize any text with minimal training input.

If you want to do that I suggest you use a commercial voice service that you can generate the prompt text in the generic synthesized twang you are looking for, then use those outputs to train your own voice.

amal · March 26, 2025, 3:44am

ODaily · March 26, 2025, 3:57am