My very own AI server

ODaily · July 23, 2025, 7:32pm

Would you mind trying this?

ollama run wizard-vicuna-uncensored:13b

I puttered around alot of models, and for some reason, this really was the fastest. I’ve got 22gb on a 2080ti, and those look like dual P40s, I’m curious as to the performance changes.

If that runs fine, then maybe up the quantization and see if you can get the extra performance without loosing speed?

ollama run wizard-vicuna-uncensored:13b:q6_K

turbo2ltr · July 23, 2025, 8:05pm

have any particular prompt you want me to give it?

turbo2ltr · July 23, 2025, 8:13pm

The first one runs at 28tok/s

Reply wasn’t great.

~~As for the q6 one, Ollama doesn’t seem to have quants other than q4.~~
Disregard, I found it: wizard-vicuna-uncensored:13b-q6_K

I never saw the “View all” link at the top until now

turbo2ltr · July 23, 2025, 8:30pm

Compare the response with qwen2.5-coder:32b… I mean grant it it’s a bigger slower model.

chat-Proxmox Storage Pools - flat3.pdf (6.8 MB)

ODaily · July 23, 2025, 9:26pm

That’s what I was really curious about. I’m running in the 50-60 range. My curiosity was about superior architecture vs Bigger Vram. I got a feeling if we went up in size to a point where I exceeded my vram and had to offload to the cpu, but you were still inside your bigger vram envelope, then you’d smoke me hardcore.

Well, it is an older model. And I was looking for more conversational than technical. Still, good info there.

amal · July 23, 2025, 9:38pm

Anyone hook this up to home assistant yet? I’m looking for models that don’t go right off the fucking handle… so they need to be able to interpret the controls they have available to them and apply voice commands appropriately.

Also the most basic things I can do with alexa or google home, home assistant trips over… stuff like asking what time it is or setting an alarm or timer… like a countdown timer is basic stuff but because the model has no resources for doing things like this, I suppose I would need to write a custom HA process for alarms and timers and then need to indicate which HA voice box made the request and play an alert to that voice box specifically… I can see how this would get complicated pretty quickly.

Just wondering if anyone else has played with HA integration yet.

turbo2ltr · July 23, 2025, 9:48pm

Wow nice. Yeah I didn’t realize these P40s weren’t that great. I’m still trying to understand, yeah it’s great to have 48gb vram, but it seems there isn’t enough compute power to actually utilize that much memory at a reasonable speed so probably would be better to have a better GPU with less memory…there seems there is some balance of GPU and vram that maximizes the efficiency so you aren’t wasting vram with underpowered GPU or have more compute power than you can utilize at 100% ram utilization.

The q6 model ran at 16t/sec. slightly better response.
I’m not good at conversation. lol

turbo2ltr · July 23, 2025, 9:58pm

I haven’t even started looking into this but I believe HA has a system prompt and that is the key to a lot of this. You tell it it’s mission and expose the entities it can control. All those things are sent every time you make a request.

Slightly related, I came up with a system prompt for chatGPT because it wants to jump 16 steps ahead when troubleshooting something:

For this conversation, we will be troubleshooting some computer issues. Don’t try to jump ahead provide additional steps if those steps could be different depending on the result of the first step… Only answer the question at hand, short and concise.

Man it was so much nicer and it doesn’t pollute the chat with unrelated answers.

amal · July 23, 2025, 10:08pm

from what i understand;

vram is needed to hold the entire model in memory. without enough memory, tokens/sec gets utterly decapitated as the CPU and traditional RAM gets involved. Not worth bothering. Get yourself more vram or a smaller model.
your gpu will have different “architectures” depending on the make / model / year (“generation”) and this directly impacts the types of accelerators available (nvidia tensor & cuda cores / amd matrix & stream processors) and the version of library that works with the accelerators you have.
does your hardware have tensor core mixed precision support (FP16/BF16/FP8)
Does your hardware have the memory bandwidth? Without enough GB of HBM2(e)/HBM3 and ≥1 TB/s+ of bandwidth, even the fastest tensor cores will starve.

Order of importance seems to be;

Look at the tensor‑core TFLOPS (in your chosen precision).
Check the total CUDA‑core TFLOPS (for everything else).
Verify the GPU’s memory bandwidth is ≥1 TB/s

Of course, you also need to verify your chosen LLMs will fit within your vram capacity.

ODaily · July 23, 2025, 10:19pm

I’ve run into something similar. Most models fall into groups around 13b or 32b parameters. The smaller groups fit comfortably in my 22gb, with lot of wasted space left over, and the 70b - ish models overflow. I really wish there was a 20-24b parameter size. I think that would be my sweet spot for performance vs speed. Then again, not everybody has my exact specs, so hoping the world will change to accomadate my whims is likely to be fruitless.

I hurt for money right now, which has stalled ALL projects. But I do wonder if I could push a 70B model with dual modded 2080ti cards, or if I’d loose too much speed in cross card communication.

Just for the record…
Parameter size isn’t a direct indicator of memory size, but they do tend to trend together. The Wizard Vicuna Uncensored model I used before is:
7B at 3.8 gb Vram
13B at 7.4 gb Vram
30B at 18 gb Vram
(next model not wizard vicuna)
70B at 43gb Vram

turbo2ltr · July 23, 2025, 10:52pm

Yeah the P40 has no tensor cores and it’s not particularly high on the TFLOPS. It’s just one of the cheaper 24gb cards. And there’s a reason it’s cheaper than a 3090.

It’s tough to use smaller models when you get used to how powerful chatGPT is, but then you also realize the ridiculous amount of compute power it takes to run it.

Plato · July 24, 2025, 2:40am

P40s are fine, they are 1080ti’s with 24GB of VRAM and are basically as fast as a 3060.

You are running into a few issues here.

You aren’t actually benchmarking you are running a prompt and seeing what speed you get as a response. That doesn’t tell you anything because you could be caching or not caching that prompt which could mean processing it or no, your prompt could be large or no, you could be generating a large or small amount of tokens, etc etc. It is like trying to tell how fast it takes to drive somewhere by getting on a bus. It could be just as fast or it could be 10 times slower, all depending on a bunch of things you aren’t accounting for.
You are comparing with someone else doing the exact same thing. This squares the unknowns
You are using the worlds worst inference backend for speed optimization and benchmarking. It doesn’t tell you what it is doing at all, and it doesn’t let you change any thing. Ollama is fine for getting started I guess, but you should really just dump it now and jump into the deep end. It will do more harm to your learning what you are doing in the long run and give you terrible habits like thinking you need to containerize everything because installing things properly is hard or something

This is what you want for a benchmark:

Processing Prompt [BLAS] (924 / 924 tokens)
Generating (100 / 100 tokens)
[22:32:36] CtxLimit:1024/1024, Amt:100/100, Init:0.10s, Process:1.56s (592.31T/s), Generate:4.58s (21.81T/s), Total:6.14s
Benchmark Completed - v1.96.1 Results:
======
Flags: NoAVX2=False Threads=9 HighPriority=False Cuda_Args=['rowsplit'] Tensor_Split=None BlasThreads=9 BlasBatchSize=512 FlashAttention=True KvCache=0
Timestamp: 2025-07-24 02:32:36.294608+00:00
Backend: koboldcpp_cublas.so
Layers: 999
Model: Wizard-Vicuna-13B-Uncensored.Q6_K
MaxCtx: 1024
GenAmount: 100
-----
ProcessingTime: 1.560s
ProcessingSpeed: 592.31T/s
GenerationTime: 4.585s
GenerationSpeed: 21.81T/s
TotalTime: 6.145s
Output:  1 1 1 1
-----

amal · July 25, 2025, 12:35am

Plato · July 27, 2025, 6:58pm

Home assistant has an MCP server.

Sounds like that would be perfect for you.

Plato · August 11, 2025, 2:34pm

If any of you Ollama users want to know what’s going on

Ollama and GPT-OSS

ggerganov is the creator of the inference engine that Ollama based itself off of, and the original local AI gangsta.

amal · August 11, 2025, 4:29pm

Ugh… that is so lame to pull a PR stunt and fuck everyone at the same time, then fuck them again by not supporting or explaining what’s going on.

amal · September 2, 2025, 6:38pm

Training and detecting AI sleeper agents… so interesting..

bretbernhoft · September 2, 2025, 11:46pm

It’s great to see other technologists building their own AI server for their homelab. Congratulations on starting so early on.

I am still a newbie when it comes to deploying and programming with an LLM, or any artificial intelligence model that is locally operated.

Along these lines, today I received the last core components for a dedicated AI machine of my own. Here is a photo of those boxed parts:

My goal (with this build) is to become AI-literate as a developer, in the hopes of working with more sophisticated technologies in the future. Making this both a passion project and professional interest.

Does the OP (@ODaily) of this thread have any advice for those just entering into the world of working with open source AI models?

ODaily · September 3, 2025, 12:59am

Sure, but first you should know I’m a total newb myself. I only got this far following what others did.

I’d say first, don’t release the magic smoke. But that’ s more a me problem cause I was combining components that weren’t ever meant to be together. Looks like you’ve got a much more coherent system build.

I’m seeing a 16gb graphics card. Assuming you go ollama, it actually lists the vram necessary in GB for each model. Don’t go over or it seriously bogs everything.

So you could run the three in red, but not the blue. With that said, I favor the larger models as being more capable. It’s also weird that often you can’t find a model that fully fills your vram (like these 3) without going over (like the blue).

Experiment. Try different models, poke and prod at them. What worked best for me was an older, almost outdated model.

Chatgpt. Weirdly it works to hold your hand and walk you through stuff. Alot of what I did was just copy paste back and forth between my system and chatgpt. This will vary with your level of linux skill. Mine was nearly non-existent when I started, but improving. BUT NOTE THIS, when chat gpt says you’ve gone over your limit and you’ll get downgraded until sometime later, STOP. The downgraded model will lead you down so many false paths your head will spin.

Most importantly, it’s a learning experience. Be o.k. with breaking stuff.

bretbernhoft · September 3, 2025, 1:10am

From one newbie to another, thank you for the gentle, insightful advice.

I am quite excited about playing around with these tools and the hardware running it all. Including the 5060 GPU with 16 GB of VRAM. I was originally only going to invest in an 8 GB card. But was recommended going bigger, for upgrades and more performant AI models.

As I experiment and make progress, if you’re cool with it, I will share periodic updates here. But this is your thread, and I don’t mean to interrupt the flow of conversation already happening here.