My very own AI server

From one newbie to another, thank you for the gentle, insightful advice.

I am quite excited about playing around with these tools and the hardware running it all. Including the 5060 GPU with 16 GB of VRAM. I was originally only going to invest in an 8 GB card. But was recommended going bigger, for upgrades and more performant AI models.

As I experiment and make progress, if you’re cool with it, I will share periodic updates here. But this is your thread, and I don’t mean to interrupt the flow of conversation already happening here.

1 Like

No worries, I’d originally planned this thread to be a mono subject build log, but…

Life happens.

I’m currently juggling about 3 times the amount of stuff I should be, and I’m not gonna get back in the groove for awhile. I say let 'er rip!

While I’m thinking about it, don’t just share the good. When everything goes sideways you learn more, we learn more, and more help can be given. So if you faceplant wile E coyote style, share the results so we can all learn and help.

1 Like

6 Likes

2 Likes

Aaaaaaaa! Grrrrrrrrr! Why!?

You can take those apart, and use compressed air to get the water out from under the chips…

And I thought that Pilgrim was looking for ways to boost his intelligence instead of trying to torture me at first…

:emoji_die:

psst, hey @enginerd

2 Likes

You are evil… And now you have to cable manage that!

1 Like

With hedge trimmers???

Go ahead, it has to look clean and work well once you’re finished.

If a hedge trimmer makes your job easier when removing the old cables, use it. Of course, I’m expecting you to run new patch cords and leave it looking nice!

:heart_eyes:

Well, I successfully assembled my AI computer and have been chatting with Llama 3 and Mixtral 8x7B for about 4 days now. It’s quite an interesting experience to have multiple AIs locally deployed in my homelab.

My thoughts are to integrate n8n for automating certain tasks, as a next step.

3 Likes

So my AI project has completely stalled out. For, reasons… Plus I’ve got more projects, and… I’m kinda scattershot lazy sometimes.

Long story short, I had to move a big chunk of data over to the R720 cause it was killing my PC and I wondered how everyone else’s rig was progressing, or not.

Not trying to call anybody out, but has anyone gotten anywhere?

1 Like

I ended up deploying n8n on another computer in my homelab, and then successfully connecting two workflows therein to my AI computer via an Ollama API. I have also made relatively significant progress in building applications using Python and JS that talk to (for example) Mixtral 8x7B, now on my LAN.

Instead of sending the data to Anthropic or Perplexity, I am using artificial intelligence locally, in my home. Which is pretty cool in my opinion.

Building a dedicate machine for housing numerous AI services has been a high quality decision. I encourage anyone interested in experimenting with these tools, to go for it.

2 Likes

6 Months Later…

I’m re-engaging the project, basically I’ve gotten my life a little (a lot really) more focused, and I’m gonna go at this at a more sensible, well planned pace.

I reformatted and reinstalled Debian 12, Ollama, and chunks of whisper / piper / etc. Currently I can run Ollama with Qwen2.5 and do the normal type driven interaction in the terminal window. I can also (unrelated to the LLM) Interact with voice commands. But it’s dumb. Like brick dumb. It does exactly two things, in fact.

  1. Using the wake word “Sweetie”, it will parrot back whatever you said to it. “Sweetie, How do you feel?” = “I heard you say, How do you feel?”
  2. It will tell you the time on the system clock. “Sweetie, what time is it?” = “It is five oh seven.”

A great deal of time was spent in trying to keep it from locking up / out whenever the server would sleep / screen off. Also preventing it from reading extraneous b.s. like punctuation, and numbers in word format. Still dumb, but much more well spoken.

Now, where I’m going.
I need to A. connect the voice address to the LLM so that it can do more conversationally. B. I want to revisit the hardware to try and make replies more “snappy”. Right now I’m focusing on the hardware.

I currently have the R720, 2080ti (modded to 22gb vram), 24 x 8gb ram at 1333mhz (192gb), and dual E5-2630 v2 processors. After a long chat with chatgpt, I’ve decided to make the following changes.

  1. E5-2630 v2 to E5-2667 v2.
  2. 24x8gb RAM @ 1333mhz to 8x16gb RAM @ 1600mhz. ChatGPT also strongly recommended 2RX4 RAM.

The Processor change bumps my clock speed way up, but with fewer cores. My AI setup simply doesn’t need or use all those extra cores, and CPU speed looks to be a bottleneck.

The faster RAM should also be obvious, and just like the CPU I simply don’t need that many GB, but there’s a quirk specific to the R720 that ought to be mentioned. While it will accept and run 1600mhz RAM, if you install too much, it literally sucks more power than the chassis provides, and the power saving strategy kicks in where it underclocks the RAM to 1333mhz. As it stands, I probably don’t even need half of what I’m going to put in there, but as long as it doesn’t drag the clock speed down, why not?

One potential further mod.
I’m considering (mostly money issue), putting in a second GPU. But not to expand the LLM VRAM pool. I’m thinking of getting an unmodded 2080ti or 3060, and assigning specific duties to each GPU. Basically I’d keep the LLM on my 22GB GPU, and segregate everything else that would benefit from GPU useage to the second GPU. Because that needs speed, but not so much max VRAM, is where I don’t bother getting something with more than 8-12gb VRAM.

I guess it’s worth saying that I have not got homeassistant, my USB Zigbee module, or my homeassistant voice preview module installed at all. I’m currently using a RodeNT mic and a set of computer speakers. They have to be used together because the R720 does not have an audio jack, and the only way to hook it up was to pass through the RodeNT.

Processors get here the end of the week, RAM should be next week. Then I’m gonna hafta get back down into the terminal and start getting the software put together. First smart voice, then homeassistant, then build out homeassistant peripherals.

Is there no way to convince you to not use Ollama? You are going to be at a huge disadvantage using those processors since they lack AVX2. I think your best bet for GPUs would be P40s if they can fit. You want to max VRAM such that you don’t have to offload anything to the CPU or else you will be crawling. I just upgraded from P40s to 5060ti’s and have three of them complete with coolers and power connectors and was going to put them up on the for sale forum soon. The benefit for having more than one CPU is so that you can have more PCIe lanes to run more GPUs. The RAM and CPU upgrades are going to do absolutely nothing for speed. Without AVX2 your CPU is going to be the bottleneck for speed and no amount of adding MHz or memory bandwidth will help with inference.

1 Like

Here are some benchmarks showing the impact of a CPU upgrade from dual E5-2630v2s to E5-2680v2s

Before upgrade

MaxCtx: 2048
ProcessingTime: 15.03s
ProcessingSpeed: 129.62T/s
GenerationTime: 15.76s
GenerationSpeed: 6.35T/s
TotalTime: 30.79s

After upgrade

MaxCtx: 2048
ProcessingTime: 14.78s
ProcessingSpeed: 131.82T/s
GenerationTime: 15.77s
GenerationSpeed: 6.34T/s
TotalTime: 30.55s

I eventually relented and got an R640 that can handle gold CPUs that have all the bells and whistles. I also got myself 3x low profile Tesla T4 GPUs which draw far less power and are on par or better than my two power hog 3060s. I still run the r720 with 2x 3060s for frigate and faster-whisper and other random stuff but big brain LLMa run exclusively on my r640.

I run the linux-server distro of faster-whisper because the rhasspy version isn’t GPU capable.. however I had to patch in VAD for noise reduction processing to handle noiser environments. I put this up so you can build a docker container with it to deploy. Specify the larger-v3 model if you want the best performance on your GPUs.

Add a service in the Wyoming integration for it and point your vice assistant at faster whisper instwtof your local whisper vtt processor to make voice more responsive and accurate.

As for the dumb LLMs and reading off punctuation etc what’s needed is either more strict instructions on how to output text for the piper tts or somehow make piper smart enough to read things properly. Right now the former is all you can really do.

For now I’ve given up on actually linking a good LLM to my voice assistant. The problem is the models trained for home assistant and know how to respond so the HA engine can accurately inform the model of devices and entities and the model respond with proper control commands is also kind fucking dumb itself.. and it still fucks up the time for tts all the time.

The two reasons I still have Google home mini speakers everywhere is time, weather, and alarm setting. Home assistant voice assistants still suck total ass at these basic things I took for granted from Alexa and Google.

2 Likes

A quick note: It’s hard for me to make an argument (in the respectful way) without sounding argumenative (that means bad), so if this reads like that, ignore it. I both appreciate and enjoy the finer points of the discussion.

For now, it works. In fact, it’s THE most reliable stable part of the rig. Eventually, I’ll build something wildly different, but by that point my skillset will be wildly different. So for now, I’m sticking with it. Put another way, this is one of those “Better is the enemy of Good” philisophical moments.

Moving on to the meat of the matter. As I understand it:

2080ti (modded)       P40
22gb                  24gb
Tensor cores          No Tensor cores
F16 support           Nope.
GDDR6                 GDDR5
4354 Cuda cores       3840 Cuda cores

So I get 2gb more VRAM, but lose the Tensor cores, and have slower performance on just about everything else?

There’s no way for you to know this in advance, but it’s pretty relevant for the rest of this reply. I’ve been unemployed for 14~ish months now. I’m not starving, but dropping 160 bucks on a couple of processors and ram is a pretty big deal for me.

I know nothing about AVX, so I looked. I have AVX1.0. To get AVX2.0, I’d have to completely replace the server in order to get a motherboard compatible with the next generation of processors that had it. R730 at a minimum. Can’t do it.

I think this leads to a divergence of our strategies, if I’m understanding you correctly. Basically you’re advocating using dual GPUs, that connect via faster connections through the CPU to push fast inference in a large model. I would grossly paraphrase this as “pooling the VRAM”. If I had the hardware, great. Doubly great if I was only running the LLM, and didn’t have other stuff trying to run on the GPUs too.

My strategy is based on the fact that I can’t do that. I have a R720, and it’s limitations just won’t go there. The PCIe3.0 bottlenecks through the CPUs when doing GPU to GPU communications, and the server won’t support NVlink. So, I want to let my “good” GPU do inference for the LLM while the 2nd GPU takes care of all other GPU tasks. (whisper mostly, a little piper). Basically an attempt to keep the cross talk to a minimum. Granted, I don’t have a second GPU, but it’s in the plan for someday… Unless I get to a position where I can afford a whole new rig all at once, then all bets are off.

As for the CPU / RAM. I’m a newb, got it. But as I understand it, for my application, Piper is much more CPU oriented than GPU oriented. Going from a 2.6mhz processor to a 3.2mhz is not intended to help the LLM inference at all. I’m trying to cut down on that unnatural pause in the moment between query and reply in the voice system, right where Piper resides.

Don’t tease me, it’s not nice. :star_struck:

aww sorry :slight_smile:

So Piper generates voice from text and that doesn’t take really any time at all. The biggest time sucks are 1) getting text from audio input (Whisper), and 2) doing something with it (an LLM or local home assistant text processor).

Try setting your voice assistant to just use the normal HA text processor, not connect it up to an LLM for now. Set it to use the normal local non-gpu slow Whisper service. Run the some simple tests with this setup. Get a feel for how long it takes to process specific voice commands.

Then deploy the faster-whisper docker container from the repo I posted. You can use this docker-compose.yml;

services:
  wyoming-faster-whisper:
    container_name: wyoming-faster-whisper
    # Custom fork with VAD support: https://github.com/amalg/docker-faster-whisper-vad
    image: ghcr.io/amalg/docker-faster-whisper-vad:gpu
    restart: unless-stopped
    environment:
      - PUID=1000
      - PGID=1000
      - TZ=America/Los_Angeles
      - WHISPER_MODEL=large-v3
      - WHISPER_BEAM=5
      - WHISPER_LANG=en
      # VAD (Voice Activity Detection) - Silero VAD
      - WHISPER_VAD=true
      - WHISPER_VAD_THRESHOLD=0.5
      - WHISPER_VAD_MIN_SPEECH=250
      - WHISPER_VAD_MIN_SILENCE=2000
    volumes:
      - ./data:/config
    ports:
      - "10300:10300"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0']
              capabilities: [gpu]

Once it’s up and running, go to your HA integrations and click (or add) Wyoming Protocol

This is the protocol that moves audio data to the audio processor service and gets back the text between HA and Whisper (the audio processor). Click the Add Service button to add your new faster-whisper service in the docker container. I run HAOS which has it’s own IP address and faster-whisper on my server, so I put in my server IP and the default port (10300);

Once it’s connected, you’ll see an “entity” under faster-whisper;

Then go to your Voice Assistant’s page, click your voice assistant, and change your speech to text service to faster-whisper;

Once you get it working, try your specific voice commands again and see if the timing of that pause has changed. Then try using different models by updating WHISPER_MODEL= in the docker-compose.yml file to use any of the following;

Model Parameters English-only Multilingual VRAM Required
tiny 39M tiny.en tiny ~1 GB
base 74M base.en base ~1 GB
small 244M small.en small ~2 GB
medium 769M medium.en medium ~5 GB
large-v1 1550M - large-v1 ~10 GB
large-v2 1550M - large-v2 ~10 GB
large-v3 1550M - large-v3 ~10 GB
large-v3-turbo 809M - large-v3-turbo ~6 GB
distil-large-v2 756M English distil-large-v2 ~4 GB
distil-large-v3 756M English distil-large-v3 ~4 GB
distil-medium.en 394M English distil-medium.en ~3 GB
distil-small.en 166M English distil-small.en ~2 GB

Smaller models will get you faster response times, at the cost of accuracy of course. Try some different models.. down the container, change your model, bring it up.. let it download the new model.. you might have to restart the container after.. sometimes it’s picky.. use lazydocker to monitor your containers and watch logs (or whatever docker manager you use).

Once you get a feel for how the whisper model used affects the “unnatural pause”, you can then start trying using LLMs to respond and see how that changes response times. Without documenting every little change and seeing the effects, it will be hard to quantify what is making the most impact to that pause time.

This is one of those moments where I’m irrationally going to dig my heels in, all sense to the contrary.

I’m avoiding docker for all I’m worth.
It was one of the main frustration points that led me fade out on my last attempt. I know, I get it, I’m trying to swim against the current, but I know myself well enough to say that until absolutely forced to, no docker.

The beauty of doing all this (the wrong way) is that as much as I want everything to “just work”, I also want to explore, learn, and break stuff.

That’s kinda what I’m doing now, well it was pre-hardware update. I have the LLM, and I have the voice inputting and outputting (dumbly). It’s currently just a matter of dicking around with it to see what it feels like when I do this vs that.