Local AI Server Guide

Plato · December 8, 2025, 6:13pm

Without AVX2 you will need to compile it. Don’t worry it is super easy.

git pull https://github.com/LostRuins/koboldcpp
cd koboldcpp
make -j 8 LLAMA_CUBLAS=1

Change the 8 to the number of cpu cores you have. It should only take a few minutes to compile.

The reason you need to do this is that with your CPU limitations, it is going to be difficult to get a precompiled engine that will be optimal for you. This isn’t specific to any particular engine; I would advise doing it regardless, and it is pretty painless to do with kobold because they use make.

By compiling it on your machine it will automatically find the appropriate instruction sets and optimizations for your CPU and compile that into the binary libraries.

There is no reason to run it in a docker on a local server because it doesn’t install anything and it doesn’t require an evironment or any specific runtimes except what you would need to run your system anyway.

If you run into issues with the CUDA libraries and compiling, go through the steps in the guide in the last post to install the CUDA toolkit.

When it is done compiling just do

python koboldcpp.py https://huggingface.co/bartowski/google_gemma-3-12b-it-GGUF/blob/main/google_gemma-3-12b-it-Q4_K_M.gguf --mmproj https://huggingface.co/bartowski/google_gemma-3-12b-it-GGUF/blob/main/mmproj-google_gemma-3-12b-it-f16.gguf --contextsize 8192 --gpulayers 999

or whatever your parameters are.

If you want some scripts to automate the run process a bit easier let me know.

amal · December 16, 2025, 7:10am

finally got around to re-attempting my frigate setup with my 2 GPUs and 2 Coral TPU PCIe modules

Plato · December 16, 2025, 7:33am

What are you using the Corals for?

amal · December 16, 2025, 2:50pm

They are good at object detection and inference for very low power cost.

Plato · December 16, 2025, 2:57pm

Just wondering what you are detecting.

Plato · December 16, 2025, 2:58pm

I have some DepthAI cameras from a previous project which have a TPU built in similar to Corals. They do a decent job of image detection but are kind of a pain in the ass to program.

amal · December 16, 2025, 3:01pm

Ahh.. people mostly. The GPUs then do face rec on top of that but timeline people detection is what the TPUs are up to.

I’ll probably try to make a kind of “presence” filter for home assistant operations based on people detection, at least for indoor cameras. Outdoor cameras seem to be too unpredictable / chaotic to get it working reliably. Spiders crawling across the camera fov trigger people events etc.

enginerd · December 16, 2025, 3:39pm

I’d react the same way as the AI if I had spiders crawling over my eyes…

bretbernhoft · January 1, 2026, 12:38am

I have reflected a bit on your comments regarding Ollama. It would be useful for me to branch out and try different software.

What is your suggested AI runtime on Linux for someone with programming chops?

Plato · January 1, 2026, 2:07am

Are you on linux or windows? is this a desktop or a server?

bretbernhoft · January 1, 2026, 2:25am

I am on Linux Mint for my primary/developer PC. The AI computer is a Linux desktop I access via SSH.

Plato · January 1, 2026, 4:47am

I recommend llama-server, from llama.cpp. You can find it on github. Koboldcpp has some real quality of life additions but that’s personal preference. Most people are happy with llama.cpp. It has a built in webgui, but if you want to do things like MCP I recommend witsy as a client.

bretbernhoft · January 1, 2026, 5:01am

Thank you for the recommendation. I will look into it.

Plato · January 6, 2026, 10:01am

For anyone looking to grab a GPU, now is the time to do it.

GrimEcho · January 7, 2026, 4:01am

Ugh, this is the absolute worst time to have to RMA my Gigabyte RTX 4090. I sent it in a couple of weeks ago, but I’m not holding out much hope. Even if they do actually agree there is an issue I doubt they have any stock to replace it, even refurbished.

My guess is they plug it in, power it up, run it for 30 seconds, then return it to me as “working”.

amal · January 7, 2026, 4:48am

Given the current market, you’re lucky they don’t say;

rma rep: What? A 4090? Hey George, you see a 4090 around here?
repair tech George: Nope. I ain’t seen nuffin.
rma rep: Seems we don’t have any idea what you’re talking about.

GrimEcho · January 8, 2026, 6:08am

Lol, I insured it for $3K when I sent it in. I was half-hoping UPS lost it. Crazy that a used two-year old card is selling for double what I paid new.

I heard back from Gigabyte today. They hooked it up and ran 3DMark for an hour with no crashes and want to send it back to me. The problem is that the card is only crashing when playing games. Multiple games crash with a graphic card fault within ~10-20 minutes of playing. Happens in multiple systems, multiple sets of drivers, Win 10 and 11, and the crashes immediately resolve when I swap the card out for an RTX 5070 with no other system or software changes.

I was able to convince the CSR on the phone to add some notes to try and get their tech support department to actually test the card in some modern games. The rep told me that the repair/RMA department techs aren’t allowed/able to install games on their rigs … rigs built to test high end cards that are 90% used for gaming .

amal · January 8, 2026, 7:04am

i wonder what the difference is between a game and 3DMark ? .. different types of cores used? more memory used in a game?

Plato · January 8, 2026, 9:18am

What power supply do you have?

GrimEcho · January 8, 2026, 11:51pm

My hunch is that it’s a memory or memory controller issue coupled with Nvidia specific features like DLSS. I was never able to find a benchmark that could stress the card in the same way that that a modern game will, especially at high resolutions (I have a 5120 x 2160 monitor). Even Furmark’s stress test ran stable.

It’s also not a overheating issue (unless the cards temperature sensors are faulty). Power use, fan speeds and temperatures all had normal looking curves, and max reported temps were in the low 70s

A 1200W Be Quiet Pure Power 12. I also tested with a friends 1000W power supply (I don’t remember the brand, but it was powering a 5080 without issue) and the crashes persisted.