Local AI Server Guide

I am going to use this space to write up a comprehensive guide to getting a local AI server running.

With the field evolving constantly, it is incredibly difficult to keep your head above water when trying to learn all this yourself; the only practical way to accomplish it is to be embedded in it. This is impractical for most people who are not working in the field professionally or academically, and the problem is compounded by those who do populate the professional space being either advanced-degree holders with FAANG employers throwing recruiters at them (who have no time to write how-to’s), or hype-spewing grifters freshly demobed from the crypto battlefields with all the SEO know-how to populate every Google search with bullshit.

As someone with a CV full of credentials reading ‘just good enough to convince people I know what I am doing, but not good enough to be an actual expert’, and with extensive experience being dedicated to learning every new, highly complicated technical thing in front of me until I know enough about it to get bored and move on, I think I am quite suited to write such a guide.

I have a rough structure in my head, but I would like to field questions from people about what topics generally interest them in this field so that I can address them. But, please don’t get into very specific use cases as I’d rather focus on broad areas that would be applicable to many people who can use this as a foundation, rather than creating a comprehensive but narrow solution.

4 Likes

I literally just ordered everything I need to make a similar unit. Just waiting for the server to arrive.

Dell R730 128gig DDR4 ram, 24TB of raw mass storage, 1TB boot SSD, dual xenon E5-2680v3, dual 24GB P40s.

What’s your software stack?

1 Like

Ubuntu for the OS, KoboldCpp for inference and Cherry for the front end.

Part 1: Why Run AI Locally?

This is fundamentally not just a practical decision but a philosophical one. It is undeniable that our lives are dependent on technology. We cannot live in the modern world without it. What technology we choose to use and how we engage with it are the only choices we really have.

The technology market is dominated by data gathering and processing. Google and Meta make their money by collecting large amounts of data about their users and transforming it into information that advertisers pay for. They cannot exist without gathering this data, and everything they do is built around that. This trickles down to the smallest players in the technology markets: almost everything that can access the internet will collect data and attempt to send that data back home. This is not paranoia. This is established fact.

Unfortunately, this ‘surveillance capitalism’ cannot be meaningfully opposed by an individual. All we can do is make small decisions that try to break our dependence on the services that are fueling it.

The inevitable end-game of corporate-controlled, consumer-facing AI (AI is nothing new to corporations, which have been using it for over a decade to run their systems – it is just consumers being able to interact directly with it that is new) is that once a dominant player emerges and they are able to figure out a strategy to extract value from it, then it will become integrated into the information collection and exchange system. Whether or not this is a bad thing is something that people are likely to disagree on. I think it is.

Right now we are in about Phase 1 of the cycle where the public is offered things of real utility for free or for much less than the production cost, while the providers try to consolidate user bases and figure out monetization strategies. This is the time we need to start putting that utility to use, because regardless of whether you agree that the end-game is a bad one, it is still free, and it is still useful, and it is still only for a limited time.

Any person who values this utility will find it in their interest to take as much of it as they can and learn how to integrate it into their lives to maximize its usefulness. However, how one does this does depend on their view of that end-game. If you would like to be as independent as possible from the centrally controlled information gathering market, then it is critical that you learn how to manage it yourself and that you run as much as possible locally. This guide is an attempt to help you get started doing that.

2 Likes

Did you use AI to write this or modify your original text?

I use it to copy edit and prune, since I can be overly verbose. I don’t use to generate any new text.

1 Like

Part 2: Hardware

This section is about hardware needs for running a local LLM. The goals are assumed to be as follows:

  1. Effective dollars per unit of performance is the target
  2. A server or always-on system is desired
  3. Access to the server from inside and outside the network is desired
  4. Electricity is a factor in the costs
  5. You are proficient in hardware and software
  6. The desired function is inference, not training

Hardware Factors in LLM Performance

LLM inference has two distinct phases with different hardware requirements:

1. Prompt Processing (Compute-Bound)

Prompt processing is bound by compute speed - FLOPS (floating point operations per second) or TOPS (integer operations per second). Since we are dealing with quantized model weights, the most important specs are 8bit integers (INT8 TOPS) and 16bit floats (FP16 FLOPS). GPUs excel at the parallel processing required, making them ideal for prompt processing.

  • Bottleneck is raw compute power (FLOPS/TOPS)
  • GPUs excel due to parallel processing while CPUs are terrible
  • Look for INT8 TOPS, FP16 FLOPS

This is why you should definitely have a GPU in your system for prompt processing, regardless of how much memory bandwidth your system has.

Example
Context window of 8192 tokens:

  • Prompt: 200 tokens
  • Chat history: 3000 tokens
  • Document: 1000 tokens
  • Image: 500 tokens
  • Total used: 4700 tokens for processing (if not cached)
  • Remaining: 3492 possible tokens for generation

2. Token Generation (Memory-Bound)

Token generation is bound by memory bandwidth. To generate each token, the entire model must be sent to the processor layer by layer. This means whatever the size of the model, that amount of data must be transferred for every generated token. This data movement almost always takes longer than the actual processing. GPUs typically have higher memory bandwidth than CPUs, but consumer motherboards often have limited RAM bandwidth due to having only two memory channels. Adding memory channels increases bandwidth proportionally and is achievable by using multiple smaller RAM sticks (two per channel).

  • Bottleneck is memory bandwidth
  • More memory channels > faster memory speed
  • Dual CPUs double memory channels

Because of this, much older servers or professional workstations often outperform new consumer hardware at token generation because they have more memory channels.

Calculating Memory Bandwidth

Formula: bandwidth_gbps = (memory_speed_mhz × 2 for DDR × 64 for bits per channel × channels) Ă· 8000 for bits per GB

Example

DDR4-3200:

  • Speed is 1600mhz or 1600 million clocks per second
  • DDR means double the speed for 3200 million clocks per second
  • Multiply by 64 bits per clock for 204800 million bits second
  • Divide by 8000 to get 25.6 billion bytes per second (GB/s)

Therefore, at DDR4-3200 we get:

  • Single channel: 25.6 GB/s
  • Dual channel: 51.2 GB/s
  • Quad channel: 102.4 GB/s
  • Dual CPUs, each quad channel: 204.8GB/s

For a specific example, this is what I wrote and submitted, asking for it to be more readable and to add concrete examples and format it for skimming, resulting in the above comment:

Part 2: Hardware

This section is about hardware needs for running a local LLM. The goals are assumed to be as follows:

  1. Effective dollars per unit of performance is the target, when that is the highest spot on the curve when the (performance, cost) function is plotted on a graph
  2. A server or always-on system is desired which will be configured and maintained by you
  3. Access to the server from inside and outside the network is desired
  4. Electricity is a factor in the costs
  5. You are proficient in hardware and software such that you can assemble complicated combinations of hardware along with software and
  6. You are proficient enough to be able to troubleshoot novel problems arising from such combinations
  7. The desired function is inference, not training

Hardware Factors in LLM Performance

With LLM inference there are two parts which require extensive compute resources:

  1. Prompt processing
  2. Token generation

Prompt processing is self-explanatory and is the processing of tokens before generating new ones. LLMs are text completion engines, and every generated token is reliant on all of the previous tokens, with each previous token adding to the compute cost quadratically. The context window is the best way to understand this, as this is the maximum number of tokens that the LLM is allowed to process or generate. This includes all of the previous tokens in the current chat, including tokens used for images, documents, your inputs, and the prompts from the beginning of the chat until the end. The tokens generated by the LLM must then fit into the remaining window or it will become incoherent. Example: context window of 8192 tokens; prompt uses 200 tokens, chat history uses 3000 tokens, document uploaded uses 1000 tokens, image uploaded uses 500 tokens:

remaining_context = context_window - used_context 
used_context = 200 + 3000 + 1000 + 500 
remaining_context = 8192 - 4700 
remaining_context = 3492 tokens

Those 4700 tokens in the used context are what would be contained in the prompt processing part of the compute. In practice, this is not really a big deal with context that rolls over since it is computed and cached, but when starting new chats or adding large amounts of tokens to chats, or modifying parts of chats from before, then you will have to deal with the time needed for prompt processing.

Prompt processing is bound by compute speed. That is, FLOPS (floating point operations per second) or TOPS (integer operations per second). You will can find these numbers in the spec pages for CPUs and GPUs, but the precision does matter. Precision is the amount of bits used for the numbers in the operations, such as 32bits for floats or 8bits for integers. This is essentially the number of decimal places in float numbers or the number of digits in integer numbers. Since we are dealing with quantized model weights the most important specs in this regard are going to be 8bit integers or INT8 TOPS and 16bit FLOPS or FP16 FLOPS. Maximizing these specs will ensure fast prompt processing. Since GPUs excel at parallel processing of the type of operation required by this, they are prized for this aspect of inference.

Token generation is the creation of new text by the LLM when given text to complete (a prompt, context, etc). This is bound by memory bandwidth. In order to generate a token, the entirety of the model weights must be sent to the processor to be computer, layer by layer. This means that whatever the size of the model that resides on your drive, that amount of data must be sent to the processor for every generated token. As you can imagine, this is a lot of data to move around and it almost always takes longer to get that data to the processor than it takes for the processor to process it. Because GPUs tend to have a lot of memory bandwidth, they are good at token generation. The fact that they are better than the CPU, however, is more a factor of most consumer motherboards and processors have abysmal RAM bandwidth because they lack memory channels. By adding an extra channel you add bandwidth equal to the base bandwidth of the system memory. This is essentially free if you are willing to split your RAM sticks into small enough units of two per channel. This sounds confusing but is actually really simple. [equation for computing ram bandwidth by PCxxx or DDRx specs] [simple example of adding channel to system and bandwidth increase]

1 Like

If you prefer the non-AI formatted version I am happy to submit that instead; I figured people would appreciated it being easier to read, even if it does look AI generated.

I was just curious if you “eat your own dog food” so to speak :slight_smile: I have no problem with the undulations the output went through to prior to posting.

2 Likes

Part 2: Hardware

Continued


GPU selection infographic.

EDIT:

A Note on Prices

The pros and cons in the graphic were made with the current USA (as of July 2025) eBay prices in mind.

MSRP MEANS NOTHING FOR NVIDIA GPUS ANYMORE

You cannot base anything on what NVIDIA or reviews claim the MSRP to be. It just has no relevance in the current GPU market. This is unfortunate, but it is the reality.

3 Likes

Part 3: Base Knowledge

There are things that will be required or extremely helpful to know going forward. I recommend attempting to commit as much as you can to memory:

Terms and Jargon

Models are the whole of the AI unit. As person is to human, a model is to AI.

Weights are the files the AI is stored on. They come in different forms, but the kinds of weights we will be dealing with are quantized.

Quantization removes precision from model weights and makes them smaller. Using different methods, the numbers in the weights are converted to smaller representations, making them easier to store, easier to fit into memory, and faster. This can be thought of as a type of lossy compression. There are different methods of quantization, but the one we are going to focus on is GGML, specifically GGUF.

GGUFs are a file extension and a type of quantized model weights used by llama.cpp.

llama.cpp is an inference backend, and is the basis for a lot of wrappers like Ollama, llama-box, LM Studio, KoboldCpp, Jan, and many others. It can also be used by itself without a wrapper, though the wrappers usually have some features making it much easier to use.

Image projectors are additional model weights stored in a single GGUF file which allows a language model to have vision capability.

Huggingface is a website that offers model weight storage and downloading, using a git system. It is the home of foundational models and finetunes.

Finetunes are models which have been trained again for a specific use case. Many finetunes are an attempt to uncensor foundation models to varying degrees of effectiveness.

Abliteration is a model with the capability to refuse requests specifically removed programmatically rather than through additional training.

MCP is a protocol which allows models to call tools offered to them which can provide information or capabilities they do not have, like web search or file access.

Instruct training is used to turn a base model into an instruct, or it model which can follow directions and chat. It requires an instruct template, which is a specific way of wrapping a prompt in tags that the model recognizes and informs it when turns begin and end, what part of the chat was the user or the model, and how to call tools, and things like that. Without instruct training a model will only complete the text you send it, since it does not know how turns work.

OpenAI is not just a company which makes a shitty chat model helmed by a giant psychotic asshole, but is a standard for sending requests to models serving an API. Think of it like USB: there is a standard which lets you plug pretty much anything into a computer, but what happens once the device talks to the computer is not the concern of USB; as long as the plug fits and the computer can talk to the device, the job of the USB layer is finished. The OpenAI standard is the USB of LLMs and lets you abstract away chat history, instruct templates, tool calling, and all that annoying crap and just send messages and get a response back.

Offloading is putting model layers onto the video card to be processed instead of the CPU.

Samplers are settings that determine the probability distribution of the output tokens. They directly affect the quality and type of response that the model is able to give.

Context is the conversation log that fits into the model’s working memory. It contains the prompts, outputs, and user inputs, and any media or documents given to the model to perform the task asked of it. There is a limited amount of tokens that can fit here, along with a large performance penalty as it gets filled. In addition, there must be enough room left over for the model to generate an output.

GGUF Quants

Let’s get into some specifics about GGUFs. They have a specific naming convention which is more or less followed: FineTune-ModelName-ModelVersion-ParameterSize-TrainingType-QuantType_QuantSubType.gguf with the image projector having an mmproj prefix.

The parameter size is usually a number followed by a ‘b’ and indicates how many numbers are in the weights, with the ‘b’ meaning billions. Those parameters are each composed of a number of bits specified by the quant type.

Example:

Mistral-Small-3.1-24B-Instruct-2503-Q6_K.gguf

  • Mistral is the name of the AI company that created the model.
  • Small is the name of the model. Very creative, I know.
  • 3.1 is the version.
  • It has 24 billion parameters in it
  • It is instruct tuned
  • It was released in month 03 of year 25
  • Each parameter is a 6-bit integer (Q6), with some layers being slightly larger or smaller (K)

It is not important to know what the letters mean specifically after the Q, just that they are ways of adding more or less precision to certain parts of the model, and that K_M is generally the one you want if you can fit it.

How do you know if you can fit a model in your VRAM?

A quick rule of thumb is to multiply the parameter size by the quant precision and divide by 8 to get a number in GB. For instance 24b at Q6 would be (24 * 6) / 8 = 18GB. Hey, would you look at that, it is also the file size!

But of course you will need room for the context cache, so you need to add about 20% to that to get the VRAM required to offload it completely onto GPU.

GGUFs are sometimes composed of multiple files due to huggingface having a 50GB single file limit.

2 Likes

Getting there! About halfway done. Hopefully I can finish this by the end of August. No promises though.

Building a Server

We have established that you need a GPU, but you have to put a GPU into something.

The Ideal Server

The ideal server would be a new top-of-the-line workstation with AMD EPYC or Intel Xeon with maxed-out DDR5 channels and a 2KW power supply, but I don’t know anything about top-of-the-line hardware and only care about getting the cheapest, yet most capable box for the money. I will set an ideal all-encompassing target price of $1000, including the GPU(s), using current (as of July 2025) eBay prices for the completed system. The system will be composed of:

  1. At least one CPU
  2. 128GB or more of system RAM, in such distribution as to max out the channels
  3. Cooling
  4. Power supply capable of handling at least two LLM capable GPUs
  5. Solid state drive for the OS
  6. Solid state or regular drive for the model weights
  7. Wired networking
  8. Case

We could build it, but with used parts, sourcing everything and fitting it all together is a nightmare, and very rarely cost effective. We are left with older professional workstations made by Dell, HP, and Lenovo.

Caveats

These things are HEAVY. They are very picky about what hardware is inside them, including and especially RAM. They have proprietary base components like motherboards, coolers, PSUs, cases, and cabling. They go through extensive self tests and have a long boot-up sequence. They are large and ugly. They are generally not used by hobbyists and have really long extensive enterprise warranties so you will find little in the way of community help for them and often will get told to contact manufacturer support for any issue. They have a lot of strange foot-guns in the BIOS options.

BUT


They are much easier to work on than servers. They are much quieter than servers. They are much more reliable than consumer machines. They generally contain everything needed for a great LLM box. They are generally cheap and available in all sorts of configurations once they pass the enterprise warranty period and get dumped on the secondary market (though this can take quite a long time).

What To Avoid

These are the things I know about; I am sure there are many more. If you have ANY information that should be added here, especially if you had to figure it out yourself, then please reply to this thread with it. There is probably nowhere else someone would be able to find it!

Avoid the Precision 7920/7820/5820/5720 models from 2017.

Avoid any system that:

  • doesn’t have AVX2
  • doesn’t support DDR4 or DDR5
  • contains Intel efficiency cores (12th gen and higher may have them)
  • is marketed for regular business or internet tasks
  • is a laptop, is compact, mini, or all-in-one

Build

Let’s build an imaginary server with real parts. I will screenshot eBay listings to illustrate what I look for and for posterity.

Note: This is just an exercise, you will have to decide your own specifications and needs and what is available on the market, and plan accordingly.

Let’s start with the HP G4 Z series.

Here is one to check out:

Let’s search for the CPU.

SKIP.

Here is another one.

Passed the memory bandwidth check.

Let’s check for features. First, I see if I can find a picture with the serial number.

Then I put it into the HP support page.

Good for PSU. Good for PCIe slots.

See if we have enough room in the case for GPUs.

I do a search on the web for “HP z6 g4 rebar above 4g decode” to make sure it can support datacenter GPUs. A few Reddit posts confirm that it should be able to with a BIOS update. If it doesn’t, we can flash the ‘desktop’ mode in P40s or P100s to set their rebar low, but that is a last resort.

Now we need to scope out some memory. I do another web search and find this info. This is really good news. It means with six DIMM slots we get 6 channels!

I see on one of the photos that there are 3 filled slots, so we will take one out and add 4x32GB Registered DIMMs.

Now we need a GPU or two. What to do. Might as well go with old reliable and boring 3060 12GB.

Let’s grab a cheap P100 for the extra VRAM.

Time to total her up:

Component Price (USD)
HP Z6 G4 309.98
128GB RAM (4x32GB) 140
3060 12GB 230
P100 16GB 138
NVMe 2TB $100
NVMe->PCIe $15
Power adapter, fan, etc for P100 $50
Unforeseen expenses $75
Total $1,057.98

By a nose!

2 Likes

P100 for half of what I paid for my P40s. :angry:

Great thread! Wish it came before I bought everything lol.

2 Likes

Its the weekend, which means another section of the guide


Building the Server

I will go through and build an AI server out of parts that I have and document the process.

Note: this part of the guide is going to assume that you are able to troubleshoot hardware issues yourself. It will not contain any troubleshooting information. The guide assumes you can handle finding out if hardware is incompatible or broken and to replace it as needed. It is not meant to be comprehensive but merely informative of the process.

All the parts have arrived! Now we get to do the real fun stuff.

First step is to make sure it boots as-is. The first time takes a while it has to go through its checks and self-training again and this will happen every time I add or remove a part. Once that is confirmed, the RAM goes in and then we set the BIOS. Reset to defaults first in the menu and reboot again. The most important part in here is CSM gets turned OFF and then UEFI boot ON. Second is the MMIO or Above 4G Decode, which must be ENABLED. Set the rest of the stuff in here to your liking. After another reboot proves we didn’t break anything, shut it down, pull the GPU out that came with it, and put the new (used) one in. If you have more than one, pick one to be the primary, install that in the same slot as the one that came with it, and hook up any power connectors needed. If you have more than one GPU, leave all but the primary one out at this point.

About cooling and power for P40/P100 style datacenter cards

If you have a P40 or P100 you will need to attach a fan to it. There are different ways of doing this. I 3D printed an adapter and attached a 75mm blower on to it, and hooked that up to a temperature sensing fan controller. Another 3D printed part holds the NTC temp probe on the card. The best place to mount the temp probe is at the front on top, as it is metal, it is where the hottest space is that you can find on the outside of the card, and it already has screw holes and screws you can use. On top of that you will need 8 pin PCIe to 8 pin EPS power converters and a way to power the fan. You will have to find the adapters yourself, they are about $10-$15, and if you have limited PCIe power connectors you will need to get a splitter for them. You will need two 6 pins, an 8 pin split to two 6 pins, or two 8 pins for each P100 or P40 card in your system. For the fan, some people just pull the connectors pin from the fan connector and shove them into the 12V and GND inside the PCIe or molex power, but I cut apart a SATA power connector and soldered on headers for this one.

OS Install

Another note: this guide is not a comprehensive guide on Linux use. I assume you will understand things like command line and how to solve software problems.

Now we will install Ubuntu Server. Why Ubuntu? Because it works and I am used to it. I couldn’t give two shits about Linux distros, so if you are big on Arch or whatever, use that, but any instructions in the guide are for Ubuntu. Also, I am not going to install a desktop environment, because I don’t need one. Once it is working ssh will be the primary mode of administration and it won’t even have a monitor.

Download the latest point release of Ubuntu Server LTS and put it on a USB stick, using Rufus or something else to make it bootable.

When you install the OS, you will need to enable a few things, the rest is at your discretion:

  • OpenSSH
  • Third party drivers

Once the OS has installed and booted, make sure to update:

sudo apt update
sudo apt upgrade

At this point we can just SSH from another machine and do things there. Find the IP address of your LLM box—it will be given upon login or by typing

ip addr show

and you can ssh into it. I suggest setting it as a static IP on your router.

Now we will get the nvidia ecosystem sorted out. Shut down and put any additional GPUs in your box if you have them. When booted type:

sudo ubuntu-drivers devices

Find the recommended driver in the list and install it.

sudo apt install nvidia-driver-575

Reboot.

Next we install CUDA toolkit. Go here and find the instructions, then type what they tell you into your terminal

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-9
nano ~/.bashrc

Paste this at the end of .bashrc

export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

Now check that its all good:

source ~/.bashrc
nvidia-smi
nvcc -V

And it’s ready to go!

If you want to get an LLM running right now:

wget -O koboldcpp https://github.com/LostRuins/koboldcpp/releases/latest/download/koboldcpp-linux-x64
chmod +x koboldcpp

Note: If you use a P40 or P100 add this to the following commands:

--usecublas rowsplit

If you have 24GB VRAM:

./koboldcpp https://huggingface.co/bartowski/google_gemma-3-27b-it-GGUF/blob/main/google_gemma-3-27b-it-Q4_K_M.gguf --mmproj https://huggingface.co/bartowski/google_gemma-3-27b-it-GGUF/blob/main/mmproj-google_gemma-3-27b-it-f16.gguf --contextsize 8192 --gpulayers 999

If you have 12GB or 16GB VRAM:

./koboldcpp https://huggingface.co/bartowski/google_gemma-3-12b-it-GGUF/blob/main/google_gemma-3-12b-it-Q4_K_M.gguf --mmproj https://huggingface.co/bartowski/google_gemma-3-12b-it-GGUF/blob/main/mmproj-google_gemma-3-12b-it-f16.gguf --contextsize 8192 --gpulayers 999

When that loads, go to the server’s ip address with :5001 at the end in a web browser, and talk to your LLM!

EDIT:

Koboldcpp will work with any OpenAI compatible client by using the URL + port with /v1 at the end as the endpoint. Example: http://192.168.1.44:5001/v1

EDIT 2: DO NOT INSTALL OLLAMA it is a giant pain in the ass to get off of your system. You have been warned.

3 Likes

My AI plans have been on hiatus for a long time, but I’m circling back around to this. I set up ollama in a docker so removing it was easy.. but now I’m ready to start again. I’d like to keep things sequestered in docker containers as getting GPU pass-through was easy enough and I like the segmentation of docker.

I would love a follow-up to this guide with setting up an LLM platform with a hard bent toward acting as a descent intelligence for Home Assistant, particularly the voice assistant hardware modules they came out with. Basic things like being able to get the current time, weather, set alarms, action devices for set amounts of time like “Turn on bathroom light for 5 minutes”.. these are the top 99% of use cases we have for our Alexa and Google Home spyware, which I would love to replace outright.

2 Likes

Chiming in because I’ve been playing around with cheapo self hosted AI for a while and recently landed on a ghetto setup I’m very happy with.

software:

  • Gpt-oss 20b (ablated with heretic, so it doesn’t push back if the conversation brings us in silly places)
  • llama.cpp : I keep the server running 24/7 to have web UI and API available for whatever reason
  • Tailscalethis is personal use, so I don’t publish the server on the wider internet

hardware

  • 4 Ampere Altra cores
  • 24 GB RAM

Not super fancy, but free (as in beer) because it’s part of the OCI free tier :slight_smile:

performance

~6 tokens/sec. Not super fast, but usable and completely free, not even electricity bills.

The quality of gpt-oss 20b also seems to be VERY good in my experience so far.

All in all the best LLM is the one that you always have in your pocket and this setup, for me, meant using my own model as the default strategy for most daily things.

I have been using Ollama for about a month now and really enjoy the software. It makes communicating with (programming for) Mixtral-8x7B and other LLM models quite easy.

Is Ollama really that difficult to remove from a Linux PC? I would have assumed the opposite.

Either way, self-hosting different AI models is a lot of fun and has opened a new world of computer engineering for me. From intelligent automation to data analysis.

My dedicated AI hardware is as follows:

  • AMD Ryzen 5 5500
  • ASUS 5060 Ti 16GB
  • 64 GB of DDR4 3200 RAM
  • 1TB NVMe SSD

Nothing fancy or top-of-the-line, but it allows me to experiment and tinker with artificial intelligence in my own homelab. Which I think is priceless, especially long-term.

I haven’t used Ollama in a long time but when I did (about a year or so ago), then if you didn’t isolate it in a docker it set itself up as a service and ran in the background once you installed it.

I get why people use(d) Ollama. It is easy; it abstracts away things like instruct templates and samplers and quants. It handles file management and has its own model library. It integrates with other easy solutions like Openwebui.

But it is a terrible solution for anyone who actually cares about running a local inference service as anything but a toy, for exactly those reasons.

The field is absolutely not mature enough, and the tooling not developed enough, for us to abstract away basic aspects of inference at this point. We don’t even have standards! Thats why there are so many damn instruct templates and samplers and such.

If you want to run a local inference server you have to learn these things whether you want to or not. If you don’t, you won’t know what you are doing.

You have a 5060ti 16GB, which is a current generation card with plenty of VRAM, along with 64GB of system RAM. Why are you running a 2 year old Llama 2 era model meant for memory and compute starved systems? It’s probably because having Ollama set it all up for you via its model library gave you no incentive to look at the current state of models.

I’m not trying to make things personal, but I want to really press the point that the paradigm Ollama is pushing is not appropriate for the state of this field.

And I want there to be full disclosure: I think it is a terrible company, I dislike the people who run it and the way they get their funding, but even with that being true, if it actually were a good product I would still tell people not to use it but I wouldn’t tell them it was because it was a bad product but because I disliked it. In this case, I dislike it, and it is a bad product.

1 Like

I’m trying to figure something out here.. I set up KoboldCPP in a docker with gpu passthrough and it’s working fine.. the model is small since it’s just one quantized and tuned for use with home assistant. it only takes up like 4GB of vram, but loading it takes 10 minutes at 100% CPU. The theory is that it takes this long to load because I have older CPUs that only support AVX1 not AVX2.. but ollama was loading much larger 10gb models into vram nearly instantaneously. What gives?

1 Like