Local AI Server Guide

I am going to use this space to write up a comprehensive guide to getting a local AI server running.

With the field evolving constantly, it is incredibly difficult to keep your head above water when trying to learn all this yourself; the only practical way to accomplish it is to be embedded in it. This is impractical for most people who are not working in the field professionally or academically, and the problem is compounded by those who do populate the professional space being either advanced-degree holders with FAANG employers throwing recruiters at them (who have no time to write how-to’s), or hype-spewing grifters freshly demobed from the crypto battlefields with all the SEO know-how to populate every Google search with bullshit.

As someone with a CV full of credentials reading ‘just good enough to convince people I know what I am doing, but not good enough to be an actual expert’, and with extensive experience being dedicated to learning every new, highly complicated technical thing in front of me until I know enough about it to get bored and move on, I think I am quite suited to write such a guide.

I have a rough structure in my head, but I would like to field questions from people about what topics generally interest them in this field so that I can address them. But, please don’t get into very specific use cases as I’d rather focus on broad areas that would be applicable to many people who can use this as a foundation, rather than creating a comprehensive but narrow solution.

Now, for those of you interested in some hardware pr0n, here are some shots of my LLM inference rig.

Dell Precision T7610 with 3x P40 24GB GPUs.

16x 32GB sticks of PC3-14900L ECC RAM for 8 channels for ~120GB/s of memory bandwidth at 512GB. Not bad for a system older than a middle schooler.

I currently use it to run a completely open, unlogged, full service LLM on a google indexed domain. I am not going to tell you what it is here, that would be stupid. But if you want to send me a private message then, who knows?

4 Likes

I literally just ordered everything I need to make a similar unit. Just waiting for the server to arrive.

Dell R730 128gig DDR4 ram, 24TB of raw mass storage, 1TB boot SSD, dual xenon E5-2680v3, dual 24GB P40s.

What’s your software stack?

1 Like

Ubuntu for the OS, KoboldCpp for inference and Cherry for the front end.

Part 1: Why Run AI Locally?

This is fundamentally not just a practical decision but a philosophical one. It is undeniable that our lives are dependent on technology. We cannot live in the modern world without it. What technology we choose to use and how we engage with it are the only choices we really have.

The technology market is dominated by data gathering and processing. Google and Meta make their money by collecting large amounts of data about their users and transforming it into information that advertisers pay for. They cannot exist without gathering this data, and everything they do is built around that. This trickles down to the smallest players in the technology markets: almost everything that can access the internet will collect data and attempt to send that data back home. This is not paranoia. This is established fact.

Unfortunately, this ‘surveillance capitalism’ cannot be meaningfully opposed by an individual. All we can do is make small decisions that try to break our dependence on the services that are fueling it.

The inevitable end-game of corporate-controlled, consumer-facing AI (AI is nothing new to corporations, which have been using it for over a decade to run their systems – it is just consumers being able to interact directly with it that is new) is that once a dominant player emerges and they are able to figure out a strategy to extract value from it, then it will become integrated into the information collection and exchange system. Whether or not this is a bad thing is something that people are likely to disagree on. I think it is.

Right now we are in about Phase 1 of the cycle where the public is offered things of real utility for free or for much less than the production cost, while the providers try to consolidate user bases and figure out monetization strategies. This is the time we need to start putting that utility to use, because regardless of whether you agree that the end-game is a bad one, it is still free, and it is still useful, and it is still only for a limited time.

Any person who values this utility will find it in their interest to take as much of it as they can and learn how to integrate it into their lives to maximize its usefulness. However, how one does this does depend on their view of that end-game. If you would like to be as independent as possible from the centrally controlled information gathering market, then it is critical that you learn how to manage it yourself and that you run as much as possible locally. This guide is an attempt to help you get started doing that.

2 Likes

Did you use AI to write this or modify your original text?

I use it to copy edit and prune, since I can be overly verbose. I don’t use to generate any new text.

1 Like

Part 2: Hardware

This section is about hardware needs for running a local LLM. The goals are assumed to be as follows:

  1. Effective dollars per unit of performance is the target
  2. A server or always-on system is desired
  3. Access to the server from inside and outside the network is desired
  4. Electricity is a factor in the costs
  5. You are proficient in hardware and software
  6. The desired function is inference, not training

Hardware Factors in LLM Performance

LLM inference has two distinct phases with different hardware requirements:

1. Prompt Processing (Compute-Bound)

Prompt processing is bound by compute speed - FLOPS (floating point operations per second) or TOPS (integer operations per second). Since we are dealing with quantized model weights, the most important specs are 8bit integers (INT8 TOPS) and 16bit floats (FP16 FLOPS). GPUs excel at the parallel processing required, making them ideal for prompt processing.

  • Bottleneck is raw compute power (FLOPS/TOPS)
  • GPUs excel due to parallel processing while CPUs are terrible
  • Look for INT8 TOPS, FP16 FLOPS

This is why you should definitely have a GPU in your system for prompt processing, regardless of how much memory bandwidth your system has.

Example
Context window of 8192 tokens:

  • Prompt: 200 tokens
  • Chat history: 3000 tokens
  • Document: 1000 tokens
  • Image: 500 tokens
  • Total used: 4700 tokens for processing (if not cached)
  • Remaining: 3492 possible tokens for generation

2. Token Generation (Memory-Bound)

Token generation is bound by memory bandwidth. To generate each token, the entire model must be sent to the processor layer by layer. This means whatever the size of the model, that amount of data must be transferred for every generated token. This data movement almost always takes longer than the actual processing. GPUs typically have higher memory bandwidth than CPUs, but consumer motherboards often have limited RAM bandwidth due to having only two memory channels. Adding memory channels increases bandwidth proportionally and is achievable by using multiple smaller RAM sticks (two per channel).

  • Bottleneck is memory bandwidth
  • More memory channels > faster memory speed
  • Dual CPUs double memory channels

Because of this, much older servers or professional workstations often outperform new consumer hardware at token generation because they have more memory channels.

Calculating Memory Bandwidth

Formula: bandwidth_gbps = (memory_speed_mhz × 2 for DDR × 64 for bits per channel × channels) ÷ 8000 for bits per GB

Example

DDR4-3200:

  • Speed is 1600mhz or 1600 million clocks per second
  • DDR means double the speed for 3200 million clocks per second
  • Multiply by 64 bits per clock for 204800 million bits second
  • Divide by 8000 to get 25.6 billion bits per second (GB/s)

Therefore, at DDR4-3200 we get:

  • Single channel: 25.6 GB/s
  • Dual channel: 51.2 GB/s
  • Quad channel: 102.4 GB/s
  • Dual CPUs, each quad channel: 204.8GB/s

For a specific example, this is what I wrote and submitted, asking for it to be more readable and to add concrete examples and format it for skimming, resulting in the above comment:

Part 2: Hardware

This section is about hardware needs for running a local LLM. The goals are assumed to be as follows:

  1. Effective dollars per unit of performance is the target, when that is the highest spot on the curve when the (performance, cost) function is plotted on a graph
  2. A server or always-on system is desired which will be configured and maintained by you
  3. Access to the server from inside and outside the network is desired
  4. Electricity is a factor in the costs
  5. You are proficient in hardware and software such that you can assemble complicated combinations of hardware along with software and
  6. You are proficient enough to be able to troubleshoot novel problems arising from such combinations
  7. The desired function is inference, not training

Hardware Factors in LLM Performance

With LLM inference there are two parts which require extensive compute resources:

  1. Prompt processing
  2. Token generation

Prompt processing is self-explanatory and is the processing of tokens before generating new ones. LLMs are text completion engines, and every generated token is reliant on all of the previous tokens, with each previous token adding to the compute cost quadratically. The context window is the best way to understand this, as this is the maximum number of tokens that the LLM is allowed to process or generate. This includes all of the previous tokens in the current chat, including tokens used for images, documents, your inputs, and the prompts from the beginning of the chat until the end. The tokens generated by the LLM must then fit into the remaining window or it will become incoherent. Example: context window of 8192 tokens; prompt uses 200 tokens, chat history uses 3000 tokens, document uploaded uses 1000 tokens, image uploaded uses 500 tokens:

remaining_context = context_window - used_context 
used_context = 200 + 3000 + 1000 + 500 
remaining_context = 8192 - 4700 
remaining_context = 3492 tokens

Those 4700 tokens in the used context are what would be contained in the prompt processing part of the compute. In practice, this is not really a big deal with context that rolls over since it is computed and cached, but when starting new chats or adding large amounts of tokens to chats, or modifying parts of chats from before, then you will have to deal with the time needed for prompt processing.

Prompt processing is bound by compute speed. That is, FLOPS (floating point operations per second) or TOPS (integer operations per second). You will can find these numbers in the spec pages for CPUs and GPUs, but the precision does matter. Precision is the amount of bits used for the numbers in the operations, such as 32bits for floats or 8bits for integers. This is essentially the number of decimal places in float numbers or the number of digits in integer numbers. Since we are dealing with quantized model weights the most important specs in this regard are going to be 8bit integers or INT8 TOPS and 16bit FLOPS or FP16 FLOPS. Maximizing these specs will ensure fast prompt processing. Since GPUs excel at parallel processing of the type of operation required by this, they are prized for this aspect of inference.

Token generation is the creation of new text by the LLM when given text to complete (a prompt, context, etc). This is bound by memory bandwidth. In order to generate a token, the entirety of the model weights must be sent to the processor to be computer, layer by layer. This means that whatever the size of the model that resides on your drive, that amount of data must be sent to the processor for every generated token. As you can imagine, this is a lot of data to move around and it almost always takes longer to get that data to the processor than it takes for the processor to process it. Because GPUs tend to have a lot of memory bandwidth, they are good at token generation. The fact that they are better than the CPU, however, is more a factor of most consumer motherboards and processors have abysmal RAM bandwidth because they lack memory channels. By adding an extra channel you add bandwidth equal to the base bandwidth of the system memory. This is essentially free if you are willing to split your RAM sticks into small enough units of two per channel. This sounds confusing but is actually really simple. [equation for computing ram bandwidth by PCxxx or DDRx specs] [simple example of adding channel to system and bandwidth increase]

1 Like

If you prefer the non-AI formatted version I am happy to submit that instead; I figured people would appreciated it being easier to read, even if it does look AI generated.

I was just curious if you “eat your own dog food” so to speak :slight_smile: I have no problem with the undulations the output went through to prior to posting.

2 Likes

Part 2: Hardware

Continued…

GPU selection infographic.

EDIT:

A Note on Prices

The pros and cons in the graphic were made with the current USA (as of July 2025) eBay prices in mind.

MSRP MEANS NOTHING FOR NVIDIA GPUS ANYMORE

You cannot base anything on what NVIDIA or reviews claim the MSRP to be. It just has no relevance in the current GPU market. This is unfortunate, but it is the reality.

3 Likes