For a specific example, this is what I wrote and submitted, asking for it to be more readable and to add concrete examples and format it for skimming, resulting in the above comment:
Part 2: Hardware
This section is about hardware needs for running a local LLM. The goals are assumed to be as follows:
- Effective dollars per unit of performance is the target, when that is the highest spot on the curve when the (performance, cost) function is plotted on a graph
- A server or always-on system is desired which will be configured and maintained by you
- Access to the server from inside and outside the network is desired
- Electricity is a factor in the costs
- You are proficient in hardware and software such that you can assemble complicated combinations of hardware along with software and
- You are proficient enough to be able to troubleshoot novel problems arising from such combinations
- The desired function is inference, not training
Hardware Factors in LLM Performance
With LLM inference there are two parts which require extensive compute resources:
- Prompt processing
- Token generation
Prompt processing is self-explanatory and is the processing of tokens before generating new ones. LLMs are text completion engines, and every generated token is reliant on all of the previous tokens, with each previous token adding to the compute cost quadratically. The context window is the best way to understand this, as this is the maximum number of tokens that the LLM is allowed to process or generate. This includes all of the previous tokens in the current chat, including tokens used for images, documents, your inputs, and the prompts from the beginning of the chat until the end. The tokens generated by the LLM must then fit into the remaining window or it will become incoherent. Example: context window of 8192 tokens; prompt uses 200 tokens, chat history uses 3000 tokens, document uploaded uses 1000 tokens, image uploaded uses 500 tokens:
remaining_context = context_window - used_context
used_context = 200 + 3000 + 1000 + 500
remaining_context = 8192 - 4700
remaining_context = 3492 tokens
Those 4700 tokens in the used context are what would be contained in the prompt processing part of the compute. In practice, this is not really a big deal with context that rolls over since it is computed and cached, but when starting new chats or adding large amounts of tokens to chats, or modifying parts of chats from before, then you will have to deal with the time needed for prompt processing.
Prompt processing is bound by compute speed. That is, FLOPS (floating point operations per second) or TOPS (integer operations per second). You will can find these numbers in the spec pages for CPUs and GPUs, but the precision does matter. Precision is the amount of bits used for the numbers in the operations, such as 32bits for floats or 8bits for integers. This is essentially the number of decimal places in float numbers or the number of digits in integer numbers. Since we are dealing with quantized model weights the most important specs in this regard are going to be 8bit integers or INT8 TOPS and 16bit FLOPS or FP16 FLOPS. Maximizing these specs will ensure fast prompt processing. Since GPUs excel at parallel processing of the type of operation required by this, they are prized for this aspect of inference.
Token generation is the creation of new text by the LLM when given text to complete (a prompt, context, etc). This is bound by memory bandwidth. In order to generate a token, the entirety of the model weights must be sent to the processor to be computer, layer by layer. This means that whatever the size of the model that resides on your drive, that amount of data must be sent to the processor for every generated token. As you can imagine, this is a lot of data to move around and it almost always takes longer to get that data to the processor than it takes for the processor to process it. Because GPUs tend to have a lot of memory bandwidth, they are good at token generation. The fact that they are better than the CPU, however, is more a factor of most consumer motherboards and processors have abysmal RAM bandwidth because they lack memory channels. By adding an extra channel you add bandwidth equal to the base bandwidth of the system memory. This is essentially free if you are willing to split your RAM sticks into small enough units of two per channel. This sounds confusing but is actually really simple. [equation for computing ram bandwidth by PCxxx or DDRx specs] [simple example of adding channel to system and bandwidth increase]