For those not keeping up with the latest in local LLM developments, a new model was released last week called GLM-4.7 Flash. It has llamacpp support and a Q4_K_M quant can fit in a 24GB card or two 12GB cards. It is an MoE which means it is fast, since the running expert is much smaller than the model itselt. Though if you have to offload it onto CPU and have a slow CPU if may not be great.
How to get it running
Get and build llamacpp
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-server
cp llama.cpp/build/bin/llama-server ./
Run
./llama-server --model GLM-4.7-Flash-Q4_K_M.gguf --threads -1 --fit on --temp 1.0 --top-p 0.95 --min-p 0.01 --ctx-size 16384 --port 8001 --jinja
Open it up
Example generation:
