New, fast, local LLM model kicks ass

For those not keeping up with the latest in local LLM developments, a new model was released last week called GLM-4.7 Flash. It has llamacpp support and a Q4_K_M quant can fit in a 24GB card or two 12GB cards. It is an MoE which means it is fast, since the running expert is much smaller than the model itselt. Though if you have to offload it onto CPU and have a slow CPU if may not be great.

How to get it running

Get and build llamacpp

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-server
cp llama.cpp/build/bin/llama-server ./

Download the weights.

Run

./llama-server --model GLM-4.7-Flash-Q4_K_M.gguf --threads -1 --fit on --temp 1.0     --top-p 0.95 --min-p 0.01 --ctx-size 16384 --port 8001 --jinja

Open it up

GLM-4.7

Example generation:

2 Likes

Awesome!

It’s been awhile since I walked away from my setup, and ya know what, I’m ready to try again.

Also… Snow day.

Anywho, thanks for the bump. I’m looking at wiping and restarting my whole project, and this model looks almost ideal for me. (Also found an abliterated version :emoji_thumbsup: )

3 Likes

If you get stuck, spend $20 for an Anthropic Pro subscriptions and install Claude code, then fire it up and tell Claude to fix it.

2 Likes