New, fast, local LLM model kicks ass

Plato · January 25, 2026, 11:26am

For those not keeping up with the latest in local LLM developments, a new model was released last week called GLM-4.7 Flash. It has llamacpp support and a Q4_K_M quant can fit in a 24GB card or two 12GB cards. It is an MoE which means it is fast, since the running expert is much smaller than the model itselt. Though if you have to offload it onto CPU and have a slow CPU if may not be great.

How to get it running

Get and build llamacpp

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-server
cp llama.cpp/build/bin/llama-server ./

Download the weights.

Run

./llama-server --model GLM-4.7-Flash-Q4_K_M.gguf --threads -1 --fit on --temp 1.0     --top-p 0.95 --min-p 0.01 --ctx-size 16384 --port 8001 --jinja

Open it up

GLM-4.7

Example generation:

amal · January 25, 2026, 3:26pm

Awesome!

ODaily · January 25, 2026, 10:10pm

It’s been awhile since I walked away from my setup, and ya know what, I’m ready to try again.

Also… Snow day.

Anywho, thanks for the bump. I’m looking at wiping and restarting my whole project, and this model looks almost ideal for me. (Also found an abliterated version )

Plato · January 25, 2026, 10:58pm

If you get stuck, spend $20 for an Anthropic Pro subscriptions and install Claude code, then fire it up and tell Claude to fix it.

turbo2ltr · January 29, 2026, 2:43am

Finally got my setup back up and running after some hardware config changes.
I’m not a fan of reasoning models because I’m impatient. But seems I think this will be my new go to model. Thanks.

I still feel like this setup is pretty slow. maybe this model can walk me though improving it or tell me my hardware sucks. lol.

ODaily · January 29, 2026, 3:06pm

Your hardware sucks. See, you’re halfway there already!

This pretty much killed it for me. Great model, but the time to reply is critical as I’m going for a voice address system. I ended up with one of the qwen variants instead. Output is less capable perhaps, but replies are snappy and I really need that. Also, I don’t have to figure out a way to prevent the TTS voice from reading off the 2 page reasoning monologue.

I would be completely lost without this. I’m still slowly working through it, but using ChatGPT has so far allowed me to reset up, get a piper voice setup and configured, and I now have a LLM that works, and a voice address system that responds to it’s wakeword and can either read off the time, or will parrot back any other request. Next step is a little polishing and then to connect the two as voice input → LLM → voice output. After that, I start my foray into home automation.

Between the fact that ChatGPT seems to have been improved since my last go, and me figuring out some stuff in my life that made a huge change in my capacity to manage a project of this size, it’s going so much better this time. Really really glad @Plato provided the push to get this going again.

turbo2ltr · January 29, 2026, 4:08pm

I don’t use chatGPT much anymore, I’ve starting paying the $20 for claude. For stuff I do, I think claude is better…particularly for coding and computer stuff.

I bought one of those alexa replacements that integrates with HA but I haven’t had time to try and get that working with an AI backend.

amal · January 29, 2026, 4:11pm

use claude code to help you. that’s how i got my nabu casa HA voice assistant working properly