New AI Release Thread

It has been proposed that I contain AI releases to a thread for easy reference. I shall do that here.

2026, March: Qwen 3.5 Series

The current hot new release this week is a big one. It is smaller parameter Qwen 3.5 series which can work on consumer devices.

Background

This is a Chinese developed vision language model with open licensing. The Qwen series also has models which do image generation, text to speech, embedding, and other things, but they are not part of this specific line.

Plato’s Opinions

Negatives:

  1. The Qwen line doesn’t have a great reputation for its writing or chat ability. It tends to write English as something that learned it from a text book and is overly formal
  2. Has a very dry personality
  3. Notorious for repetition
  4. Tends to overly long or useless thinking generations
  5. Can sometimes think itself away from the right answer after having already gotten it
  6. Overly censored

Positives:

  1. Permissive licensing. I won’t comment since I am not a legal expert but if you use AI not just for personal use check licenses
  2. Large range of sizes allows running on anything from a raspi to a server
  3. Very good image recognition allows solving problems related to documents or images without converting to text
  4. Breadth of support amongst inference engines

Conclusion:

Qwen can be a great choice when you need an answer to a solvable question, when performance constrained, when working with images, for use with tools or as an agent, or when you need an AI to perform a task and don’t care about its quirks.

Links

[Main page]( Qwen3.5 - a Qwen Collection )

Since there are so many of them I won’t link to specific quants.

EDIT: forum software wants to convert my markdown links to its own link style for some reason.

2 Likes

2026 March

Nemo 3 Family

Another March release here, this time from nvidia. Nvidia releases models often to demo possibilites and to destroy moats so they can sell more cards (maybe, I just guessing based on things that seems completely obvious).

The new Nemo 3 models are a novel type of mixture of experts, where they did some crazy tricks to make more experts fit and run just as fast. The also increased the context window to 1M tokens and trained the model natively at NVFP4. Training at 4 bits instead of the usual 32 or 16 means that the quantization effects are minimized so you get close to the same results at a huge speed, size, and memory savings. In fact the accuracy seems almost identical when compared to running it at 16 bit precision.

Nemo 3 was trained to excel at agentic tasks and at following directions and reasoning its way through structured instructions. The 4 bit quant of the 120B ‘Super’ model can fit completely into 64GB of system ram and runs pretty snappy if you have a GPU to help with prompt processing. There is also a 30B model available for people with less RAM or who have to divide the RAM amongst other computing needs.

Plato’s Opinion

The Nemo series have a reputation for pulling way more weight than other models in their size class, and are revered not only for tasks but for conversations and even role play. The ability to run at 4 bit with minimal loss makes this something that many people should be interested in. Even if you don’t use this model, you might appreciate that the training data, the code used to train it, the process, and the weights are all open and available for anyone to use.

Instructions

Open up a terminal and clone llama.cpp

git clone https://github.com/ggml-org/llama.cpp

Update or install the dev tools and build it

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

Install uv if you don’t have it

curl -LsSf https://astral.sh/uv/install.sh | sh

Install huggingface hub if you don’t have it

uv tool install huggingface_hub hf_transfer

Run download the 120b model

hf download unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF \
    --local-dir ~/models/gguf/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF \
    --include "*UD-Q4_K_XL*"

Or the 30B

hf download unsloth/Nemotron-3-Nano-30B-A3B-GGUF \
    --local-dir ~/models/gguf/Nemotron-3-Nano-30B-A3B-GGUF \
    --include "*UD-Q4_K_XL*"

Deploy

./llama.cpp/llama-server \
    --model ~/models/gguf/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF/NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q4_K_XL-00001-of-00003.gguf \
    --alias "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B" \
    --prio 3 \
    --min_p 0.01 \
    --temp 0.6 \
    --top-p 0.95 \
    --ctx-size 16384 \
    --port 8001
3 Likes