don’t. just do it. you will be happier once you figure out how to use it. i avoided it for the longest time, but it’s really worth learning how to do it.
Spend that effort of exploring and learning to learn docker. You will be far better off for having done so. I updated the post above with more info. If you need further guidance, use this thread.
You don’t sound argumentative. Quite the opposite. And of course you need to optimize for your budget. But please believe me when I say that I am trying to help you do that. I will try and be specific about why I am making the hardware recommendations that I am for you.
First of all, what is AVX2?
AVX2 is an instruction set built in to CPUs which allows parallel calculations for the matrix multiplication. This is so integral to getting performance from a CPU for inference that it is a core optimization in the llama.cpp kernel. Now, it can work without it, but you are then going to be running calculations sequentially. The difference going from non-AVX2 to AVX2 when running inference on the same cpu is a 4x to 8x speed increase.
Now, lets look at your RAM upgrade. You are going from 24 8gb sticks at 1333mhz to 8 16gb sticks at 1600mhz.
24 sticks is 12 per processor, giving you 8 channels since you have 4 per CPU on the R720
1333MT/s x 64bits per stick / 8 bits per byte x 4 channels per CPU x 2 CPUs /1024 for gigabytes = 83.3125GB/s
Now lets do the upgrade. You are going to have 8 sticks of 1600mhz which lets you have 2 channels per CPU. So we get
1600MT/s x 64 / 8 x 2 x 2 /1024 = 50GB/s
You will have 60% of the original memory bandwidth.
So you should save your money on the RAM. The CPUs are basically free so upgrade them if you can, I guess.
As far as the recommendation to max your VRAM: This is just a practical way to fit larger models without resorting to having to run them on CPU, where the speed will be abysmal. The PCIe bandwidth is irrelevant – you aren’t transferring anything between GPUs except the final calculations for inference, which are trivial in size, and you aren’t going to be transferring to and from the CPU because you have enough VRAM for the model to run. This will optimize your setup and is exactly what I ended up doing when I was running an E5-v2. I went down the same path you did and realized that doing anything on the CPU was just not worth it. I moved to a newer Xeon and hence don’t need the VRAM of the P40s and can trade off for newer CUDA support and cooler, lower power cards instead.
I don’t know why you are using Piper. It is two years old. You should run Kokoro – it should be faster.
Regarding Piper: I think it is a good assumption to make that any documented processes for local AI are going to be out of date unless they come out along with or soon after the tools and models they reference. This is a major problem, obviously, but I don’t think anyone is going to try to do anything about it for a while at least. Generally if the github repo hasn’t been updated in the past month or two at the most, it is probably a good idea to look for something else. The reasoning isn’t so much ‘its old and doesn’t have the latest features’ but ‘it isn’t being maintained and is liable to break at any moment’. There isn’t much point spending time setting up, learning and troubleshooting something in that case – best to bite the bullet figure out how to do it with whatever has taken its place. That’s been my experience at least.
But it’s had months of people banging on it at a minimum. Keep in mind my skill level, it’s critical to understanding some of the choices I make. I don’t want cutting edge, I can’t even comprehend bleeding edge, I want and need something people have banged on for long enough that Chatgpt can source enough relevant knowledge to push me through.
This is true, but kinda misses the point. If I asked if you wanted to have 12% of the money Amal’s wallet, or 53% of the money in Pilgrimsmaster’s wallet, you would reasonably need to know just how much money is in the wallets, right?
I had to run the numbers through Chatgpt to be precise~ish. But, given my model size being entirely on GPU, and with my other system specs,
For your exact model and stack:
Idle: ~2.1 GB
Typical use: ~3.7 GB
Hard peak: ~5 GB
128gb is just stupidly overkill, so losing 40% of my original 192gb is a non-issue.
Now speed,
Per socket bandwidth:
1333 → ~42.6 GB/s
1600 → ~51.2 GB/s
+20% peak bandwidth.
Whisper Observed effect:
1333 → 1600: ~6–10% faster transcription
Latency improvement, not just throughput
You will notice slightly quicker response time.
LLM Inference Effect:
~2–5% improvement only when context is large
Zero difference for short prompts
GPU-resident inference
Home Assistant
Piper (TTS)
Effects = 0–1% difference
In short, this is one of the very few things I can do (physically) to get an improvement, especially pairing CPU and RAM upgrades. Not exactly thrilling changes per dollar spent, but given that I’m on a fixed hardware platform, it’s all I can do really.
On dual GPUs, as long as I can fit the model onto a single GPU (no offload) then my fastest arrangement is to keep it there. Splitting my model will always be slower. Granted I can run bigger models, but I don’t need bigger, I need fast. With that said, I should be able to get a little speed bump if I can offload just the smaller GPU loads to a secondary GPU. The idea is to basically keep them from being in the way of the LLM Model, so it runs faster. So CPU0 and GPU0 do nothing but LLM, and CPU1 and GPU1 handle all other tasks, and never let them interfere loads. Given my server constraints, I think that’s the fastest possible strategy. Realistically, a 2080, 2080 super, or 2080ti would be about as bang per buck as possible and the 8-11gb of vram is actually overkill if it’s not handling any LLM inference at all.
But it’s abandoned. The repo has been closed. It is a dead project. I’m not going to try to talk you out of how to run your own server but I will say that you might have a greater chance at success if you don’t repeat other people’s mistakes. I have made them and other people have made them, and there is knowledge being offered on how to avoid them. A repo which is in ‘archive mode’ is not ‘tested in battle to victory’ – the people who created it decided it wasn’t worth trying to keep putting work in to it – it is ‘tested in battle to defeat’.
I honestly have no idea what you are trying to say with this.
You are spending money on hardware that is going to make your system demonstrably slower by 40% in a specific area, with no gain. Faster memory in mhz but removing half your RAM channels means you lose 50% of the bandwidth you have and then make up for by 10% leaving a 40% deficit. You are throwing money away and making your system slower. I don’t know how to make that more clear.
On dual GPUs, as long as I can fit the model onto a single GPU (no offload) then my fastest arrangement is to keep it there. Splitting my model will always be slower. Granted I can run bigger models, but I don’t need bigger, I need fast.
Splitting them only very slightly decreases speeds but not nearly as much as putting any of it on the CPU.
Realistically, a 2080, 2080 super, or 2080ti would be about as bang per buck as possible and the 8-11gb of vram is actually overkill if it’s not handling any LLM inference at all.
We have all sorts of models that want space on your GPU. We have LLM, embedding, vision, image generation, text to speed, speech to text… and another one just came out that can make songs for you. Who wouldn’t want that?
Me, me exactly. If it doesn’t contribute to the intended goal, it’s bloatware, a.k.a. deadweight. Awesome in a broad context, but for what I’m trying to accomplish, an unnescessary burden.
But it is, in fact, tested, understood, known. I need that. All software will be tested to defeat, then revised / updated / extended / replaced. ALL of it. I’m working on older stuff specifically because it’s known, and my skill level is currently incompatible with stuff that’s being actively explored / tested.
I get that, I kinda feel like we’re talking past each other. Probably just from different frames of reference??? Right or wrong, I think this is the path I’m goint to pursue. However, I’m going to have both the old and new stuff. Might be fun to try and bench mark it? Maybe, I’ll have to thunk on it a bit to figure it out.
That’s the rub isn’t it? I’m NOT putting anything on the CPU, so splitting the model, WILL slow me down.
It seems we just have different ways of solving problems, and even though I don’t understand your reasoning, it isn’t my job to tell people what to do with their own equipment. If you need help with anything specific, let me know.
I really do want to make it clear though – the money you are spending on the RAM upgrade is being completely wasted. Not just in the sense that you think I am saying ‘it won’t help to have faster RAM’ but in the sense that you are making your system slower. Imagine that a RAM channel is a carpool lane. You can’t use it with one person in the car. If you have a passenger you can use the lane AND you get to take twice as many people in one trip. So 2 RAM sticks = a car pool lane = twice as much data per trip. By reducing the number of RAM modules, now instead of 8 car pool lanes, you can only take 4. You have reduced the amount of data you can transfer per trip by half. This has nothing to do with how much total RAM you have.
deployment (with docker) was a snap.. its very fast using my GPU.. and the voices available sound really really nice.. and it has a fun web interface for testing voices and generating sample audio (which you can download too!)
just what comes with it are really nice.. but there is a training studio it looks like. planning on messing with that later.
regarding what you said about the pause and piper.. i guess if you are planning on interacting conversationally with an LLM then yes long text responses could create a long pause just from speech generation. in this case Koboro with GPU support would be much faster.
Went to help a friend move stuff that he got from an estate sale today. Saw a box for a 4070 gpu and was told it had been put in a beefy PC that they had spent 5K on, but would be selling. Just taking a chance, I asked if there were any leftover parts, cause gamers tend to upgrade and that can leave extra stuff behind.
Anywho, I came home with a 3060 ASUS (in a weird shorty dual fan configuration). For 150 dollars.
I put in the new cpus, ram, and gpu. Would not start, gets hung at the dell screen. (it does config memory and initialize IDRAC first and I have full IDRAC access).
I’ve tried updating the bios, the idrac/lifecycle controller (it’s a package deal), resetting IDRAC, clearing NVRAM with the jumper, replacing the original CPU / RAM. Removing both GPUs, took it down to one CPU and one stick of ram (the originals that I know work), I even yanked all the riser cards, disconnected the hard drives, and unplugged the CD drive.
It always gets to the splash screen, and depending on what I’ve tried lately sometimes it tries to go into the lifecycle controller (and hangs up), or I can manually try to enter setup (F2 and it still hangs up), Or it just hangs up on the splash screen.
I’ve tried everything I can think of, short of first law of engineering.
In the depth of my despair last night I walked away with it still powered up. A couple of hours later, it had gotten itself to the point where it might have been trying to load from the disks (unplugged).
So, I’m trying an experiment. I put all the new stuff back in (the proverbial 10lbs in a 5lb bag) and I’m gonna let it percolate and see if it can just “work it’s shit out”.