Home assistant with local ollama AI

yep that’s the whole point (for me anyway) of running local LLMs on those GPUs. The home assistant guys really push HACS (home assistant cloud service) a lot but you don’t need to give in.

  • I would explore deploying home assistant OS either locally on a Pi or in a virtual machine guest. I went the VirtualBox guest OS route on my big server.

  • Then deploy openweb-ui in a docker on your GPU server and it will deploy ollama and the webui interface as 2 services via docker. Update your docker config to ensure nvidia GPUs are allowed to be accessed from the container.

  • Set up home assistant “devices and intergrations” and add the official ollama integration. Point it at your IP and port and it will pull in models… select a model you want to use and it will create a service.

  • Go to voice assistants and change your conversational agent to the ollama model you configured.

1 Like

Alright … now i need to buy a server :roll_eyes:

2 Likes

I FEEL YOU.

well … i bought a Proliant DL385p G8 … :roll_eyes: … will see how much damages i can do :stuck_out_tongue_winking_eye:

2 Likes

So after owning my rack for almost a year now. I wish I would have went with a less rack solution. Desktop servers seem to be both quieter and have larger options for expandability. But I’m not unhappy with my choice.

3 Likes

I get that, I’m sure that someday I’ll feel the same, but right now, I’ve got a real big boy server.

And one day, when it’s long obsolete, I’ll bolt legs to it and use it as a coffee table.

3 Likes

I’m sure I’m gonna be at the same place, but for what I payed for mine I couldn’t even buy the ram thats inside … it’s gonna be a good learner and donor when I figure out what I need/want and how to put it together :sweat_smile:

Noise is a worry, but I have a few ideas (and it’s part of the reason I went with a 2u :roll_eyes:)

I like the table idea :+1:
Put a clear top to see I’ll the goodness.

Now I want it running like that.
(o.k., I don’t have the time / inclination, but still…)
Oh no. What if you spilled a coke on it? bad thoughts, bad thoughts.

2 Likes

Oh wow, I didn’t expect to run into other HA users here, it seems I’ve found my people! :joy::joy:

2 Likes

:roll_eyes::sweat_smile:

5 Likes

shiney and chrome

5 Likes

Looks so much cooler than mine. All the good are covered by air separators

So I have the Ollama 1b model running in a windows VM on a 3 year old dell laptop with decent performance. It spits out tokens faster that it would read it back.

Unless you need your home assistant to explain quantum mechanics, you likely don’t need any crazy stuff, or a GPU at all, you just aren’t going to be running the 40b models. I don’t want to spend $30 a month powering a giant server so HA can tell me the time.

That company that makes the Satellite that I put a picture of is coming out with a nVidia Jetson based AI computer.

1 Like

Another option

What’s interesting is that this unit uses unified memory so you aren’t limited by the RAM on your video card. You tell it what you want to use for GPU and what you want to use for GPU. Of course you are still at the mercy of the processor so having enough ram to be able to load a 30b model isn’t really useful if you don’t have the processing power to do something with it.

It does have occulink as well so technically you could run an external GPU.

2 Likes

FYI, HACS is the Community Store (lets you install homebrewed plugins) which can be really useful. It’s unrelated to Home Assistant Cloud (which I agree there’s no need to pay for, especially if you can handle things like port forwarding).

Really? If I’m running home assistant OS, how do I connect to hacs?

1 Like

You install it as an integration, after which it will appear in the sidebar (see here).

Please don’t use Ollama. They are a for-profit venture funded company which relies on open source for their engine while actively employing Embrace, Extend, Extinguish by adding features to the engine without giving them back (they write them in a different language). They recently partnered with OpenAI if that tells you anything. They are just waiting for market domination before cutting in their own closed source engine and capturing all the value with a subscription service (this is conjecture, but it seems obvious this is the plan).

There are many inference engines you can use in place of Ollama: llamacpp, koboldcpp (which has openai and ollama compatible APIs) or llama-box. Let me know if you want any specific advice.

My ultimate goal is to run descent local AI that can;

  • Have self managed persistent memory. Openwebui doesn’t have this yet but they did introduce persistent memory. I can put information about myself and the projects I’m working on in that memory and every conversation I have or start will have a reference of these things that extends beyond the context window. Being able to tell it to remember something like openAI and it updating its own memory would be ideal.

  • have a decent document processing framework so I can include PDFs and emails in a knowledge store for it to reference. I’ve set this up many times in Ollama … it’s never worked properly because it’s still very esoteric and complicated and nuanced to do properly, but the framework is there through openwebui.

I admit I don’t know if these features are part of Ollama and exposed by openwebui or if they are more on the openwebui side and Ollama can be swapped right out… but if that’s the case then what proprietary aspects of the engine are being developed by Ollama and not shared back to the community?

  1. No engine will do what you need out of the box. This is a brand new technology and isn’t consumer facing yet. We are still figuring out if RAG is good enough for stored memory (probably not on its own), and engines are engines. Openwebui is not ollama-- it is a webUI for ollama. An inference engine is what loads the model, takes input and gives output, usually through an API on a TCP/IP port. That said, for a drop-in perpetual info database, KoboldCpp’s embedded webUI has this feature. You can dump text into it and if it finds anything in there it will add it to the context. If you want this all to be transparent, I don’t think that exists at the moment without hacking it together yourself

  2. Sounds like you were using RAG. This is always going to be a pain in the ass and will never work perfectly. But all you need for that is capability for embedding and a database. At the end of it, these are all just APIs and any program which speaks with that API will work. The good news is that the OpenAI API is pretty standard in all backends, so with a little configuration, pretty much any solution that works for OpenAI will work with llamacpp derivatives (llamacpp, koboldcpp, ollama, llama-box, etc)

The things that ollama develops that are proprietary are generally new model support. Every new model (Qwen->Qwen 2->Qwen 2 VL, Gemma, Gemma-2, Gemma-3, anything brand new) needs to have support written for it. Ollama, because it has employees and works with the model creators sometimes, has support for new models and things like vision capability and doesn’t submit the changes back to llamacpp which it is a fork of. Also, their model library is proprietary.

I know this whole ecosystem is complicated, and constantly changing. It can be a real burden to figure out. But I am absolutely sure anyone on this forum can do it if they want.

It really helps to know python, by the way. At least a little.

1 Like