What you are proposing is just a combination of how to package and cache the data and some prompt engineering. I admit I don’t really know what you mean by ‘NLP’ in that context though, but I am self-taught and don’t really know the proper terms for things sometimes.
If this is something you want to pursue, the code I wrote so far should be a good foundation. If you want, you can use an LLM to break down what it does and help with the unfamiliar syntax, but with your background I don’t see it being a huge challenge for you.
This is where my ignorance really shines… but really I just mean that the tool which is looking for “home depot receipts that amount to $300” might look at the first chunk and see that it is something to do with home depot, maybe even that it looks like it might be a receipt… so when it’s done processing that chunk, it makes a note about it like “this chunk appears to be somewhat relevant to the query because it looks like it may be the start of a home depot receipt.” … so literally a text note is made off to the side about chunk one. Then for chunk two, the note generated might say “I didn’t find any reference to home depot but I did find an amount of $300 that is exactly what the query is looking for.” … so now you have two notes about the chunks and their relevance to the original query. Now the tool could evaluate those notes to see if this particular document might fit the query. I think a tool evaluating two notes about a document that say “this is a home depot receipt” and “I found an amount of $300” would clearly flag this document as extremely relevant to the query.
Again I’m ignorant about how this stuff works but to me that seems like a really good approach for documents that are so large they need to be chunked.
If you can’t read json, it is just a way to pass information between APIs and services using text to create structured data. Each chunk index is a separate page in this case because the tools had to convert the pages to images in order for the LLM to read them (it is easier/faster/arguably better to send the image of text to a modern vision capable LLM than to perform OCR on it and pass that text to it). The ‘score’ is given by the tool after evaluating each document/image and is a relevance score between 0 and 10. If the score is below 5, the document is marked as irrelevant and is not included in data object as seen in the link.
Sorry I just realized that when I say ‘document’ I actually mean ‘chunk of document’ because the tool considers each chunk to be its own data object just like a document, so if any chunk in a document is deemed relevant for any reason then the relevant information will be passed along. As you can see in the data linked in my previous comment, it tends to cast a decently wide net when looking for relevant data. Sorry for the confusion – I tend to forget that other people don’t know what I know having worked on a thing and just throw around a bunch of assumed knowledge.
Since discovering frigate (tried a while ago but never got it working) face recognition features and semantic search, immich image search, and a couple other home assistant image recognition integrations, I decided to build out a “new” dedicated AI server to handle some of the workload.
Compute is on par with the RTX 3060s in my other server but half the wattage and more vram per GPU. With 3x I get 48GB vram total across all 3 cards. Also the T4 cards are Dell compatible meaning they work with the idrac directly for reporting, IPMI events, thermal fan control, etc.
Total cost (so far) has been around $3500 but that’s because both RAM and GPUs are absolutely disgustingly overpriced right now. Probably paid $1000 at least over market rate just a couple years ago.
Now I’ll be able to start testing and exploring more LLM integrations like RAGs, document deep dives, etc. that I wanted to explore and not fret over vram used for “real” workloads being used on a daily basis.