I’m thinking we could use llama box as the backend with the cohere 7b model and write a custom tool to search through directories without a database using extractous as the document parser. This is an unfamiliar workload for me so I will need some time to experiment first. Does this sound OK to you?
Awesome. I’m watching Python vids and I’ll dig into your links tomorrow. Excited
A quick review of the links looks promising. Using llama box as the engine, the Command R7B agent concept (which looks closest to what I’m wanting in terms of LLMs using NLP to complete a task)… extractous to actually get the data from PDFs, emails, text files etc… this is looking very promising.
I made a little proof of concept script. It uses koboldcpp because I already had some tooling made for it, but it doesn’t install or modify anything in your system – it is a single executable you can just delete it when you don’t want it anymore.
ooo shit that’s awesome… I’ll take a stab at it soon. The whole fam is sick except me so I’m playing both doctor and sexy nurse until it passes.
What do you think?
Note: This is in real time, and I am using a 35 billion parameter model, so it’s kinda slow.
(open gif in a new tab)
Repo:
Instructions:
Download files.
Install requirements.
Install uvicorn.
Edit config.py to point to your documents directory.
Run:
uvicorn main:app --host localhost --reload
Go into OpenWebUI admin panel, settings, tools. Add server http://localhost:8000
Click on + in new chat, select the tool.
Ask the model a question which has an answer in one of the documents in your documents directory.
Wait a really long time.
A little bit better…
I am continually updating the github repo. It is slowly going from ‘proof of concept made with scraps of things I had from other projects’ to ‘this might actually just work’.
Awesome. I’m now bedridden for the next few days I’ll check it out asap though
Get well soon!
It literally feels that way today… muscles so painful in trying to suppress coughs like I’m in the quiet place
COVID?
Might be… tests are negative but the tested are old… maybe expired.
Ok had a chance to re-watch the video and that is awesome. Questions…
-
Would it handle a large number of documents such that each one is checked and dumped from memory to preserve context window, or is that “not how any of this works!” ?found in github repo readme;The only data added to the ongoing context is the question, the tool response, and the model's answer.
-
Would it be faster per document if it was raw text data in a txt file vs a PDF or does it matter?
-
Can it create a collection of relevant documents, say “Show me all receipts that have amounts of $341.21” and find the 3 documents with that total in? Or possibly find documents that have totals within them that sum to $341.21 - though that seems like black magic to me.
-
Can these workflows be queued … torn off into another thread that runs separately while I continue to converse in the main thread, giving more tasks and jobs? They don’t have to run in parallel, they can queue of and run sequentially… but getting answers back asynchronously would be amazing.
-
How are cross-chunk relevance issues handled. For example, if my query is “Find a home depot receipt for $350.31” and the top of the document has “home depot” in chunk 1, and the very bottom has $350.31 which is in chunk 2… will the tool find it? Does relevance / weight maintain across chunks?
- Answered in the github
- Not really, the PDF to text parsing is trivial performance wise
- Yes, that’s exactly what it is doing – it finds documents with the relevant information and collects all the information together then gives that information to the conversation model to answer, so as long as any document contains any relevant text, that text will get used in the answer
- No. You would be asking a question and then not getting an answer to it, then conversing about other things and getting an answer back sometime later. LLMs would not be able to deal with that
- This comes down to a fundamental problem with LLMs and context windows – if the information isn’t in the context window it doesn’t exist. I set it to chunk at about 75% of the context window size because the output is included in context and it needs some space for that as well, so if the document is significantly larger than the context window and it doesn’t reference the thing you are asking in the chunk it will not be correlated to the question. There is nothing that can be done about this, except to ask a thorough question and not have 20 page long home depot receipts I guess
Currently working on:
- Image capability (is this image relevant, get text from images)
- Restrict document types
- Sending information found in previous documents along with chunks to allow model to make connections between chunks
Suggestions welcome.
Now with image processing and non-text pdf processing. For instance the implanting guide for professionals pdf is made of non-searchable/non-selectable text so it will convert each page to an image and send it to the model to be processed as an image. It will also skip non-document formats and the code has been cleaned up and polished.
I consider this a working tool at this point. Without any specific direction or other indications I am going to consider this a fulfillment of your original specifications.
It was built on llama-box and open-webui. It should work with any OpenAI compatible backend, but I haven’t tested it.
Happy to walk through getting it working but as far as general system administration I hesitate to give advice.
I’m wondering if it’s possible to create a sub-tool that could handle this… basically like calling a subroutine. The idea is that if the tool evaluates chunk one and sees there is partial relevance… something is relating to the query but not perfectly relevant or not all the query elements are represented (“home depot” but not the amount), to then store this result as a kind of weight until all the chunks are processed. For each chunk that is relevant, add to the stored weight until the tool can consolidate these weights and say okay across all the chunks the relevance is quite high so this document will be accepted as relevant. Is there some way to effectively cache or store these chunk results to be reconsolidated by the tool?
If the weights are actually not numeric weights but NLP outputs like “chunk 1 of this document contains the term ‘home depot’ which is relevant to the query” and then NLP can evaluate these stored weights after the whole document is chunked and processed to see if all the relevant aspects of the query are represented across the entire document chunks?