this post was submitted on 04 Oct 2024

34 points (87.0% liked)

Selfhosted

40133 readers

1045 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam posting.
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
No trolling.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 1 year ago

MODERATORS

HybridSarcasm@lemmy.world

HybridSarcasm@lemmy.hybridsarcasm.xyz

Please suggest some good self-hostable RAG for my LLM. (lemmy.world)

submitted 1 month ago* (last edited 1 month ago) by Maroon@lemmy.world to c/selfhosted@lemmy.world

15 comments fedilink hide all child comments

A while ago, I had requested help with using LLMs to manage all my teaching notes. I have since installed Ollama and been playing with it to get a feel for the setup.

I was also suggested the use of RAG (Retrieval Augmented Generation ) and CA (cognitive architecture). However, I am unclear on good self hosted options for these two tasks. Could you please suggest a few?

For example, I tried ragflow.io and installed it on my system, but it seems I need to setup an account with a username and password to use it. It remains unclear if I can use the system offline like the base ollama model, and that information won't be sent from my computer system.

top 15 comments

sorted by: hot top controversial new old

[–] exu@feditown.com 6 points 1 month ago

I'm aware of these options to do RAG, though I'm not using any yet. Only SillyTavern for chat stuff

[–] Zelyios@lemmy.world 5 points 1 month ago

You can use h2ogpt which allows you to build a RAG choosing your documents without coding anything

[–] BaroqueInMind@lemmy.one 4 points 1 month ago* (last edited 1 month ago)

Why not use this and select whatever LLM to leverage as a RAG? It literally allows you to self host the model and select any model for both chat and RAG analysis. I have it set to Hermes3 8B for chat and a 1.3B Llama3 as the RAG.

[–] chagall@lemmy.world 3 points 1 month ago (1 children)

You should ask @brucethemoose@lemmy.world. He seems to know all about this stuff.

[–] brucethemoose@lemmy.world 16 points 1 month ago* (last edited 1 month ago) (1 children)

I have an old Lenovo laptop with an NVIDIA graphics card.

@Maroon@lemmy.world The biggest question I have for you is what graphics card, but generally speaking this is... less than ideal.

To answer your question, Open Web UI is the new hotness: https://github.com/open-webui/open-webui

I personally use exui for a lot of my LLM work, but that's because I'm an uber minimalist.

And on your setup, I would host the best model you can on kobold.cpp or the built-in llama.cpp server (just not Ollama) and use Open Web UI as your front end. You can also use llama.cpp to host an embeddings model for RAG, if you wish.

This is a general ranking of the "best" models for document answering and summarization: https://huggingface.co/spaces/vectara/Hallucination-evaluation-leaderboard

...But generally, I prefer to not mess with RAG retrieval and just slap the context I want into the LLM myself, and for this, the performance of your machine is kind of critical (depending on just how much "context" you want it to cover). I know this is !selfhosted, but once you get your setup dialed in, you may consider making calls to an API like Groq, Cerebras or whatever, or even renting a Runpod GPU instance if that's in your time/money budget.

[–] kwa@lemmy.zip 4 points 1 month ago* (last edited 1 month ago) (1 children)

I’m new to this and I was wondering why you don’t recommend ollama? This is the first one I managed to run and it seemed decent but if there are better alternatives I’m interested

Edit: it seems the two others don’t have an API. What would you recommend if you need an API?

[–] brucethemoose@lemmy.world 3 points 1 month ago* (last edited 1 month ago)

Pretty much everything has an API :P

ollama is OK because its easy and automated, but you can get higher performance, better vram efficiency, and better samplers from either kobold.cpp or tabbyAPI, with the catch being that more manual configuration is required. But this is good, as it "forces" you to pick and test an optimal config for your system.

I'd recommend kobold.cpp for very short context (like 6K or less) or if you need to partially offload the model to CPU because your GPU is relatively low VRAM. Use a good IQ quantization (like IQ4_M, for instance).

Otherwise use TabbyAPI with an exl2 quantization, as it's generally faster (but GPU only) and much better at long context through its great k/v cache quantization.

They all have OpenAI APIs, though kobold.cpp also has its own web ui.

[–] scrubbles@poptalk.scrubbles.tech 2 points 1 month ago (1 children)

I'm not 100% what you're asking for, but I use text-generation-webui for all of my local generation needs.

https://github.com/oobabooga/text-generation-webui

[–] brucethemoose@lemmy.world 1 points 1 month ago (1 children)

Text-generation-webui is cool, but also kinda crufty. Honestly a lot of the stuff is holdovers from what's now ancient history in LLM land, and it has (for me) major performance issues at longer context.

[–] scrubbles@poptalk.scrubbles.tech 1 points 1 month ago (1 children)

Anything better you know of? Most of my usage now with it is through its api

[–] brucethemoose@lemmy.world 1 points 1 month ago (1 children)

Uh, depends on your hardware and model, but probably TabbyAPI?

[–] scrubbles@poptalk.scrubbles.tech 1 points 1 month ago

Neat! I'll check it out!

[–] BitSound@lemmy.world 1 points 1 month ago

Not sure how ollama integration works in general, but these are two good libraries for RAG:

https://github.com/facebookresearch/faiss

https://pypi.org/project/chromadb/

[–] Antiochus@lemmy.one 1 points 1 month ago

I'm not sure how well it would work in a self-hosted or server-type context, but GPT4all has built in RAG functionality. There's also a flatpak in addition to the Windows, Mac and .deb installs.

[–] filister@lemmy.world -4 points 1 month ago

Why don't you build your own?