32
TheHobbyist
I see, so there is indeed a broader context to the burning alone, it was also with additional verbal hatred and then possibly the location, and the overall intention. I think this makes it clearer. Thanks
Not familiar with the guy himself who maybe does deserve criticism and prison, but about the Quran burning, is it genuinely fair to sentence someone to prison for that? Is it equivalent to burning the cross? The Swedish flag? I might be mission a broader context, but I don't feel like someone burning my symbol or flag should be punished with prison. Am I alone? I would hate it, don't get me wrong, but I still feel it goes in freedom of expression.
The proud dad's name ends with Unis and the kid remembers the X first digits, hence Unix, hence Linux!
They do mention compatibility a lot, if it's hardware, I agree with you. But perhaps they mean something else?
This. I will resume my recommendation of Bitwarden.
I didn't say it can't. But I'm not sure how well it is optimized for it. From my initial testing it queues queries and submits them one after another to the model, I have not seen it batch compute the queries, but maybe it's a setup thing on my side. vLLM on the other hand is designed specifically for the multi co current user use case and has multiple optimizations for it.
I run the Mistral-Nemo(12B) and Mistral-Small (22B) on my GPU and they are pretty code. As others have said, the GPU memory is one of the most limiting factors. 8B models are decent, 15-25B models are good and 70B+ models are excellent (solely based on my own experience). Go for q4_K models, as they will run many times faster than higher quantization with little performance degradation. They typically come in S (Small), M (Medium) and (Large) and take the largest which fits in your GPU memory. If you go below q4, you may see more severe and noticeable performance degradation.
If you need to serve only one user at the time, ollama +Webui works great. If you need multiple users at the same time, check out vLLM.
Edit: I'm simplifying it very much, but hopefully should it is simple and actionable as a starting point. I've also seen great stuff from Gemma2-27B
Edit2: added links
Edit3: a decent GPU regarding bang for buck IMO is the RTX 3060 with 12GB. It may be available on the used market for a decent price and offers a good amount of VRAM and GPU performance for the cost. I would like to propose AMD GPUs as they offer much more GPU mem for their price but they are not all as supported with ROCm and I'm not sure about the compatibility for these tools, so perhaps others can chime in.
Edit4: you can also use openwebui with vscode with the continue.dev extension such that you can have a copilot type LLM in your editor.
As you probably know, an LLM works iteratively: you give it instructions and it "auto-completes", one token at a time. Every time you want to generate the next token, you have to perform the whole inference task, which is expensive.
However, verifying if a next token is the correct one, can be cheap because you can do it in parallel. For instance, take the sentence " The answer to your query is that the sky is blue due to some physical concept". If you wanted to check whether your model would output each one of those tokens, you would split the sentence after every token and you could batch verify the next token for every split and see whether the next token matches the sentence.
Speculative decoding is the process where a cheap and efficient draft model is used to generate a tentative output, which is then verified in parallel by the expensive model. Because the cheap draft model is many times quicker, you can get a sample output very fast and batch verify the output with the expensive model. This saves a lot of computational time because all the parallel verifications require a single forward pass. And the best part is that it has zero effect on the output quality of the expensive model. The cost is that you know have to run two models, but the smaller one may be a tenth of the size, so runs possibly 10x faster. The closer the draft model output matches the expensive model output, the higher the inference speed gain potential.
This is interesting. Need to check if this is implemented in Open-WebUI.
But I think the thing which I'm hoping for most (in open-webui), is the support of draft models for speculative decoding. This would be really nice!
Edit: it seems it's not implemented in ollama yet
Can't you return the laptop within 30 days if you don't like it? If that's the case, why don't you just go ahead, buy it and give it a reasonable shot? Nobody else's opinion will change how the laptop works for you :)
Indeed, totally an Apple approach to modularity: it is a proprietary Apple SSD...