Quick Links
-
DeepSeek Coder V2 Instruct
-
Wizard Vicuna Uncensored-GPTQ
With quantized LLMs now available on HuggingFace, and AI ecosystems such as H20, Text Gen, and GPT4All allowing you to load LLM weights on your computer, you now have an option for a free, flexible, and secure AI. Here are nine of the best local/offline LLMs you can try right now!
1
Hermes 2 Pro GPTQ
Hermes 2 Pro is a state-of-the-art language model fine-tuned by Nous Research. It uses an updated and cleaned version of the OpenHermes 2.5 dataset, along with a newly introduced Function Calling and JSON Mode dataset developed in-house. This model is based on the Mistral 7B architecture and has been trained on 1,000,000 instructions/chats of GPT-4 quality or better, primarily synthetic data.
Model |
Hermes 2 Pro GPTQ |
---|---|
Model Size |
7.26 GB |
Parameters |
7 billion |
Quantization |
4-bit |
Type |
Mistral |
License |
Apache 2.0 |
Hermes 2 Pro on Mistral 7B is the new flagship 7B Hermes model, offering improved performance across various benchmarks, including AGIEval, BigBench Reasoning, GPT4All, and TruthfulQA. Its enhanced capabilities make it suitable for a wide range of natural language processing (NLP) tasks, such as code generation, content creation, and conversational AI applications.
Download: Hermes 2 Pro GPTQ via Hugging Face
2
Zephyr 7B Beta
Zephyr is a series of language models trained to act as helpful assistants. Zephyr-7B-Beta is the second model in the series, fine-tuned from Mistral-7B-v0.1 using Direct Preference Optimization (DPO) on a mix of publicly available, synthetic datasets.
Model |
Hermes 2 Pro GPTQ |
---|---|
Model Size |
7.26 GB |
Parameters |
7 billion |
Quantization |
4-bit |
Type |
Mistral |
License |
Apache 2.0 |
By removing the in-built alignment of the training datasets, Zephyr-7B-Beta demonstrates improved performance on benchmarks like MT-Bench, enhancing its helpfulness in various tasks. However, this adjustment may lead to the generation of problematic text when prompted in certain ways.
Download: Zephyr 7B Beta via Hugging Face
3
Falcon Instruct GPTQ
This quantized version of Falcon is based on the decoder-only architecture fine-tuned on top of TII’s raw Falcon-7b model. The base Falcon model was trained using an outstanding 1.5 trillion tokens sourced across the public internet. As an instruction-based decoder-only model licensed under Apache 2, Falcon Instruct is perfect for small businesses looking for a model to use for language translation and data entry.
Model |
Falcon-7B-Instruct |
---|---|
Model Size |
7.58 GB |
Parameters |
7 billion |
Quantization |
4-bit |
Type |
Falcon |
License |
Apache 2.0 |
However, this version of Falcon is not ideal for fine-tuning and is for inferencing only. If you want to fine-tune Falcon, you will have to use the raw model, which can require access to enterprise-grade training hardware such as NVIDIA DGX or AMD Instinct AI Accelerators.
Download: Falcon-7B-Instruct via Hugging Face
4
GPT4ALL-J Groovy
GPT4All-J Groovy is a decoder-only model fine-tuned by Nomic AI and licensed under Apache 2.0. GPT4ALL-J Groovy is based on the original GPT-J model, which is known to be great at text generation from prompts. GPT4ALL -J Groovy has been fine-tuned as a chat model, which is great for fast and creative text generation applications. This makes GPT4All-J Groovy ideal for content creators in assisting them in writing and creative works, whether it be poetry, music, or stories.
Model |
GPT4ALL-J Groovy |
---|---|
Model Size |
3.53 GB |
Parameters |
7 billion |
Quantization |
4-bit |
Type |
GPT-J |
License |
Apache 2.0 |
Unfortunately, the base GPT-J model was trained on an English-only dataset, which means even this fine-tuned GPT4ALL-J model can only chat and perform text generation applications in English.
Download: GPT4ALL-J Groovy via Hugging Face
5
DeepSeek Coder V2 Instruct
DeepSeek Coder V2 is an advanced language model that enhances coding and mathematical reasoning capabilities. It supports an extensive range of programming languages and offers an extended context length, making it a versatile tool for developers.
Model |
DeepSeek Coder V2 Instruct |
---|---|
Model Size |
13 GB |
Parameters |
33 billion |
Quantization |
4-bit |
Type |
DeepSeek |
License |
Apache 2.0 |
Compared to its predecessor, DeepSeek Coder V2 shows significant advancements in code-related tasks, reasoning, and general capabilities. It expands support for programming languages from 86 to 338 and extends the context length from 16K to 128K tokens. In standard benchmark evaluations, it outperforms models like GPT-4 Turbo, Claude 3 Opus, and Gemini 1.5 Pro in coding and math benchmarks.
Download: DeepSeek Coder V2 Instruct via Hugging Face
6
Mixtral-8x7B
Mixtral-8x7B is a sparse mixture of expert (MoE) models developed by Mistral AI. It features eight experts per MLP, totaling 45 billion parameters. However, only two experts are activated per token during inference, making it computationally efficient and comparable in speed and cost to a dense 12-billion parameter model.
Model |
Mixtral-8x7B |
---|---|
Model Size |
12 GB |
Parameters |
45 billion (8 experts) |
Quantization |
4-bit |
Type |
Mistral MoE |
License |
Apache 2.0 |
Mixtral supports a context length of 32k tokens and outperforms Llama 2 70B on most benchmarks, matching or exceeding GPT-3.5 performance. It is proficient in multiple languages, including English, French, German, Spanish, and Italian, making it a versatile choice for various NLP tasks.
Download: Mixtral-8x7B via Hugging Face
7
Wizard Vicuna Uncensored-GPTQ
Wizard-Vicuna GPTQ is a quantized version of Wizard Vicuna based on the LlaMA model. Unlike most LLMs released to the public, Wizard-Vicuna is an uncensored model with its alignment removed. This means the model doesn’t have the same safety and moral standards as most models.
Model |
Wizard-Vicuna-30B-Uncensored-GPTQ |
---|---|
Model Size |
16.94 GB |
Parameters |
30 billion |
Quantization |
4-bit |
Type |
LlaMA |
License |
GPL 3 |
Although possibly posing an AI alignment control problem, having an uncensored LLM also brings out the best in the model by allowing it to answer without any constraints. This also allows the users to add their custom alignment on how the AI should act or answer based on a given prompt.
Download: Wizard-Vicuna-30B-Uncensored-GPTQ via Hugging Face
8
Orca Mini-GPTQ
Looking to experiment with a model trained on a unique learning method? Orca Mini is an unofficial model implementation of Microsoft’s Orca research papers. It was trained using the teacher-student learning method, where the dataset was full of explanations instead of only prompts and responses. This, in theory, should result in a smarter student, where the model can understand the problem rather than just looking for input and output pairs like how typical LLMs work.
Model |
Orca Mini-GPTQ |
---|---|
Model Size |
8.11 GB |
Parameters |
3 billion |
Quantization |
4-bit |
Type |
LlaMA |
License |
MIT |
With only three billion parameters, Orca Mini GPTQ is easy to run even on less powerful systems. However, this model should not be used for anything professional as it generates false information, biased, and offensive responses. This model should be utilized to learn and experiment with Orca and its methods.
Download: Orca Mini-GPTQ via Hugging Face
9
Llama 2 13B Chat GPTQ
or
Llama 2 is the successor to the original Llama LLM, offering improved performance and versatility. The 13B Chat GPTQ variant is fine-tuned for conversational AI applications optimized for English dialogue.
Model |
Llama 2 13B Chat GPTQ |
---|---|
Model Size |
7.26 GB |
Parameters |
13 billion |
Quantization |
4-bit |
Type |
Llama 2 |
License |
Meta License |
Llama 2 is intended for commercial and research use. Its licensing terms allow companies with fewer than 700 million users to utilize it without additional fees. This model is ideal for organizations seeking a robust chatbot solution that requires minimal additional training.
Download: Llama 2 13B Chat GPTQ via Hugging Face
Some of the models listed above have several versions in terms of parameters. In general, higher parameter versions yield better results but require more powerful hardware, while lower parameter versions will generate lower quality results but can run on lower-end hardware. If you’re unsure if your PC can run the model, try going for the lower parameter version first, then keep going until you feel the performance drop is no longer acceptable.