How To Run LLMs Locally? How To & Tool | Murf AI

API

How To Run LLMs Locally? How To & Tool | Murf AI

Kanika Bansal

Last updated:

March 20, 2025

Min Read

Try Murf for Free

Contact Sales

How To Run LLMs Locally? How To & Tool | Murf AI

Table of Contents

Text Link

Large language models (LLMs) have fast turned from a talking point to something that the average person uses in their everyday life. For developers, this has meant exploring different options, with some looking for more customization while many seek cost-effective alternatives. One key trend that has emerged is the use of LLMs on local devices that leads to three primary advantages: privacy, savings and offline capabilities.

There are several user-friendly methods for deploying Large Language Models on Windows, macOS and Linux, ranging from GUI-based tools to command-line interfaces. Let’s dive into a few.

‍

Popular Tools for Local LLM Deployment

The landscape of local LLM deployment has evolved rapidly, with several powerful tools emerging to meet different user needs and technical requirements. Each tool offers unique advantages, whether you're prioritizing ease of use, performance, or customization capabilities.

Some leading tools that have gained significant traction in the local LLM space include:

Ollama: It offers a user-friendly command-line interface that allows users to quickly pull and run optimized models without complex setup procedures.
GPT4All: This platform provides a graphical user interface for easy interaction with local LLMs, making it accessible to users with varying technical expertise.
LLaMa.cpp: Known for its efficient C++ implementation, LLaMa.cpp is favored by developers seeking high-performance inference on consumer hardware.
Llamafile: This self-contained executable bundles model weights with llama.cpp, offering a portable solution for running LLMs across different systems.
LocalAI: An open-source alternative that focuses on providing a local API similar to OpenAI's, enabling seamless integration with existing AI workflows.
LM Studio: Convenient application with built-in API server for model management
Jan AI: Desktop interface with Python integration capabilities‍‍

‍System Requirements and Performance Considerations

Minimum Requirements:

CPU: 64-bit processor (x86 or ARM)
RAM: 8GB (minimum for smaller models), 16GB+ recommended for larger models
Storage: 10GB+ free space for model files
Operating System: Windows 10/11, macOS 10.15+, or Linux with kernel 4.x+

Recommended Specifications:

CPU: Modern multi-core processor (8+ cores)
RAM: 32GB+
Storage: NVMe SSD with 50GB+ free space
GPU: Optional but recommended (8GB+ VRAM for optimal performance)

Performance varies significantly based on your hardware. In our testing on a MacBook Pro M1 (16GB RAM), it was 7-10 tokens/second.

‍1. Ollama Deployment Guide

‍Ollama is a powerful tool for running large language models locally, offering a balance between advanced AI capabilities and ease of use [1]. It supports various LLMs, including Llama and other open-source models, and provides a command-line interface for quick deployment and interaction. One of Ollama's key advantages is its ability to run models offline, enhancing privacy and reducing latency.

‍
Here's a detailed Python code example demonstrating how to create a simple question-answering system using Ollama:

import ollama
import json

# Define a knowledge base
knowledge_base = {
    "capital_france": "The capital of France is Paris.",
    "largest_planet": "Jupiter is the largest planet in our solar system.",
    "python_creator": "Python was created by Guido van Rossum."
}

def get_ollama_response(prompt, system_prompt):
    try:
        response = ollama.chat(model='llama3.3', messages=[
            {'role': 'system', 'content': system_prompt},
            {'role': 'user', 'content': prompt}
        ])
        return response['message']['content']
    except Exception as e:
        return f"An error occurred: {str(e)}"

def question_answering_system():
    system_prompt = """You are a helpful assistant. Use the provided knowledge base to answer user questions.
    If the question can't be answered using the knowledge base, politely say you don't have that information."""

    print("Question Answering System (type 'exit' to quit)")
    while True:
        user_question = input("User: ")
        if user_question.lower() == 'exit':
            break

        prompt = f"Knowledge Base: {json.dumps(knowledge_base)}\n\nUser Question: {user_question}\n\nPlease provide a concise answer based on the knowledge base."

        response = get_ollama_response(prompt, system_prompt)
        print(f"Assistant: {response}")

if __name__ == "__main__":
    question_answering_system()

To run this script, ensure Ollama is installed and running on your system, install the Ollama Python library (pip install ollama), and then execute the script. This example shows how Ollama can be used to create a local, privacy-preserving question-answering system that leverages a custom knowledge base, making it suitable for various applications where data control and offline capabilities are crucial.

2. Setting Up GPT4All Locally

GPT4All offers a user-friendly approach to running powerful language models locally on personal devices.

GUI Installation

Download the installer for your operating system (Windows, macOS, or Ubuntu) from the official GPT4All website.
Install the application and launch it from the created desktop shortcut.
Upon first launch, you'll need to download a language model. Select one based on your system's RAM capacity.
Once a model is installed, you can start interacting with the AI by typing prompts in the chat interface.

GPT4All runs entirely on your CPU, requiring no internet connection or GPU. This ensures privacy and allows for offline use, making it an accessible option for those wanting to explore AI language models without relying on cloud-based services.

‍Programmatic Usage

‍To run an LLM locally using GPT4All, you can use the following Python code snippet:

from gpt4all import GPT4All

# Initialize the model
model = GPT4All("Meta-Llama-3-8B-Instruct.Q4_0.gguf")

# Create a chat session
with model.chat_session():
    # Generate a response
    response = model.generate("How can I run LLMs efficiently on my laptop?", max_tokens=1024)
    print(response)

This code demonstrates how to:

Import the GPT4All library
Initialize a model (in this case, Meta-Llama-3-8B-Instruct)
Create a chat session
Generate a response to a given prompt

GPT4All automatically handles model downloading and loading, making it easy to run LLMs locally without requiring API calls or GPUs. This approach ensures privacy and allows for offline use, as all processing occurs on your device.
Remember to install the gpt4all library using pip install gpt4all before running the code.

3. Using LLaMa.cpp for Inference

LLaMa.cpp provides an efficient framework for running LLM inference locally. To use it for inference, follow these steps:

Clone and build the LLaMa.cpp repository.
Download a compatible GGUF model file.
Run inference using the command line:

./main -m /path/to/model.gguf -p "Your prompt here"

For more control, you can use the C++ API to integrate LLaMa.cpp into your projects. Key steps include:

Initialize the model using llama_load_model_from_file
Create a context with llama_new_context_with_model
Set up a sampler chain for token generation
Tokenize input, perform inference, and detokenize output

LLaMa.cpp optimizes performance through efficient memory management and support for quantization, enabling inference on consumer hardware [1][2]. Its low-level approach provides flexibility for custom applications while maintaining high efficiency.

Llama Cpp with Python

‍To use llama.cpp for running an LLM locally in Python, you can utilize the llama_cpp_python library, which provides Python bindings for llama.cpp. Here's a code snippet demonstrating how to set up and use a local LLM:

from llama_cpp import Llama

# Initialize the model
llm = Llama(model_path="/path/to/your/model.gguf", n_ctx=2048, n_threads=4)

# Generate a response
prompt = "Explain the benefits of running LLMs locally:"
output = llm(prompt, max_tokens=100, stop=[".", "\n"], echo=True)

# Print the generated text
print(output['choices'][0]['text'])

# For a more interactive experience, you can create a simple chat loop
while True:
    user_input = input("You: ")
    if user_input.lower() == 'exit':
        break
    
    response = llm(user_input, max_tokens=200, stop=["\n", "You:"], echo=False)
    print("AI:", response['choices'][0]['text'].strip())

This code demonstrates how to initialize the model, generate responses, and create a simple chat interface. Before running this script, ensure you have installed the llama-cpp-python library using pip:

pip install llama-cpp-python

Remember to replace /path/to/your/model.gguf with the actual path to your downloaded GGUF model file. This approach allows you to leverage the efficiency of llama.cpp while working in a Python environment, making it easier to integrate local LLM capabilities into your projects.

4. LlamaFile Local Deployment

One more way of running an LLM locally is to use Llamafile. It offers a straightforward approach to running LLMs locally, combining model weights and a compiled version of llama.cpp into a single executable file. To use Llamafile:

1. Download a llamafile from HuggingFace, for example:

wget https://huggingface.co/Mozilla/TinyLlama-1.1B-Chat-v1.0-llamafile/resolve/main/TinyLlama-1.1B-Chat-v1.0.F16.llamafile

‍

2. Make the file executable:

chmod +x TinyLlama-1.1B-Chat-v1.0.F16.llamafile

‍

3. Run the llamafile in server mode:

./TinyLlama-1.1B-Chat-v1.0.F16.llamafile --server --nobrowser --embedding --port 8080

‍

To interact with the LLM using Python and LlamaIndex:

from llama_index.llms import Llamafile
from llama_index.embeddings import LlamafileEmbedding

# Initialize LLM and embedding model
llm = Llamafile(model_path="http://localhost:8080")
embed_model = LlamafileEmbedding(model_path="http://localhost:8080")

# Generate text
response = llm.complete("What is the capital of France?")
print(response.text)

# Generate embeddings
embedding = embed_model.get_text_embedding("Hello, world!")
print(embedding)

This approach allows for easy local deployment of LLMs, ensuring privacy and offline capabilities.

5. LocalAI Usecase

LocalAI provides an open-source alternative to the OpenAI API, allowing developers to run AI models locally with minimal setup. Here are some concise code examples demonstrating how to use LocalAI:

1. Setting up LocalAI:

docker run -p 8080:8080 -v $PWD/models:/models -ti --rm quay.io/go-skynet/local-ai:latest --models-path /models --context-size 700 --threads 4

‍
2. Using LocalAI with Python:


import openai

openai.api_key = "sk-xxx"  # Any string works as API key
openai.api_base = "http://localhost:8080/v1"

response = openai.Completion.create(
    model="gpt-4o-mini",
    prompt="Translate 'Hello, world!' to French:",
    max_tokens=50
)
print(response.choices[0].text.strip())

This script demonstrates how to use the OpenAI Python library with LocalAI by simply changing the API base URL [1].

3. Integrating LocalAI with a chatbot UI:


import gradio as gr
import openai

openai.api_key = "sk-xxx"
openai.api_base = "http://localhost:8080/v1"

def chat(message, history):
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": message}]
    )
    return response.choices[0].message.content

interface = gr.ChatInterface(chat)
interface.launch()

This example shows how to create a simple chatbot interface using Gradio and LocalAI [2].

These code snippets demonstrate the ease of integrating LocalAI into existing projects, allowing developers to leverage open-source AI models while maintaining control over their data and infrastructure.

6. LM Studio Application

LM Studio provides a convenient way to run local LLMs, including the Llama 3.3 model, through its built-in API server. Here's a Python code example demonstrating how to interact with the Llama 3.3 model using LM Studio:

import requests
import json

# LM Studio server URL (default port is 1234)
url = "http://localhost:1234/v1/chat/completions"

# Headers for the API request
headers = {
    "Content-Type": "application/json"
}

# Function to generate a response from the model
def generate_response(prompt):
    data = {
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        "temperature": 0.7,
        "max_tokens": 150
    }
    
    response = requests.post(url, headers=headers, data=json.dumps(data))
    return response.json()['choices'][0]['message']['content']

# Example usage
prompt = "Explain the benefits of running LLMs locally using LM Studio."
response = generate_response(prompt)
print(response)

Download and install LM Studio from the official website.

To use this script:

Download and install LM Studio from the official website.
Load the Llama 3.3 model in LM Studio.
Start the local inference server in LM Studio.
Run the Python script.

This approach allows developers to leverage LM Studio's user-friendly interface for model management while integrating local LLM capabilities into their Python projects. The script uses the OpenAI-compatible API provided by LM Studio, making it easy to switch between different models or even cloud-based services if needed.

7. Jan AI Use Case

Jan AI, primarily known for its user-friendly desktop interface, also offers Python integration capabilities for developers looking to incorporate local LLMs into their projects. To use Jan AI with the Llama 3.3 model in Python, you can leverage its API server functionality. Here's a concise example of how to integrate Jan AI with Python:

To use this script, first ensure that Jan AI is installed and running on your system with the Llama 3.3 model loaded [1]. Jan AI provides an OpenAI-compatible API, making it easy to integrate with existing codebases or switch between different models. This approach combines the privacy benefits of local LLM deployment with the flexibility of programmatic access, allowing developers to build AI-powered applications that run entirely on their own hardware.

import requests
import json

# Jan AI server URL (default port is 1337)
url = "http://localhost:1337/v1/chat/completions"

# Headers for the API request
headers = {
    "Content-Type": "application/json"
}

def generate_response(prompt):
    data = {
        "model": "llama-3.3",  # Specify the Llama 3.3 model
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        "temperature": 0.7,
        "max_tokens": 150
    }
    
    response = requests.post(url, headers=headers, json=data)
    return response.json()['choices'][0]['message']['content']

# Example usage
prompt = "Explain the advantages of using Jan AI for local LLM deployment."
response = generate_response(prompt)
print(response)

Common Implementation Challenges and Solutions

During our testing of these various approaches, we encountered several important considerations and potential roadblocks:

Memory Management
‍
- Challenge: Large models can cause out-of-memory errors
  ‍
- Solution: Use quantized models (4-bit or 8-bit) or implement batch processing for large requests
CPU Utilization
‍
- Challenge: High CPU usage can slow down other system processes
  ‍
- Solution: Configure thread count appropriately (n_threads parameter) and consider process nice values on Unix systems
Model Loading Time
‍
- Challenge: Initial model loading can take several minutes
  ‍
- Solution: Implement model preloading at system startup or use model caching
Temperature Management
‍
- Challenge: Extended inference can cause thermal throttling
  ‍
- Solution: Monitor system temperatures and implement cool-down periods for intensive workloads

Choosing the Right Solution

‍Each tool has its optimal use cases:

Ollama
- Best for: Development environments and quick prototyping
- When to use: Need simple CLI interface and quick setup
- System requirements: Moderate (8GB RAM minimum)
- Advantages: Easy to use, good documentation
- Limitations: Limited model customization options
GPT4All
- Best for: Non-technical users and desktop applications
- When to use: Need a GUI interface and simple deployment
- System requirements: Light (8GB RAM minimum)
- Advantages: User-friendly interface, good for beginners
- Limitations: Less flexible for custom implementations
LLaMa.cpp
- Best for: Production environments and performance-critical applications
- When to use: Need maximum performance and control
- System requirements: Varies (8GB-32GB RAM depending on model)
- Advantages: Highly optimized, great performance
- Limitations: Requires technical expertise
Llamafile
- Best for: Cross-platform deployment and portability
- When to use: Need to distribute self-contained applications
- System requirements: Moderate (8GB RAM minimum)
- Advantages: Excellent portability, simple deployment
- Limitations: Limited to supported model formats
LocalAI
- Best for: API-first applications and OpenAI API compatibility
- When to use: Need to replace OpenAI API with local alternative
- System requirements: Moderate to High (16GB RAM recommended)
- Advantages: API compatibility, flexible deployment
- Limitations: More complex setup, requires Docker knowledge
LM Studio
- Best for: Model management and API-driven applications
- When to use: Need built-in API server with easy model loading
- System requirements: 8GB RAM minimum
- Advantages: User-friendly interface, Llama 3.3 support
- Limitations: Requires manual model download and server start
Jan AI
- Best for: Desktop AI interaction with Python integration
- When to use: Need flexible local LLM with programmatic access
- System requirements: 8GB RAM minimum
- Advantages: Desktop GUI, OpenAI-compatible API
- Limitations: Emerging ecosystem, fewer model options

Conclusion‍

While local deployment requires initial hardware investments and ongoing maintenance, the long-term benefits often outweigh these challenges. As the technology continues to evolve, tools like Ollama, GPT4All, and Llamafile are making local LLM deployment increasingly accessible to businesses and individuals alike.

Frequently Asked Questions

Author’s Profile

Kanika Bansal

Kanika is a Principal Product Manager at Murf AI, specializing in AI-driven voice technology. Previously worked with Amazon's Alexa AI and Nova, she brings deep expertise in artificial intelligence, speech synthesis, and product innovation. At Murf, Kanika focuses on enhancing AI voice solutions to empower content creators, businesses, and developers - to bridge the gap between cutting-edge AI advancements and real-world applications.

Share this post