Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Practical Guide to Running Large Language Models Locally

Tech May 18 4

Understanding Large Language Models

Large Language Models (LLMs) have become a focal point in artificial intelligence since the emergence of ChatGPT. These models can process and generate human-like text, enabling capabilities such as text generation, question answering, conversation, and document summarization.

To understand LLMs, let's use an educational analogy:

  1. Institution Selection - Training LLMs requires substantial computational resources, typically necessitating powerful GPUs that only well-funded organizations can afford.
  2. Curriculum Development - LLMs require vast amounts of training data, often measured in hundreds of billions of tokens.
  3. Teaching Methodology - Algorithms determine how the model learns relationships between tokens in the training data.
  4. Specialization - Fine-tuning adapts the model to specific domains or tasks.
  5. Application - The model performs tasks like translation or Q&A through a process called inference.

In LLMs, tokens represent the basic units of text processing. These can be characters, words, or subwords, depending on the tokenization method. Tokens are converted to numerical IDs that form a vocabulary table:

Token ID
The 345
cat 1256
sat 1726

To help computers understand relationships between tokens, they are converted into dense vector representations through embedding. Common embedding algorithms include:

  • Statistical approaches
    • Word2Vec - learns word vectors through contextual statistics
    • GloVe - learns word vectors based on co-occurrence statistics
  • Deep learning approaches
    • CNN - obtains vectors using convolutional networks
    • RNN/LSTM - utilizes sequence models for text vectors
  • Neural network approaches
    • BERT - pre-trains word vectors using Transformers and masked language modeling
    • Doc2Vec - generates vectors for text sequences using neural networks

Transformer-based models employ self-attention mechanisms to learn dependencies between tokens, generating high-quality embeddings. The "large" in Large Language Models refers to the numerous parameters (weights and biases) that express token relationships. For instance, GPT-3 contains 175 billion parameters, while its vocabulary has only about 50,000 tokens.

Evolution of Large Language Models

The foundation of modern LLMs began with the 2017 paper "Attention Is All You Need." Since then, numerous pre-trained models have emerged:

  • BERT (Bidirectional Encoder Representations from Transformers) - Introduced by Google in 2018, it pioneered bidirectional pre-training to capture contextual semantics and used masked language modeling for better semantic inference. Parameter scale: 110M to 340M.
  • GPT (Generative Pre-trained Transformer) - Released by OpenAI in 2018, it demonstrated the power of autoregressive language modeling without additional supervision signals. Parameter scale: 175 billion.
  • LLaMA (Large Language Model Meta AI) - Released by Meta in 2021, it was one of the first major open-source models, providing a systematic approach to building larger, more general language models. Parameter scale: billions to hundreds of billions.

Model Deployment Techniques

LLMs require significant memory resources. For example, a GPT-2 model with 1.5B parameters using float32 representation requires approximately 6GB of memory (4 bytes × 1,500,000,000). More advanced models like LLaMA with 65B parameters would need 260GB, not including vocabulary storage. Therefore, model compression is essential for practical deployment.

Since data transfer between CPU and memory is often a bottleneck, reducing memory usage is a primary optimization goal. Using smaller data types is a direct aprpoach—16-bit floating-point numbers can halve memory requirements. Several 16-bit standards exist, with NVIDIA supporting bfloat16 in its latest hardware:

Format Significand Exponent
bfloat16 8 bits 8 bits
float16 11 bits 5 bits
float32 24 bits 8 bits

Quantization

Reducing precision from 16-bit to 8-bit or 4-bit is possible but requires different approaches since hardware-accelerated floating-point operations aren't available for these smaller types. This is where quantization comes in—converting weights to smaller integers that can leverage hardware acceleration instructions like Intel's AVX.

A simple quantization method involves finding the maximum and minimum weight values, then dividing this range into the number of available buckets for the integer type (256 buckets for 8-bit, 16 for 4-bit). This approach, known as post-training quantization, is the simplest method for model quantization. Two primary quantization methods are popular today:

  • GPTQ (Accurate Post-Training Quantization for Generative Pre-trained Transformers) - Primarily optimized for NVIDIA GPUs.
  • GGML (Large Language Models for Every one) - Focuses on CPU-optimized quantization, particularly for Apple M1 and M2 chips.

Community contributor "TheBloke" has applied these quantization methods to most LLMs in the Hugging Face Transformers library, significantly improving accessibility.

Practical Implementation

For macOS users, GGML-quantized models offer an efficient solution. While official implementations are typically Python-based with limited efficiency, several community projects provide optimized alternatives:

  • ggerganov/llama.cpp - A C/C++ implementation of Facebook's LLaMA model
  • ggerganov/whisper.cpp - A C/C++ implementation of OpenAI's Whisper model

Running LLaMA

To compile with Metal GPU support:

LLAMA_METAL=1 make

Download a quantized model such as Llama-2-7B-Chat-GGML (sizes range from 3GB to 7GB depanding on quantization level):

./main -m ~/Downloads/llama-2-7b-chat.ggmlv3.q4_1.bin \
       -p "Building a website can be done in 10 simple steps:" -n 512 -ngl 10

Example output:

 Building a website can be done in 10 simple steps:

 planning, domain name registration, hosting choice, selecting a CMS or building from scratch, developing the site content, designing the user interface, coding and debugging, testing for usability, and launching it.

 Planning involves setting goals and objectives for your website, identifying your target audience, and determining its purpose.

 Domain name registration is choosing a unique web address that reflects your brand or business identity.

 Choosing the right hosting service ensures that your website loads quickly and efficiently. Popular choices include Bluehost, SiteGround, and
 HostGator.

 Selecting a Content Management System (CMS) or building one from scratch allows you to easily manage and update content without needing technical knowledge. Options include WordPress, Joomla, and Drupal.

 Developing website content involves creating text, images, videos, and other media that convey your message and provide value to users.

 Designing the user interface (UI) focuses on visual aspects of your site such as layout, color scheme, typography, and navigation.

 Coding ensures your site functions correctly by writing clean HTML, CSS, and JavaScript code. Debugging involves identifying and fixing any errors or bugs that arise during testing.

 Testing for usability means checking how easy it is for users to navigate through your site and find the information they need.
Launching involves making your website live for all visitors to access, and promoting it through marketing channels such as social media and search engines. [end of text]

llama_print_timings:        load time =  1267.46 ms
llama_print_timings:      sample time =   204.14 ms /   313 runs   (    0.65 ms per token,  1533.23 tokens per second)
llama_print_timings: prompt eval time =   397.22 ms /    14 tokens (   28.37 ms per token,    35.25 tokens per second)
llama_print_timings:        eval time =  9504.40 ms /   312 runs   (   30.46 ms per token,    32.83 tokens per second)
llama_print_timings:       total time = 10132.02 ms
ggml_metal_free: deallocating

llama.cpp also provides a web interface. Start the server with:

./server -m ~/Downloads/llama-2-7b-chat.ggmlv3.q4_1.bin -ngl 512

The server listens on port 8080 by default, which you can access through a web browser.

Running Whisper

Similar to LLaMA, compile using the make command. Then download quantized models from ggerganov/whisper.cpp and convert audio files (currently only WAV format is supported):

ffmpeg -loglevel -0 -y -i "$INPUT" -ar 16000 -ac 1 -c:a pcm_s16le "${INPUT}.wav"

./main -m models/ggml-small.bin -f "$INPUT" \
        -osrt -t 8 -p 4

Example SRT output:

1
00:00:00,000 --> 00:00:05,520
 Hello everyone and welcome to another episode of the Form 3 TET podcast.

2
00:00:05,520 --> 00:00:08,800
 My name is Kevin Holtich, head of Pat from Engineering at Form 3.

3
00:00:08,800 --> 00:00:12,560
 Today I'm really excited that I've been joined by Torsten Ball.

4
00:00:12,560 --> 00:00:13,920
 How's it going, State Torsten?

Available Whisper models:

Size Parameters English-only model Multilingual model Required VRAM Relative speed
tiny 39 M tiny.en tiny ~1 GB ~32x
base 74 M base.en base ~1 GB ~16x
small 244 M small.en small ~2 GB ~6x
medium 769 M medium.en medium ~5 GB ~2x
large 1550 M N/A large ~10 GB 1x

For English audio, the small model is generally sufficient. For Chinese or other languages, the large model is recommended.

Free LLM Products and Alternatives

While ChatGPT is a paid service with limited availability in some regions, numerous alternatives exist. Google's Bard is one option, and for code-specific tasks, several alternatives to GitHub Copilot are available:

  • Tabnine - AI assistant that accelerates development while maintaining code security
  • Codeium - Free AI code completion and chat
  • Amazon CodeWhisperer - AI coding companion for faster, more secure application development
  • SourceGraph Cody - AI assistant with knowledge of your entire codebase
  • Tabby - Open-source, self-hosted AI coding assistant
  • fauxpilot - Open-source alternative to GitHub Copilot server

The Future of LLMs

Despite its name, OpenAI has not made ChatGPT's source code or model data publicly available. However, Meta's release of LLaMA in February 2023, followed by the commercially-permissible LLaMA 2 in July 2023, has significantly accelerated LLM development. A leaked Google document titled "We Have No Moat, And Neither Does OpenAI" suggested that Meta might become the dominant force in the LLM era, surpassing both OpenAI and Google.

As large language models continue to evolve, they raise important questions about the future of work and human-computer interaction.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.