A Beginner's Guide to Running Llama 3 on Linux (Ubuntu, Linux Mint)

Introduction

Llama 3, Meta's latest open-source AI model, represents a major leap in scalable AI innovation. Built with flexibility and performance in mind, Llama 3 is designed to handle various AI tasks, from natural language processing to interactive chat models, making it a powerful tool for developers, researchers, and AI enthusiasts alike. In this guide, we’ll walk you through the steps to set up and run Llama 3 on a Linux system. Whether you’re an experienced AI developer or new to machine learning, this step-by-step guide will help you get up and running, even on modest hardware.

Prerequisites

Before we get started, let’s ensure your system meets the necessary hardware and software requirements to run Llama 3 efficiently. Here’s what you’ll need:

  • GPU: An Nvidia GPU with at least 8GB of VRAM (12GB or more is recommended for better performance, especially with larger models).
  • Operating System: Ubuntu 20.04 or a similar Linux distribution.
  • Python: Version 3.8 or newer.
  • CUDA drivers: Ensure that Nvidia’s CUDA toolkit is properly installed and configured on your machine. I have another a guide just for that here

To verify your GPU setup, you can run the following command:

nvidia-smi

This will display your GPU's available VRAM and other relevant specs. While it’s possible to run smaller Llama 3 models with 8GB or 12GB of VRAM, more VRAM will allow you to work with larger models and process data more efficiently.

Step 1: Setting Up Your Environment

Before installing Llama 3, it’s essential to set up a dedicated environment to keep things organized. Here’s how to get started:

Create a project directory and clone the official Llama 3 repository from GitHub:

mkdir llama3-project
cd llama3-project
git clone https://github.com/meta-llama/llama3.git

Next, install essential Linux packages required for the installation process:

sudo apt-get install wget md5sum

Finally, install the required Python dependencies. It’s a good idea to create a virtual environment for Python to avoid conflicts with other packages on your system:

python3 -m venv llama3-venv
source llama3-venv/bin/activate
pip install -e .

By isolating your Python dependencies in a virtual environment, you can easily manage different versions and packages without affecting other projects.

Step 2: Downloading Llama 3 Model Weights

Meta provides model weights upon request, and these are crucial for running Llama 3. To obtain the model weights, you’ll need to visit the official Llama 3 website and submit a request. Once your request is approved, Meta will send you a download link via email, which remains active for 24 hours.

Once you have the link, you can use the built-in download script provided by Llama 3 to download the weights. Here’s how:

./download.sh

During the download process, you’ll be prompted to choose a model version that fits your hardware. If you have a GPU with 8GB or 12GB of VRAM, the 8B model is a good starting point. If you have more VRAM, feel free to opt for a larger model.

Step 3: Running Your First Inference

With the model weights downloaded, it’s time to run your first inference using a text generation example. This example demonstrates Llama 3’s ability to generate coherent text based on input prompts.

Here’s how to run the example:

torchrun --nproc_per_node 1 example_text_completion.py --ckpt_dir Meta-Llama-3-8B/ --tokenizer_path Meta-Llama-3-8B/tokenizer.model --max_seq_len 128 --max_batch_size 4

In this command:

  • --ckpt_dir specifies the directory containing the model weights.
  • --tokenizer_path points to the tokenizer required for processing text.
  • --max_seq_len controls the length of the generated sequences (128 is a good starting point).
  • --max_batch_size defines the batch size for processing. If you encounter memory issues, try reducing this value.

Make sure to replace the paths with the correct directories for your downloaded weights and tokenizer. Adjusting maxseqlen and maxbatchsize based on your GPU memory will help optimize performance.

Step 4: Running the Chat Model

Llama 3 also includes an interactive chat model that’s fine-tuned for instruction-following tasks, making it ideal for creating conversational AI systems. Here’s how to run the chat model:

torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir Meta-Llama-3-8B-Instruct/ --tokenizer_path Meta-Llama-3-8B-Instruct/tokenizer.model --max_seq_len 512 --max_batch_size 4

Again, you’ll need to adjust the paths to match the directories where your model weights and tokenizer are stored. For an 8GB GPU, you may need to reduce the maxbatchsize or sequence length to avoid running out of memory.

Troubleshooting Common Issues

If you run into issues during the setup or inference process, here are a few common troubleshooting steps:

  • Out of memory errors: If you receive an out-of-memory error while running the models, try lowering the max_batch_size or max_seq_len parameters in the command. These adjustments help reduce the load on your GPU.

  • CUDA errors: Ensure your CUDA toolkit and Nvidia drivers are correctly installed. If your system does not recognize your GPU, run nvidia-smi to verify the GPU is available and properly configured.

  • Missing dependencies: If you encounter errors related to missing Python libraries or version conflicts, ensure all required packages are installed. You can re-run pip install -e . to install the necessary dependencies.

  • File path issues: Double-check that your model and tokenizer file paths are correct, especially if you’ve downloaded models into non-default directories.

For more detailed troubleshooting, refer to the official Llama 3 documentation or join the community discussion forums where many common problems are addressed.

Conclusion

Congratulations! You’ve successfully set up Llama 3 on your Linux system and run both a text generation model and an interactive chat model. Even if you’re working with a GPU that has just 8GB or 12GB of VRAM, you can still get Llama 3 running efficiently by managing batch sizes and sequence lengths. While higher VRAM allows for more complex tasks, this guide helps you get started with smaller models.

Now that you’ve mastered the basics, you can dive deeper into using Llama 3 for your own projects, whether it’s generating text, building AI-powered chatbots, or experimenting with new ideas in machine learning. For further exploration, consult the official Llama 3 documentation and stay engaged with the community for tips, updates, and advanced features.

Happy experimenting!

Wei-Ming Thor

I create practical guides on Software Engineering, Data Science, and Machine Learning.

Background

Full-stack engineer who builds web and mobile apps. Now, exploring Machine Learning and Data Engineering. Read more

Writing unmaintainable code since 2010.

Skill/languages

Best: JavaScript, Python
Others: Android, iOS, C, React Native, Ruby, PHP

Work

Engineering Manager

Location

Kuala Lumpur, Malaysia

Open Source
Support

Turn coffee into coding guides. Buy me coffee