How to install Meta Llama2 on Linux (Ubuntu / Linux Mint)

In this guide, I'll demonstrate how to set up Meta's Llama2 open-source large language model to run on your desktop computer. To show how to do some customization, we'll create a basic game NPC AI that enables interaction.

Prerequisites

Several prerequisites are necessary to run and customize the Llama2 language model.

Hardware

We'll be configuring the 7B parameter model. Despite being the smallest parameter model, it demands significant hardware resources for smooth operation. Note that there are no definitive or official hardware requirements for Llama2. The following are general recommendations for running 7B size language models based on feedback from the community and other testers.

  • RAM: 8GB or 16GB; the more, the better
  • GPU VRAM: Minimum of 8GB, recommended at least 12GB; the more, the better
  • Storage: Strongly recommend SSD. Approximately 11GB of storage for the 7B model

Keep in mind that GPU memory (VRAM) is crucial. You might be able to manage with lower-spec hardware, I've successfully run it on my M1 MacBook Air (though performance was extremely slow).

Python Language

You'll need some basic knowledge of Python programming to create an interactive program that utilizes the model.

Downloading the model

Request for access

The initial step is requesting access to the model on Meta AI's website and agreeing to their terms and conditions.

Complete your details and submit. Typically, you'll promptly receive an email with download instructions.

Download the model

Follow the email's instructions. The steps might change over time. Below are the steps, but always refer to the email as your primary guide in case steps change.

Navigate to the Llama2 repository and download the code:

# Clone the code
git clone git@github.com:facebookresearch/llama.git

Access the directory and execute the download script:

cd llama

# Make the ./download script executable
sudo chmod +x ./download.sh

# Run the ./download script
./download.sh

The download script will prompt you to enter the link from the email, resembling https://download.llamameta.net/*?Policy=eyJTdGF0ZW1lbnQiOlt7InUuaXF1ZV9oYXNoIjoidWRuMGljOGhmNGh2eXo0e....

Subsequently, it will prompt you to choose from available model weights:

  • Llama-2-7b
  • Llama-2-7b-chat
  • Llama-2-13b
  • Llama-2-13b-chat
  • Llama-2-70b
  • Llama-2-70b-chat

Use the Llama-2-7b-chat weight to start with the chat application. Select and download. Once downloaded, you'll have the model downloaded into the ./llama-2-7b-chat directory.

Install

Installing the library dependencies is essential. You can optionally (but it's recommended) to set up a Python virtual environment, to isolate the environment for your project.

# Using virtualenv...
virtualenv env

# Or, using venv
python3 -m venv env

# Then, activate the environment
source env/bin/activate

Then, install the project and the dependencies.

# Install the project
pip install -e .

# Install the project's dependencies as specified in the requirements.txt file.
pip install -r requirements.txt

Test the model

After installation, it's time to test and run the model. The code includes example application scripts for testing. Within the code, you'll find:

  • example_chat_completion.py
  • example_text_completion.py.

You should now have Torch installed from the requirements.txt file installation in the previous step. This loads the example_chat_completion.py script to run the downloaded llama-2-7b-chat model.

torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir llama-2-7b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 512 --max_batch_size 6

It should output a sample conversation like this:

> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Loaded in 10.51 seconds

User: what is the recipe of mayonnaise?

> Assistant:  Mayonnaise is a thick, creamy condiment made from a mixture 
of egg yolks, oil, and an acid, such as vinegar or lemon juice. Here is a 
basic recipe for homemade mayonnaise:

Troubleshoot

Unfortunately, it doesn't always go smoothly and you might not see the above. The most common error you'd see is:

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

However, this is a very generic PyTorch error, making it challenging to pinpoint the issue precisely. One solution that worked for me was modifying the values in --max_seq_len 512 --max_batch_size 6 parameters. You can experiment with values like 128, 256, etc., for --max_seq_len, and 4, 6, or 8 for --max_batch_size.

As of now, no definitive solution exists for this problem. If the above doesn't work, I suggest looking at the issues section of the PyTorch repo for further troubleshooting.

Write the chat AI

Let's develop a straightforward game NPC AI to develop some hands-on knowledge of how the code operates and how you can construct your chat AI. Begin by crafting your chat script, creating a file ./my_chat.py. You can copy the code from ./example_chat_completion.py as a base and clear the example dialogs.

from typing import Optional

import fire

from llama import Llama

def main(
    ckpt_dir: str,
    tokenizer_path: str,
    temperature: float = 0.6,
    top_p: float = 0.9,
    max_seq_len: int = 512,
    max_batch_size: int = 8,
    max_gen_len: Optional[int] = None,
):
    generator = Llama.build(
        ckpt_dir=ckpt_dir,
        tokenizer_path=tokenizer_path,
        max_seq_len=max_seq_len,
        max_batch_size=max_batch_size,
    )

    # Can support multiple dialogs at once.
    # A dialog is essentially a conversation.
    # Just create one empty dialog to keep it simple for now.
    dialogs = [
        []
    ]

    results = generator.chat_completion(
        dialogs,
        max_gen_len=max_gen_len,
        temperature=temperature,
        top_p=top_p,
    )

    # Remove everything else below


if __name__ == "__main__":
    fire.Fire(main)

To create a continuous prompting loop, insert the following into your chat script. Each dialog consists of message exchanges within the conversation. The structure of the object can be referred to in the example code.

# ...

dialogs = [
    []
]

# Note: I've prompted it to keep answers short to prevent resource issues from occurring.
dialogs[0].append({ "role": "system", "content": "Provide short answers like a game NPC named George"})

while True:
    user_input = input("Say something: ")

    if user_input == 'exit':
        print("Exit the conversation.")
    break

    dialogs[0].append({ "role": "user", "content": user_input})

    results = generator.chat_completion(
        dialogs, # type: ignore
        max_gen_len=max_gen_len,
        temperature=temperature,
        top_p=top_p,
    .

    dialogs[0].append(results[0]['generation'])
    print("George:" + results[0]['generation']['content'])

Ensure that each message is assigned a role of user, generation, or system. Messages should alternate between user and generation roles. Consecutive messages with the same role, such as two consecutive user roles, are invalid (example below)

[
    { "role": "user", "content": "Hi"},
    { "role": "user", "content": "How are you"},
]

In the chat script, begin by adding an instruction for the system. This is where you can provide AI commands, such as instructing it to behave like a game NPC named George. Prompt user input and add it to the dialogue. The dialog is then passed to generator.chat_completion() to generate the AI's response, which is added to the dialog's results. The generator.chat_completion() generates results for all dialogs, but since we're using one dialog for simplicity, dialog[0] corresponds to results[0].

Append the response's generation results to the dialog to keep track of the conversation's progression. Print the response as well.

dialogs[0].append(results[0]['generation'])
print("George:" + results[0]['generation']['content'])

In essence, in each iteration of the loop, you feed the entire conversation history to generate a response based on the entire dialog history.

The result

You now have a game NPC AI bot for engaging in conversations. Here's a sample output of a conversation:

Say something: Hi
George: Hey there, young adventurer! *adjusts spectacles* What brings you to 
this humble village? Are you here to seek fortune, or perhaps to uncover the 
secrets of the ancient ruins that lie nearby? *winks*

Say something: Who are you?
George: Ah, a curious traveler! *chuckles* My name is George, and I am the 
village elder here in Greenhaven. *adjusts spectacles* It's a pleasure to 
make your acquaintance! *smiles* What brings you to our little village? Are 
you here to rest your weary bones, or perhaps to seek out the wisdom of the 
ages? *winks*

Conclusion

You've completed a guide on installing Llama2 on your local machine and applying it to a simple application. You can extend it to accommodate more dialogs and content generation. This guide provides a foundation for utilizing the Llama2 model for various applications. I hope you've managed to learn from this experience and create something exciting with it!

Wei-Ming Thor

I write guides on Software Engineering, Data Science, and Machine Learning.

Background

Full-stack engineer who builds web and mobile apps. Now, exploring Machine Learning and Data Engineering.

Writing unmaintainable code since 2010.

Skill/languages

Best: JavaScript, Python
Others: Android, iOS, C, React Native, Ruby, PHP

Work

Engineering Manager

Location

Kuala Lumpur, Malaysia

Open Source
Support

Turn coffee into coding guides. Buy me coffee