Deploy the Mistral 7b Generative Model on an A10 GPU on AWS

Summary

This NLP cloud course shows how to deploy and use the Mistral 7b generative AI model on an NVIDIA A10 GPU on AWS.

The Mistral 7b model beats LLaMA 3 7b on all benchmarks and LLaMA 3 13b in many benchmarks. It is actually even on par with the LLaMA 1 34b model.

Deploying it and using it requires at least 15GB of VRAM which is why we need an A10 GPU with 24GB of VRAM at least.

Here is the structure of the course:

Transcript

Hello, this is Julien Sainas from NLP Cloud.

Today, we are going to see how to deploy the Mistral 7b generative model on an AWS A10 GPUHere we go.

Mistral 7b is a state-of-the-art generative model released by a French company called Mistral AI.

This model was released in September 2023 and beats Lama 2 7b on all the official benchmarks.

Even more interestingly, it also beats Lama 2 13b on many benchmarks, and it is on par with Lama 1 34b.

Mistral AI released this model with an Apache license, which allows you to use this model however you want.

The team released both a foundational model and a fine-tuned chat version.

We are going to deploy the chat version in this video today.

Mistral 7b requires at least 14 gigs of virtual memory and more in case of a large context size.

So we are` going to deploy it on an A10 NVIDIA GPU on AWS, as this GPU has 24 gigs of virtual memory and is quite cost-effective.

The easiest way to deploy Mistral 7b is to use the Hugging Face framework and follow Mistral AI's official guidelines.

As a first step, we will need to select the right AWS machine.

There are tons of machines on AWS, so the best advice I can give you is to start with this instance types page and then go to Accelerated Computing on the left.

Here, you have a list of all the accelerated hardware instances that AWS provides, and the one we want today is the G5.

As you can see here, G5 embeds an A10 GPU, which is what we want.

There are several flavors of G5 instances.

Some only have one GPU, some have four or eight GPUs.

One GPU is enough for us because there is enough virtual memory on one GPU, but we have to be very careful about the amount of memory that the instance has because when we will start the Mistral 7b model, we will need temporarily some memory to load the model.

It is why we will select a G5 4X large instance today because 64 gigs should be enough.

Now, I'm switching to my AWS console, and I click Launch Instance.

Let's call it Test A10 Mistral.

We will select the Ubuntu OS, but there is a trick.

We do not want to select the standard Ubuntu OS because we will have to manually install the NVIDIA drivers on it, which is very painful.

What we will do is select the Deep Learning AMI GPU PyTorch server here, which is much better because this AMI comes with Ubuntu plus NVIDIA drivers plus CUDA toolkit plus PyTorch and other things, which are all things that we will need today for our tests.

Here, we select the G5 4X large instance.

If you don't have a key pair, you need to create one.

If this is the first time you do this and you're not exactly sure how to connect VS Code to your AWS instance, I recommend that you watch our dedicated video about remote development environment with VS Code on AWS.

No need to open other ports, and I recommend that you add maybe 100 gigs of disk.

In theory, the model should only take 20 gigs of hard disk, but it's always best to have more because we'll need to install libraries maybe, so here we will be safe.

Let's click Launch Instance.

Good, it's created.

If you have a quota problem because maybe it is your first time launching an 8N GPU, I recommend that you reach out to the AWS support.

I'm now taking the public IP here, and now I'm switching to VS Code.

At the bottom left, you need to connect a current window to host, and first, you need to configure your hosts.

Here, this is the IP address I just retrieved from AWS, and this is my SSH key.

I'm saving the file, and I do the same thing again, and this time I click Mistral 7B.

I want to accept the new fingerprint.

Perfect.

Now we are on our 8N GPU machine.

Let's check first if the GPU is available with the right drivers.

With NVIDIA SMI, perfect.

I can see that I have an 8N GPU here and that it is empty, so I have almost 24 gigs of VRAM for my model today.

I'm creating a test directory that I will open with VS Code, and now I'm creating a test file.

Maybe let's call it infer.py.

So what should we put in this infer.py file? Easy.

Let's go to our Mistral AI model on Hugging Face.

If this is the first time you are downloading a model on Hugging Face, basically you can go here to models, and you have tons of models available.

You can click here and type Mistral 7B.

As you can see, Mistral was already at the top of the list because it is very trendy these days.

I'm going to select the Instruct model because it is more funny to play with today, and here I'm just following the guidelines from the Mistral AI team.

So I simply copy-paste the code in VS Code.

It will not work as is because before this, we will need to install the Transformers library.

So as this Mistral 7B model has just been added to Transformers, it is not yet available in the PyPy package, but it's not a problem.

We will install Transformers from the GitHub repository directly.

Good.

Now, Transformers is correctly installed.

The last thing we will need to do is to use the floating-point 16 version of the model because if we use the default version of the model, which is floating-point 32, it will be too big for our A10 GPU, and most of the time, the difference between FP16 and FP32 for this kind of model is absolutely not noticeable.

So what we need to do today is to import Torch and to add this parameter, Torch dtype, when loading the model.

Good.

Now, let's try to run the inference script.

Good.

So we have a proper recipe about mayonnaise.

Maybe we can try something else.

Let's ask the model how to install Transformers on a Linux server.

We can remove this.

Good.

So CSS, I'm not sure why.

That sounds correct, except the CSS markup here.

I'm not sure why there is this detail, but I think that's enough to show you that it is a nice 7B model, and now you know how to use it, so it's your turn now.

You now know how to deploy the Mistral 7B model on your own server.

As you can see, it is not necessarily complex, especially because we are only using one single GPU today.

If you do not have an A10 GPU with enough virtual memory, you might need several smaller GPUs.

In that case, you will need to split your model on several smaller GPUs.

It will be a bit more complex, and we will need another dedicated video for this.

Have a nice day.