151 lines
5.3 KiB
Markdown
151 lines
5.3 KiB
Markdown
+++
|
|
title = "Running Vicuna-13B in Google Cloud"
|
|
date = 2023-05-06T00:00:00
|
|
lastmod = 2023-05-06T00:00:00
|
|
draft = false
|
|
|
|
# Authors. Comma separated list, e.g. `["Bob Smith", "David Jones"]`.
|
|
authors = ["Carl Pearson"]
|
|
|
|
tags = []
|
|
|
|
summary = "How to experiment with hosting Vicuna-13B on a cloud VM"
|
|
|
|
# Projects (optional).
|
|
# Associate this post with one or more of your projects.
|
|
# Simply enter your project's folder or file name without extension.
|
|
# E.g. `projects = ["deep-learning"]` references
|
|
# `content/project/deep-learning/index.md`.
|
|
# Otherwise, set `projects = []`.
|
|
projects = []
|
|
|
|
# Featured image
|
|
# To use, add an image named `featured.jpg/png` to your project's folder.
|
|
[image]
|
|
# Caption (optional)
|
|
caption = ""
|
|
|
|
# Focal point (optional)
|
|
# Options: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight
|
|
focal_point = "Center"
|
|
|
|
# Show image only in page previews?
|
|
preview_only = false
|
|
|
|
|
|
categories = []
|
|
|
|
# Set captions for image gallery.
|
|
|
|
+++
|
|
|
|
[Vicuna-13B](https://lmsys.org/blog/2023-03-30-vicuna/) is an LLM chatbot based off of the LLaMa model.
|
|
It's authors claim it achieves 90% of the quality of ChatGPT in a "fun and non-scientific" evaluation.
|
|
|
|
You can rent some cloud hardware and experiment with Vicuna 13B yourself!
|
|
Using CPU-only is slow (couple tokens per second), but fast enough for yout to get an idea of what to expect.
|
|
|
|
## Set up your Cloud Instance
|
|
|
|
Create a cloud VM with
|
|
* 150 GB of disk space
|
|
* 64 GB of CPU memory
|
|
|
|
I used a Google Compute Engine `e2-standard-16`, which costs around $0.70/hour, so it may not be something you want to leave running. You can stop the instance when you're not using it.
|
|
|
|
When everything was done, my VM had 132GB of disk space used.
|
|
|
|
Ordinarily I wouldn't recommend setting up python like this, but since we're just experimenting:
|
|
|
|
```
|
|
apt-get install python3-pip
|
|
```
|
|
|
|
## Acquire the LLaMa-13B model
|
|
|
|
For licensing reasons, Vicuna-13B is distributed as a delta of the LLaMa model, so the first step is to acquire the LLaMa model.
|
|
The official way is to request the weights from Meta, by filling this [Google Docs form](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform?usp=send_form).
|
|
|
|
You can also use leaked weights from a torrent with the following magnet link:
|
|
|
|
`magnet:?xt=urn:btih:b8287ebfa04f<HASH>cf3e8014352&dn=LLaMA`
|
|
|
|
> NOTE
|
|
> replace `<HASH>` above with this: `879b048d4d4404108`
|
|
|
|
Or, someone has made the leaked weights available on IPFS, which you can access through a helpful mirror:
|
|
|
|
https://ipfs.io/ipfs/Qmb9y5GCkTG7ZzbBWMu2BXwMkzyCKcUjtEKPpgdZ7GEFKm/
|
|
|
|
I couldn't figure out how to get a torrent client working on Google's VMs (perhaps a firewall issue), so I ended up using [aria2c]() to download the LLaMa weights from the IPFS mirror above.
|
|
|
|
```
|
|
apt-get install aria2
|
|
|
|
mkdir -p $HOME/llama/13B
|
|
cd $HOME/llama/13B
|
|
aria2c https://ipfs.io/ipfs/QmPCfCEERStStjg4kfj3cmCUu1TP7pVQbxdFMwnhpuJtxk/consolidated.00.pth
|
|
aria2c https://ipfs.io/ipfs/QmPCfCEERStStjg4kfj3cmCUu1TP7pVQbxdFMwnhpuJtxk/consolidated.01.pth
|
|
aria2c https://ipfs.io/ipfs/QmPCfCEERStStjg4kfj3cmCUu1TP7pVQbxdFMwnhpuJtxk/checklist.chk
|
|
aria2c https://ipfs.io/ipfs/QmPCfCEERStStjg4kfj3cmCUu1TP7pVQbxdFMwnhpuJtxk/params.json
|
|
aria2c https://ipfs.io/ipfs/Qmb9y5GCkTG7ZzbBWMu2BXwMkzyCKcUjtEKPpgdZ7GEFKm/tokenizer.model
|
|
```
|
|
|
|
The `consolidated` files are the weights.
|
|
`checklist.chk` has the md5 sums for the files, which you should check after they're downloaded.
|
|
`params.json` seems to have some metadata.
|
|
Finally, `tokenizer.model` is needed to convert the weights to HuggingFace format.
|
|
|
|
## Convert weights to HuggingFace Format
|
|
|
|
```
|
|
pip install torch transformers accelerate sentencepiece protobuf==3.20
|
|
python3 convert_llama_weights_to_hf.py --input_dir ~ --output_dir ~/llama-hf --model_size 13B
|
|
```
|
|
|
|
I used [rev d2ffc3fc4 of the script](https://github.com/huggingface/transformers/blob/d2ffc3fc48430f629c38c36fa8f308b045d1f715/src/transformers/models/llama/convert_llama_weights_to_hf.py).
|
|
|
|
```
|
|
apt-get install wget
|
|
|
|
wget https://github.com/huggingface/transformers/blob/d2ffc3fc48430f629c38c36fa8f308b045d1f715/src/transformers/models/llama/convert_llama_weights_to_hf.py
|
|
|
|
pip install torch transformers accelerate sentencepiece protobuf==3.20
|
|
|
|
python3 convert_llama_weights_to_hf.py --input_dir $HOME/llama --output_dir $HOME/llama-hf --model_size 13B
|
|
```
|
|
|
|
These are the package versions that worked for me (note `protobuf=3.20` in the pip install command).
|
|
|
|
| package | version |
|
|
|----------------|---------|
|
|
|`torch` | 2.0.0 |
|
|
|`transformers` | 4.28.1 |
|
|
|`accelerate` | 0.18.0 |
|
|
|`sentencepiece` | 0.1.99 |
|
|
|`protobuf` | 3.20.0 |
|
|
|
|
I got an error about regenerating protobuf functions if I used protobuf > 3.20.
|
|
|
|
## Apply the vicuna deltas
|
|
|
|
[FastChat](https://github.com/lm-sys/FastChat) has done the work of getting a little chat interface set up.
|
|
We'll use their package to download the deltas and apply them as well.
|
|
|
|
```
|
|
pip install fschat
|
|
python3 -m fastchat.model.apply_delta \
|
|
--base-model-path $HOME/llama_hf \
|
|
--target-model-path $HOME/vicuna-13b \
|
|
--delta-path lmsys/vicuna-13b-delta-v1.1
|
|
```
|
|
|
|
I had `fschat 0.2.5`.
|
|
|
|
## Start Chatting
|
|
|
|
This will open up a little Chat-GPT-style interface in your terminal.
|
|
|
|
```
|
|
python3 -m fastchat.serve.cli --device cpu --model-path vicuna-13b/
|
|
``` |