![]() ![]() So: Don’t think of these posts as “doing it just because you can and it’s fun to tinker” It is free, private, and runs on a midrange laptop.Ī little more than a month ago that wasn’t possible, not with my level of knowledge of the tooling involved at least, now it requires little more than running an executable and minor troubleshooting of python dependencies (on another machine it “just worked”) The Wizard 13B model runs much faster, maybe 2-3 tokens per second. It can produce answers to questions only slightly less well than ChatGPT (3.5). Running it overnight on something like “summarize an analysis of topic X it does a reasonable job. It’s slow, but if I ask it to write a Haiku it’s slow on the order of “go brew some coffee and come back in 10 minutes” and does it very well. I can run the Wizard 30B ggml model in CPU mode using a Ryzen 5700 and 16GB of system RAM, not GPU VRAM. Here's how the commands can be represented in a table format: | 4 | Run the resulting `main` executable with the `-ngl` option set to 18. | 3 | Compile the code with support for CUBLAS (BLAS on GPU) using `make LLAMA_CUBLAS=1`. | 2 | Clone the llama.cpp repository using `git`. | 1 | Download the weights for the model you want to use, such as "5_1.bin". | 0 | Make sure you have a PC with an NVidia GPU running Ubuntu, and you have already set up the NVidia drivers and CUDA Toolkit. ![]() ![]() If you encounter any issues or errors during the process, please provide more details, and I'll be happy to assist you further.Ĭertainly! Here are the detailed steps for the tasks you mentioned in a table format: Please note that the steps provided are based on the information you provided, and it assumes you have the necessary dependencies and permissions to perform these actions. Make sure to replace the example URLs and filenames with the actual ones you are using. ![]() Use the following command to run the executable: This will attempt to load 18 layers of the model into the GPU's VRAM instead of the system's RAM. After the compilation is successful, you can run the resulting `main` executable with the `-ngl` option set to 18. Compile the code with support for CUBLAS (BLAS on GPU) by running the following command:ĥ. Navigate into the cloned `llama.cpp` directory:Ĥ. Open a terminal and navigate to the directory where you want to clone the repository, then execute the following command:ģ. Make sure you have `git` installed on your system. Clone the llama.cpp repository using `git`. You can use a web browser or a command-line tool like `wget` to download the file. Download the weights for the model you want to use, such as "5_1.bin". Make sure you have a PC with an NVidia GPU running Ubuntu, and you have already set up the NVidia drivers and CUDA Toolkit.ġ. Llama_print_timings: total time = 120788.82 msĬertainly! Here are the detailed steps for the tasks you mentioned:Ġ. Llama_print_timings: prompt eval time = 2197.82 ms / 2 tokens ( 1098.91 ms per token) Llama_print_timings: sample time = 280.81 ms / 294 runs ( 0.96 ms per token) Llama_print_timings: load time = 7638.95 ms Llama_print_timings: total time = 239423.46 ms Llama_print_timings: prompt eval time = 13876.81 ms / 259 tokens ( 53.58 ms per token) Llama_print_timings: sample time = 612.06 ms / 536 runs ( 1.14 ms per token) Llama_print_timings: load time = 3725.08 ms I am testing it on an AWS instance and the speedup effect is not as consistent as I hope. ![]()
0 Comments
Leave a Reply. |