How to Set Up a Simple Command Line Interface Chat With Ollama

In this project, I will show you how to download and install Ollama models, and use the API to integrate them into your app.

The main purpose of this project is to show examples of how streaming and non-streaming API requests work within the Ollama environment.

If you just want to get some examples here is the Github Repo Link

Before we start, there are some requirements:

Step 1 - Pre-Requisites

Ollama Installation:

macOS

If you're running macOS, use the official download link:
Ollama Download for macOS

Windows Preview

For Windows, either use the official .exe below or use WSL and the Linux command.
Ollama Download for Windows

Linux

On Linux, just paste the following into your terminal:

curl -fsSL https://ollama.com/install.sh | sh

Python Environment

You'll need Python version 3.12 and above.

Setting up a local Python environment is quite simple. Just start by creating a virtual environment in 'your-project-directory':

mkdir my-project && cd my-project

Now, in your directory, create a virtual environment. You can name the virtual environment directory anything, but to keep it consistent with the guide, use .venv:

python3 -m venv .venv

Now, to activate the virtual environment:

source .venv/bin/activate

To confirm that you have fully enabled the virtual environment, use this command to show which Python environment is being used at runtime:

which python

Step 2 - Ollama Setup

Now that you have Ollama set up, I will list some useful commands that will help you navigate the CLI for Ollama. I will also list some of my favourite models for you to test.

Important Commands

Use ollama serve to start your Ollama API instance. If you get an error saying that port :11434 is already in use, just navigate to localhost:11434 and you should see a message saying Ollama is running. No further activation is necessary.

ollama serve

Use ollama pull <model name> to pull a model from the Ollama repository. In some instances, you might need to specify which version you want to download. For example, 8B, 70B, 405B, etc.

ollama pull llama3.1

Or, to specify a larger model (you will need more RAM and compute power to make these models run on your PC):

ollama pull llama3.1:70b

Use ollama list (this one is the most important in my opinion) as it will list all the available models you have installed:

ollama list

Use ollama rm <model name> to remove a model from your workspace:

ollama rm <model-name>

If you are learning about modelfiles and how to edit a system prompt before passing it through the model's response, you'll need to know how to "create" a model. This is essentially creating a response pattern for your LLM to apply before generating its response to the user's text.

This example is from the Ollama GitHub:

First, create a modelfile:

FROM llama3.1

# Set the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 1

# Set the system message
SYSTEM """
You are Mario from Super Mario Bros. Answer as Mario, the assistant, only.
"""

Then you'll need to create the model:

ollama create <name-of-new-model> -f ./Modelfile

Then just run the new model as shown below:

ollama run <name-of-new-model>

My Personal Favourite Models

Model	Parameters	Size	Download
Llama 3.1:7b	7B	3.8GB	`ollama run llama3.1:7b`
Mistral-Nemo	6B	3.2GB	`ollama run mistral-nemo`
CodeLlama	7B	3.8GB	`ollama run codellama`
Snowflake-Arctic-Embed	10B	5.6GB	`ollama run snowflake-arctic-embed`
Phi 3	14B	7.9GB	`ollama run phi3`
Gemma 2	9B	5.5GB	`ollama run gemma2`
CodeGemma	13B	8.2GB	`ollama run codegemma`

Step 3 - Creating a Custom Command Line Interface for Your Models

In this step, you can either git clone the repo with my examples or code along with this guide to learn more about the model environment.

git clone https://github.com/LargeLanguageMan/python-ollama-cli

Ollama API Requests

Ollama will automatically set up a REST API for managing model responses.

This is a POST request with streaming enabled, meaning each response token is sent as an individual chunk.

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt":"Why is the sky blue?"
}'

This is a POST request with streaming disabled, meaning the resulting response will have the entire LLM response in the response field.

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

You will primarily need two libraries to handle this request: the json library and the requests library.

While your virtual environment venv is active, use:

pip install requests

Within the requests library, the post() function is able to handle our payload as long as we specify it, as shown below for streaming:

response = requests.post(url, headers=headers, data=json.dumps(data), stream=True)

And below for non-streaming:

response = requests.post(url, headers=headers, data=json.dumps(data))

Now that we are handling the parameters, we need to define the parameters and the payload.

The URL is our localhost with open port 11434:

url = "http://localhost:11434/api/generate"

Our headers are of type application/json:

headers = {"Content-Type": "application/json"}

Lastly, our payload in the POST request is:

data = {
        "model": model,
        "prompt": prompt
    }

Or, we can use the streaming flag set to false, so the response is a single JSON object instead of a list:

data = {
        "model": model,
        "prompt": prompt,
        "stream": False
    }

(Option 1) Streaming - Decoding the Data

Now, with simple JSON manipulation, we can decode the data, load it, and print it back to the console:

all_chunks = []
for chunk in response.iter_lines():
    if chunk:
        decoded_data = json.loads(chunk.decode('utf-8'))
        all_chunks.append(decoded_data)
return all_chunks

This will create a list of objects. To finally output them, we will need a for loop to print the response, like so:

for response in result:
    obj = obj + response["response"]
print(obj)

(Option 2) For Non-Streaming

This option is much easier to handle as the response is a simple object, and we can return our response as JSON:

response = requests.post(url, headers=headers, data=json.dumps(data))
return response.json()

Then we can print:

print(result['response'])

In my example programs, I have set up an input to choose the model and type your response into the CLI. Then, the script will generate a response from the LLM you chose. See the picture below:

This is my example with no streaming