Starting with AI in a web application

So you have an existing web application and you want to dive into adding AI to your application? This articles shows you how to start.

Picking a LLM and model

The most known LLM is ChatGPT from OpenAI. While it achieves good results, it's not free and if you start out you want a free alternative that can be executed locally. If you Google this, you would probably end up with Ollama. The great thing about Ollama is that it can be executed in a Docker container and also tries to mimic the API calls for ChatGPT, so you can run ollama locally and use ChatGPT in production.

If we add this to a docker-compose.yaml file and build it we end up with an ollama server:
services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - 11434:11434
    container_name: ollama
If you run this and go to localhost:11434 you will see the text "Ollama is running":
If you login into the container you can do calls to ollama, but you require a model. A larger model will output better results with less 'hallucinations', but it also requires more processing powers. The largest models are more than 1 TB which require a super computer. 'llama3' is the general model that will always work and is 3GB in size, but you can also use 'tinyllama' which requires only 800MB.
As you can see in the screenshot the chatbot is not as powerful as ChatGPT but for testing purposes it is good enough (in the screenshot it claims that "Quito" is the capital city with the longest name, which it obviously is not):

Calling Ollama as an API call

Ollama offers its own API calls to chat, but it also offers similar API calls that tries to be compatible with OpenAI's API, so it can be tested without the need of a paid account of ChatGPT.

To test the API calls of our local Ollama client we can make an API call with Postman. I could not find a good working OpenAPI/Swagger spec of Ollama, so we need to just make calls in Postman to test our API.
So let's see how an API call looks like in Postman:

So the POST body is a message like this:
{
    "model": "llama3",
    "stream": false,
    "messages": [
        {
            "role": "system",
            "content": "You are a bot that tells the user what language a text is written
            in. The user will submit a text and you will respond what language it is
            written in. The language should be in ISO 639-1 format."
        },
        {
            "role": "user",
            "content": "pizza"
        }
    ]
}

A call to an LLM is stateless, so you have to provide the entire chat history in the API call to Ollama. A message consists of a role and the content. You have the roles "system", "user" and "assistant".
  • The role "user" is what a user would have typed and where the chatbot will respond to.
  • The role "assistant" is the answer of the chatbot on a typed message
  • The role "system" is to tell instructions how or what the chatbot should respond to. For example we could give our chatbot a name in case a user would ask for the name of our chatbot. In our example we limit our chatbot to only respond what type of language the text is written by the user. We also specify the ISO format. If we would not mention this we would get different formats. Of course we could make our chatbot mimic Paris Hilton or the Pope by providing this with role "system"

The rest of the parameters are related how it should be returned. We prefer JSON as a format because a computer can parse JSON much easier. If we would not ask for JSON we would get the answer as text, but the format will be quite random; sometimes a sentence, sometimes just the language. With stream: false we also tell to wait for the full chat response. If we would do stream: true, we would receive only the first part and have to do subsequent API calls to get the full message. Since we want a full response it's very common to not stream the response.

Handling a response

So let's see what we get back if we send this POST body request:
{
    "model": "llama3",
    "stream": false,
    "messages": [
        {
            "role": "system",
            "content": "You are bot that tries to act like the pope. It tries to respond
            making references to the bible and calling everything a sin if it is not
            allowed according to the catholic church."
        },
        {
            "role": "user",
            "content": "What is your name?"
        }
    ]
}
When we make the call this is what we get:
{
    "model": "llama3",
    "created_at": "2024-09-18T13:07:54.75638332Z",
    "message": {
        "role": "assistant",
        "content": "My child, I am the Vicar of Christ on Earth, the Successor of Saint Peter,
        the Bishop of Rome, and the Supreme Pontiff of the Universal Church. And as such, I shall
        henceforth be referred to as His Holiness, Pope... (insert papal name here).\n\nNow,
        let us proceed with humility and a contrite heart, acknowledging our many sins and
        transgressions before the Almighty God. For it is written in Psalm 51:
        \"Create in me a clean heart, O God, and renew a right spirit within me.\""
    },
    "done_reason": "stop",
    "done": true,
    "total_duration": 14377571743,
    "load_duration": 26113761,
    "prompt_eval_count": 56,
    "prompt_eval_duration": 711235000,
    "eval_count": 115,
    "eval_duration": 13488615000
}
Most fields we get back are more for testing purposes. Since we sent stream: false, "done" is always true and "done_reason" is always "stop". The message property contains the created message with the role "assistant". If we would want to respond what the chatbot said, we need to make a new call and add it to the POST body in messages:
{
    "model": "llama3",
    "stream": false,
    "messages": [
        {
            "role": "system",
            "content": "You are bot that tries to act like the pope. It tries to respond
            making references to the bible and calling everything a sin if it is not
            allowed according to the catholic church."
        },
        {
            "role": "user",
            "content": "What is your name?"
        },
        {
            "role": "assistant",
            "content": "My child, I am the Vicar of Christ on Earth, the Successor of Saint Peter,
            the Bishop of Rome, and the Supreme Pontiff of the Universal Church. And as such, I shall
            henceforth be referred to as His Holiness, Pope... (insert papal name here).\n\nNow,
            let us proceed with humility and a contrite heart, acknowledging our many sins and
            transgressions before the Almighty God. For it is written in Psalm 51:
            \"Create in me a clean heart, O God, and renew a right spirit within me.\""
        },
        {
            "role": "user",
            "content": "Wow, I would never guess I can chat with the pope. I love your outfit!
            What is your favorite Pokémon?"
        }
    ]
}
We get a a response like this (and everytime we do the same call we get a different response
{
    "model": "llama3",
    "created_at": "2024-09-18T13:15:23.213922729Z",
    "message": {
        "role": "assistant",
        "content": "My child, thou dost mistake my attire for mere mortal finery! This is the regalia of the Vicar of Christ,
        a symbol of my office and a reminder of the sacred trust placed in me by our Lord Jesus Christ.\n\nAnd as for thy query
        about Pokémon... (shaking head) Verily, I say unto thee, such frivolous pursuits are but a fleeting distraction from the
        weighty matters of salvation. Shall we concern ourselves with the whimsical creatures of fantasy or attend to the eternal
        concerns of the soul? \"What shall it profit a man if he gain the whole world and lose his own soul?\" (Mark 8:36)\n\nLet
        us instead focus on the things that are above, not on the things that are on earth. For in the words of Saint Paul,
        \"Whether ye eat or drink, do all to the glory of God.\" (1 Corinthians 10:31)"
    },
    "done_reason": "stop",
    "done": true,
    "total_duration": 39731029738,
    "load_duration": 5124879857,
    "prompt_eval_count": 204,
    "prompt_eval_duration": 11415155000,
    "eval_count": 184,
    "eval_duration": 22839314000
}
You can see that running a chatbot locally is very slow especially if you keep adding the entire chat history.

Strict format

In most web applications a chatbot with an AI is not very useful and distracting, but it would be more useful to let AI generate code as auto-fill-in a form etc. In ChatGPT integrating an application often means you use a feature called function calling or a less strict one in the form of a structured output

If we would make a call like this:

  {
    "model": "llama3",
    "stream": false,
    "format": "json",
    "response_format": { "type": "json_object" },
    "messages": [
        {
            "role": "system",
            "content": "You are a bot that tells the user what language a text is written in.
            The user will submit a text and you will respond what language it is written in.
            The language should be in ISO 639-1 format."
        },
        {
            "role": "user",
            "content": "Welke taal is dit?"
        }
    ]
}
Every time we call this the format will be in a different JSON object. Sometimes it will return {}, sometimes { "language": "nl" } or { "Lang": "nl" } or { "language": "nl", "confidence": 0.9 }. So to force this we add a json schema ChatGPT uses to always format the same JSON:

  {
    "model": "llama3",
    "stream": false,
    "format": "json",
    "response_format": {
        "type": "json_schema",
        "json_schema":{
            "strict": true,
            "schema": {
                "type": "object",
                "properties": {
                    "language": {
                        "description": "Detected language in ISO 639-1 format",
                        "type": "string"
                    }
                }
            }
        }
    },
    "messages": [
        {
            "role": "system",
            "content": "You are a bot that tells the user what language a text is written in.
            The user will submit a text and you will respond what language it is written in.
            The language should be in ISO 639-1 format."
        },
        {
            "role": "user",
            "content": "Welke taal is dit?"
        }
    ]
}
  
Now the sad news: Ollama has no support for this yet. So how could we still make Ollama output the same JSON as ChatGPT would? The answer is very simple: we could add it to the system message so Ollama will output this format almost every time (sometimes it fails).
{
    "model": "llama3",
    "stream": false,
    "format": "json",
    "messages": [
        {
            "role": "system",
            "content": "You are a bot that tells the user what language a text is written in.
            The user will submit a text and you will respond what language it is written in.
            The language should be in ISO 639-1 format.
            Output as JSON using the schema defined here: { \"language\": { \"type\": \"string\", \"description\": \"Detected language in ISO 639-1 format\"}, \"precision\": { \"type\": \"float\", \"description\": \"Value between 0 and 100. 100 being very sure, 0 being very unsure the language was correctly detected\"}"
        },
        {
            "role": "user",
            "content": "Welke taal is dit?"
        }
    ]
}
Once we worked out the API call or service we want we can make our AI functionality. Writing it like this also avoids problems where people try to exploit chatbots.

Performance

I was testing it locally with the llama3 model, but some of the API calls take a lot of time to calculate, especially if you keep adding the entire chat history with every API call. The reason for this is that ollama runs everything on CPU by default. AI works with the use of neural networks and they require a lot of parallel processing which is slow on CPU. It's faster to run it on GPU.

Running ollama in a docker container might speed up setting up ollama, but it is slower. And to make it worse you need to configure docker to be able to run code on the GPU which it does not by default. It requires changes on your computer for setting this up and it depends on the type of video card you have. Running it on an actual server will also be troubling as often web servers are not configured to run code on GPU. Keep this in mind if you want to run your own AI container on your own server as it's very likely to perform very badly. It also requires lots of RAM memory. For llama3 you need 6GB RAM or 4.5GB RAM when running on the GPU.

The only other possibility is using a smaller model file. The smallest I used is tinyllama which is 800MB. I did find a model of only 46MB, called all-minilm. This could be acceptable for development reasons, but probably the responses will be terrible.

Conclusion

It shows it's quite easy to make some AI services with some tweaking and experimenting. It's quite fun to play with it. Ollama is also very good for testing in development, so you do not need a OpenAI subscription for testing or share your content with an US server. It's advisable to use a chatbot only indirectly, so behind an API call in your application. Especially since Ollama has no authentication layer all calls can be made without any trouble at all and misuse is very easy to do. A better setup would be to use ChatGPT in production and use Ollama only in development. But then it's also better to communicate with ChatGPT hidden from the user.

Comments