LLM Engineering on GitLab with CI Services¶

This is a repost of my original article in Siemens' blog with some formatting enhancements.

GitLab CI services with GPU acceleration enable seamless integration of LLMs into DevOps pipelines without requiring additional infrastructure.

Large language models (LLMs) have demonstrated notable capabilities on a range of natural language tasks, facilitating advanced AI applications with natural language interfaces. Developers typically use pre-trained proprietary or open-access¹ LLMs rather than building them from scratch due to the significant resources, data, and expertise required to develop and train these complex models. Common techniques for developing LLM-infused applications include static prompt engineering to guide LLM responses for desired outcomes with crafted instructions.

LLM engineering often involves external (cloud) services such as managed LLM inference endpoints. This architecture introduces significant complexity for developers due to service provisioning through Infrastructure as Code (IaC) and secure access management across local development environments and continuous integration (CI) pipelines.

This article presents a novel approach to LLM engineering on GitLab by leveraging CI services to run an LLM inference server for integration testing in a CI pipeline. It extends a familiar DevOps workflow with minimal changes for LLM engineering, utilizes an entirely open toolstack, and requires no additional infrastructure beyond GitLab. Its effectiveness is exemplified by developing a simple Python library that turns an LLM into a calculator with a natural language interface via zero-shot prompting with a crafted instruction. To enable production-grade use cases, we have contributed GPU support for CI services to the GitLab Runner (released with GitLab Runner v17.10) for accelerated LLM inference.

Primer¶

Containers with GPUs¶

Docker supports NVIDIA GPU acceleration for containerized applications. For this to work, we need to install Docker Engine, NVIDIA GPU drivers, and the NVIDIA Container Toolkit, which allows Docker to interface with NVIDIA GPUs. Then, we can use the --gpus flag of the docker container run command to specify the number or selection of GPUs to use in the container. The syntax is --gpus <string>, where <string> can be:

all: Assigns all available GPUs to the container.
device=<index>: Assigns specific GPUs to the container, where <index> is a comma-separated list of GPU indices (e.g., device=0 for the first GPU) or GPU UUIDs (e.g., device=GPU-ad2367dd-a40e-6b86-6fc3-c44a2cc92c7e).

GitLab CI services¶

GitLab CI services are auxiliary containers that run alongside the main job container during CI pipeline execution. Services complement the CI job's runtime environment by providing necessary network-accessible resources, e.g., databases like PostgreSQL, caches like Redis, testing frameworks like Playwright, or build services like Docker-in-Docker. This approach allows for a clean, isolated, and self-contained CI job setup, enhancing the developer experience and facilitating correctness, reproducibility, and efficiency.

GitLab CI services are defined using the services keyword in the .gitlab-ci.yml file. For example, the following snippet defines a job-level PostgreSQL service that is used by the CI job test to create a database:

.gitlab-ci.yml

test:
  # ⚡️ Disable PostgreSQL password protection, don't use in production!
  variables:
    POSTGRES_HOST_AUTH_METHOD: trust

  # Service containers
  services:
    - postgres:17

  # Main job
  image: postgres:17
  script:
    - psql -h postgres -U postgres -c "CREATE DATABASE my_db;"

Ollama¶

Ollama is an open-source framework for building and running LLMs on local infrastructure. It pulls models from the Ollama library, allowing users to download and use pre-built models locally. Additionally, it supports creating derived models through system prompts or fine-tuning, enabling customization based on specific needs. Ollama also serves models for inference via an HTTP API, facilitating integration with other applications and providing real-time responses to queries. This comprehensive management ensures efficient deployment and utilization of LLMs.

To use Ollama, start the server:

CLIDocker (CPU only)Docker (with GPU)

ollama serve

docker container run -d --name ollama -p 11434:11434 -v ollama:/root/.ollama ollama/ollama:0.6.3

docker container run -d --name ollama -p 11434:11434 -v ollama:/root/.ollama --gpus all ollama/ollama:0.6.3

Then, for example, pull a model (e.g., Microsoft's Phi-3 Mini)

CLIAPI

ollama pull phi3:mini

curl -sS http://localhost:11434/api/pull -d '{"model": "phi3:mini"}'

and send a chat message:

CLIAPI

ollama run phi3:mini "Hello world"

curl -sS http://localhost:11434/api/generate -d '{
  "model": "phi3:mini",
  "prompt": "Hello world",
  "stream": false
}'

LLM serving via CI services¶

Instead of relying on externally-hosted LLMs for developing and testing LLM-infused applications, we propose using a local LLM inference server for both local development and for integration testing in a CI pipeline. Similar to running the Ollama server locally, we configure a CI service to run the Ollama server as an auxiliary container of the CI job. For this, we add a CI job specification akin to the following to the .gitlab-ci.yml file:

.gitlab-ci.yml

tests:
  services:
    - name: ollama/ollama:0.6.3
      alias: ollama
  image: curlimages/curl
  before_script: |
    curl -sS http://ollama:11434/api/pull -d '{
      "model": "phi3:mini"
    }'
  script: |
    curl -sS http://ollama:11434/api/generate -d '{
      "model": "phi3:mini",
      "prompt": "Hello world",
      "stream": false
    }'

By default, GitLab derives the hostname of a CI service from the service name. To set an explicit and concise hostname override, we use the alias key, exposing the Ollama server via http://ollama:11434.

While some LLMs are sufficiently lightweight to allow inference on CPUs, many LLMs require GPU acceleration to be practical. We contributed GPU support for CI services to the GitLab Runner, allowing the presented CI setup to support a broad range of real-world use cases. A GitLab Runner administrator can configure GPU access for CI services in the [runners.docker] section of the runner's config.toml file via the service_gpus setting, akin to the gpus setting:

config.toml

[[runners]]
  (...)

  [runners.docker]
    (...)
    service_gpus = "all"

On multi-GPU hosts, device pins using indexes or UUIDs ensure per-job exclusive access to GPUs. Further, GPUs may be shared between the main job container and service containers by assigning the same device pins to both:

config.toml

[[runners]]
  (...)
  gpus = "device=<INDEX>"

  [runners.docker]
    (...)
    service_gpus = "device=<INDEX>"

Example¶

Let's exemplify the effectiveness of the presented approach by turning an LLM into a calculator with a natural language interface via zero-shot prompting with a statically crafted system prompt. For this, we create a minimal Python project with the following filesystem layout:

📁 src
└── 📁 llm_calc
    ├── 📄 py.typed
    └── 📄 __init__.py
📁 tests
├── 📄 conftest.py
└── 📄 test_calculate.py
📄 .gitlab-ci.yml
📄 pyproject.toml
📄 uv.lock

To start, we populate pyproject.toml with essential project metadata and dependency specifications:

pyproject.toml

[project]
name = "llm-calc"
version = "0.0.0"
requires-python = ">=3.10"
dependencies = ["ollama>=0.4.7", "pydantic>=2.11.1"]

[dependency-groups]
dev = ["pytest==8.3.5"]

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

Then, we install the dependencies in a virtual environment using uv:

uv sync

uv also creates the uv.lock file that contains the exact versions of installed dependencies to enable reproducible environments.

Next, we implement an LLM calculator class in src/llm_calc/__init__.py, which encapsulates the internal use of an LLM and provides a simple interface for calculating a mathematical expression in natural language form:

src/llm_calc/__init__.py

from __future__ import annotations

from textwrap import dedent
from typing import Final
from typing import Literal
from typing import TypeAlias

import ollama
import pydantic


Response: TypeAlias = (
    int
    | float
    | Literal[
        "error:zero-division",
        "error:invalid-domain",
        "error:invalid-expression",
    ]
)

ResponseModel = pydantic.TypeAdapter[Response](Response)


class LLMCalculatorError(Exception):
    def __init__(self, prompt: str, message: str) -> None:
        super().__init__(message)
        self.prompt = prompt


class InvalidDomainError(LLMCalculatorError):
    def __init__(self, prompt: str) -> None:
        super().__init__(prompt, f"Invalid math domain: {prompt}")


class InvalidExpressionError(LLMCalculatorError):
    def __init__(self, prompt: str) -> None:
        super().__init__(prompt, f"Invalid expression: {prompt}")


class LLMCalculator:
    MODEL: Final[str] = "phi3:mini"

    INSTRUCTION: Final[str] = dedent(
        """\
        Perform a mathematical calculation and return the numeric result.

        To respond with an error, return one of the following literal strings:
        - "error:zero-division" if the prompt contains a division by zero
        - "error:invalid-domain" if the prompt contains a mathematical domain error
        - "error:invalid-expression" if the prompt is an invalid mathematical expression

        Calculate: \
        """
    )

    def __init__(self, client: ollama.Client | None = None) -> None:
        self._client = client or ollama.Client()

    def calculate(self, prompt: str) -> int | float:
        response = self._client.generate(
            self.MODEL,
            prompt,
            system=self.INSTRUCTION,
            options={"seed": 0},
            format=ResponseModel.json_schema(),
        )
        result = ResponseModel.validate_json(response.response)
        match result:
            case "error:zero-division":
                raise ZeroDivisionError(prompt)
            case "error:invalid-domain":
                raise InvalidDomainError(prompt)
            case "error:invalid-expression":
                raise InvalidExpressionError(prompt)
        return result

The LLMCalculator class provides a single method, calculate, which takes a user prompt as input and returns the numeric result of the calculation. Internally, the calculate method generates a structured response from the LLM, guided by the system prompt, according to the JSON schema generated from the Pydantic model ResponseModel. The response is then parsed using the ResponseModel model. If one of the declared and instructed error responses is encountered, a corresponding exception is raised. Else, the numeric result is returned.

To test the expected behavior of our LLM-based calculator, we add a test suite in the tests/ directory using the pytest framework. For this, we implement fixtures in tests/test_calculate.py to instantiate an Ollama client and pull our selected LLM in the setup phase of the test suite:

tests/test_calculate.py

from __future__ import annotations

import ollama
import pytest

from llm_calc import LLMCalculator


@pytest.fixture(scope="session")
def client(request: pytest.FixtureRequest) -> ollama.Client:
    return ollama.Client(request.config.getoption("--ollama-host"))


@pytest.fixture(scope="session", autouse=True)
def pull_model(client: ollama.Client) -> None:
    client.pull(LLMCalculator.MODEL)

The client fixture is parametrized by the --ollama-host option passed to the pytest executable via a corresponding command-line flag defined in tests/conftest.py, as the test suite will be run locally, where the Ollama server's hostname is localhost, and in a CI job, where its hostname differs as we will see later:

tests/conftest.py

from __future__ import annotations

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    import pytest


def pytest_addoption(parser: pytest.Parser) -> None:
    """Pytest options parser."""
    parser.addoption(
        "--ollama-host", action="store", default="localhost", help="Ollama server host"
    )

Finally, we add a parametrized test case in the tests/test_calculate.py file to ensure expected behavior for a few user prompts:

tests/test_calculate.py

# ...

@pytest.mark.parametrize(
    ("prompt", "expected"),
    [
        ("3 plus 7", 10),
        ("123 minus 45", 78),
        ("10 divided by 2", 5),
        ("123 times 45", 5535),
    ],
)
def test_calculate(calculator: LLMCalculator, prompt: str, expected: int | float) -> None:
    result = calculator.calculate(prompt)
    assert result == expected

In addition, we add another parametrized test case in the tests/test_calculate.py file to ensure expected behavior for user prompts that shall raise an exception:

tests/test_calculate.py

# ...

from llm_calc import InvalidDomainError
from llm_calc import InvalidExpressionError
from llm_calc import LLMCalculatorError

# ...

@pytest.mark.parametrize(
    ("prompt", "exception_class"),
    [
        ("foo plus bar", InvalidExpressionError),
        ("Square root of -1", InvalidDomainError),
        ("One divided by zero", ZeroDivisionError),
    ],
)
def test_calculate_raises_error(
    calculator: LLMCalculator, prompt: str, exception_class: type[Exception]
) -> None:
    with pytest.raises(exception_class) as exc:
        calculator.calculate(prompt)
    if isinstance(exc, LLMCalculatorError):
        assert exc.value.prompt == prompt

Now, we run the test suite locally:

uv run pytest
# or
uv run pytest --ollama-host localhost

Finally, we add the following CI job specification to the .gitlab-ci.yml file for running the test suite via GitLab CI:

.gitlab-ci.yml

tests:
  services:
    - name: ollama/ollama:0.6.3
      alias: ollama
  image: ghcr.io/astral-sh/uv:0.6.11-python3.13-bookworm-slim
  before_script:
    - uv sync --frozen
  script:
    - uv run pytest --ollama-host ollama

For the complete implementation of this example, please refer to our supplemental GitLab project.

Conclusion¶

Large language models (LLMs) have become a transformative technology, enabling advanced AI applications with natural language interfaces. Both proprietary and open-access LLMs exist, each offering distinct advantages and disadvantages. We present a novel approach to LLM engineering that leverages GitLab CI services to serve open-access LLMs within a CI pipeline for integration testing, enabling seamless integration of LLMs into DevOps workflows without requiring additional infrastructure. Additionally, we describe how to enable GPU access for CI services, which is a new feature we have contributed to GitLab Runner and essential for the practical use of many LLMs. An example project involving zero-shot prompting showcases the practical application of this approach.

If you find this article useful, please consider sharing your feedback and experience via the comment box below.

We use open-access LLMs as an umbrella term to refer to open-source, open-weights, and restricted-weights LLMs under a single term. ↩

LLM Engineering on GitLab with CI Services¶

Primer¶

Containers with GPUs¶

GitLab CI services¶

Ollama¶

LLM serving via CI services¶

Example¶

Conclusion¶

Comments