LV
7 min read

I Built a Local RAG System for My Volleyball Club. Here's What Actually Happened.

Building a Q&A chatbot that answers questions about match results, standings, and schedules -- powered by Ollama, ChromaDB, and FastAPI, running entirely on local hardware.

AI/ML RAG LLM Backend Ollama

Why I Did This

I help run a volleyball club — RM Volley, based in Piacenza. We have multiple teams across different age categories: Serie D women’s, Under 18, Under 16, Under 14, Seconda Divisione. Match data comes from the Italian federation (FIPAV) as spreadsheets. Standings are scraped from their website as JSON. Schedules, results, set scores, venues — it’s all there, but scattered across files that nobody wants to dig through.

The question that kept coming up: “When’s the next match for the Under 18s?” or “How did RM Volley Piacenza do last weekend?” Simple questions, but answering them meant opening an Excel file, finding the right rows, and parsing Italian sports federation jargon.

So I built a chatbot. Ask a question in natural Italian, get an answer with the actual data. No cloud APIs, no subscriptions — just Ollama running locally and a Python backend.


The Architecture

The system is a FastAPI application backed by ChromaDB for vector search and Ollama for text generation.

The request flow:

  1. User asks a question via POST /ask
  2. The API detects intent: is this about standings? Past matches? Future schedule? A specific team?
  3. A retriever searches ChromaDB for relevant document chunks
  4. A prompt builder assembles Italian system instructions + retrieved context + the question
  5. Ollama generates a response
Client  -->  FastAPI  -->  Intent Detection  -->  Retriever (ChromaDB)
                                                        |
                                                        v
                                                  Prompt Builder
                                                        |
                                                        v
                                                   Ollama (LLM)
                                                        |
                                                        v
                                                     Response

The LLM part was maybe two days of work. The retrieval logic, chunking strategy, and intent routing took weeks.


Ingestion: Turning Sports Data Into Searchable Chunks

Before the system can answer anything, match data and standings need to become something a vector database can search. My data sources are specific:

  • Gare.xls — An Excel export from FIPAV with every match: dates, teams, scores, set results, venues, league names, match status
  • classifica.json — Scraped league standings with points, wins, losses, set ratios

The chunking strategy is domain-specific. No sliding windows, no generic text splitting.

One Chunk Per Match

Each row in the Excel file becomes a single semantic chunk, written as natural Italian text:

def create_match_chunk(self, match):
    """Convert a match record into a semantic text chunk."""
    chunks = []
    chunks.append(f"Partita del {date}: {home_team} vs {away_team} (Squadra {category})")

    if has_result:
        chunks.append(f"Risultato finale: {result}")
        chunks.append(f"{rm_team} ha vinto {result} contro {opponent}")

    if set_scores:
        chunks.append(f"Parziali: {set_scores}")

    chunks.append(f"Impianto: {venue}")
    chunks.append(f"Campionato: {league}")

    text = ". ".join(chunks) + "."
    return {"id": f"match_{match_id}", "text": text, "metadata": metadata}

The key insight: each chunk carries rich metadata — team names, date, league, result, whether RM Volley was home or away, the team category (Under 18, Serie D, etc.). This metadata is critical for filtering later.

One Chunk Per League Standings

For standings, an entire league table becomes one chunk, prefixed with question-like phrasing to help semantic matching:

def create_league_standing_chunk(self, league_name, teams):
    """Create a full league standings chunk with semantic preamble."""
    preamble = (
        f"Qual è la classifica della {league_name}? "
        f"Ecco la classifica aggiornata della {league_name}:\n"
    )
    for i, team in enumerate(teams, 1):
        preamble += f"{i}. {team.name} - {team.points} punti "
        preamble += f"({team.wins} vittorie, {team.losses} sconfitte)\n"

    return {"id": f"standing_{league_name}", "text": preamble, "metadata": {...}}

That question-preamble trick (“Qual è la classifica della…?”) was deliberate. When a user asks about standings, the embedding of their question naturally matches the embedding of this preamble. It’s a simple technique that noticeably improved retrieval for this type of query.

Embedding Model

I’m using intfloat/multilingual-e5-small — a 384-dimensional multilingual model. This was a deliberate choice over the more common all-MiniLM-L6-v2 because all my data and queries are in Italian. The multilingual model handles Italian vocabulary and sentence structure significantly better.

Documents are embedded in batches of 32 using sentence-transformers and stored in ChromaDB with their metadata.


The Smart Part: Intent-Based Query Routing

This is probably the most interesting piece of the system, and the one that made the biggest difference in answer quality.

Not every question should hit the vector database the same way. “What’s the standings?” and “When’s the next match for the Under 18s?” need completely different retrieval strategies.

The /ask endpoint detects intent using Italian keyword matching:

standings_keywords = ["classifica", "posizione", "punti", "graduatoria"]
past_keywords = ["recente", "giocato", "risultat", "vinto", "perso", "com'è andata"]
future_keywords = ["prossima", "prossime", "calendario", "quando gioca"]
stats_keywords = ["statistiche", "bilancio", "andamento", "forma", "stagione"]

Then it routes accordingly:

  • Standings query — retrieve only standing-type documents
  • Past match query for a team — retrieve matches with results, sorted most recent first
  • Future match query — retrieve matches without results, sorted closest upcoming first
  • Statistics query — combine past results AND next match for the detected team
  • General query — standard vector similarity search

There’s also team detection via regex (RM\s*VOLLEY\s*#?(\d+), RM\s*VOLLEY\s*PIACENZA) to scope queries to a specific team.

A small but useful detail: the system detects singular vs. plural in Italian. “La prossima partita” (the next match) returns one result. “Le prossime partite” (the next matches) returns several. This kind of language-aware routing matters when the domain is this specific.


Retrieval: Where Quality Actually Lives

ChromaDB handles the vector search with cosine similarity over the HNSW index. The core retrieval is straightforward:

def retrieve(self, query, n_results=5, filter_metadata=None):
    query_embedding = self.embedder.embed_query(query)
    query_params = {
        "query_embeddings": [query_embedding],
        "n_results": n_results,
    }
    if filter_metadata:
        query_params["where"] = filter_metadata  # e.g., {"type": "match"}

    results = self.collection.query(**query_params)
    return results

But the team-specific retrieval is where things get interesting. It fetches a large set of results, then post-filters in Python:

  1. Only keep match documents (not standings)
  2. Filter by team name (case-insensitive, handling spacing variations)
  3. Parse dates and filter by past/future
  4. Sort by date (most recent first for past, closest first for future)
  5. Truncate to the requested count

This hybrid approach — vector search for initial recall, then deterministic filtering for precision — works well when your data has clear structured attributes. Pure semantic search would mix up teams with similar names or return future matches when you asked about past ones.

What I Learned About n_results

I started with n_results=10 thinking more context is better. For match queries it sometimes was, but for standings and specific questions, the LLM would get confused by low-relevance chunks contradicting higher-relevance ones. The default ended up at 5 for general queries, with dynamic adjustment based on query type (1 for “next match,” more for statistics).


Prompt Design: Constraining the Model in Italian

The entire prompt is in Italian, which was a deliberate choice. The system prompt is detailed and defensive:

Sei un assistente di statistiche di pallavolo per RM Volley.

DATA ODIERNA: {today's date}

CRITICO - DISTINZIONE TEMPORALE:
- Se lo stato della partita è "da giocare" → la partita è nel FUTURO
- Se la partita ha un risultato (es. "3-1") → la partita è già stata giocata
- NON inventare risultati per partite future

CRITICO - DISTINZIONE SQUADRE (NON CONFONDERLE MAI):
- "RM VOLLEY PIACENZA" = Serie D Femminile
- "RMVOLLEY#18" = Under 18 Femminile
- "RMVOLLEY#16" = Under 16 Femminile
(these are DIFFERENT teams!)

Two things I learned the hard way:

  1. Inject today’s date. Without it, the model has no concept of “past” vs. “future.” A match scheduled for next week looks the same as one from last month if you don’t tell the model what day it is.

  2. Explicit team disambiguation. Early on, the model kept mixing up RM VOLLEY PIACENZA (the Serie D adult team) with RMVOLLEY#18 (the youth team). They have similar names. The system prompt now lists every team with its category, and the user prompt reinforces “do NOT confuse these.”

The user prompt wraps the retrieved context and adds strict instructions for how to handle standings (copy the exact order, don’t reorder) and matches (first item = most relevant, don’t confuse past and future).

Temperature is set to 0.5 — a bit of flexibility for natural-sounding Italian, but low enough to keep answers grounded in the context.


Running Ollama Locally

The system supports two LLM backends via a simple abstraction:

  • Ollama (local) — the default, running mistral:7b on local hardware
  • Groq (cloud) — optional, using llama-3.3-70b-versatile via their OpenAI-compatible API

The Ollama integration is minimal:

def generate(self, prompt, system_prompt=None, temperature=0.5, max_tokens=400):
    payload = {
        "model": self.model,
        "prompt": prompt,
        "stream": False,
        "options": {
            "temperature": temperature,
            "num_predict": max_tokens,
        },
    }
    if system_prompt:
        payload["system"] = system_prompt

    response = requests.post(
        f"{self.base_url}/api/generate", 
        json=payload, 
        timeout=60
    )
    return response.json()["response"]

No streaming — the full response comes back at once. For an internal tool where responses are short (match results, standings), the wait is acceptable. Streaming is on the todo list but hasn’t been necessary yet.

Model Choice

I settled on mistral:7b for daily use. It handles Italian well and follows structured instructions reliably. llama3.2:3b is the fallback default in code — faster but noticeably worse at following the team disambiguation rules. The Groq option with llama-3.3-70b-versatile is there for when I want higher quality without buying GPU hardware.

The honest takeaway: when your retrieval feeds the model the right context, even a 7B model gives solid answers. The quality gap between models shrinks dramatically when the input is good.


The Frontend

A vanilla HTML/JS chat interface. Dark theme, no framework, no build step. It sends POST requests to /ask and displays the response. Nothing fancy, but it works.

The frontend sends temperature: 0.5 and n_results: 10 as defaults, with the ability to filter by document type (matches vs. standings).


The Takeaway

RAG is a backend pipeline with a language model as one component. The LLM is the easy part. The hard parts are:

  1. Turning messy data into good chunks with rich metadata
  2. Routing queries to the right retrieval strategy
  3. Writing prompts that constrain the model to the retrieved context

For a domain-specific tool like this, where the data is structured and the questions are predictable, intent-based routing and metadata filtering mattered far more than model size. A 7B model answering from the right context beats a 70B model guessing.

The system works. It answers questions about match results, upcoming schedules, and league standings correctly, in Italian, from local data, without calling any external API. When it doesn’t have the information, it says so.

That last part is the whole trick.