IRS Jovem 2025 RAG chatbot
MLOps LLM RAG Chatbot POC project
IRS-Jovem-Chat is a POC developed over a weekend in order to develop my skills in AI, RAG and MLops, which involves deploying a chatbot to the web.
Context
The IRS Jovem is a Portuguese tax benefit designed to support young workers by offering a partial exemption on their personal income tax (IRS) for a set period.
I believe that there is a lot of confusion related to this topic because the law changes, and a lot of young people who generally already do not have a good knowledge base see themselves very confused, so I decided to create this project. This project consists of an end-to-end (development to infrastructure deployment) RAG(Retrieval augmented generation) chatbot about IRS jovem.
Disclaimer:
This project is a POC (Proof-of-Concept) and I do not save any personal data
Architecture
The architecture was designed using Docker Compose to ensure full modularity and easy scalability, allowing both the frontend and backend to support multiple replicas at any time and run in distinct servers/regions.
Data
Gathering data related to this topic was challenging because there is an objective to fetch data that is accurate and has been updated to the recent law changes, so it was crucial to only collect up-to-date information (2025). There were two sources of information: websites and PDFs.
Fetching PDFs was a manual task because there were only a few quality PDFs online. To do this, I used search engineering in Google by searching by file type, date and search query.
Fetching websites was done using SERP API, which is a tool that helps scrape Google results based on the search query and geo location. So I fetched the top 100 Google search requests and then used a Python scraper to fetch all the information from the HTML.
RAG
Retrieval-augmented generation (RAG) is an AI framework that enhances text generation by integrating information retrieval with generative models. Instead of relying solely on pre-trained knowledge, RAG dynamically retrieves relevant documents or data from external sources, such as databases or the web, to provide more accurate, up-to-date, and contextually relevant responses. This approach improves accuracy, reduces hallucinations, and is particularly useful for tasks like question-answering and summarisation. By combining retrieval with generation, RAG enables AI models to generate informed and reliable responses beyond their static training data.
In this use case, I used an embedding model from Hugging Face to read all data in the PDF and the Websites and create a vector database using ChromaDB.
Rate-Limitting
To spare my credit card, I had to think about a technique to prevent users from spamming queries:
- Browser fingerprinting + IP & IP
One of the main challenges of this project was finding a way to rate-limit unauthenticated users. While I believe this will never be perfect, one approach is to combine browser fingerprinting and IP addresses, similar to a SQL compound key.
For browser fingerprinting, I used FingerprintJS, which has the disadvantage of being client-side and, therefore, susceptible to spoofing by the user.
The combination of IP address and browser fingerprinting helps differentiate users on the same network (with the same IP) and allows for more accurate rate-limiting.
- Query caching
Query caching involves creating a database to store requests made to the Gemini API. When a subsequent request is made, the system checks if it has already been processed. If so, it returns the cached response, thereby preventing redundant API calls and saving costs.
- Firewall
Creating a Firewall rule with Cloudflare to block requests outside of Portugal will block a lot of requests from crawlers and bots. I also used Cloudflare to create an “I am not a robot” challenge for the user to complete before entering the website.
Embedding Model
Most of the available embedding models are based on English wording, but the objective of this model is to speak Portuguese as best as possible, so I ended up using a trending model from Hugging Face. This embedding model showed better results compared to an English or Portuguese-only model.
Tech Stack
- API: FastAPI
- RAG: Llama-index
- Frontend: NextJS
- Vector database: ChromaDB
- Docker
- Cloudflare

