IRS Jovem 2025 RAG chatbot
MLOps LLM RAG Chatbot POC project
IRS-Jovem-Chat is a POC developed over a weekend in order to develop my skills in AI, RAG and MLops and what involves deploying a chat bot to the web.
Context
The IRS Jovem is a Portuguese tax benefit designed to support young workers by offering a partial exemption on their personal income tax (IRS) for a set period.
I belive that there is a lot of confusion related to this topic because the law changes and a lot of young people that generaly already do not have a good knowlege based see themselves very confused so I decided to created this project. This project consists in end-to-end (development to infrastructure deployment) RAG(Retrival augmented generation) chat bot about IRS jovem.
Disclamer:
This project is a POC (Proof-of-Concept) and I do not save any personal data
Architechre
The architecture was designed using Docker Compose to ensure full modularity and easy scalability, allowing both the frontend and backend to support multiple replicas at any time and run in distict servers/regions.
Data
Gathering data related to this topic was challenging because there is the objective the fetch data that is accurate and the has been up to the recent law changes so it was crucial to only collect up to date information (2025). There were two sources of informations: websites and pdfs.
Fetching PDFs was a manual task because there were only a few quality pdfs online to do this I used search engineering in google by searching by file type, date and search query.
Feching websites was done using SERP API which is a tool the helps scrapping google results based on the search query and geo location. So I fetched the top 100 google search requests and then used a python scraper to fetch all the information HTML.
RAG
Retrieval-augmented generation (RAG) is an AI framework that enhances text generation by integrating information retrieval with generative models. Instead of relying solely on pre-trained knowledge, RAG dynamically retrieves relevant documents or data from external sources, such as databases or the web, to provide more accurate, up-to-date, and contextually relevant responses. This approach improves accuracy, reduces hallucinations, and is particularly useful for tasks like question-answering and summarization. By combining retrieval with generation, RAG enables AI models to generate informed and reliable responses beyond their static training data.
In this use case, I used an embedding model from hugging face to read all data in the PDF and the Websites and create a vector database using ChromaDB.
Rate-Limitting
In order to spare my credit card I had to think about a technique to prevent users from spamming queries:
- Browser fingerprinting + IP & IP
One of the main challenges of this project was finding a way to rate-limit unauthenticated users. While I believe this will never be perfect, one approach is by combining browser fingerprinting and IP addresses, similar to a SQL compound key.
For browser fingerprinting, I used FingerprintJS, which has the disadvantage of being client-side and, therefore, susceptible to spoofing by the user.
The combination of IP address and browser fingerprinting helps differentiate users on the same network (with the same IP) and allows for more accurate rate-limiting.
- Query caching
Query caching involves creating a database to store requests made to the Gemini API. When a subsequent request is made, the system checks if it has already been processed. If so, it returns the cached response, thereby preventing redundant API calls and saving costs.
- Firewall
Creating firewall a rule with cloudflare to block requests outside of portugal will block a lot of requests from crawlers and bots. I also used cloudflare to create a “I am not a robot” cahllenge for the user to complete before entering the website.
Embeding Model
Most of the available embedding models are based on English wording but the objective of this model is to speak Portuguese as best as possible so I ended up using a model trending model from Hugging Face. This embedding model showed better results compared to an English or Portuguese only model.
Tech Stack
- API: FastAPI
- RAG: Llama-index
- Frontend: Nextjs
- Vector database: ChromaDB
- Docker
- Cloudflare

