ViSecRAG is a Retrieval-Augmented Generation (RAG) system designed to process and retrieve security-related information in Vietnamese. The project integrates advanced techniques for data chunking, semantic search, and response generation through AI models.
The RAG system consists of two main components:
- Knowledge Base Construction (KBC): Build a knowledge base from input documents (txt, pdf, docx)
- Question Answering Engine (QA Engine): Answer questions based on retrieved information
Data processing pipeline:
- Chunking: Break long documents into short segments, each containing a main idea
- Tokenization: Process Vietnamese word segmentation and improve embedding quality
- Retrieval: Search for relevant information
- Generation: Generate responses based on retrieved documents
- Data Processing: Intelligent chunking using semantic_chunkers
- Vector Storage: Weaviate integration for structured data storage
- Vietnamese Processing: Support for Vietnamese natural language processing with UnderTheSea
- Custom Models: Fine-tuning pipeline for embedding and reranking models
- API Client: Client interface for system interaction
ViSecRAG/
├── README.md
├── figure/ # Images and illustration documents
└── src/
├── finetuning_pipeline/
│ ├── finetune_biencoder_model.ipynb # Fine-tune embedding model
│ └── finetune_crossencoder_model.ipynb # Fine-tune reranking model
└── rag_pipeline/
├── main.py # Main system entry point
├── config.py # System configuration
├── client.py # Client for Weaviate communication
├── chunking.py # Document chunking processing
├── retrieval.py # Information retrieval module
├── generation.py # Response generation module
├── tokenization.py # Tokenization processing
├── requirement.txt # List of dependencies
├── test.ipynb # System testing and demonstration
└── ViSecRAG.ipynb # Main system notebook
- Python 3.8+
- Weaviate instance (local or cloud)
cd src/rag
pip install -r requirement.txt- semantic_chunkers: Semantic-based document chunking
- weaviate-client: Client for connecting to Weaviate database
- underthesea: Vietnamese natural language processing
Edit config.py to configure:
from config import Config
config = Config()
# Configure database, models, etc.from client import Client
from config import Config
config = Config()
client = Client(config)
print(client.is_ready) # Check connectionclient.create_schema(config.cluster_name)from chunking import process_data
chunks = process_data("path/to/corpus")
client.upload_data(config, chunks, config.cluster_name)from retrieval import retrieve
results = retrieve(query="Your question")from generation import generate
response = generate(query="Your question", retrieved_docs=results)
print(response)Uses Triplet Loss to improve embedding quality. Goals:
- Bring relevant question-document pairs closer in embedding space
- Push irrelevant pairs away
Run fine-tuning: src/finetune_pipeline/finetune_biencoder_model.ipynb
Uses Multiple Negatives Ranking (MNR) Loss to improve document relevance ranking.
- Pull relevant documents closer to anchor
- Push irrelevant documents away
- Improve retrieval accuracy and ranking
Run fine-tuning: src/finetune_pipeline/finetune_crossencoder_model.ipynb
Run testing notebooks:
# Main notebook
jupyter notebook src/rag_pipeline/ViSecRAG.ipynb| Module | Function |
|---|---|
config.py |
Manage system configuration |
client.py |
Communicate with Weaviate database |
chunking.py |
Semantic-based document chunking |
retrieval.py |
Search for relevant information |
generation.py |
Generate responses from retrieved documents |
tokenization.py |
Vietnamese tokenization processing |
- Ensure Weaviate instance is running before starting the system
- Prepare corpus data in appropriate format before uploading
- Fine-tuned models should be saved and reused
- Fine-tuning (contrastive) significantly improves multilingual embedding quality for Vietnamese (in the Vietnamese securities domain).
- vietnamese-bi-encoder best results:
- 0.9955 (margin = 0.1)
- 0.9920 (margin = 0.15)
- Other models:
- mpnet-base: 0.9825
- MiniLM: 0.98
Compared to pretrained:
-
MiniLM: 0.0755 → 0.98 (~12×)
-
mpnet: 0.3465 → 0.9825 (~1.8×)
-
Larger margins (0.3–0.4) decrease performance:
- vietnamese-bi-encoder decreases significantly
- MiniLM more stable (>0.90)



