Skip to content

testingforAI-vnuuet/ViSec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ViSecRAG

Introduction

ViSecRAG is a Retrieval-Augmented Generation (RAG) system designed to process and retrieve security-related information in Vietnamese. The project integrates advanced techniques for data chunking, semantic search, and response generation through AI models.

System Architecture

The RAG system consists of two main components:

  • Knowledge Base Construction (KBC): Build a knowledge base from input documents (txt, pdf, docx)
  • Question Answering Engine (QA Engine): Answer questions based on retrieved information

Data processing pipeline:

  1. Chunking: Break long documents into short segments, each containing a main idea
  2. Tokenization: Process Vietnamese word segmentation and improve embedding quality
  3. Retrieval: Search for relevant information
  4. Generation: Generate responses based on retrieved documents

Framework Architecture

Key Features

  • Data Processing: Intelligent chunking using semantic_chunkers
  • Vector Storage: Weaviate integration for structured data storage
  • Vietnamese Processing: Support for Vietnamese natural language processing with UnderTheSea
  • Custom Models: Fine-tuning pipeline for embedding and reranking models
  • API Client: Client interface for system interaction

Project Structure

ViSecRAG/
├── README.md                      
├── figure/                         # Images and illustration documents
└── src/
    ├── finetuning_pipeline/
    │   ├── finetune_biencoder_model.ipynb   # Fine-tune embedding model
    │   └── finetune_crossencoder_model.ipynb   # Fine-tune reranking model
    └── rag_pipeline/
        ├── main.py                 # Main system entry point
        ├── config.py               # System configuration
        ├── client.py               # Client for Weaviate communication
        ├── chunking.py             # Document chunking processing
        ├── retrieval.py            # Information retrieval module
        ├── generation.py           # Response generation module
        ├── tokenization.py         # Tokenization processing
        ├── requirement.txt         # List of dependencies
        ├── test.ipynb              # System testing and demonstration
        └── ViSecRAG.ipynb          # Main system notebook

🚀 Installation Guide

System Requirements

  • Python 3.8+
  • Weaviate instance (local or cloud)

Install Libraries

cd src/rag
pip install -r requirement.txt

Main Libraries

  • semantic_chunkers: Semantic-based document chunking
  • weaviate-client: Client for connecting to Weaviate database
  • underthesea: Vietnamese natural language processing

Usage

1. Configure the System

Edit config.py to configure:

from config import Config

config = Config()
# Configure database, models, etc.

2. Initialize Client

from client import Client
from config import Config

config = Config()
client = Client(config)
print(client.is_ready)  # Check connection

3. Create Schema

client.create_schema(config.cluster_name)

4. Upload Data

from chunking import process_data

chunks = process_data("path/to/corpus")
client.upload_data(config, chunks, config.cluster_name)

5. Retrieve Information

from retrieval import retrieve

results = retrieve(query="Your question")

6. Generate Response

from generation import generate

response = generate(query="Your question", retrieved_docs=results)
print(response)

🔬 Model Fine-tuning

Embedding Model (Bi-Encoder)

Uses Triplet Loss to improve embedding quality. Goals:

  • Bring relevant question-document pairs closer in embedding space
  • Push irrelevant pairs away

Triplet Loss Illustration

Run fine-tuning: src/finetune_pipeline/finetune_biencoder_model.ipynb

Reranking Model (Cross-Encoder)

Uses Multiple Negatives Ranking (MNR) Loss to improve document relevance ranking.

  • Pull relevant documents closer to anchor
  • Push irrelevant documents away
  • Improve retrieval accuracy and ranking

MNR Loss Illustration

Run fine-tuning: src/finetune_pipeline/finetune_crossencoder_model.ipynb

Testing

Run testing notebooks:

# Main notebook
jupyter notebook src/rag_pipeline/ViSecRAG.ipynb

Main Modules

Module Function
config.py Manage system configuration
client.py Communicate with Weaviate database
chunking.py Semantic-based document chunking
retrieval.py Search for relevant information
generation.py Generate responses from retrieved documents
tokenization.py Vietnamese tokenization processing

Important Notes

  • Ensure Weaviate instance is running before starting the system
  • Prepare corpus data in appropriate format before uploading
  • Fine-tuned models should be saved and reused

📊 Results

Framework Architecture

  • Fine-tuning (contrastive) significantly improves multilingual embedding quality for Vietnamese (in the Vietnamese securities domain).
  • vietnamese-bi-encoder best results:
    • 0.9955 (margin = 0.1)
    • 0.9920 (margin = 0.15)
  • Other models:
    • mpnet-base: 0.9825
    • MiniLM: 0.98

Compared to pretrained:

  • MiniLM: 0.0755 → 0.98 (~12×)

  • mpnet: 0.3465 → 0.9825 (~1.8×)

  • Larger margins (0.3–0.4) decrease performance:

    • vietnamese-bi-encoder decreases significantly
    • MiniLM more stable (>0.90)

📂 Dataset

ViSecQA: https://huggingface.co/datasets/Dainn98/ViSecQA#

About

ViSecRAG: A comprehensive Retrieval-Augmented Generation framework featuring semantic-aware data chunking, vector-based retrieval, and fine-tuned embedding models optimized for Vietnamese language security and information extraction tasks.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors