Personal Knowledge RAG System
Upload research documents, strategy reports, or technical papers and instantly extract keywords, focus areas, and key sentences using client-side TF-IDF analysis and cosine-similarity vector search — no server, no AI API, no data leaves your browser. Built as a demonstration of Retrieval-Augmented Generation (RAG) architecture principles applied through pure JavaScript: document embeddings via term-frequency vectors, similarity scoring, and extractive summarisation.
Fully static and self-hosted. All processing runs in your browser. Supports PDF and TXT files. No data is transmitted externally.
Knowledge Base Statistics
Documents Indexed
0
Uploaded & processed
Total Words
0
Across all documents
Unique Terms
0
In vocabulary index
Keywords Extracted
0
Top TF-IDF terms
01Document Upload
📄
Drop files here or click to browse
Process multiple documents at once
PDF TXT MD
Processing…

Files are processed entirely in your browser using the PDF.js library for PDFs and native FileReader for text files. Nothing is uploaded to any server.

02Document Library
No documents indexed
Upload a PDF or text file to begin
06Vector Search — Cosine Similarity Retrieval
Knowledge base empty
Upload documents to enable semantic search across your knowledge base

Search uses TF-IDF weighted term-frequency vectors with cosine similarity scoring — the same mathematical foundation as production RAG systems. Results ranked by relevance score.

07RAG Architecture — How It Works
Step 1 — Ingestion
Document Processing
Text is extracted from PDFs or plain text files, tokenised, and stop-words are removed. The cleaned token stream becomes the basis for all analysis.
PDF.js Tokenisation Stop-word removal
Step 2 — Embedding
TF-IDF Vectors
Each document is encoded as a sparse TF-IDF vector over the shared vocabulary. Term frequency × inverse document frequency weights rare, meaningful terms higher than common ones.
TF-IDF Sparse vectors IDF weighting
Step 3 — Retrieval
Cosine Similarity
Query terms are vectorised against the same vocabulary. Cosine similarity between the query vector and each document passage vector determines relevance ranking.
Cosine similarity Passage ranking Top-k retrieval

This implementation mirrors the retrieval layer of production RAG systems. In a full deployment, Step 2 would use a neural embedding model (e.g. sentence-transformers) and Step 3 would query a vector database (e.g. Pinecone, Weaviate, FAISS).