Personal Knowledge RAG System
Upload research documents, strategy reports, or technical papers and instantly extract
keywords, focus areas, and key sentences using
client-side TF-IDF analysis and cosine-similarity vector search — no server, no AI API, no data leaves your browser.
Built as a demonstration of Retrieval-Augmented Generation (RAG) architecture principles applied through
pure JavaScript: document embeddings via term-frequency vectors, similarity scoring, and extractive summarisation.
Fully static and self-hosted. All processing runs in your browser. Supports PDF and TXT files. No data is transmitted externally.
Knowledge Base Statistics
Documents Indexed
0
Uploaded & processed
Total Words
0
Across all documents
Unique Terms
0
In vocabulary index
Keywords Extracted
0
Top TF-IDF terms
01Document Upload
Drop files here or click to browse
Process multiple documents at once
PDF
TXT
MD
Processing…
Files are processed entirely in your browser using the PDF.js library for PDFs and native FileReader for text files. Nothing is uploaded to any server.
02Document Library
No documents indexed
Upload a PDF or text file to begin
06Vector Search — Cosine Similarity Retrieval
Knowledge base empty
Upload documents to enable semantic search across your knowledge base
Search uses TF-IDF weighted term-frequency vectors with cosine similarity scoring — the same mathematical foundation as production RAG systems. Results ranked by relevance score.
07RAG Architecture — How It Works
Step 1 — Ingestion
Document Processing
Text is extracted from PDFs or plain text files, tokenised, and stop-words are removed. The cleaned token stream becomes the basis for all analysis.
PDF.js
Tokenisation
Stop-word removal
Step 2 — Embedding
TF-IDF Vectors
Each document is encoded as a sparse TF-IDF vector over the shared vocabulary. Term frequency × inverse document frequency weights rare, meaningful terms higher than common ones.
TF-IDF
Sparse vectors
IDF weighting
Step 3 — Retrieval
Cosine Similarity
Query terms are vectorised against the same vocabulary. Cosine similarity between the query vector and each document passage vector determines relevance ranking.
Cosine similarity
Passage ranking
Top-k retrieval
This implementation mirrors the retrieval layer of production RAG systems. In a full deployment, Step 2 would use a neural embedding model (e.g. sentence-transformers) and Step 3 would query a vector database (e.g. Pinecone, Weaviate, FAISS).