Portfolio Project

Smart Sentence Retriever

NLP Embeddings & Serverless Retrieval

Machine Learning Automation Python AWS Docker NLP

Context

I wanted a quick way to retrieve relevant sentences by meaning, not exact wording.

Approach

  • Prepared the corpus (Alice in Wonderland), split into sentences, and precomputed embeddings.
  • Benchmarked 6+ embedding models on 800 sentences (k=2–6), tracking both silhouette score and efficiency (silhouette per million parameters).
  • Chose the best silhouette‑score model and deployed it as a stateless AWS Lambda endpoint with CORS for top‑k ranking.

Impact

  • Best absolute silhouette: 0.313 – Snowflake/snowflake‑arctic‑embed‑l‑v2.0 (k=2, 1024‑d, ~568M params).
  • Best efficiency: 0.0116 per M params – jinaai/jina‑embeddings‑v3 (~12.9M params, k=6, 1024‑d).
  • Deployed AWS model: Snowflake/snowflake‑arctic‑embed‑l‑v2.0 (prioritizing quality); live demo runs on a lightweight, scalable Lambda API.

Links