Under the hood

How It Works

When you add a YouTube URL, refind runs a multi-step pipeline to prepare the content for AI-powered retrieval. Here's what happens at each stage.

Ingestion pipeline

Caption extraction

refind fetches the official YouTube captions for each video — manual captions first, auto-generated as a fallback. Captions include precise timestamps for every sentence, which is what makes citation links possible.

PII redaction

Before anything is stored, an AI model scans the transcript and replaces names, email addresses, and phone numbers with a redacted placeholder. No personal information from the captions is ever written to disk.

Chunking

The transcript is split into overlapping chunks of roughly 1,500 tokens, respecting sentence boundaries so no sentence is cut in half. Each chunk remembers its timestamp range so citations can link back to the exact moment.

Embedding

Each chunk is converted into a numerical vector using a local embedding model. These vectors capture the semantic meaning of the text — similar ideas end up close together in vector space, regardless of exact wording.

Tagging

An AI classifier reads the video titles and descriptions and assigns up to 10 topic tags (e.g. machine-learning, finance, productivity). Tags are how the Explore page organises content and how you can query across a topic without knowing specific source names.

Query pipeline

Semantic retrieval

Your question is embedded using the same model as the chunks. refind finds the chunks whose vectors are closest to your question vector — these are the passages most likely to contain your answer, even if they use different words.

Re-ranking

The top candidates are re-scored by a more precise relevance model. This two-stage approach — fast vector search followed by slower but more accurate re-ranking — keeps latency low while improving answer quality.

Answer synthesis

The top-ranked chunks are passed to OpenAI's language model which synthesises them into a single coherent answer. Your question and the relevant excerpts are sent to OpenAI transiently — refind never stores your question text. Each claim in the answer is linked to the source chunk and through it to the exact timestamp in the original video.

A note on your questions

Your questions are never stored. refind logs only anonymous metadata (response time, source queried) — never the question text itself. See the Privacy Policy for the full picture.

← Getting Started ← All docs