How It Works
The Knowledge Base is Consuelo’s semantic search layer. It goes beyond simple keyword matching — it understands the meaning of your content and finds relevant results even when the exact words don’t match.
The Indexing Pipeline
When a file is uploaded to Consuelo, the following happens automatically:
-
Text Extraction — Consuelo reads the file content. Supported formats: PDF, Word (.doc/.docx), plain text, Markdown, CSV, and HTML.
-
Chunking — The extracted text is split into chunks of approximately 500 tokens each, with a 50-token overlap between chunks. This overlap ensures that context isn’t lost at chunk boundaries.
-
Embedding — Each chunk is converted into a 1536-dimensional vector using OpenAI’s
text-embedding-3-small model. This vector captures the semantic meaning of the text.
-
Storage — The chunks and their embeddings are stored in PostgreSQL using the pgvector extension, with an HNSW index for fast approximate nearest-neighbor search.
-
Collection Assignment — The chunks are assigned to the workspace’s default collection (or a specific collection if you specify one via the API).
Searching
When you (or an AI agent) search the knowledge base:
- Your query is converted into the same 1536-dimensional vector
- pgvector finds the chunks whose vectors are most similar (cosine similarity)
- Results are returned ranked by similarity score (0 to 1, where 1 is a perfect match)
- Only results above the minimum similarity threshold (default: 0.7) are returned
Collections
Collections are workspace-scoped groupings of knowledge chunks. They let you organize indexed content by topic, team, or purpose.
Default Collection
Every workspace has a default collection that’s automatically created when the first file is indexed. All auto-indexed files go into this collection.
Custom Collections
You can create custom collections via the GraphQL API to organize knowledge:
- Sales Playbook — battle cards, competitive intel, pricing guides
- Product Knowledge — feature docs, release notes, technical specs
- Onboarding — training materials, process docs, team handbook
- Industry Research — market reports, analyst briefings, case studies
Collection Operations
| Operation | GraphQL | Description |
|---|
| List collections | knowledgeCollections query | See all collections and their chunk counts |
| Create collection | createKnowledgeCollection mutation | Create a new named collection |
| Delete collection | deleteKnowledgeCollection mutation | Remove a collection and all its chunks |
| Index file | indexFileInKnowledgeBase mutation | Index a file into a specific collection |
| Search | knowledgeSearch query | Search across all or specific collections |
For AI Agents
The knowledge base is designed to be the primary way AI agents access your team’s sales content. An agent connected via the GraphQL API can:
- Search for context before a call — “What do we know about Acme Corp’s pricing concerns?”
- Find relevant scripts — “What’s our objection handling for ‘we already have a solution’?”
- Access methodology — “What are the MEDDIC qualification criteria for enterprise deals?”
- Retrieve competitive intel — “How do we compare to Competitor X on security features?”
The agent doesn’t need to know which file contains the answer. It searches by meaning and gets the most relevant chunks back.
Supported File Types
| File Type | Extension | Text Extraction |
|---|
| PDF | .pdf | Full text + page-level extraction |
| Microsoft Word | .doc, .docx | Full text extraction |
| Plain Text | .txt | Direct read |
| Markdown | .md | Direct read |
| CSV | .csv | Direct read |
| HTML | .html | Direct read |
| Images | .png, .jpg, etc. | Not indexed (stored only) |
| Video | .mp4, etc. | Not indexed (stored only) |
| Archives | .zip | Not indexed (stored only) |
OCR and image-based PDFs are not currently supported. If your PDF contains scanned images instead of selectable text, the content won’t be extracted. We recommend using text-based PDFs for best results.