<< Back to posts

AI Tool Review - ChromaDB as a local vector database for RAG

Posted on December 31, 2023 • Tags: llms ai ml chromed rag vector db python ai tool review

ChromaDB is a Python library which provides a simple, local vector database that runs in-memory.

It can be used for things like storing embeddings / doing retrieval-augmented generation (RAG).

If you just want to (a) store a bunch of vectors locally and (b) query them fast with (c) as minimal set-up as possible, then ChromaDB is a great choice for you.

However, I found the documentation pretty lacking. And it does not seem to be as “production-ready” as other solutions.

But for research projects, it’s pretty good. Here are some assorted tips / code snippets:

Setup

To install:

pip install chromadb

Quickstart

For the below examples, let’s say we’ve already generated our embeddings using something like the following code:

from transformers import AutoTokenizer, AutoModel

metadatas: List[Dict[str, Any]] = [ # we will filter our chromaDB by these keys
  { 'id' : 1, 'name' : 'Marty', },
  { 'id' : 3, 'name' : 'Jessica', },
]
notes: List[str] =  ['I have diabetes.', 'I have a heart condition.']

# Embed notes
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
embeddings: torch.Tensor = model(**tokenizer(notes))[0].mean(dim=TODO)

query: str = 'diabetic'
query_embedding: torch.Tensor = model(**tokenizer(query))[0].mean(dim=TODO)

Create Persistent local DB

import chromadb
client = chromadb.PersistentClient(path="./data/chroma")

Load collection (or create if not exists)

When choosing the distance metric for your collection, you should use the same metric used to train the model (e.g. for MiniLM, this was cosine).

collection = client.get_or_create_collection(
    name=collection_name,
    metadata={"hnsw:space": "cosine"}, # NOTE: This distance metric is 'cosine distance'
)

Add embedding into ChromaDB

collection.add(
  embeddings=embeddings,
  metadatas=metadatas,
  documents=notes,
  ids=[ x['id'] for x in metadatas ] # NOTE: Must be unique for each embedding
)

Fetch embeddings that meet filter criteria

results = collection.query(
    query_embeddings=query_embedding,
    where={"id": query_id},
    include=["metadatas", "documents", "distances", ],
)

# Convert results into List of Dicts
records: List[Dict[str, Any]] = [
  { 'id' : id, 'distance' : distance, 'text' : text, 'metadata' : metadata }
  for id, distance, text, metadata in zip(results['ids'][0], results['distances'][0], results['documents'][0], results['metadatas'][0])
]

Pitfalls

Cosine Similarity != Cosine Distance

The collection.query function always returns distances, not similarity scores (unlike some other vectorDBs). This tripped me up the first time.

To get Cosine Similarity scores, you need to setup your collection with Cosine Distance, and then do cosine similarity = 1 - cosine distance.

A code example is below:

collection = client.create_collection(
    name="collection_name",
    metadata={"hnsw:space": "cosine"} # NOTE: This does cosine *distance*, not *similarity*!
)

results = collection.query(
    query_embeddings=embedding,
)

first_result_cosine_sim: float = 1 - results['distances'][0] # NOTE: need to do `1 - returned dist` to convert cosine distance -> similarity

Takeaways

Strengths

  1. Easy to use. It was pretty trivial to save and load my own embeddings.
  2. Locally hosted. Can “reset” the database by just deleting the directory.

Limitations

  1. Incomplete documentation It feels like the documentation isn’t quite there yet.
  2. Locally hosted. Hard to collaborate with others versus cloud solutions like PineCone.
  3. Runs in memory. So it will be limited in how much data it can work with.

References