AI Tool Review - ChromaDB as a local vector database for RAG
ChromaDB is a Python library which provides a simple, local vector database that runs in-memory.
It can be used for things like storing embeddings / doing retrieval-augmented generation (RAG).
If you just want to (a) store a bunch of vectors locally and (b) query them fast with (c) as minimal set-up as possible, then ChromaDB is a great choice for you.
However, I found the documentation pretty lacking. And it does not seem to be as “production-ready” as other solutions.
But for research projects, it’s pretty good. Here are some assorted tips / code snippets:
Setup
To install:
pip install chromadb
Quickstart
For the below examples, let’s say we’ve already generated our embeddings using something like the following code:
from transformers import AutoTokenizer, AutoModel
metadatas: List[Dict[str, Any]] = [ # we will filter our chromaDB by these keys
{ 'id' : 1, 'name' : 'Marty', },
{ 'id' : 3, 'name' : 'Jessica', },
]
notes: List[str] = ['I have diabetes.', 'I have a heart condition.']
# Embed notes
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
embeddings: torch.Tensor = model(**tokenizer(notes))[0].mean(dim=TODO)
query: str = 'diabetic'
query_embedding: torch.Tensor = model(**tokenizer(query))[0].mean(dim=TODO)
Create Persistent local DB
import chromadb
client = chromadb.PersistentClient(path="./data/chroma")
Load collection (or create if not exists)
When choosing the distance metric for your collection, you should use the same metric used to train the model (e.g. for MiniLM, this was cosine).
collection = client.get_or_create_collection(
name=collection_name,
metadata={"hnsw:space": "cosine"}, # NOTE: This distance metric is 'cosine distance'
)
Add embedding into ChromaDB
collection.add(
embeddings=embeddings,
metadatas=metadatas,
documents=notes,
ids=[ x['id'] for x in metadatas ] # NOTE: Must be unique for each embedding
)
Fetch embeddings that meet filter criteria
results = collection.query(
query_embeddings=query_embedding,
where={"id": query_id},
include=["metadatas", "documents", "distances", ],
)
# Convert results into List of Dicts
records: List[Dict[str, Any]] = [
{ 'id' : id, 'distance' : distance, 'text' : text, 'metadata' : metadata }
for id, distance, text, metadata in zip(results['ids'][0], results['distances'][0], results['documents'][0], results['metadatas'][0])
]
Pitfalls
Cosine Similarity != Cosine Distance
The collection.query
function always returns distances, not similarity scores (unlike some other vectorDBs). This tripped me up the first time.
To get Cosine Similarity scores, you need to setup your collection
with Cosine Distance, and then do cosine similarity = 1 - cosine distance
.
A code example is below:
collection = client.create_collection(
name="collection_name",
metadata={"hnsw:space": "cosine"} # NOTE: This does cosine *distance*, not *similarity*!
)
results = collection.query(
query_embeddings=embedding,
)
first_result_cosine_sim: float = 1 - results['distances'][0] # NOTE: need to do `1 - returned dist` to convert cosine distance -> similarity
Takeaways
Strengths
- Easy to use. It was pretty trivial to save and load my own embeddings.
- Locally hosted. Can “reset” the database by just deleting the directory.
Limitations
- Incomplete documentation It feels like the documentation isn’t quite there yet.
- Locally hosted. Hard to collaborate with others versus cloud solutions like PineCone.
- Runs in memory. So it will be limited in how much data it can work with.
References
- ChromaDB website: https://docs.trychroma.com/
- The most useful ChromaDB doc page: API Cheatsheet