Drop #381 (2023-12-01): Weekend Project Edition

Build Your Own Searchable Vectorized Knowledge Database: No AI Degree Required

I realize this is the start of Advent of Code, so some folks may not need or even want the distraction of yet-another weekend project. You should always prioritize rest and recovery over “crank out one more thing”. I gave myself the freedom to do that with this year’s 30-Day Map Challenge and have not regretted it for one second.

Today’s WPE revolves around a new open-source project from Kagi dubbed VectorDB. Let’s dig into the details.

What Is VectorDB?

white printer paper on white table

Kagi’s VectorDB (GH) is a “lightweight” Python package (the deps are kind of heavy IMO) that provides a VERY easy-to-use interface for storing and retrieving text using chunking, embedding, and vector search techniques. It is designed for use cases where low latency is essential and allows for efficient and accurate retrieval of relevant information from potentially massive datasets.

By converting text into high-dimensional vectors, VectorDB enables quick comparisons and searches, even when dealing with millions of documents. Additionally, embeddings capture the semantic meaning of the text, which improves the quality of search results and enables more advanced natural language processing tasks.

I decided on it for today’s WPE since it only requires Python and you don’t need to be an expert in any of the modern alphabet soup that surrounds these fancy AI language models. VectorDB takes care of chunking text and creating embeddings for it — a grunt-work task I find to be — at best — annoying. It’s also designed to be CPU-friendly, so those of us without $10K workstations or leftover GCP credits can actually have a bit of fun with far less frustration.

It also packages up the use of some modern, foundational libraries for work in this space, and that code is super-readable. If you’ve wanted some very grokable code to get started in this space, VectorDB is 100% a great candidate.

We’re also going to work with it because Kagi uses it to power Kagi. If it works for them, it’ll work for you.

Getting This Thing Installed

screw lot on paper

We’re working with icky Python things, so — as usual — it’s a potential dependency nightmare waiting to happen. There’s also a slight issue with a misspelling of a package import in the VectorDB PyPI package code we will temporarily work around. And, there are some special considerations on macOS that we will account for.

The following dance in bash should get everyone going:

mkdir whatev
cd whatev

python3 -m venv vectordb
source vectordb/bin/activate

# This spaCy model is required and helps with the data prep
python3 -m pip install spacy
python3 -m spacy download en_core_web_sm

# --- Start of macOS required ops for mrpt ---

brew install llvm libomp

export LDFLAGS="-L/opt/homebrew/opt/llvm/lib"
export CPPFLAGS="-I/opt/homebrew/opt/llvm/include"
export LDFLAGS="-L/opt/homebrew/opt/libomp/lib"
export CPPFLAGS="-I/opt/homebrew/opt/libomp/include"
export PATH="/opt/homebrew/opt/llvm/bin:$PATH"

# --- End of macOS reqired ops  for mrpt ---

python3 -m pip install git+https://github.com/vioshyvo/mrpt/

python3 -m pip install vectordb2
# yes there is a different vectordb out there
# gotta love the confusion in py-land

You will also need to edit vectordb/lib/python3.11/site-packages/vectordb/vector_search.py and fix the misspelled mprt/mrpt import statement. Doing it this way is far easier than reinstalling the tensorflow bits from scratch.

If you can start a Python interpreter and successfully do from vectordb import Memory, you should be good to go.

Running Kagi’s Example

black and yellow computer keyboard

For the next step, I suggest copying Kagi’s slightly larger example to a file (I used vt.py because I’m horrible at naming files in circumstances like this) and running it. In my version i added an import json at the top and changed the final print to use json.dumps so I could pipe the whole thing to jq.

If you go through the sample code, the idiom is pretty basic:

  • create a “Memory” where the data will be stored. If you read through the README it shows how to use a backing store to persist this Memory and re-load it for future ops.

  • add new elements to the memory. These elements are text blobs with any associated metadata you want to add to it. Their example uses a title and URL. But, you can add anything.

  • perform a natural language search over this Memory.

Under the covers, VectorDB uses FAISS or mrpt for the similarity search, depending on size. You can look at the code to see how this works to gain a deeper understanding of this super cool tech.

There are also quite a few options you can work with when executing a search.

Further Reading

tortoiseshell sunglasses

Kagi has a real-world use-case for this similarity search idiom up on Google Collab, and I think folks might find bits of the HN convo useful as well.

Your Mission!

3 men standing on rocky shore during daytime

Grab an RSS feed of text content, your markdown notes, a directory of PDFs you’ve converted to text, … anything!

Read through the README to see what to do to ensure your work is persisted.

Then, write a small Python script to load up your data into serialized Memory, and start searching!

FIN

You may not end up using this in “production”, but it’s a fine way to have super-custom topic similarity search capability for a targeted text corpus; especially since it has all the necessary batteries included and does not require extra moving parts like a full-on database. ☮️

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.