👋 Hi, this is Sarah with the weekly issue of the Dutch Engineer Newsletter. In this newsletter, I cover data and machine learning concepts, code best practices, and career advice that will help you accelerate your career.
Vector databases have recently gained popularity because of the huge impact of ChatGPT and other large language models. At the end of last year, with the introduction of ChatGPT, Pinecone, who created a vector database geared towards data scientists in 2021, raised a $100M investment, and Zilliz, the start-up behind open-source vector database Milvus raised $60M. I will shed light on this sudden interest in vector databases by explaining the basics of vector databases and how it applies to specific use cases.
In this article, we will cover:
What is a vector database?
How can we use a vector database in common applications?
Thanks to Delta for sponsoring this newsletter! I am a huge fan of Delta Lake and use it every day both in Data Engineering and Machine Learning.
What is a vector database?
A vector database is a type of database that takes advantage of similarity calculations between embeddings and vectors. Let’s dive into what this actually means.
At its core, a vector represents a graphical location. If a vector has the coordinates [0,1], then the x coordinate is 0 and the y coordinate is 1. On a graph, this vector will look like this:
If we then have another vector, [1,1], that would look like this:
You can graph them both on the same plot as well. Here is a graph of that.
These two vectors are close together as they are only one unit in the x direction away from each other.
Then, if we add a third vector [4,3] on that same graph (denoted in green),
we can see that it is much further away from [0,1] and [1,1]. We can identify the “furtherness” by the distance between points, often calculated using the Euclidean distance equation.
For just two coordinates or vectors, the Euclidean distance formula is as follows:
d = √[(x2 – x1)^2 + (y2 – y1)^2]
where,
(x1, y1) are the coordinates of one point.
(x2, y2) are the coordinates of the other point.
d is the distance between (x1, y1) and (x2, y2).
Note: as we increase in dimensions, we can add third, fourth, etc. coordinates by subtracting them and squaring them.
d = √[(x2 – x1)^2 + (y2 – y1)^2 + (z2-z1)^2 + …]
To explain this formula, we are taking the difference between each point in their coordinates, squaring that, adding them together, and square-rooting them.
The distance between [0,1] and [1,1] is
d = √[(x2 – x1)^2 + (y2 – y1)^2 = √[ (1 – 0)^2 + (1 – 1)^2] = √1 = 1
The distance between [0,1] and [4,3] is
d = √[(x2 – x1)^2 + (y2 – y1)^2 = √[ (3 – 0)^2 + (4 – 1)^2]
= √[(3)^2 + (3)^2] = √[9 + 9] = √18 = 3√2 = 4.24
Note that we can calculate the distance with vectors that are much larger than only an x and a y coordinate and often are called multidimensional vectors.
Multidimensional vectors occur often in embeddings.
What are embeddings?
Embeddings are vectors that represent real-world objects and relationships. That to be honest does not tell me much, so I will add an example here to further explain.
Let’s imagine we want to compare different sentences together to see how closely related they are.
Text 1: I want ice cream
Text 2: I want a banana
We cannot actually compare the two sentences together, but what we could do is translate those two sentences into a mathematical equation (machine learning model) and then calculate their distances.
Text 1: I want ice cream → model → [1,2,3]
Text 2: I want banana → model → [1,2,4]
And we can see they are only 1 unit away in the third coordinate.
d = √[ (x2 – x1)^2 + (y2 – y1)^2+(z2-z1)^2] = √[ (1-1)^2 + (2 – 2)^2+(4-3)^2] = √[ (1)^2] = 1
If we had a third text and converted it to an embedding,
Text 3: I need banana → model → [1, 3, 4]
we then calculate the Euclidean distance from text 1: √2 (1.41) and from text 2: √1 (1). These calculations suggest that text 3 is closer to text 2 than text 1. Both of these sentences talk about bananas, so that makes sense.
Now, if our text was not one sentence with three words but multitudes of different sentences with thousands of words, we would call this an embedding of the text.
And those embeddings are what vector databases hold. And that distance equation we were talking about? That’s what is often referred to as similarity searching and can be calculated using other methods than Euclidean distances.
Vector databases are powerful because they can identify similar objects quickly and efficiently if the object is correctly embedded with an embedding or encoding model like word2vec or BERT (Bidirectional Encoder Representations of Transformers).
How can we use a vector database in common applications?
Since vector databases can basically be used with anything that can be vectorized (or embedded), the applications are endless. In this section, I want to focus on a few specific examples and explain them more thoroughly.
Recommendation systems
A recommender system is a tool that predicts what users will like, often for products or other items. Collaborative filtering is a common technique used to create recommendation systems and looks at what users already like to decide what they will like. Most modern collaborative filtering systems use embeddings.
For this example, we will use a user's embeddings and an item's embeddings to make predictions on whether a user will like that item, connecting users with new products. If two users like similar things, they will receive similar recommendations.
I have written about a movie recommendation system here using BentoML and Airflow.
Computer Vision
Vector databases can be used in computer vision applications, such as image and video search engines. Objects within images can be represented as vectors and compared to other vectors to quickly and efficiently identify similar images. This can be particularly useful in applications such as reverse image search, where a user can input an image and quickly find similar images.
Semantic Search
Vector databases are also used in semantic search applications, which involve finding the most relevant results based on the meaning of the user's query. By representing words and phrases as vectors, vector databases can quickly and efficiently find the most similar results to the user's query. This can be particularly useful in applications such as chatbots or virtual assistants or your latest Google search, where understanding the meaning behind the user's query is crucial.
Conclusion
Vector databases are powerful tools that are becoming increasingly popular in the field of machine learning and artificial intelligence. They are used in a wide range of applications, including recommendation systems, image and video search engines, and natural language processing. If you are working on an application that involves the identification of similarities between objects, then a vector database may be a good choice for your project.
If you found this article helpful, please consider sharing it.
Great easy to understand explanation, thanks!
Doing God's work. Clean explanation! Loved it.