Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chroma vector store doesn't return similar questions (even the same question) #187

Closed
mrtunguyen opened this issue Sep 26, 2023 · 9 comments

Comments

@mrtunguyen
Copy link

mrtunguyen commented Sep 26, 2023

Hi,

I encountered the problem that the chroma vector store doesn't return exact the same question that was put into the golden records. When I deep dive into the code, I found out that when you add record into chroma collection, you don't add embeddings for that record, which become None. I think that's why it doesn't work as expected.

target_collection.add(documents=documents, metadatas=metadata, ids=ids)

@mrtunguyen
Copy link
Author

Hmm I think my bad when concluding too quickly that it returns None as embedding. refer to this issue.
In fact, the chroma collection uses ONNXMiniLM_L6_V2 as default embedding function. so it should be fine.

But why when I query exactly the same question which exists in target_collection, it doesn't return anything?

@MohammadrezaPourreza
Copy link
Contributor

Hello @mrtunguyen,

I want to express my gratitude for your valuable contribution to enhancing our work.

I've conducted tests on the latest version of Dataherald, and using ChromaDB, I was able to successfully retrieve the same question that I had added as golden records.

One potential reason for not obtaining results from the Chroma vector store could be related to the way ChromaDB stores vectors in memory. Whenever you create a new container with the --build flag, you lose the previously stored vectors in ChromaDB. However, please be aware that you will still see the golden records stored in the Mongo collection.
Thank you once again for your efforts and contributions.

@ppmarkus
Copy link
Contributor

ppmarkus commented Sep 28, 2023

For the tests I've done, it seems to be working fine. When I ask exactly the same question I get

Thought: The first example question is exactly the same as the given question. I can use the SQL query from the example and modify it to fit the given question.

It then checks that the table/columns etc. are those recognized by the system (i.e. those that have been processed by /api/v1/table-descriptions/sync-schemas I guess)

providing that checks out it should work.
Maybe if you have a golden record that references tables/columns that have not been scanned it might not work as expected.

@mrtunguyen
Copy link
Author

thank you for your all replies. I will do a check today.

One potential reason for not obtaining results from the Chroma vector store could be related to the way ChromaDB stores vectors in memory. Whenever you create a new container with the --build flag, you lose the previously stored vectors in ChromaDB. However, please be aware that you will still see the golden records stored in the Mongo collection.

In that case, shouldn't we need to update chromadb automatically with golden records stored in Mongo?

@jcjc712
Copy link
Contributor

jcjc712 commented Sep 28, 2023

@mrtunguyen
We have identified two potential solutions:

1. Script-Based Solution:
One approach involves creating a script that runs each time the app container starts up. This script would be responsible for retrieving all the 'golden_records' rows and seamlessly inserting them into Chroma.

2. Dockerized Chroma Solution:
Alternatively, you can opt to create a Chroma container within the 'docker-compose' file and configure the necessary volumes to persist the data. Then, in our current Chroma implementation located in the 'vector_store' folder, establish a connection to the newly created Chroma container.

If you do implement these it would be great if you could raise a (PR). Otherwise we will get a fix in place early next week.

@ppmarkus
Copy link
Contributor

ppmarkus commented Oct 4, 2023

I have build a script-based solution that takes config from json files and other database stores.

Data that doesn't change much like database-connections and table-descriptions, instructions can be stored as json in a file and loaded but golden-records are updated over time to provide better coverage.

I run the script on demand when some config data changes.
The script just deletes all entries like golden-records by first retrieving them all from the DB and then calling the the delete API. After that it calls the API to create new ones. That way I know that both Mongo DB and the Chroma memory is updated (at least in the case of DefaultContextStore)

I'm wondering if this is something I can contribute with?
I could make assumptions about the format (json) and standardize the the attributes and then provide a script to refresh the context when the script is run (which could be on container load or via API call etc.). The files could be shared by a docker volume, etc.

It wont satisfy every users requirements but it could be used as a reference.
Just a suggestion.

@mrtunguyen
Copy link
Author

I think it should be done if the vector store being used is Chroma, and not necessary in other cases (for ex Pinecone)

@ppmarkus
Copy link
Contributor

ppmarkus commented Oct 5, 2023

In my case it is just the default Chroma Context Store but I think it doesn't matter because the scripts use the API to upload. As long as your flavor of Context Store implements the dataherald.context_store.ContextStore the interface is supported.

@jcjc712
Copy link
Contributor

jcjc712 commented Oct 20, 2023

@mrtunguyen Hi, We added a script to upload the golden records in vector stores (Chroma|Pinecone) from MongoDB golden_records collection. Please check this in the documentation

Just run this command:

docker-compose exec app python3 -m dataherald.scripts.delete_and_populate_golden_records

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants