[1/3 - anomaly detection] [1/2 - KNN classification] Batch upload dataset to Qdrant (crops dataset)

Target Audience

This workflow is designed for data scientists, machine learning engineers, and developers who are working with image datasets and require efficient methods for anomaly detection and classification. It is particularly useful for those using Qdrant for vector similarity search and Voyage AI for image embeddings. Users who need to batch process large image datasets stored in Google Cloud Storage will find this workflow beneficial.

Problem Solved

This workflow addresses the challenge of efficiently uploading and processing large image datasets for anomaly detection and classification. It automates the steps required to check for existing collections in Qdrant, create new collections if necessary, embed images using Voyage AI, and upload them in batches to Qdrant, all while filtering out specific classes of images (like tomatoes) to enhance the anomaly detection process.

Workflow Steps

Trigger the workflow manually: The process begins when the user clicks ‘Test workflow’.
2. Retrieve images from Google Cloud Storage: The workflow fetches image URLs from a specified bucket.
3. Prepare data for Qdrant: Extracts relevant fields like public links and crop names from the fetched data.
4. Check for existing Qdrant collection: Verifies if a collection already exists to avoid duplication errors.
5. Create Qdrant Collection: If the collection does not exist, it creates a new one with specified parameters, including vector size and similarity metric.
6. Index payloads: Sets up an index on the crop_name field to optimize future queries.
7. Filter out unwanted images: Excludes images of specific crops (e.g., tomatoes) to focus the analysis on relevant data.
8. Batch processing: Splits the images into batches and generates unique UUIDs for each point in Qdrant.
9. Embed images: Sends batch requests to the Voyage AI API to create embeddings for the images.
10. Upload to Qdrant: Finally, uploads the embedded images along with their metadata to the Qdrant collection in batches.

Customization Guide

Users can customize this workflow by:
- Modifying the Google Cloud Storage bucket: Change the bucketName parameter to point to a different dataset.
- Adjusting the filtering criteria: Update the filter conditions to include or exclude different crop types based on the analysis requirements.
- Changing the embedding model: If using a different model from Voyage AI, update the model parameter in the embedding step.
- Altering batch sizes: Adjust the batchSize variable in the Qdrant cluster variables to optimize performance based on the dataset size.
- Customizing Qdrant collection settings: Modify the vector size and similarity metric in the collection creation step to fit different types of data or analysis needs.