Easily Compare LLMs Using OpenAI and Google Sheets

Target Audience

Data Scientists: Need to evaluate and compare different LLM outputs for specific use cases.
- AI Developers: Working on AI agents that require assessment of multiple language models for performance.
- Product Managers: Want to make informed decisions about which LLM to implement based on real-world evaluations.
- Non-Technical Stakeholders: Can easily review and assess model outputs through Google Sheets without requiring deep technical knowledge.

Problem Solved

This workflow addresses the challenge of evaluating and comparing outputs from different language models (LLMs) efficiently. It allows users to:
- Assess the performance of models side by side.
- Log responses in a structured format for easy analysis.
- Make data-driven decisions on which model to use in production based on comparative results.

Workflow Steps

Step 1: Trigger the workflow by receiving a chat message.
- Step 2: Define the models to compare, such as openai/gpt-4.1 and mistralai/mistral-large.
- Step 3: Loop through each model, sending the same user input to both.
- Step 4: Each model generates a response, which is stored along with the input and context.
- Step 5: Responses are concatenated for comparison and logged into Google Sheets.
- Step 6: Users can evaluate the model outputs directly in the sheet, with options for manual or automated assessments.

Customization Guide

Model Selection: Modify the Define Models to Compare node to include additional models as needed.
- Google Sheets Template: Customize the Google Sheets structure to include additional evaluation criteria or change existing ones.
- Memory Management: Adjust the memory nodes to use different backends like Redis or Postgres if required for scalability.
- AI Agent Configuration: Define specific system prompts and tools in the AI Agent node to tailor responses for your use case.