Testcases

Overview

The Testcases feature in GTWY AI is designed to evaluate how well the system prompt and configuration perform against an expected response. This enables users to evaluate and refine their prompts for improved accuracy when calling LLM service APIs like OpenAI and Anthropic.

How It Works

For each test case, the user provides:

  1. User Message - The input prompt that the system will process.

  2. Expected Response or Tool Call - The desired output can be a direct response or a tool invocation. We don’t call any tool when running a test case.

You can create a testcase from a bridge history - Users can also create test cases from previous interactions stored in bridge history.

Then, select a bridge version - A specific version of the bridge to run the test cases for.

GTWY AI then calls the selected LLM API with your version configuration and evaluates the response using one of three matching methods:

Matching Methods

  1. Exact Matching - Compares both the type and value of the expected and actual response for a precise match.

  2. AI Matching - Uses another LLM to assess the accuracy of the actual response relative to the expected one and provides a score.

  3. Similarity Matching - Measures the similarity between the expected and actual response using cosine similarity and provides a score.

Score Evaluation

GTWY AI also displays the user's previous version scores, allowing them to compare current results with past performances. When running test cases in a specific version, users can see past scores alongside new scores to track improvements and adjust their prompts accordingly. Instead of a pass/fail verdict, the system provides users with a score based on the chosen matching method. This score helps users gauge how closely the actual response aligns with the expected output, allowing for iterative improvements. GTWY AI also displays the user's previous version scores, allowing them to compare current results with past performances. This score helps users gauge how closely the actual response aligns with the expected output, allowing for iterative improvements.