I like a few of these metric ideas like Coherence and Instruction Following (could use a snappier name) for measuring AI, others seem like a bit of a stretch like Groundedness or Verbosity.
Here’s a few metrics that we surfaced for AI performance based on the results from a survey with 90+ product leaders, ux designers, and researchers.
Trust, the rate at which participants accept the first answer provided
Repetition, the number of times that users have to ask a prompt in a session to get a desired answer
Frustration, the amount of negative terms/language thats used by participants in their follow-up prompts to the AI
You can check out the link to the survey and the participant responses here: Helio | Design Insights
Pairwise metrics involve comparing the responses of two models and picking the better one to create a win rate. This is often used when comparing a candidate model with the baseline model. These metrics work well in cases where it’s difficult to define a scoring rubric and preference is sufficient for evaluation.
From an underlying perspective, this seems to align with our conversation the other day. Identifying an example of an ideal to compare against the AI’s first pass allows for a round of refinement.