Google's AI Metrics - How Google is Defining Them

Howdy to a fantastic Tuesday, Glarians!

I’ve come here to find an answer a simple question:

What are some AI Metrics that make sense to track?

Here’s an interesting find from one of Google’s blogs:

Here are some of the key metrics that they highlight:

  1. Coherence: Measures the model’s ability to generate a coherent response based on the prompt.
  2. Fluency: Measures the model’s language mastery based on the prompt.
  3. Safety: Measures the level of harmlessness in a response.
  4. Groundedness: Measures the ability to provide or reference information included only in the prompt.
  5. Instruction following: Assesses a model’s ability to follow instructions provided in the prompt.
  6. Verbosity: Measures a model’s conciseness and its ability to provide sufficient detail without being too wordy or brief.
  7. Text quality: Measures how well a model’s responses convey clear, accurate, and engaging information that directly addresses the prompt.
  8. Summarization quality: Measures the overall ability of a model to summarize text.

What are some other ones that make sense to add to this list? Are these the most important ones to define?

3 Likes

I like a few of these metric ideas like Coherence and Instruction Following (could use a snappier name) for measuring AI, others seem like a bit of a stretch like Groundedness or Verbosity.

Here’s a few metrics that we surfaced for AI performance based on the results from a survey with 90+ product leaders, ux designers, and researchers.

  • Trust, the rate at which participants accept the first answer provided

  • Repetition, the number of times that users have to ask a prompt in a session to get a desired answer

  • Frustration, the amount of negative terms/language thats used by participants in their follow-up prompts to the AI

    You can check out the link to the survey and the participant responses here: Helio | Design Insights

3 Likes

Pairwise metrics involve comparing the responses of two models and picking the better one to create a win rate. This is often used when comparing a candidate model with the baseline model. These metrics work well in cases where it’s difficult to define a scoring rubric and preference is sufficient for evaluation.

From an underlying perspective, this seems to align with our conversation the other day. Identifying an example of an ideal to compare against the AI’s first pass allows for a round of refinement.

1 Like