Intelligence Metrics: The User vs. Product Framework

Last week, @Bryan had posted a framework of UX Metrics, revealing the need for Intelligence metrics to address user interactions in AI product space.

So we took a stab at how we’re thinking about Intelligence metrics broadly.

When developing intelligence metrics, we must address the multifaceted ways users engage with LLMs. By mapping the lifecycle of a prompt (Input and Output) against two distinct perspectives (User and Product), we identify four necessary quadrants for measurement.

1. The Input: Intent vs. Quality

When a user prompts an LLM, they use their own criteria in comprehension to try and achieve a goal. This is Prompt Intent.

However, from a product perspective, the user’s prompt may have been poor quality. This creates a critical distinction: A user may have a clear intent, but a poor input will inevitably affect the performance of the LLM in serving a result.

2. The Output: Satisfaction vs. Performance

Once the result is served, the user makes a judgment on Result Satisfaction based on their original intention.

The Disconnect: There is oftentimes a disconnect between what the user intended to communicate and what was actually served. Consequently, the user will perceive this to be a machine error, even if the product operated exactly as designed given the input.

To address this, we need to measure Result Performance on the product side. This requires fine-tuned metrics to answer technical questions:

  • Is the tone or voice of the output correct?
  • Does the summary take into account the full breadth of the content?
  • Is it providing the correct amount of specificity?

Breaking it Down

Across these four quadrants, different metrics are warranted.

  • The User Side: While we could focus on attitudinal and behavioral UX Metrics, we need more specific metrics to understand the user’s perception, regardless of whether they are technically or mechanically correct.
  • The Product Side: The need for metrics is much more nuanced, focusing on the specific mechanics of how the model was designed to operate.

This framework helps categorize the complex nature of LLM interactions.


Key Questions

Does this approach resonate with others working in this space?
What metrics would you propose for each quadrant?
Would you push back on the purpose or naming of the quadrants?

Calling out @Kevin_Schumacher , @ben , @nikhil_mahen , @steven_seal for input on this one.

3 Likes

Helio starts with hunches to test against, how do we know this will produce the right hunches? Or how do we know this is choosing the right testing criteria against the hunches?

2 Likes

The word ‘correct’ is used here twice, but how do we ground this word in context to the users intent? Especially if the users intent isn’t clear from the original prompt.

2 Likes

I think recognition that assessing from User Intent may not be the best approach in all cases. More prescriptive implementations of AI into products will make this differentiation more clear.

BUT, there are some foundational details needed to define what ‘correct’ is, and perhaps there are guiding principles that could help in the development of the metrics.

Thoughts @schuboxaz ?

This is such a cool breakdown!! Great work here @EricZ.

Really cool seeing the chart expanding. @Bryan, you should definitely take a look at this one.

1 Like

The evolution is a bright improvement… This might fit with the work I am exploring around ontological and epistemological foundations for AI.

2 Likes

Meaty post here @EricZ ! haha I’m curious how others are evaluating prompt quality in a way that’s simple enough for product teams but still meaningful for model behavior.

1 Like

Please check my understanding. “Intelligence Metrics” are to measure how well an LLM process is doing? Would “how well it’s doing” in that case be measured by the goal of the LLM feature (or whatever user outcome they are trying to achieve with it)?

Maybe I’m taking too macro of a view of this. I see “Prompt Intent” may be the goal that we are trying to achieve in the Intelligence Metric

2 Likes

I have a lot of thoughts this one. I don’t want to just brain dump in a single post (unless you want me to haha). But this topic of “correct” is key to any feature but especially with AI powered ones. LLMs are probabilistic so you really need to be careful with setting user expectations to not be binary options. It’s really about defining what progress looks like for the user and designing the product experience and success criteria around that.

2 Likes

@Doug_Curtis that’s right. I may have gotten too specific with ‘prompting’, but I think there’s a first and third-person awareness of the input and output that needs to be considered with Intelligence metrics, where there’s a dialog (of sorts) occurring.

1 Like

I think the big thing I need to wrestle with is to understand how the lines are drawn between the product builders responsibilities and the user.

On the surface I don’t think we can really have a metric around the quality of what the user entered. The user will do what they do.. It’s up to the product builder to set expectations and deliver an experience where the user has made progress. Maybe Prompt Quality is actually the system prompt and it would be directly impacted by the evals that you would run against the LLM.

Have you all talked about how Intelligence Metrics is different from standard product analytics + Evals?

1 Like

Yes, I very much agree with the idea that users will experience the product the way they want to.

using a product

That said, some patterns can help inform how the product will be used. Each prompt experience, with its personalized history, is unique to the person, which makes it more challenging to improve.

There are scenarios where this might be easier, as the intent is more focused, like commerce. (I chatted with Danny Baker about Adaptive UX about this).

Intelligence metrics vs product analytics + Evals

@Doug_Curtis Here are ways to think about it, what do you think?

  • Measures system thinking
    Intelligence metrics could show how the system reasons and aligns with the user, not what the user did.

  • Focuses on interaction quality, not outcomes
    Analytics look back, and evals check correctness, but intelligence metrics capture whether the exchange itself made sense.

  • Lives between feeling and doing
    They measure how clearly the system communicates and supports the next move, not just emotion or task success.

  • Surfaces misunderstandings analytics can’t see
    They reveal where the system interprets intent wrong before it becomes a behavioral failure.

  • Tracks usefulness, not just accuracy
    Evals confirm correctness, but Intelligence Metrics show whether the output actually helps the user.

2 Likes

That gif is perfect :rofl:

How are you thinking of the “system” in this example?

Zooming out a little when thinking of the goals of each of these metrics here’s how I think of:

  • Product Analytics: Engagement, adoption, retention. Generally: Is the feature getting the users to a better place..
  • Eval: In an LLM process, is the user input permutations with my system prompt + specific model yielding responses that is within the product builder’s intent.

Basically I’m thinking of an LLM call as an implementation detail to a broader product experience or feature. The way you measure the effectiveness of an LLM call is different than traditional analytics because UI elements are deterministic and the LLM response is probabilistic.

I would personally start with standard product analytics to make sure the feature is meeting its business goal. If it’s not and I root cause the problem to be the LLM response, I would dig into my eval dataset to ensure the user inputs from production are working correctly with my system prompt. If that’s the culprit, I’d start adjusting the prompt/changing models until the response is aligned with my intent.

I’m coming at this from a very technical (but still user focused) perspective, but I could very well be too much in the weeds to be helpful here! I feel like I’m missing some context. I love the idea of Intelligence Metrics and getting laser focused on what we can do as designers/builders to build user experiences in AI that shift the probability in our favor to ensure the users are benefiting.

Design constraints feel like the next big design task to figure out. How to add the right constraints/set expectations to deliver compelling user experiences using probabilistic technology seems like the name of the game as we start incorporating more AI into our products.

2 Likes

That seems about right.

Edwin has a lot of truth bombs in describing quality in AI modeling.

2 Likes

Awesome I’ll check that out! Already I’m nodding looking at that headline :sweat_smile:

2 Likes

Wow, why haven’t you shared this one with me yet @Bryan?

1 Like

I think the subtext in this conversation is around what does it mean to implement from scratch? How do we assess user engagement and product output while the build is in play? How do we shape the output based on our learnings of User Inputs, and provide affordances to the user to help them shape their inputs to align to those outputs?

1 Like

yeah I think that’s right! I might draw a picture to help think this out more, but what you said Eric is how I’m thinking about it. An AI product experience is basically a hybrid of traditional software and AI. The AI part can be thought of like a black box in that system that the product builder has the ability to affect the response by changing a few variables on it (system prompt, model, and user input to the extent that they have the ability to change the UI around how the user enters input).

How do we assess user engagement and product output while the build is in play?

  • I’m not sure I follow what you mean by the “build is in play” but I think this is where traditional product analytics would be used.

How do we shape the output based on our learnings of User Inputs, and provide affordances to the user to help them shape their inputs to align to those outputs?

  • I think this will all depend on the product and feature that’s being built. But in general you could use product analytics techniques to capture the high level success rates and for the flows where users fall off you would examine the inputs they used and the AI response to try to pin point if the problem was in the AI part or not. If it is an AI/LLM problem you would have to add those user inputs to your battery of eval test cases and start fiddling with the variables: system prompt, model, or re-phrasing/structuring user input to get the model to respond in a way that will lead to a successful feature engagement next time.
3 Likes

What if AI had to prove it was telling the truth?

The reason for quiet mistrust in AI technology today lies in the need for accountability. The future depends on AI that can verify itself.

2 Likes

Prompting will be going away soon, and IoT and Spatial technologies will replace screens:

2 Likes