Last week, @Bryan had posted a framework of UX Metrics, revealing the need for Intelligence metrics to address user interactions in AI product space.
So we took a stab at how we’re thinking about Intelligence metrics broadly.
When developing intelligence metrics, we must address the multifaceted ways users engage with LLMs. By mapping the lifecycle of a prompt (Input and Output) against two distinct perspectives (User and Product), we identify four necessary quadrants for measurement.
1. The Input: Intent vs. Quality
When a user prompts an LLM, they use their own criteria in comprehension to try and achieve a goal. This is Prompt Intent.
However, from a product perspective, the user’s prompt may have been poor quality. This creates a critical distinction: A user may have a clear intent, but a poor input will inevitably affect the performance of the LLM in serving a result.
2. The Output: Satisfaction vs. Performance
Once the result is served, the user makes a judgment on Result Satisfaction based on their original intention.
The Disconnect: There is oftentimes a disconnect between what the user intended to communicate and what was actually served. Consequently, the user will perceive this to be a machine error, even if the product operated exactly as designed given the input.
To address this, we need to measure Result Performance on the product side. This requires fine-tuned metrics to answer technical questions:
- Is the tone or voice of the output correct?
- Does the summary take into account the full breadth of the content?
- Is it providing the correct amount of specificity?
Breaking it Down
Across these four quadrants, different metrics are warranted.
- The User Side: While we could focus on attitudinal and behavioral UX Metrics, we need more specific metrics to understand the user’s perception, regardless of whether they are technically or mechanically correct.
- The Product Side: The need for metrics is much more nuanced, focusing on the specific mechanics of how the model was designed to operate.
This framework helps categorize the complex nature of LLM interactions.
Key Questions
Does this approach resonate with others working in this space?
What metrics would you propose for each quadrant?
Would you push back on the purpose or naming of the quadrants?
Calling out @Kevin_Schumacher , @ben , @nikhil_mahen , @steven_seal for input on this one.



