Intelligence Metrics: The User vs. Product Framework

How soon is soon?

My guess is that it’ll be a show shift, until it’s not

1 Like

The shift is already happening. But yes, there will be a day when you are walking around and noticing, “hey…people aren’t looking at their phones any longer..are people even using phones now?…oh wow” (that’s the moment)…if you’ve been around awhile, you remember that moment for things like, web page vs the yellow pages, literal phones with screens, physical shopping vs online shopping, and the list goes on.

2 Likes

Haha, I think we’ll find something to look at. The alternative? That awkward elevator ride of where to look.

1 Like

I had an interesting conversation this weekend. We had visited Alcatraz, and I had asked our exchange student what she liked best about our trip to SF. She said she liked Alcatraz, and had been chatting with ChatGPT about the attempted escapes and the details.

If you take ChatGPT out of the equation, replace it with Google and/or Wikipedia, it can all be the same, assuming the information is quality.

I find using ChatGPT more engaging in this use case. It’s more interesting and focused, depending on what I wanted to learn. I did the same thing on my trip to Spain. Even had fun taking pictures to see if it could guess where I was.

Getting back to those intelligence metrics… [there’s an argument for the overlap]
(AI Search intent study: What 50M+ ChatGPT prompts reveal):

But the generative nature of ChatGPT shifts the usage. I’d argue this will grow as the technology gets better, people learn, and the tools make it easier.

1 Like

I don’t think that’s going away. People look at their phones while walking for stimulation (I think).

Maybe if we all used Google lenses

Definitely more interesting and engagement. BUT, you are taking a decent risk of GPT lying to you.

Results: In total, 11 systematic reviews across 4 fields yielded 33 prompts to LLMs (3 LLMs×11 reviews), with 471 references analyzed. Precision rates for GPT-3.5, GPT-4, and Bard were 9.4% (13/139), 13.4% (16/119), and 0% (0/104) respectively (P<.001). Recall rates were 11.9% (13/109) for GPT-3.5 and 13.7% (15/109) for GPT-4, with Bard failing to retrieve any relevant papers (P<.001). Hallucination rates stood at 39.6% (55/139) for GPT-3.5, 28.6% (34/119) for GPT-4, and 91.4% (95/104) for Bard (P<.001). Further analysis of nonhallucinated papers retrieved by GPT models revealed significant differences in identifying various criteria, such as randomized studies, participant criteria, and intervention criteria. The study also noted the geographical and open-access biases in the papers retrieved by the LLMs.

Conclusions: Given their current performance, it is not recommended for LLMs to be deployed as the primary or exclusive tool for conducting systematic reviews. Any references generated by such models warrant thorough validation by researchers. The high occurrence of hallucinations in LLMs highlights the necessity for refining their training and functionality before confidently using them for rigorous academic purposes.

Going back to measuring AI and evaluating “truth”… I think that is what makes it difficult to judge these AI products at scale, unless there are patterns of “truth” we can all agree on… which is highly unlikely.

In product development, from the company’s perspective, the truth is measured in dollars.

Social sites started to get into problems with truth as it relates to “fact-checking.” Ultimately, it’s the social algorithm that starts to dictate what gets put in front of you, driven by the need to collect ad dollars. Jack Dorsey recently suggested that there should be a marketplace of algorithms you could adopt. I can see the same thing with AI.

For the vast majority of companies that start integrating AI into their products, I see a need to evaluate performance against at least a simple set of metrics.

2 Likes

Dorsey was onto something with ‘a marketplace of algorithms you could adopt’, but in the case of LLMs there needs to be a marketplace of axiomatic cores that can be adopted and would run your Gen AI. Similar to a sim for mobile phones.

If that were in place, it would start to build greater and more stable competition as far as ‘trustworthy’ AI results and information. It would also move that Evidence axis to the left into the so-called ‘land of wishful thinking’, because we would very quickly discover the difference between what is supposed wishful thinking, and what is actual wishful thinking.

1 Like

unless there are patterns of “truth” we can all agree on… which is highly unlikely.

People and businesses all understand what truth is. It’s the reason people and businesses can lie. Lying is just an acknowledgement that the truth exists, but freely choosing to reject truth in favor of a preferred and different answer. Isn’t it interesting that AI models this behavior precisely?

1 Like

“Truth” fractures at scale.

  • Social is optimized for engagement, not truth
  • AI is optimized for usefulness, speed, or “future” revenue

Are they trying to lie? LLMs do not lie in the human sense…

  • AI does not have beliefs, goals, or an internal model of truth (or…does it?)
  • It does not know when something is false (to your point)
  • It optimizes for likelihood, coherence, and usefulness based on training data

Dollars become the proxy for correctness inside companies under the law. But I’m not sure businesses themselves drive “truth”, it’s the people in them that do. Perhaps businesses adhere to truth?

Truth collapses into patterns that are good enough for a goal. Engagement, retention, conversion, reduced support cost, etc.

For AI in products, not sure how truth aligns, but we should be able to measure:

  • Does the output help users complete tasks?
  • Does it reduce errors or rework?
  • Does it earn trust over repeated interactions?
  • Does it behave consistently under the same conditions?
2 Likes

Doing a good job highlighting how AI usage isn’t black and white (like most things).

One thing I wanted to note is that there’s a setting that you can change to make it more random (called temperature) that is a bit fun to play with.

Sometimes, consistency could be a bad thing. Like when you want the responses to be different every time.

I think that most AI frameworks speak to the idea of truth, which is telling. Is it that truth is not understood, or are they trying to create awareness that truth isn’t happening?

We posted an audio guide for AI:

I’m always interested in the metrics- how is truth defined in metrics?

1 Like

“Truth” fractures at scale.

Is this statement :backhand_index_pointing_up: true? Or does it fracture at scale?

    1. Social is optimized for engagement, not truth
    2. AI is optimized for usefulness, speed, or “future” revenue

For these reasons, AI products should be transparently aligned with these goals, instead of being a catchall that sycophantically attempts to convince it’s user that what is saying is trustworthy.

AI does not have beliefs, goals, or an internal model of truth (or…does it?

We have written a white paper to propose an axiomatic core for all AI to solve exactly this issue. This would allow an AI agent or GenAI product to be able to be aligned with a goal in an ontological + context capacity and thus remove any ambiguity or subjectivity in any output. Essentially the Core gives the AI specific beliefs, goals, and an internal model of truth.

Truth collapses into patterns that are good enough for a goal. Engagement, retention, conversion, reduced support cost, etc.

This scenario only works if the initial patterns are based on objective truth, otherwise the patterns breakdown into subjective data, which can be manipulated, which is the resulting behavior we are experiencing. Coincidentally (or not?), when humans do this we judge them as having a lack of discernment, which is not understanding the difference between what is true, and what is almost true. We see the AI modeling this behavior even though it has no agency and does so because of optimization, to your point Bryan.

Also to your point, these bullet items below are imperative to solve, but the current answers are:

  • Does the output help users complete tasks? Yes, but without trust
  • Does it reduce errors or rework? Unknown, as we cannot trust the answers
  • Does it earn trust over repeated interactions? No, because the answers are epistemological and not ontologically objective, and results can vary
  • Does it behave consistently under the same conditions? Depends, results can be randomized or vary widely

There’s a lot of work being done here to solve this issue, and by a variety of research groups.

Here are some additional discoveries in this space that are eyebrow raising:

Work being done over at AI Central and Anthropic:

Testing Science with AI
Empirical proof that AI models have been damaged by the modern science narrative

They propose AIQ as a calibration metric for AI scientific discernment, or more specifically, for evaluating artificial intelligence systems’ ability to distinguish valid scientific arguments from credentialed nonsense.

Work being done at The Center for AI Safety (CAIS):

CAIS published “Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs” (website, code, paper). In this paper, they showed that modern LLMs have coherent and transitive implicit utility functions and world models, and provided methods and code to extract them. (updated October 2025)

1 Like

@schuboxaz, I can tell that you’re doing tons of research yourself. Awesome stuff, trying to absorb it all, haha

Well, in a practical sense, we have 195 countries and four main religions… there are different definitions of truth within these larger structures. No?

Won’t exchange rates be continually influenced by these perspectives?

Speaking from a practical sense, we are changing the definition of truth (that which corresponds to reality) to subjective utilitarianism. Fair enough, but if reality ceases to have any objective grounding, it reduces truth to mere opinion and power. When perspective is all that matters, the strong simply impose their will and call it “truth,” while actual facts and natural law are discarded. This leads to a society where lies become mainstream and sanity is treated as heresy.

And it also produces a world of constant and dramatic tension between literal reality and invented reality by those in power. After all, in those 195 countries and four main religions, people will still sort their medicine bottles from their bottles of poison. Or if a government had enough power to pass a law declaring anyone over thirty to be recorded as officially ‘dead’, would the >30 contingent suddenly die? Or does reality tell us otherwise?

2 Likes

You expose the problem that systems are running into. Truth becomes something declared rather than discovered…power will always fill the gap. Definitions change, language bends, and dissent gets treated as madness.

But reality will keep asserting itself anyway, whether through biology, physics, or simply survival.

Both Christianity and Islam reject relativism and claim objective truth (Islam a bit differently)… so, for example, how can they agree on how truth is rendered in an LLM?

1 Like

Agreement on how truth is rendered in an LLM:

  • Begin by applying metaphysics/abstract objects as true ontology (not the taxonomical version they currently call ontology)
  • Use the ontology as an axiomatic core for all LLMs
  • Give the cores multimodal access to physical reality via agent swarms
  • Label each core accordingly and let reality do it’s work by applying the technology to ‘biology, physics, or simply survival’ - a 21st century, Synthesophical type of cage match.
  • Add an auto-destruct mode for disobedience to it’s core.
  • Let the games begin - allow wagering.
1 Like

This reminds me of an idea I was in discussions about 15 years ago, the idea being a digital registration of a physical object (a baseball signed by Babe Ruth in 1934…).

Now, AI would view any Ebay listing as legitimate due to a lack of onotological awareness/context, but an EFT registration of the object vetted by an organization might solve for aspects of this.

1 Like