It's All Just Context

It’s interesting how the games have been changing over the last two years. Insane amounts of improvements across AI tools, daily new findings and architectural adjustments for how people are thinking about AI, and it feels like the increments are moving faster than ever.

Ran across this video that @bryan passed along around how Caitlin Sullivan does customer report analysis with AI.

Here are my key takeaways from the video:

  1. Super users like control over how the data is synthesized. They like freedom and agency.
  2. Build context first.
  3. Move processes into multiple steps. Context → Bias evaluation → Qualitative data pulling → data bucketing and labeling → etc.
  4. Agents mix up stories. Hallucinations are clearly still a problem with the latest models.
  5. Claude is better for synthesis and accuracy.

We’re also building out our own tools for synthesizing reports in Helio. We’ve been doing a large amount of testing around what produces the best design signals which is also really interesting as well, working with several flows:

Here are the results/evals of the report syntheses- seems that a more structured approach seems to perform decently, but the cost significantly increase with the amount of processing needed to produce insights.

What I’m finding interesting is that flexibility seems to create high-levels of satisfaction, especially for super users who already have their flows down using ChatGPT/Claude themselves. @MoData being one of these people who seems to find better results uploading data, providing specific context, and doing a back and forth session with AI to pull out his own crafted insights.

What I figured I’d streamline is moving the context, data, and chat window to the survey itself, providing a faster connection to the report data, rather than forcing the user into using a process that, in reality, doesn’t really work that well (yet).

Excited to find out how this performs vs the more structured data @MoData.

Curious what other people are finding as well. What flows work for you and what flows don’t? What sort of control do you like, and what control would you rather give to AI to offload instead?

1 Like

I was surprised to see so much prompting happening for what would be essentially the same research processes and patterns.

In my opinion, creating an agent (or maybe having a system/swarm of agents) with the patterns and processes built-in to how the testing or research is produced, would save significant token costs, as well as HITL synthesis costs, because the agents can be fine-tuned to handle exact contexts and questions. If my main job was around research and testing, this would be a gold mine of efficiency to get the best results.

1 Like

Yes, interestingly so. I think that’s because ultimately super users want to maintain the control. But, I also think that they’re consistently finding problems with consistent hallucinations, even with Claude.

I’m hypothesizing that the actually gap isn’t the data and hallucinations themselves, it’s how the AI is able to sift through that data. Too much context == high amounts of hallucinations. The less amount of context and the higher the value of the context, the better the results.

I think that this is where a swarm of agents make sense. They can delegate tasks and pull apart the data without overloading a singular agent, but I do think that a swarm of them would actually increase token spend by a significant amount.

A swarm would only increase the token amount if each agent was NOT handling a single pattern or question. Right now the prompt handles too much. A swarm makes sense only in the sense that the researcher might be able to use the agents like contextual lego-pieces, plugging in the context when needed. Theoretically this would lower costs, not increase them.

It would also make synthesis much more accurate and valuable. You could build single agent response data comparisons from it.

2 Likes

Going to be playing with different models to see how they perform today. Will let everyone know what the results look like!

A few hunches:

  • Claude will outperform the rest of the models
  • Gemini should produce some interesting results (my guess is that it will produce more verbose results)
  • OpenAI’s various models might change the name of the game depending on which part of the flows they’re deployed in
1 Like