BG elementBG element

At Freeday, our digital employees handle a high volume of customer conversations every single day, across chat, email and voice. They help customers schedule mechanic appointments, answer questions, recommend holiday destinations, create support tickets, and more.

They’re fast, tireless, and consistent. But like any member of your support team, they carry your brand in every interaction. Which raises the essential question:

How do we know they’re doing a good job?

The Survey Blind Spot

Customer Satisfaction (CSAT) surveys are one of the most widely used ways to measure support quality. After a conversation, customers are asked to rate their experience, typically on a scale from 1 (very dissatisfied) to 5 (very satisfied), with an optional comment or thumbs up/down.

The problem isn’t that CSAT is bad. In fact, with 10–15% response rates, it’s actually better than many other survey-based metrics. The problem is that the majority of customers, 85–90%, don’t respond at all.

What about all those silent conversations? We have no direct feedback, no score, and often no idea whether the customer left happy, neutral, or frustrated.

And even when customers do respond, there’s another challenge:

  • Scores often sit at the extremes: the very happy or the very unhappy.
  • It’s hard to know why someone gave a particular score, especially if they left no comment.

The result? We end up with an incomplete, skewed view of the real customer experience. One that risks overrepresenting the loudest voices and ignoring the “quiet middle,” where most interactions actually happen.

For a high-volume support operation, whether digital or human, that’s a serious risk:

  • Subtle issues affecting the average customer go unnoticed.
  • Consistently “just okay” experiences never get flagged for improvement.
  • Teams may end up optimizing for edge cases instead of the overall experience.

In other words, you’re making decisions with only part of the story.

We needed a way to fill in those blanks, to get reliable CSAT-like insights for 100% of conversations, not just the 10–15% where someone clicked a button.

So we built one.

From Gut Feeling to Data

Instead of relying only on survey responses, we use AI to simulate how a customer might rate their own experience, even if they never filled out a survey.

The AI reviews the entire conversation, from start to finish, and predicts a score from 1 to 5. Alongside the score, it also provides a short explanation, just like a customer might if you asked them “Why did you give that rating?”

To make these predictions more accurate, the AI considers five core factors:

  • Resolution: Was the issue actually solved?
  • Sentiment trajectory: Did the user’s mood improve or decline over time?
  • Tone & clarity: Was the interaction polite, clear, and easy to follow?
  • Efficiency: Did the conversation flow smoothly, or were there delays?
  • Accuracy: Was the information helpful?

The model also explains why it gave that score, like a customer might if you asked them.

Whenever possible, we also feed the AI extra context. That includes things like thumbs up/down, written comments, and timestamps. These signals help it better understand tone, pacing, and the final outcome.

How Close Is AI to the Real Thing?

This is the million-dollar question. To answer it, we first need to understand what CSAT scores actually tell us.

While CSAT is often presented as a five-point scale, in practice it functions more like a two-category system: happy or unhappy.

  • Positive Experience: Scores of 4 or 5
  • Negative Experience: Scores of 1, 2, or 3

What this means:

  • A one-point difference within the same category (like 1 vs. 2, 2 vs. 3, or 4 vs. 5) still points to the same outcome. Both the AI and the user agreed on whether the experience was positive or negative.
  • A one-point difference across the boundary (like 3 vs. 4) is more significant, because it flips the outcome from unhappy to happy (or vice versa).

And just like humans, exact scores are subjective. One person’s 4 might be another person’s 5. That’s why we see a 4 predicted on a 5 interaction as a success: the AI still identified a positive experience.

In short, 74% of the time the AI gets the outcome right, giving us a reliable, consistent view of every customer interaction, not just the ones who click a survey.

The obvious next question is:

Can AI ever match every customer’s score exactly?

Probably not, and that’s okay. Here’s why:

1. The Human Element

Two people can have nearly identical conversations, and give totally different scores. One is just having a good day and gives a 5. The other one has had a bad experience with a chatbot in the past (probably not a Freeday one 🙂) and gives a 2. These emotional layers are impossible to model fully, even with the best AI.

2. Less context = more guesswork

Some support chats are just a few messages long. That doesn’t leave much room for the model to read into tone or outcome. The shorter the chat, the more guesswork required.

3. The Sarcasm Barrier

Even polite sarcasm or subtle humor can be challenging for AI to interpret perfectly. For example, “Thanks… that was helpful” might be genuine or slightly frustrated — context matters.

The key point: the AI doesn’t need to be perfect to provide value. Its predictions give a reliable, consistent view of whether a customer left satisfied or dissatisfied, which is what drives meaningful improvements.

What This Unlocks for Freeday and Our Clients

With AI-generated CSAT, we now have visibility into 100% of our digital employee’s conversations. That opens up entirely new possibilities:

✅ We can spot underperforming flows or patterns early, even when no user complains.

✅ We can slice and dice the data, by channel, team, conversation type, or any other dimension, to uncover hidden insights.

✅ We get consistent benchmarks across channels, teams, and use cases.

✅ We can improve performance continuously based on this metric, not just when feedback happens to come in.

For our clients, this means we can deliver a higher level of quality control at scale.

And because this system doesn’t depend on users clicking a survey, it works quietly in the background, always on, always improving.

What’s Next

We’re continuing to refine the model, exploring things like score calibration across languages, smarter weighting of different conversation types, and improvements in edge cases like sarcasm detection.

But even in its current form, this system gives us something we didn’t have before:

A scalable, consistent, intelligent way to measure the quality of every customer interaction, not just the loud ones.

This isn’t a nice-to-have. It’s a step toward a new standard in customer support.

At Freeday, we’re not just adapting to that future, we’re helping shape it. And we’ll keep sharing what we learn along the way.