Taking down the NPS Score: KO by Probability

Summary: Intro | Measuring the importance of gossip | Too good to be true | Arbitrary measures produce arbitrary results | tl;dr

Disclaimer: The probability computation of the article is a recap of a talk delivered by Professor Mark Whitehorn at the University of Dundee in 2015, and at PASS Business Analytics Conference in San Jose, CA in 2014. Opinions expressed are my own.

Aren’t we post-NPS hype yet? Such was my thinking until a random article came up on my feed: as one of its core objectives, a tech giant was planning to improve its Net Promoter Score by 2020. A quick internet search told me there are some companies very excited about increasing their NPS. Google Trends suggests the Net Promoter methodology is on the steady growth rate since 2004; a mortal blow to my presumption. There is something problematic about the Net Promoter methodology that I’d like to talk about: on one hand an indicator of an outstanding business delivery, on the other a possibly dangerous framework for workforce assessment. This article decomposes the NPS algorithm, reviews its criticism, and tests its validity from the probability perspective. I have based the scenario and the probability computation on an excellent talk delivered by professor Mark Whitehorn. If you happen to be a manager, a person whose performance is scored with NPS, you are into probability computations, or simply you like debunking managerial fads then this is a tale for you.

Measuring the importance of gossip

It is said that NPS can predict a company’s growth. NPS, a Net Promoter Score (or System), is a customer loyalty measurement method that can be applied to any business-customer interaction. After a chat with a service consultant or a 5-day training course, the customers are asked to estimate how likely they are to recommend that service to a friend. The drive behind NPS is to capture the impact of the word-of-mouth. The algorithm fishes out positive and negative influencers. Nudging customers to promote a product is every marketeer’s dream as recommendations from friends are the most influential factor in our purchasing decisions. Capturing the marketing impact of what people say about us certainly seems worthwhile, so how is it done?

Market

In a saturated market, a friend’s recommendation can be the only product differentiator | Photo by Dinis Bazgutdinov

The NPS formula is simple. After a session or a customer interaction, the customer scores how likely they are to recommend the service. The scale runs from 0 to 10, the higher number standing for Extremely Likely. If a service was scored high, it is assumed that the respondents will walk the walk, and in the long run bring the service provider more business. The feedback of 0 to 6 suggest that a person is likely to spread negative opinions about the service, a villain in the NPS’ jargon called a Detractor. Between 7 and 8, people are not considered influential (Neutrals). Those casting scores of 9-10 are believed to be likely to promote the service (Promoters). After the feedback is collected, the scores are assigned weights. Every Promoter translates to 100 points, a Neutral is a 0, and a Detractor a -100. To calculate the NPS score these numbers are added up and divided by the number of people per session. So, a session with two Promoters (2x 100), and one Neutral (0) will result in an NPS of 67. While these assumptions sound reasonable on the surface, they are often being questioned by the system’s opponents.

Too good to be true

Avalanched by the NPS’ good press, the system’s critics inhabit the last pages of Google Search. The claims about the Net Promoter System’s predictive power on a company’s growth are impressive, and have attracted huge following in all kinds of businesses. The opponents suggest that NPS is all fraud. They criticise the method as oblivious to other data and its findings impossible to replicate.

Multiple studies have criticized the NPS system for oversimplifying the multidimensional nature of customer behaviour. NPS advocates have a single metric suffice for a relevant loyalty estimation; any behavioural, or cross-channel data is ignored. Survey outcomes are taken at the face value, without adjusting for any answer bias. One cannot help thinking that this overwhelming trust is doomed to backfire. Take the example of Netflix, a binge watch mecca of TV shows, repeatedly told by their audience that documentaries and foreign movies are on their top to-watch list. This is rarely true; Netflix pays little attention to its user ratings, and bases their business model on the actual viewing data.

The NPS algorithm ignores the deviation within its base dataset. A company repeatedly leaving its audience unimpressed might score the same as a company polarising its user base (think Apple). Wait, what? Both a group of one Promoter and 4 Neutrals and a group of 3 Promoters and 2 Detractors will produce the same score of 20. These are not actionable results: the data can easily be misleading unless the distribution of scores is taken into account.

What’s perhaps the most interesting is that there is no proof the algorithm works. Attempts to replicate Reichheld’s findings have all failed. A study led by Timothy Keiningham, using data from 21 firms and 15,500-plus interviews found no significant correlation between the average promoter scores and the relative change in revenue per studied industry: “Using industries Reichheld cites as exemplars of Net Promoter, the research fails to replicate his assertions regarding the >>clear superiority<< of Net Promoter compared with other measures in those industries.” Worth noting, this report was published in 2007, that is, 10 years ago.

Even accounting for the criticism, and understanding that NPS being the sole predictor of the company’s growth is an overstatement, the word-of-mouth effect it aims to measure is still a valid study. I’d agree myself; however, as soon as NPS is not merely a general indicator, but a regulated KPI, it can have dangerous consequences for the company. Especially in danger are the individual workers, as they are assessed against a measure they only have a limited influence on.

Fear

A natural reaction to being measured by NPS | Image source: Pixabay

Arbitrary measures produce arbitrary results

Companies set their NPS score as per each manager’s ambition. Wikipedia advises that a score of over 50 is deemed excellent, while anything higher than 0 is good already. Behind every target score there is an expectation that employees meet it in their customer interactions. Succeeding or failing in doing so is a likely assessment point for a coach, a trainer, a call-center employee, or a sales person.

It’s easy to imagine a scenario in which the target NPS is set to 60, so just a bit over the excellent 50. Many companies would be inclined to push for extra 10 points to demonstrate their leadership. Then, only if the NPS 60 is reached, an employee gets their bonus. This is the core assumption the following scenario: how likely is it – for a very good employee – to reach the NPS of 60?

If a company have set the NPS of 60, they must have looked at what survey combinations will give their employees the average score. It could either be that over a year, 70% of people are Promoters, 20% are Neutrals, and 10% are Detractors (70 + 0 – 10 = 60). Another way the NPS of 60 could be met is with 80% Promoters, no Neutrals, and 20% Detractors (80 + 0 – 20 = 60). The former distribution sounds more reasonable, so we’ll use it as the base for the upcoming calculations.

This is where the first setback happens: even a very optimistic scenario of getting Promoter scores 70% of the time means that by running a one person session our imaginary coach is penalised one third of the time. A Neutral or Detractor in a single session would not allow them to meet the NPS of 60 per session. Funnily, the target average score per annum is still met.

Coaching a two persons group is even less in the coach’ favour. If we agree there is a 70% chance that a person is a Promoter, the likeliness of two Promoters coinciding in a group is 49%. This is probability theory for two random variables. The probabilities are multiplied as in a coin toss: while getting a head or tail is equally likely at the first throw (50% each), to strike double heads is already at 25% probability (50% * 50%), as overall in a 2-throw exercise we’ve got 4 possible combinations. Similarly here, the possible recommendation scores for a 2-person session are the following:

100%
GroupCalculationJoint Probability
P & P0.7 * 0.749%
P & N0.7 * 0.214%
P & D0.7 * 0.17%
N & P0.2* 0.714%
N & N0.2 * 0.24%
N & D0.2 * 0.12%
D & P0.1 * 0.77%
D & N0.1 * 0.22%
D & D0.1 * 0.11%

The sum of these probabilities is a 100%.

This how how they translate to NPS scores:

100%
GroupNPS ScoreJoint Probability
P & P10049%
P & N5014%
P & D07%
N & P5014%
N & N04%
N & D-502%
D & P07%
D & N-502%
D & D-1001%

Essentially then, there is only one combination that achieves NPS score of 60. It is also the best case scenario with two Promoters in the team. A scenario indeed so good it happens only 49% of the time. 51% of the time in the person will be told they have not earned their bonus that day. Even though their yearly average is a 60.

Probabilty distribution in a 2-person group

Perhaps the goal of 60 is just unfortunately high. What if we lower the target score? As the initial scale is linear (from 1 to 10), we automatically assume that moving the target by a few points will be proportional to the number of Promoter scores required to get. Yet, because the NPS formula translates the scores to a 3-point scale, and averages the result, that linearity is lost. To illustrate just how counter-intuitive this design is, let’s use the probable scores of a 2-person session to see how likely are they to reach any NPS score between -100 and 100:

Passing scores

As you’d see from the graph, if the target was a -100, anyone would have passed a 100% of time, regardless of the score. If it was anything over -100, only one scenario - a very unlikely two Detractors (1% of time) - would have been penalised. A score of -50 is not met 4% of time. That of over 0 is achieved in 77% of the scenarios. Finally, a score over 50 is a product of only 51% of the cases.

What’s the probability of scoring 51, then? The function is not linear, instead NPS system imposes a step change.

NPS imposes a step change

There is no difference in setting an NPS of 51 and 60, or 1 and 49: these are just the same likely. Between 0+ and less than 50, an employee bonus is reduced 23% of the time. This gap goes up to 51% in the bin between 50 and a 100 NPS. Repeating for the third time: this is the function behaviour even though on average the coach reaches NPS of 60.

Admittedly, so far the probability calculation has been done only for the groups of two. To check if this behaviour is not just the odd property of small numbers, we can model the results other group sizes in R. The R script calculates joint probability for all groups between 2 and 10, an incredibly mundane task if done by hand.*

R script:

# empty result vector
res = c()
# all possible nps scores
nps = c(100,0,-100)

# Grid with all possible score variations
# n is the group size
grid <- function(nps, n) {
score <- expand.grid(replicate(n, nps, simplify = FALSE))
total <- cbind((rowSums(score)/n), score)
# leaving out only the scores over 60 NPS
subs <- subset(total, total[1] > 60)

# replace the scores with their respective probabilities
subs [subs ==100] <- 0.7
subs [subs ==0] <- 0.2
subs [subs ==-100] <- 0.1
subs[1,1] <- 100

# multiply the probabilities of each score
prob <- cbind(subs,(apply(subs[,-1],1,prod)))
nprob <- n + 2
# add the probabilites
sum(prob[nprob])
}

# apply the function to incremental group sizes
# n is the maximum group size
allprobs <- function(nps,n) {
for (i in 2:n) {
res[i] <- grid(nps, i)
}
return(res)}

# Run and save the result
results <- allprobs(nps,10)

plot(results, type='o',
xlab="Group size", ylab="Probability to score",
ylim=c(0, 1))

The results are surprising.

NPS probability depends on a group size

The probability of scoring the NPS of (at least) 60 is very dependent on the group size. A group of 3 people gives the highest odds of scoring high while 5-people groups are particularly high risk. This can potentially result in people juggling with the group sizes in order to increase their likeliness of a good result. People can get declined training which decreases the company’s sales. A method aiming to bring sales can badly backfire.

tl;dr

As this article discusses, NPS can have some dangerous consequences if used as an assessment methodology, and not as an indicator of the word-of-mouth effect. Over the years the system has been criticised for its narrow view of the customer and as lacking proof of its influence on a company’s growth. While the formula approaches a very interesting problem, its design has proven problematic. The algorithm has some very counter-intuitive implications such as intricate step-change between score levels. From the probability perspective the chance of scoring NPS 1 is the same as 30; similarly so for 51 and 99. A very simple mistake of averaging scores per year and applying that average to every day customer interactions can hurt financially even an excellent employee. Pushing the algorithm to become a target measure it was never designed to be is perhaps the biggest NPS fallacy.

If you think your company might be subject to the NPS craze, hit share button and make them know.

Eve


Post Scriptum

*Mark Whitehorn proposes a more efficient method of calculating the probabilities: a Monte Carlo simulation. The result of a 10,000 random runs is equal to my heavyweight calculation. Monte Carlo is a fantastic tool for this type of computational problem.

# empty result vector
NPS = c()
vals <- c('p','n','d')
people = 10
runs = 10000
mcarlo <- function(people, runs) {
for ( i in 1:runs) {
score=sample(vals, people, prob=c(0.7,0.2,0.1), replace=TRUE)
NPS[i] = (sum(score=="p") - (sum(score=="d")))/people*100
}
res<- table(NPS[NPS>60])
sum(res)/runs
}
nmcarlo <- function(group,runs) {
for (i in 1:group) {
res[i] <- mcarlo(i, runs)
}
return(res)}
# Run and save the result
mcresults<- nmcarlo(10,10000)
# Compare the results of both:
plot(results, type='o',
xlab="Group size", ylab="Probability to score",
ylim=c(0, 1))
lines(mcresults ,col="green")