Help!! We’re excited to test Voice Messages on Discord but can’t do a traditional A/B test…Now What?
Come with us, if you will, on a causal inference journey…
In 2023, Discord added a feature: users could now post Voice Messages in text channels, DMs, and Group DMs using the mobile app. The team felt excited about the feature and was curious to learn how users would react. Surely people love to hear each others’ dulcet tones, right? But how do we measure the causal effect?
This is where things became difficult for us. Discord is rife (rife, we tell you!) with networks. We usually use A/B testing to measure the impact of our work, but many of our tests are influenced by network effects - or the feature doesn’t even make sense outside of a network. Network effects occur when the behavior of users in Group A influences the behavior of users in Group B and vice versa. This can potentially skew the results by introducing cross-group interactions and violating the assumptions of independent units across control & treatment groups (SUTVA ♥️). A feature like Voice Messages is especially vulnerable to network effects. The feature only works if one person sends a Voice Message and another receives it, ya know?
The ideal then would be to randomize by network. This is challenging because, unfortunately, Discord’s testing platform doesn't (yet!) support cluster randomization.
So, we have a few options: a (bad) user-level A/B test or randomizing by country. The idea is that most networks are likely country or language-specific, so, we could mitigate network effects by comparing a treated geo against a control geo. But geo-testing isn't great either, as comparing a treated country to a control country conflates all their other differences with the treatment effect. So… what’s a better option?
Synthetic Controls have Entered the Chat
🔥 Synthetic controls compare one treated unit (e.g. a country) to a weighted combination (a “synthetic control”) of all other non-treated units (e.g. other countries).
Slightly longer explanation
Synthetic controls are a method developed by academic economist Alberto Abadie. The main idea is: sometimes, you just can’t randomize - it’s either not possible, or it’s unethical, or you sacrifice too much precision. In those cases, you can release your treatment to one group and create a composite, synthetic control made up of a weighted combination of your untreated groups.
A typical "vanilla geo-test" design would compare one treatment geo, such as Brazil, to a “similar” control geo, like Argentina. Unfortunately, geo-tests typically fail on both internal and external validity as they are neither unbiased estimates (internal validity) nor generalizable (external validity).
But why is that? Say we compare Brazil vs. Argentina. Even if we check stats of each country’s average user engagement on Discord, we can't control for everything. We’ll still suffer from what’s called “omitted variable bias” — omitted variable bias would skew our estimate of the treatment effect. OVB is really out to get you. It would affect our testing even if we controlled for the countries' different languages, sizes, and histories!
The problem of external validity also applies: even if we managed to control for aalllll the ways Brazil and Argentina differ, we'll only get a reasonable estimate for the effect of releasing Voice Messages... in Brazil. That treatment effect may not generalize properly. Just because we learn that Brazil users love voice messages doesn't mean that users around the world would feel the same way. (Spoiler: Brazil loves voice messages. 🇧🇷)
For a place like Discord, synthetic controls are a much better option. Synthetic controls create a "fake" (or... synthetic) counterfactual using a weighted combination of all other geos that did not receive the treatment. So, instead of comparing Brazil to only Argentina, we would compare it to a “synthetic Brazil.” This “synthetic Brazil” could be something like 50% Argentina, 30% Uruguay, and 20% Chile (and 100% more helpful than a counterfactual of 100% Argentina).
Benefits of Synthetic Controls
The biggest benefit of synthetic controls is that they’re relatively simple to implement. They control for observable and unobservable characteristics more effectively than geo-tests, and they can give you a decent signal on how unbiased your estimate is.
Here’s the simple recipe:
- Outcome data for one treatment unit, multiple time periods before and after treatment.
- Outcome data for multiple control units, for those same time periods.
- A trusted analytical library to generate the synthetic control counterfactual, e.g. the Synth package in R, or the SyntheticControls package in Python.
You can evaluate how well your synthetic control has, well… controlled… by examining the Mean Squared Prediction Error (MSPE) before and after the feature was rolled out. If your MSPE gets much bigger after the feature rolls out, that means your treatment effect is working properly! You can also do a sanity check when you compare other time periods where there was no intervention - does your synthetic control closely match your observed outcomes? If so, you’re good to go!
Main Weaknesses or Assumptions of Synthetic Controls
Every solution has its advantages, but nothing is perfect. One of the main weaknesses of synthetic controls is generalizability - even with a perfect “synthetic Brazil”, we still only get the answer for Brazil. Can we mitigate this challenge? Thankfully, yes. In situations like this, we can repeat the synthetic control process by rolling the feature out to another country — for example, the UK — and constructing a “counterfactual UK”. We can continue doing this, sequentially: one synthetic control for each geo that a feature is rolled out to. Soon, we will virtualize the WORLD!!! *ahem*
Another challenge is drift: as features mature and evolve in our platform, will our synthetic control remain a reliable counterfactual? Unclear. But probably not good to rely on it for long-term treatment effects.
And finally: power! No, not that kind. Statistical power! Depending on the number of untreated groups, plus the number of time periods, we may be underpowered relative to a straight-up, typical A/B test. This means our treatment effect estimates can’t be as precise.
Okay, and one more small challenge: We need our data to be long and wide. That is, we need data on outcomes and some features, as well as many observations of those variables over time. We need data before and after the “treatment” (e.g. rolling out voice memos). And we need the same data - features and time series - for any potential control geos.
Use case: Voice Messages
In 2023, we were developing an eagerly anticipated feature that allows users to send recorded audio messages, making communication faster and more personal. You know, Voice Messages!
We were excited about the feature because we believed it would enhance engagement by offering a more dynamic and expressive way to connect, especially in situations where typing is inconvenient. However, this feature posed a critical question: How could we accurately measure the impact of Voice Messages on our platform, considering the numerous network effects such a feature might trigger?
How we landed on the synthetic control method
Choosing the synthetic control method for Voice Messages was the result of a thorough process where we considered various experimentation approaches. We couldn't do a traditional, user-level A/B test due to potential cross-contamination: users with access to the feature might send Voice Messages to a friend without access. If we allowed the recipient to play the message, they'd be exposed to the treatment; if we didn't, the sender's experience would be affected — both of these scenarios bias the results.
We then explored a server-level A/B test, a common type of experiment at Discord. While this would better address any network effects, this comes with its own challenges. Users might have access to Voice Messages in one server but not another, creating a confusing experience. Plus, our qualitative research indicated that Voice Messages were more suited for DMs or Group DMs, making a server-level test less relevant for the feature's primary use case.
Next, we considered a geo-test comparing two similar countries as control and treatment groups. However, finding truly comparable countries proved challenging. We identified a promising pair in Brazil and South Korea: both countries had high rates of DMs sent within the same country, thereby mitigating many of the network effects. They also had nearly identical rates of active users DM'ing, controlling for DM inclinations, and generated significant traffic, making them well suited for the test.
Despite these similarities, deeper analysis revealed behavioral differences that could skew results. Notably, Brazilians used voice calls—which we considered a proxy for Voice Messages—on Discord at a higher rate than South Koreans. Unsurprisingly, we later came across this article stating that Brazilians sent four times more voice notes than any other population on WhatsApp!
Finally, after considering all these approaches, we settled on the synthetic control method. Brazil would serve as the treatment group because it checked many boxes, but how could we construct a synthetic control that would closely mirror Brazil?
Creating the synthetic control method
To create our synthetic control method, we first compiled a dataset containing all the metrics we aimed to measure during the experiment, including primary, secondary, and guardrail metrics.
Next, we constructed a model with several key parameters:
- The feature release date, expressed as a day-of-year integer.
- The model training time window, represented by a range of day-of-year integers leading up to and including the release date.
- Predictor metrics for our target metric, serving as model parameters. These predictors should:
- Have data available over a sufficiently long time horizon.
- Avoid high correlation or collinearity with the target metric.
Analyzing the results
After validating our synthetic control setup, we analyzed the experiment results and saw what we were hoping for - a clear increase in user engagement after the feature's release! This gave us confidence that the feature was providing value to our users and was ready for our broader user base.
Recap
Synthetic controls are a powerful addition to your causal inference toolkit when traditional A/B tests aren't feasible—due to logistical or ethical constraints. Discord's initial rollout of voice messages exemplifies how this approach can be effectively used. The synthetic control method enabled us to test the feature while mitigating network effects. However, it's crucial to recognize the method's limitations: it requires extensive time series data and doesn't ensure generalizability.
If you made it to the bottom of this blog post and are interested in more Engineering articles like this, check out our of the Discord Blog! Or, if you want to tackle some of these challenges yourself, we’d love to have you — explore our Careers page for any openings!