This post describes a naive way of gradually rolling out new features and data migrations and how it leads to a worse experience for an unlucky 1% of users. The post starts by describing the challenges of safely making changes to large scale online services, what gradual rollouts are and how they fit into the big picture, a naive way of implementing gradual rollouts using user id mod 100, how that leads to a worse experience for users with user ids ending in 00, and what to do about it.

Engineers working on large online applications often cannot test their changes end-to-end before checking in their changes. The reason the changes cannot be tested locally is because modern software is complex with a myriad of dependencies. Running a local copy of an online service (e.g. YouTube) for development and testing is infeasible once the website is built from 100s of microservices. Engineers can write unit tests or run a local copy of a single microservices to do initial testing; however they are not able to test how the change will interact with the system as a whole before checking in their changes.

Teams rely on feature flags to make up for the gap between local and production environments. A feature flag is a flag that can be set at a granular level (e.g. per-user or per-server) to enable or disable a new feature. The idea is to wrap new code in an “if feature-is-enabled” block to disable it by default for all user accounts. This allows engineers to safely deploy changes to a staging or production server because the code will not run during any user request. Once the change is in production the engineer can enable the flag just for their account to test the change works end-to-end. Because bugs in the code could crash the server and thereby impact users who do not have the flag enabled for their account, the engineer should use the staging servers for testing.

Testing a feature for one or two accounts is insufficient to guarantee it works for all accounts because the behavior of an application depends on what data a user has saved in their application. To respect user privacy companies limit access to user data and engineers have to create their own test data that may not be representative of the data in a typical user account. Additionally service behavior can vary depending on server load. One way to mitigate this problem is with a dark launch. A dark launch runs new code or logic for each user request, logs the results, and hides the results from the user while still returning the old results. This allows new code paths to be exercised with data from all user accounts while hiding the probably buggy incorrect results from users. The limitation of a dark launch is it cannot test user reactions to a change because the dark launch is hidden from users. Additionally there is engineering and resource overhead setting up the dark launch, logging, and analyzing the results.

Gradual rollouts are an important risk reduction strategy for when the time finally comes to start exposing users to a change. The idea is to only enable a feature for a few users to start with and to monitor service health and user satisfaction metrics (e.g. click-through-rates) to check for any problems before enabling the feature universally. Typically each rollout will follow a progression like 1%, 5%, 10%, 20%, 50%, 100% of users with a period of time in between each step to collect data to assess whether to abort the rollout or proceed. The idea is when there is a bug that slipped through testing or users simply don’t like the change, it will be caught early in the rollout and only ever impact a small fraction of users. To provide a consistent user experience and collect more accurate metrics about the impact on users, user visible features get enabled on a per-user rather than per-request basis. This is because it would be confusing to users to see a new feature randomly show up and disappear each time they open an app.

A naive way of selecting what users to enable a feature for is to take the mod of their user id and compare it to the rollout percentage. This satisfies the requirement that a feature is consistently enabled or disabled for a user because the user’s id is constant across requests. More precisely, let U be the user id and R be the rollout percentage (from 0 to 100). The feature is enabled if U mod 100 < R. This assumes user ids are positive integers with uniformly distributed remainders. If this assumption is violated then some stages of the rollout will have more or less users than intended. Common techniques for generating user id are assigning sequential integers and picking large random integers. Both of these introduce a bias towards lower remainders. For example, say the service has 1234 users with ids 1, 2, 3, ..., 1234. There are 13 users for each remainder value from 0 to 33 and 12 users for each remainder value from 34 to 99. Likewise if the user ids are uniformly randomly generated integers in the range [0, 2^n - 1] there will be a bias towards lower remainders because 2^n is not evenly divisible by 100. While the bias technically violates the even distribution assumption, the deviation is small enough to not matter in practice.

The problem with using user id mod 100 to pace the rollout is it consistently targets the same users to go first when there are multiple feature rollouts. Users with ids ending in 00 will always be the first to have the feature enabled, then users with ids ending in 01, 02, 03, 04, and so on. This means those users consistently get a more unreliable experience because production bugs are more likely to appear early in the rollout. Likewise users with ids ending in 99 get a more reliable experience because whenever a problem is detected in the rollout the rollout gets stopped - possibly before reaching them.

Using a feature flag and experimentation framework is preferable to solving problems like this on your own. The example discussed in this post is just one of many subtle issues with feature rollouts and experimentation in general. For further reading see Overlapping Experiment Infrastructure: More, Better, Faster Experimentation by D Tang, et al and the Site Reliability Engineering book. That said sometimes you have to work with a system that is not integrated with a full fledged experiment framework and it can be expedient to quickly roll your own. In this case a solution is to have the remainder depend on both the user id and the feature being rolled out. If each feature flag is assigned a unique id F, then the feature can be enabled if hash(U, F) mod 100 < R where hash is a hash function to combine the user and feature ids into a single integer. Because different features have different ids, the same user is likely to have different remainders for the different features and thus appear at a different point in the rollouts.