Deploying on Friday: are we still talking about this?

From No-Go to Routine, or thanks God it is Friday until you break production.

Aug 22, 2024

Note: When I say Friday, I refer to the last working day of the week. In some countries, that is not Friday. If that is your case, you know what I mean!

There is a rule of having a rule of not deploying on Friday. When you join a new company, you wait to hear about the rule of no Friday deployment. If they have the rule, then they are compliant with the universal rule of having that rule.

But what happens with the ones that don’t have that rule? Are they reckless? Are they unaware of the rule of having that rule? Or do they know something that the others don’t? We will explore that idea.

Welcome to my newsletter! I send an article to all subscribers a couple of times per month. To receive new posts and support my work, subscribe for free!

Before we start, there is an anecdote that comes to my mind when thinking about deploying on Friday. For a few years, at the beginning of my career, I was a freelancer in Barcelona. I made websites and apps and whatnot. I was my DevOps and my on-call. Soon I started cooperating with a company in which, for the first time, I heard that rule of no deployments on Friday. I was genuinely confused. Is not Friday the best day? I honestly preferred Friday because if something fails, I could quickly fix it on Saturday, while during the week, I had a million of competing priorities. That was me being pragmatic and knowing zero about the corporate world.
Then I learned how wise advise it was, and how good it was to spend the weekend in the beach.

Friday's releases were, historically, hands down, a weekend wrecking ball.

But why, why?

I remember the time when a deployment could seriously break things. I saw teams spending a couple of days literally trying to fix the mess that was introduced in a minute and could not be rolled back: Commits piled up for months, and that broadly unknown entanglement was thrown on top of the working production code. Different systems were deployed in a specific order to ensure dependencies were met. Configuration manually changed. Dependencies leak-ins and other monsters causing the “It worked on my machine ¯\(ツ)/¯”.

CI/CD was more hype than reality. Canaries, blue/green deployments, one-click rollbacks, and feature toggles were not a thing.

But that’s not the case anymore, is it?

What does a deployment look like today?

Note: if yours doesn’t look like that, you have some food for thought right here.

Changes are deployed frequently. You don’t ship months' worth of work at once; you just need smaller changes.
The build can take some time (which is fine because you are definitely separating the build and run stages), but the release happens almost immediately.
Infrastructure, dependencies, and the jungle of libraries, versions, and server software are managed as code in an imperative/declarative way, drastically reducing the number of things you need to know and remember to do to replicate things from one environment to another.
You may have feature toggles, and behavior changes will not actually happen at deployment time but when you decide to enable them. And nothing will continue to happen if you don’t want it anymore.

This not only makes it easier to fix things or revert if something goes wrong but also makes it harder to break things in the first place. Also, if you have the right observability, dashboards, metrics, and alerts and document beforehand the post-implementation validations…, you also know right away if something went wrong. This means we will have a relatively low “Risk Priority Number”.

Risk Priority Number (RPN)

In Failure Modes and Effects Analysis (FMEA), the Risk Priority Number (RPN) is a critical metric used to prioritize risks based on three factors:

Severity (S) is the impact of a failure. This is the consequences or seriousness if it occurs without considering any mitigating actions or controls.
Occurrence (O) estimates the likelihood or frequency of it occurring.
Detection (D): The ability to detect it before it reaches the end user. It considers the effectiveness of current controls or mechanisms to identify the issue.

For example, if an issue could render the checkout process unusable, the Severity would be high. But if the chances of occurrence are very low and the chances of detecting it right away are next to 100%, the overall risk is next to nothing, more so if reverting it would be one click.

So, should I go ahead and deploy on Friday? Wait, hold your horses. Let’s see the pros and cons.

Against Friday Deployments

What popular beliefs are:

Risk of Downtime: If issues arise, they may not be addressed promptly, leading to prolonged downtime, especially if staff are unavailable or reluctant to work over the weekend.
Work-Life Balance: Teams may be forced to work late on Fridays or over the weekend to resolve deployment issues, which can negatively impact morale and work-life balance if it happens often -often being a subjectively defined word-
Limited Resources: Even with on-call teams, available staff and support resources might be reduced over the weekend, making it harder to manage complex issues. What if you depend on marketing, content, legal, or other teams that typically don’t have an on-call rotation?
Customer Impact: The response time to any incident may be longer, which may disrupt customers who rely on those services during the weekend, leading to dissatisfaction and potential revenue loss.
Time to detection. What if there is a silent problem? The time it takes to discover and diagnose a problem can be longer during the weekend due to not having everyone doing their everyday work, testing flows, checking dashboards, etc.
Operations and Customer Support. Even if everything goes well, if you are launching a new product and, for example, it requires special training, maybe you prefer to avoid a weekend when your CSR may not have team leaders available in case of any confusion or customer queries.

Counter arguments

If you are afraid of Fridays, you should be fearful of Thursdays and every day after 3 pm. If you break something on Thursday, you will find out maybe on Friday afternoon, and then what? Would you be deploying on Friday evening to fix the Thursday bug? Okay, fine. No Friday, no Thursday, no afternoons after 3 pm. Wouldn’t it be better if you arrived at a safe way of deploying any day?
Downtime why? What the heck are you deploying? If you deploy daily as soon as you have some tested updates to deploy, all your deployments are incremental and small, and if you stick with smaller changes, there isn't much that can break.
Why customer impact? Does it mean you found an error and didn’t mitigate it immediately? Or was it not straightforward to diagnose or mitigate it, so you did not roll it back? Or you didn’t detect it? Does this mean that you deployed changes without alerts and dashboards? Well, your problem is not the deployment window but your definition of done.

Some net benefits of Friday deployments

End-of-Week Momentum: Teams often push to complete work before the weekend, potentially increasing productivity and the completion of key tasks.
It avoids confusion and frustration. A team has been working hard on a project and is ready to deliver, but they may miss a deadline because it is Friday, and they are not allowed to release it, even when they see no risk.
Resource Availability: If a company has a robust on-call system, deploying on a Friday can ensure that resources are available to address issues during the weekend rather than disrupting the workweek.
Adapt to traffic patterns. Depending on the usage patterns of your customers or from what time zones they visit, Friday or even the whole weekend may actually be a quieter period and, thus, more suitable for updates.
Foster a culture of readiness and adaptability. Muscles have to be trained. If you don’t usually deploy on Fridays, you will be in trouble if one day you need to deploy something on Friday, let’s say a bug fix or a compulsory update due for example to regulation or market reasons. That would be riskier and more stressful for everyone than if it were a standard practice:
- Are you going to put in place a special process?
- Do you have a sophisticated enough branching model, or are you able to toggle off all the changes that are not meant to be promoted?
- If you don’t usually have an on-call or it is not enough to cover this, do you need to coordinate with HR and Legal and agree on a compensation model, select and influence some team members to be on-call, create runbooks, and draft an operational plan?
- And the hardest part: the cultural aspect. If the common understanding is that Friday deployments are evil, there is a negative halo and pressure around it. A team that has to deploy to meet a mandatory deadline or fix a bug, aside from all the problems and pressure, will also need to prepare an excellent narrative to justify it and react to complaints and criticism.
The vicious circle of mitigating risks by freezing actions. Risk aversion is understandable, and a code freeze can be an easy escape, but as with all easy escapes, it can be abused. No deployments to production in specific periods, then UAT or stage is blocked because it is ready with something that has to be released to production after the freeze. QA is blocked because something else has to go urgently to UAT after UAT is available again. Code piles up and new arbitrary freezes are introduced to secure complex releases. A never-ending chain that will end up in extended periods with no deployments, hotfixes in certain environments without coming from the lower one, environment swapping, etc. And the inability to perform urgent deployments without breaking all the rules and invalidating all the setups. The solution becomes the problem.
The smaller the increments, the lower the risk. Piling up commits, whatever the reason, is always wrong. Deploy as soon as possible and create your processes to take yourself in that direction.

With all this, it sounds better to deploy as early as possible and treat all days equally.

But what if I there is bigger, risky change?

On certain occasions, you have one-way-door, drop-the-bomb kind of changes. Something hard to revert or toggle off if something goes wrong.

For those, you probably need a different preparation and rules than for a more mechanical business-as-usual deployment.

In this case, you may need a specific preplanned and broadly aligned window. But then, in this case, the day of the week is probably the least of the concerns. Maybe you will even have someone out of working hours doing that, like a Saturday evening, or the opposite can be true: if you need a specific set of people or it is highly unsafe or unpredictable, you may want to wait until Monday. So, in these cases it will be more of a case-by-case situation than a rule.

Wrapping all this up

No Friday deployments is like saying :my knee hurts when I touch it, so my solution is not touching it”. Sometimes, things are a symptom of a more fundamental issue, and that issue is what you should be looking at.

Dive deep to understand why the deployment process feels scary and address the root cause. Build confidence. And then deploy whenever you need to.

Addressing the root cause doesn’t usually need reinventing the wheel, but a few known things to consider.

The foundation for a safe deployment

You may have in mind some long-term fancy things, but the reality is that the most relevant ones are essential solutions that are relatively easy to put in place:

Deploy frequently to keep changes small. The bigger the difference, the bigger the risk.
Use feature toggles for all your developments. This will help in two ways: to hide partial developments that are not supposed to reach users yet, and to deactivate a feature when the deployment goes wrong. In the future, it can also help with more things, such as progressive rollouts, A/B testing, etc.
Design and document a quick and easy rollback process for failed deployments.
Add to your process smoke/regression testing before the release.
Instrument Real User Monitoring (RUM) and application monitoring (APM).
Create dashboards and alerts to monitor relevant functional metrics (i.e., are there anomalies in the number of onboardings? are errors trending up?)

When you have theses, or a good number of them, you should be able to deploy any day of the week. If you cannot have that, you can rather avoid deploying on Friday, but that definitely means that you need to evolve your processes, and that should be your top priority because it is not fancy stuff that you are missing, but basics.

Plan your deployment for sucess

All this applies to Fridays, but also to any day for that matter.

Deploy early, ideally before noon or at least a few hours before you finish the day. You want to have time for testing and rolling back, and you don’t want to leave late if it takes too long.
Avoid last-minute changes or rushing through the deployment process due to the eagerness to close things before you go. As you would do any other day, deploy only if everything is ready. Have the testing sign-off, go/no-go, or readiness review and everything in place early in advance.
Prepare a clear deployment plan: list the steps, align the people that are required, prepare a backup and rollback plan, etc.
Communicate clearly: inform any relevant part about the deployment schedule, then plan and any potential risks involved.

The extra mile

Nothing else is really needed to enable Friday deployments and a healthy cadence. With that said, you may want to add some additional sophistication in the longer run. Even major changes like framework upgrades, vendor migrations, etc., could be done safely and with little effort with the right things in place. You may need to invest a bit more in automated testing and multi-phased rollouts, but it takes a lot of drama away once you get there.

Automatic regression testing for all features.
Full coverage end-to-end production testing.
A full-fledged CI/CD pipeline, where every PR to prod is automatically deployed after a successful run of tests.
Safest release models:
- Canaries: rolling out progressively to a number of customers, monitor (errors, functional metrics, latency, etc), and if everything goes well for a period of time, increase progresively the percentage
- Blue/Green deployments: Keep the old and the new versions in parallel and just let the load balancer route traffic to the new instances. If things go wrong, you route all traffic to the old instances again.

Summary

Not deploying on Fridays made total sense in a time when deployments were rarely done and risky. In today’s world, with more frequent deployments and better ways of detecting issues, toggling features off, or rolling back deployments, having deploy freezes introduces artificial limitations and can paradoxically increase the risk of introducing defects or being less ready for detection or mitigation.

Removing the deployment freezes altogether can reduce the complexity of your deployment processes and also improve your responsiveness to incidents by looking at the challenges and risks and preparing for them rather than pretending that risk can be dodged forever.

Ultimately, the decision to deploy on Fridays depends on various factors, including the deployment pipeline's maturity, the product's nature, the availability of staff, and the company's risk tolerance. Many companies avoid Friday deployments to mitigate potential risks, while others might have the infrastructure and processes to handle them effectively.

Having or not thorough testing, monitoring, and rollback strategies can be the main drivers for this decision. If you do have that, you can do deployments any day. If you don’t, this is giving you a very actionable hint of what your actual problems are, and you must evolve your processes until your answer is positive.

Thanks for reading Leadership in the time of the robots! This post is public so feel free to share it.

Fran Soto

Aug 25

I like how you approached it here, Carlos. It's not about using typical lines like "Deploying on Fridays is bad", that's hype like being dogmatic about any technology.

Keeping the analogy of your knee, I would say if your knee hurts, you want to go to rehab and go little by little. And Friday deployment can be like jumping and running. The risk is higher if the infrastructure and team maturity are not in the right place.

But it's not black or white. Depending on the situation and how much progress the team made towards safer deployments, you can evaluate each situation as a go or a no-go for a Friday deployment

Expand full comment

1 reply by Carlos Robles

1 more comment...

Leadership in the time of the robots