From Developer to CTO Part 6: Managing Technical Debt

Contents
1.1. Technical Debt is a Fact of Life
1.2. Not All Technical Debt is Created Equal
1.3. The Origins of Technical Debt
1.4. There’s a Practical Limit to How Much Debt can be Carried
1.5. Unmanaged Technical Debt Creates Unknown Liabilities
1.6. Technical Debt Remediation Work is Not Value-Driven
1.7. A Bad Sell
1.8.1. Example: Cost of Unplanned Work in a Component Containing Technical Debt
1.9. Technical Debt Initiatives Feel Different
1.10. Technical Debt Work Can Motivate Your Teams
1.11. The Role of the Architect in Managing Technical Debt
1.12. Proactively Managing a Technical Debt Register
1.13. Stack Rank Your Technical Debt Items
1.14. Managing Technical Debt in a Greenfield Project
1.15. Managing Technical Debt in a Legacy Project
1.16. Resource Alignment for Legacy Technical Debt Efforts

1. Managing Technical Debt

1.1. Technical Debt is a Fact of Life

Your legacy may be defined by your ability to manage technical debt.

Most executives have a basic understanding of what technical debt is in the sense that they recognize it as a liability that costs effort to stay on top of and left out of control can box their platform into a corner.  I’ll first start by providing my definition of it:

The collective result of implementation choices made knowingly or unknowingly, that at some point in the future can be expected to impede the ability to extend, make performant, make secure, and make scalable, a system under development.

Made intentionally, it is often done in the interests of reducing time to market and can actually be a tool for you.  Made unintentionally, it can cause painful problems when you least need and expect it because those kinds of debts aren’t managed risks.  Technical debt is a fact of life and being successful isn’t about outright avoiding it, but knowing how to responsibly manage it.

1.2. Not All Technical Debt is Created Equal

Lest I give the impression that I believe technical debt to be the worst thing in the world, I want to be clear: done well, the decision to intentionally add some kinds of technical debt can be highly enabling for a business and a technology team, as it can help a team focus more on value and not get bogged down in certain kinds of technical detail.  In reality, some kinds of technical implementations can safely be deferred in favour of a shorter term but less ideal implementation, as long the interim implementation is incremental toward and made in consideration of the future state, and as long as the point at which this needs to be resolved is clearly defined in roadmap and there is understanding of which business goals will not be met if it goes unresolved.1

Of course, the real world is little bit dirtier than we’d like it to be, and when under great duress to deliver to difficult timelines, unfortunately sometimes the decision to implement outright technically unfavourable solutions needs to be made.  This is the kind of technical debt that can quickly get away from you when the pressure is on, and lead to one-way solutions from which it is difficult to extricate yourself.  This kind of technical debt can back a business into a corner over a short space of time.  Finding yourself here too often is a sign of greater problems, and responsibility for this often rests across multiple teams, rooted in the organizational approach to planning.  These debts have to be repaid as soon as possible.

Technical debt lives at different levels, from the macroscopic level of architectural issues to more localized issues with object design and coding patterns. Almost universally, developers understand the concept of technical debt and how it applies within and around their areas of the architecture and code base, at least as seen through their lens of personal maturity, and they experience a tension or strain that it places on their daily work.  This understanding does not necessarily translate into meaningful efforts to learn about or mitigate the impacts, however, and even in the presence of permission to spend some time on refactoring, there sometimes exists a reluctance from developers to change their pattern.   This is a cultural issue and a cancer, and you need good oversight to identify and remedy this.

1.3. The Origins of Technical Debt

Technical debt can sometimes start as code quality issues that escalate into designs (object models and the like) that are inflexible and difficult to make performant, or to change business logic in support of new features.  By the time a piece of code has been visited enough times, it may prove to have been extended thoughtlessly, or extended by well-meaning developers who were not very patterns aware or were afraid to change its status-quo.  These eventually show up as pages-long SQL statements, pages-long code artifacts full of hidden logic issues, and general sprawl that makes it hard to discern structure (a whole section on code quality follows so I won’t go too deeply into that here).  By my definition, this becomes addressable technical debt the moment it becomes a liability in terms of performance, extensibility, or quality, whether that liability becomes realized or not.  This kind of technical debt can often be remediated by small to medium scale refactoring efforts, ideally utilizing existing higher level test coverage such as API and UI testing techniques as insurance.  I’ll expand on a strategy to manage this kind of technical debt in the section below on managing brownfield or legacy projects.

Technical debt can run deeper into the architecture itself though, and its management requires a different level of commitment.  Sadly, the genesis of these kinds of issues is often insufficiently skilled early technical oversight in the growth of a new platform.  These problems tend to be pervasive and start to appear when demands on the system grow.  The scale of remediation efforts for deeper rooted issues can make them such that they require more staff with more varied disciplines across the platform, and thus can compete more directly with product roadmap feature items than, say, a medium scale refactoring confined to a single component.  Because the connections between and assumptions around platform components can change significantly in this kind of refactor, there is invariably higher risk too.

1.4. There’s a Practical Limit to How Much Debt can be Carried

Depending on the degree to which a product is hamstrung by these issues a leadership team may have little choice but to execute a complex debt remediation plan if there are no other short-term mitigations possible, such as horizontally scaling more compute resources for a slowing component than might be typical for its load, for example, knowing this does nothing to address the root issue.  Sometimes there are few reasonable alternatives, but a product is rarely too far gone to be salvaged.  There are more readily quantifiable risks to working with an existing system versus the open slate of a greenfield one, so there can still be solid justification for deciding to complete a heavier refactor on the assumption you’ve managed to retain solid architectural talent with which to lead it.  These efforts tend to directly block roadmap for some time which is always a problem for the business.

No leadership team wants to contemplate bootstrapping another greenfield development effort to replace a legacy system.  For a start, they have weak assurance the new platform will not suffer the same fate as the first, and unless you engage a completely separate concurrent team to write it, it effectively strangles their ability to compete in the marketplace with the old platform while the new one is under development.  It goes without saying that it’s incredibly important to prevent the platform from getting to the stage where this is even a conversation.

1.5. Unmanaged Technical Debt Creates Unknown Liabilities

As I discuss later in the section on code quality, projects naturally tend toward health decline over time, and many forces combine to make this so.  I’ve observed this to be a practically universal trait across every platform I’ve ever assessed or worked with.  The common outcome almost always has the attributes that technical debt hasn’t been well managed and code quality had been allowed to decline to the point that making changes had become increasingly risky and difficult.  Again, almost universally, this is because there was insufficient importance placed on it by either engineering or the business.

I’ve joined many an existing team in an architectural or leadership capacity in which no form of technical debt register existed.  This is a problem when you’re adopting a legacy platform because you don’t really understand the liabilities in its design, making it difficult to manage the risks to the platform’s quality, extensibility, maintainability, security, and performance.  This is a precarious position for a business to be in.  Success with a platform in any form depends heavily on illuminating its liabilities.

1.6. Technical Debt Remediation Work is Not Value-Driven

Let’s first state the main goals of a technical debt remediation effort: to make functional platform development easier and efficient, to support the “ilities” – quality, extensibility, maintainability, security, and performance, and to extend the working life of the platform.  Efficiency and platform life are things the business cares about from the standpoint that they affect operational or capital expenditure (depending on which way you dice those up) and the ability to taper R&D costs down over time, but efficiency in how you get there isn’t what the business is selling to the customer.

As such, technical debt work doesn’t have direct business value as far as the customer is concerned; yes, they’re paying for features in a performant and always-on system, but these are a given and it’s of no consequence to them how that happens.  With a system that is fundamentally working, the value of technical debt work is primarily to your company itself.  Therefore, it is a difficult to prioritise these based on their comparison to business value, and I’d go so far as to say that conversation should generally not be taking place relative to roadmap, save efforts that the business knew were coming because of an earlier planned decision to adopt technical debt to reduce a feature’s time to market.

With that exception aside, assuming that efficiency is a goal of the organisation at large, then efficiency is something that you should continually be aiming to achieve, and as far as continual improvement goes, technical debt remediation can be about as foundational as it gets.  Ergo, technical debt remediation to any degree is something you should always be doing as business as usual.  This is the difference between driving from point A to point B in a Honda Civic versus a Hummer; the business value to you as a driver is identical, but the expense and wastage incurred getting there are very different.  These are things we can influence.

All else being equal, technical debt efforts should generally be prioritised based on their ability to make life more efficient for engineering and the platform more robust and extensible, and in ideal circumstances there should be no either-or decision against features.

1.7. A Bad Sell

Earlier in my career I was guilty of trying to sell the “at some point you won’t be able to extend the platform anymore” argument to leadership.  I didn’t realise it at the time, but that comes across as more of an attempt to force leadership’s hand to get some resources on the topic.  Right assertion – wrong approach.  Unless that particular point in time is imminent that timeline is difficult to predict, and in any event, there are more constructive ways to structure this conversation.  Put in specific, measurable terms relative to some specific factors leadership cares about, you’re putting decision making power in their hands and they can own that themselves.  This is good because you want them bought-in; after all, they want to de-risk the platform, too. 

1.8. Crafting the Message to Leadership

I’ve witnessed many a developer (and been that guy) burn out trying to lead this crusade with management.  This is an unfortunate state of affairs because if any effort to improve a platform is to succeed it’s these developers that you need to tap the most because they know where the figurative bodies are buried.  You and they alike need a language to argue the case and metrics to back it up.

The things that senior leadership cares about are, to name a few:

  • Poor customer experience through slow UI
  • Service outages
  • Inability to innovate at the rate they want
  • Discomfort with release quality
  • Release rollbacks and long service windows
  • Unreasonably high infrastructure costs
  • Unpredictable velocity
  • Disproportionately high development costs and project durations for the value generated

Well-substantiated and framed in these terms, it would be surprising for a leadership team to not at least sit up and take notice.  The problem is that unless there is a real trust relationship with the engineering team, anecdotal evidence alone won’t usually be sufficient to get the traction engineers need to justify budget for a concerted effort to address technical debt - these claims need to be backed with evidence for them to carry weight.  The further up the leadership hierarchy this conversation must go (and thus the more removed they are from engineering), the more and more evidence becomes important because at that level, metrics are some of the most important tools leadership has to manage by.  Moreover, the metrics need to be relative to something to give them meaning; a number is just a number in isolation of a comparison point.  Some evidence can be hard to gather and making it relatable without skewing its meaning can be tricky.  I’ll provide a few examples below.

1.8.1. Example: Cost of Unplanned Work in a Component Containing Technical Debt

High development costs are naturally caused by any number of factors, but as it relates to technical debt, there are ways to demonstrate that technical debt is contributing to quality or performance issues.  Key to this is understanding the crossover between portions of the codebase where you know technical debt to reside, and the category of work effort being put in around those areas.  Specifically, what percentage of the work is post-release bug or performance fixes versus planned work?

Although there is no magic number that describes an acceptable ratio here in and of itself, graphed against the measurement over time the trend confers more meaning.  For my money though, if this ratio of unplanned versus planned work is pushing 25%, I’m already quite concerned.  I have a large section on an incredibly powerful tool called CodeScene later in this essay that makes this determination much easier when integrated with a work item tracking system. 

The amount of remediation work effort can also be made relative to other dimensions of the platform.  Take for example a comprehensive effort to improve the performance of the platform.  Let’s use a hypothetical UI interaction backed by a string of back-end API calls between microservices that causes a slow user experience.  Take it that several of the services and database objects upon which they depend are unoptimized, and for every call between services the transaction cost is getting additively higher.  This would drive several individual debt items that need work, and as they’re tackled individually, they would theoretically increase performance a little at a time, all else being equal.  Measured for a specific transaction type, hours of work effort can be graphed against round trip time.  As you progress, you’ll demonstrate the downward trend in transaction time, thus proving the value of the work.

There are other metrics against which you could graph against work effort too, such as maximum concurrent user count or infrastructure utilization for the relevant services.  If you can go into a technical debt sell with leadership and let them know exactly which metrics you’re going to measure by, you’ll build leadership’s faith in the team’s ability to continue making a difference and the barrier to acceptance of any future effort may be lowered.  Be creative with how you construct these but make them meaningful and actionable.  Even if all you’re doing is graphing one metric in a time series, the progression tells a story.

1.9. Technical Debt Initiatives Feel Different

Assuming you’re able to generate buy-in and have reaffirmed the dimensions of the platform that are the most important to leadership, the execution plan needs to be spelled out.  Managing technical debt remediation efforts isn’t like managing a product-driven effort.  For a start, the effort needs a technical product owner versus a “business” product owner.  In the same way a “business” product owner represents the organization’s goals needed to compete in the market, so too does a technical product owner in representing the health of the platform.  This could be the CTO themselves (if they’ve the technical chops), and certainly the architect.

Secondly, the mindset of managing technical initiatives is different; they’re driven by different imperatives.  User stories by their strict definition aren’t a good way to define a unit of technical work unrelated to a user journey.  Work efforts are broken down not by the concept of an incremental, releasable MVP but more often by order of execution based on technical dependencies.   So too is the way in which progress is presented to leadership; sprint demos don’t have the same feel as their goals aren’t functional ones.  An accountability mechanism is still needed, but it needs to be based on the very metrics that were presented in the original case so the team can prove they’re making meaningful progress while you’re not necessarily in innovation mode.

1.10. Technical Debt Work Can Motivate Your Teams

For all the well-meaning initiative in the world, failure of a technology executive to succeed in securing budget for technical debt initiatives commonly results in developers becoming disgruntled about “not being allowed” to work on technical debt.  It affects their ability to work efficiently, it causes daily friction they don’t enjoy, and they don’t feel heard.  True or not, it’s a narrative present in the minds of many a developer that affects their intrinsic motivation, so it’s a conversation that needs to be had openly.

In whatever form that message to leadership is crafted, getting it home can instantly turn around a struggling team’s motivations because they see that now there can be a path for making their daily lives easier and enjoy coding more.  This will flush out the better attributes of your developers that may have been in abeyance for some time.  Job satisfaction increases, leadership talents can begin to emerge, and the platform improves.  For all the issues technical debt causes, recognizing it and committing to working on it is healthy for everyone.

1.11. The Role of the Architect in Managing Technical Debt

When it comes to managing technical debt, truly understanding what it is you’re really managing is most critical.  If the concepts underpinning the debt are too far abstract to the technology executive it is more difficult to form justifications for a remediation effort to the business, and therefore to secure the resource allocation or funding you need for it.

Notwithstanding the above, understanding the detailed nuances of these architectural considerations isn’t something the average technology leader should necessarily have to concern themselves with, either.  An architect that pushes too much technical detail upward isn’t working their messaging optimally.  Of course, with a good architect, all this can be eminently manageable, whether you have a technical background or not.  If we put sufficient emphasis on extracting from the mire a strategy that is relevant to the business, what the design enables, what risks it has, and when it becomes a problem if not remediated at some point in time, then these are the things a non-technical leader can manage.  These are the questions that need to be posed to your architect.

1.12. Proactively Managing a Technical Debt Register

Simply, apart from generating an institutional commitment to resources to address it, the most important thing about managing technical debt is keeping yourself accountable by using a tool to track it .  When engineering a greenfield platform, you have the chance to build a discipline here right from the start, making the creation of technical debt items a proactively managed activity in which the second a compromise is recognized or required, a trackable item is created right away so it slips no-one’s mind.  A miss here will likely come to bite you at some point.

Irrespective of the tool you use to track this, the key to getting this right is creating a team mindset of documenting an item and the trade-offs you’re making in a readily accessible form, and ensuring the business is aware of any big ones because at the end of the day, any large trade-offs need to be owned by them.  It should generally be their choice to work out whether they’re prepared to make those trade-offs and what price they’ll pay and when, not engineering’s.

1.13. Stack Rank Your Technical Debt Items

The list of items should be ordered with consideration to the aspects of platform and departmental performance that most resonate with leadership, and by this, I mean the above list of items about which they care the most, and that create your largest inefficiencies.  To know what’s important to them, you have to ask them.  On a scale of handling disabling debt items to creating efficiencies, to removing niggling items that annoy developers, each technical debt item can be categorized, allowing the items to be ranked in order of importance.  Naturally some may need to come first because of their blocking effect on other items, or because there is a technical dependency on their being done in a specific order, but ultimately the goal is to arrive at a list that is reflective of the values and imperatives that are most relevant to the organization as a whole.  These can change over time too, so this needs to be treated as a “living list”.

I’ll go into more detail on the concept and function of the Architecture Committee later in this essay, but of all the tasks with which such a committee should be charged, leading the information gathering and ranking exercise and managing this register over time should be one of its most important.  It’s really important to remember that the engineers you have got you where you are – for all the good and bad – and you must cast a critical architectural eye over what results.

1.14. Managing Technical Debt in a Greenfield Project

I’ll reiterate that technical debt is a fact of life and successful leadership knows how to strike a balance in managing it.  Design compromises will always have to be made, and those will create positive outcomes for now, but they will also come with outcomes that need mitigating at some point in time.  Again, sensibly managed, this can be a tool for leadership to speed time to market for a feature.

The need to strike a balance becomes especially self-evident with greenfield development projects.  By way of example, for a web application that isn’t expected to launch with millions of users right out of the gate, it might be acceptable to defer implementation of some aspects of a scalability strategy such as database sharding for a time.  Some early implementation choices may enable quicker to time to value, knowing that at some point your team may need to take time out of product roadmap (unless you have a dedicated team) to revisit this because the system will become stressed if you don’t.  This is important for the business to understand because it will come to negatively impact user experience at some point; you’ll just pay the price for a quicker launch somewhere else down the road.

1.15. Managing Technical Debt in a Legacy Project

Older brownfield projects are a different animal because all the choices have already been made that got the platform to where it is.  This debt isn’t an enabler for the business anymore – it’s just a liability.  Some projects may not even have historically had a formalised strategy for managing technical debt at all.  If you’re taking over a brownfield project, you must create one, perhaps along the lines described in this section.

In this situation, and even with an aggressive roadmap and release schedule, it should be possible to get agreement up and down the leadership chain that conducting a technical debt assessment just makes sense, if only from the viewpoint that you need to understand your risks and liabilities.  Such a recommendation should be made, it should be timeboxed, not protracted, it should include multiple resources spanning the whole stack, and it should culminate in the delivery of a technical debt register.  I say that this effort should not be protracted because it’s important to remember that technical debt can be dealt with incrementally too, and as the platform improves over time some technical debt items will drop out of existence as a side effect of another effort or diminish in relevance, so overthinking it at the start might just waste time and frustrate leadership.

In conducting said analysis, developer interviews are naturally the first place to start, but it’s important to understand that biases will present themselves that can colour the results.  Most developers who have spent any time in a platform will have their own pet peeves that they’d like to see fixed, but these cannot be viewed in isolation or at face value as they may not be the most important items to address for the platform.  Developers who have been there a long time will have more historical context though, understand the reasons certain choices were made, and be able to illuminate deeper issues.

Never forget though, that your more experienced platform developers may be responsible for the implementation of the less-than-ideal patterns that got the platform where it is, and you must aim to get a sense of this.  It can be valuable to bring in newer developers who can look at the platform with fresher eyes and be critical of it from a design, patterns, and practices viewpoint without necessarily deeply understanding the system.  Both these viewpoints are beneficial to developing a balanced plan to improve the platform.

As I’ll go into in a later section describing the CodeScene tool, some of these biases can be avoided if CodeScene is also used to comprehensively examine the whole code base and it can help you map out the blaringly problematic areas, including those that have a high density of bug fix commits.  Taken together with developer interviews, this can illuminate problems the developers themselves don’t appreciate; this is a powerful combination to get you started on building and tracking your technical debt roadmap.

1.16. Resource Alignment for Legacy Technical Debt Efforts

If you can make a convincing enough argument to get the approval to utilize some resources on your team – congratulations, and don’t mess it up because you won’t get a second chance; it’s a big bet for the business when they’ve got a live platform with lots of users.  Unless the platform is seriously ill and the business wants to throw all your resources into a resuscitation effort, you’ll get approval for a percentage of your team to be dedicated to it.  Now that you’ve got the green light, you need to learn to manage the challenges of working technical debt efforts on parallel paths with other functionally oriented teams.

The effort should be structured in consideration of how your team is aligned relative to the platform’s architecture.  Teams that are aligned vertically over their portion of the platform are best suited to doing technical debt remediation work in their own area instead of, say, another horizontal team created for this purpose (although getting fresh eyes on something is always useful).  These engineers know their own code; align them with a patterns-aware senior engineer and guide them in fixing it themselves.  You’re more likely to have success if the whole vertical team is focused on this effort, to the extent they can do so without treading on each other’s toes and creating frequent merge conflicts.  A horizontal team messing around in their code while they go about their business can just get in the way.

An engineering organization with generalist teams, as might exist in smaller development organizations, doesn’t really require a different approach, all else being equal – it’s just a matter of focus.  A team executing a technical debt remediation effort should be allowed to mess around with that portion of the code base in isolation, without overlap with functional activities to the extent permissible.  Code restructuring can change a lot of the assumptions made by the original code and continuing to create or modify functional code in line with the original expectations will likely cause compatibility issues when the two efforts are merged back together.

In both cases, your teams need to be smart and painstakingly careful with their branch and release management.

1.17. Resource Allocation - Create A Permanent Lane

Managing technical debt initiatives is such a detail-intensive, technical process that you should take any opportunity to make it more manageable.  For me, the best-case scenario is to negotiate a permanent, dedicated, non-accountable lane for resources for platform health initiatives (regardless of how this aligns with your actual team structure).  This sends a big message to the engineering team: we trust you and we value platform health.  This has the potential to alleviate much project management overhead, because now items don’t necessarily have to be as tightly interleaved with regular product management activity, and they don’t compete as heavily with product roadmap management activity either because they’re always there, ticking away in the background.

I say the lane should be non-accountable as far as the business is concerned, but for the avoidance of doubt, this still needs to be tightly project managed by Architecture, entirely in line with the value-sell made to leadership to begin with.  This isn’t carte-blanche to do whatever they want.  The difference is one of trust; being trusted, the architecture management can now focus more on managing delivery of technical value than having to manage upward.  Either way, the team must be able to show how their efforts are bearing fruit in measurable ways, contributing to the value proposition, and staying in touch with leadership to re-evaluate mission needs as they progress.

Lastly, the lane concept should not be taken to mean that the same set of resources are involved in these efforts all the time; as stated above, use the right resources as most appropriate to your architecture and team alignment.

Next Up

The rate of accumulation of technical debt is often a function of how well your team can create healthy code.  There are distinct ways in which code hygiene can be ingrained into your team and monitored to ensure ongoing consistency.  As a technology executive it is important to understand the criticality of this to your success; if you have no tools to manage it you are very vulnerable to forces not fully within your control.

You can find my discussion on this topic in my next post titled "Code Quality, Entropy, and Bugs" here.

1. I have myself made the choice to accumulate some tech debt on multiple occasions, and I have been able to do so because I had a well-managed technical debt register (see below), a clear future state vision, and a team that cared about it and was onboard with the tradeoffs we were making.

No comments:

Post a Comment