Tuesday, March 21, 2017

Turning Our #NoEstimates Game Up A Notch

In an earlier post, I described how one of my teams went down the #NoEstimates route.  We had pretty good success going down this path. As a manager and a coach, it validated for me that how long it took for a story to get done had little correlation with the size or complexity of the story. In an organization with 30 development teams, we were definitely not the first team to stop estimating stories. We had other firsts under our belt, but that was not one of them. Very soon though, there was not a single product development team that was using story point estimates. It was the end of story point estimation, but as much as I would like to tell you, this was not the end of estimation.

First, a quick description of our work breakdown structure. We take long-term strategic initiatives and break them down into features. These features are then broken down into stories. The teams have the freedom to decide if they want to break these stories further down into tasks or not. Most teams choose not to use tasks as they do not provide much value to them.

At the story level, as described in the previous #NoEstimates post, limiting our WIP and right-sizing the stories made us more predictable. We were more predictable than we had been after years of trying to figure out estimation. The transition was easy for developers, as they always found the estimating conversations time consuming and wasteful. It was also easy for the folks on the product side of the house. The predictability gained counteracted the desire to get story level estimates. The other reason that the business did not have an issue with us going to #NoEstimates was our delivery level. We, as an organization, delivered Features and not Stories. Eliminating story points stopped teams from spending time guessing the exact "size" or exact "complexity" of a story. It did not stop the business from asking the teams to spend time guessing the exact "size" or exact "complexity" of a feature.

Teams across the organization were no longer estimating stories but were providing estimates for features in the following format -

The product team at that point would take these estimates and the team's forecasted capacity for the release into consideration for release planning. The product team would play 'Tetris' in order to fit as many of the features into the release based on these story estimates. For example, if the team working on the features listed above has a projected capacity of 80, the release plan would say that they can do features A, B and C (12 + 55 + 7 = 77). These features would then be advertised as the ones we are committing to for the upcoming release. 

On the face of it, this seems completely logical and sane way to go about releases planning. Unfortunately, this approach suffers from the same problems that story-level estimates have. 
  • It is impossible to know the exact count of stories in a feature beforehand, regardless of how much analysis we do on the feature. 
  • Features, more often than not "grow" and more stories are added while they are in flight. Planning to capacity puts features at risk, even if one of the features grows.
  • In order to ensure completion on time, the largest of the features has to be started Day 1 of the release even if it is the lowest priority feature. This almost always puts higher priority features at risk.
  • These estimates are based on initial guesses, which, when later invalidated, did not always invalidate the commitment.
  • Large features are turned into multi-release efforts as opposed to being sized appropriately.

After struggling with these issues for a number of releases we started to attack these problems. We realized that we already had a blueprint for how #NoEstimates has worked for us at the story level. We decided to start applying the same to Features. To start, we first collected some data on our features. We discovered that 85% of our features consisted of 25 stories or less. This was our stake in the ground. We asked teams to start using this as a yardstick for whether a feature is too large or not. Teams had already been doing this for stories. Each team was very adept at "right-sizing" stories based on their data. The other necessary step in order to get #NoEstimates to work at the feature level was to limit WIP at the feature level. We asked teams to establish Feature level boards and limit WIP on each stage of Feature lifecycle (Selected, Analysis, Development, Test, Done to start with). Teams had again, already been doing this at the story level and this would be a natural transition for them. We, as a coaching group, set these guidelines and left the implementation in the hands of the teams.

The results were mixed. Some teams, due to various reasons, including external pressures, implemented a loose feature WIP and didn't quite go down the path of sizing features. Other teams, took the advice to heart and implemented strict WIPs on the number of features they would work on and not accept any features that were above 25 stories. When a feature seemed like it would be much larger, they broke it up into smaller features that could flow through their process smoothly. There were some things missing though. Even in the teams that took the approach head on, they did not do everything that they were doing to make themselves predictable with stories. They were not breaking up features into smallest deliverable features as they gained more information about them. They were violating their feature WIPs when emergency requests came in. In other words, our roll-out was only semi-working. Which means, we were becoming only semi-predictable.

Of late, we have changed our approach a bit. We have picked one team to work closely with and try out our approach. Apart from ongoing coaching, we are actively encouraging them to not provide feature estimates and to break up features as much and as often as possible, especially as they seem to approach the 25 story mark. We are trying to turn our #NoEstimates game up a notch. We are taking all the principles that we know work with stories and applying them to the Epics/Features. Limit WIP, Control Batch Size and Manage For Flow at the feature level, the same way as we do on the story level. Once we have these concepts proven out with this one team, we will roll them out to others in the department as well. Hopefully, all estimation conversations turn into simple conversations of "Does this work-item look like it is too large for this level, if yes, let us break it up, if no, let us start work on it."

We expect that this approach will make us more predictable with our features. The predictability gained should allow us to answer the question of - "When it will be done?" without having lengthy estimation conversations. The expectation is that this approach will also help us get to Just In Time commitment, where a feature is committed to only after it has been started Before the feature is started, the business can easily de-prioritize it and replace it with a different right-sized feature. This should allow us to be more agile and respond to market pressures and feedback more often. The hypothesis is that limiting WIP and controlling batch size allows for these things to happen naturally. Watch this space for results in the future.

Tuesday, January 3, 2017

Why Scrum Sometimes Works And What Can You Do About It?

Scrum sometimes works great and other times is a constant source of dissatisfaction for both developers and management. If Scrum is working well for you, most likely you have already started modifying the "rules" of Scrum to fit your context. On the other hand, in a failing Scrum implementation, you are either looking to give up on it or are about to bring in a "Scrum Expert" that can help you course correct. The first reaction from your new expert friend is most likely going to be - "You are doing Scrum wrong". Scrum has some pretty straightforward rules which, admittedly, are easy to get wrong. There is probably a very low percentage of teams that adhere to all the rules in the scrum guide. In fact, in my personal experience, there is little correlation between adherence to the scrum framework and the degree of success of teams.

Note: The intent of this post is not to say that Scrum is a bad methodology. It is to say that doing Scrum "by the book" is very hard. Scrum itself has brought many great things to software development, as acknowledged here. This post tries to point out that those great things are available even without a full-on adoption of Scrum. 

That does not change the fact that many teams do see improvements when Scrum is introduced. What are the reasons that this improvement happens? Also, if success and improvement, have little correlation with the degree to which the scrum framework is implemented, can the same improvement be seen without ever implementing Scrum? Below is my take on some of the reasons teams see improvements with Scrum and how you can get there with or without Scrum.

Limiting Work In Progress

Scrum forces your teams to concentrate on fewer work items. There are only a certain number of stories that can fit into the sprint. This forces a team-wide work in progress limit. The team is able to concentrate on the few items in the iteration. This is usually a huge shift from each developer having 20 work items active with them. Any developer would tell you how costly this is. There is constant context switching and since you are caught up in 20 items. As a result, none of them get done. If any stories do get done, they are not done with the quality that a decent developer would be proud of. Context switching kills both productivity and quality. Limiting the number of work items you are working on, helps with both these aspects. Reducing "work in progress" also has the direct result of every item in progress getting done faster. That, in turn, results in more things getting done on a regular basis. There is a mathematical theory behind this. If you are interested in further details on the math, please read more about Little's Law here.

Here is the fun part - The benefits of limiting work in progress, do not have to come through Scrum. Simply having a rule that no developer can ever work on more than one work item (or even less than one) will produce very similar if not better results. This is not an easy change for most organizations. Scrum does not prescribe it explicitly either. This could be the reason why many Scrum implementations don't achieve the success they are expected to. If you are considering going Agile, this is the one single change I would start with. Explicitly limit the number of things your developers are working on.

Small Batches

The entire idea of having time-boxed iterations in Scrum forces the team to think of the work they need to accomplish in incremental small units. The Scrum Guide encourages these to be units that are close to a day in length. This can often mean that the limited amount of work that the team is taking on, is further broken down into smaller batches. Small batches have great benefits. They help find mistakes early, whether in the requirements, the code or the tests. They avoid big up-front design and heavy architecture work that often ends up as waste. Instead, design and architecture emerge as new needs are discovered. The team continually makes measurable progress towards its goals rather than having little idea of where they stand in the overall picture. Small Batches help gain a lot of predictability. Most successful scrum teams usually reinforce this by saying that they will not work on anything that higher than 5 or 8 story points. This is not a prescribed rule for Scrum, but one that seems to be commonly used by successful Scrum implementations.

Interestingly enough, small batches are completely achievable without adopting scrum. Developers can break down work items into batches that take 2-3 days to get done. Every time they pick up a work item, they can the question - Can this be done in less than 3 days? If the answer yes, they start work, otherwise, they break it down into pieces that are achievable in 3 days or less. Of course, they are allowed to be wrong, and some items will take longer. This approach will give you the same benefits of small batches with or without the adoption of full Scrum. If your current Agile implementation lacks the emphasis on small batches, that is another easy win.

Collaborative Improvement

Scrum uses its ceremonies, Planning, Daily Standup, Sprint Review and Retrospective as tools to establish collective ownership of the process and the product within the team. The most powerful of these, for long term improvement, is the retrospective. This is the activity where the team gets together at the end of an iteration and figures out things that are going well and things that can be improved. Scrum did a great job of taking the job of "figuring out efficiencies" away from managers and handing it to the teams. There are numerous retrospective techniques out there. As long as teams are looking to improve, you can employ most of these techniques and get successful results.

For some reason, teams adopting Scrum are forced into using the same cadence (sprint length) for Planning, Retrospectives, Stakeholder Reviews and delivery. There is no reason for these cadences to be intertwined. Scrum puts retrospectives at the end of iterations or sprints. They don't have to be this way. If you are not doing Scrum and don't have sprints established (and there might be no good reason to), you can do retrospectives as and when an issue that needs the team to get together pops up. This goes hand in hand with working in small batches.

Small batches work effectively whether we are talking about developing software or making improvements on a team. Instead of building up a batch of issues to talk about, let us take care of things as soon as they come up. This might mean that there is no established cadence for a retrospective and that, in my opinion, is perfectly fine. It might work better than having a backlog of painful items build up for two to four weeks. A large retrospective backlog can lead to ineffective retrospectives as not all the important topics can be talked about before people burn out. Retrospectives, themselves, are not unique to Scrum. You do not have to be doing Scrum in order for your team to improve collaboratively. Have retrospectives as and when you need, give the people doing the work, the power to improve collaboratively.

User Feedback

Central to the entire idea of Agile is to get feedback from users and stakeholders. Scrum does this by having a sprint review/showcase at the end of every sprint. This is what really puts the Agile in Scrum, especially if you are smart enough to make decisions based on the user feedback. The problem is that Scrum regiments the user feedback to the end of sprint, review step. It is the same cadence matching problem that retrospectives suffer from. The earlier we get this feedback the better, then why wait until the end of the sprint. Get the feedback as soon as you have something ready.

If you are working in small batches and limiting your work in progress, it is likely to have something ready for review every day. These changes should not have to wait till the end of a sprint in order to be shown to users and stakeholders. For a developer, 2 weeks is an eternity when it comes to remembering what he/she did. There is a great deal of efficiency in faster feedback. We can tweak something a developer just worked on before the dev has moved on to working on a different part of the system. Get feedback, course correct and deliver value as early and as often as possible.

The (Re)Starter Kit

Scrum is often touted and used as the starter kit for Agile. The problem is that Scrum while appearing simple on the outside is very hard to "do right". The reason people are attracted to Scrum as a starting point usually is due to the fact that Scrum is a documented recipe. Do these things and you will be Agile. Every Agile coach will tell you that just doing the steps does not make you Agile. Unfortunately, the same recipe is the reason why Scrum adoptions fail and need reboots. Developers are analytical beings. If they believe that the recipe produces the Agile cake, they will follow it in every detail. The focus shifts from the intent of the law to the letter of the law. A system designed to help developers quickly changes into a way of micromanaging developers.

Inflexible rules lead to inflexible processes. Inflexible processes by their nature are not great at adapting to your context. In order to be successful with Scrum, most of the times, you have to be flexible and tweak the rules to fit the context. Is it necessary though to have the rules in the first place? Why not have principles and let them define the rules in our context? The basic premise for Agile, in my opinion, is - Deliver early and often and use feedback to determine the future course you need to take. Let us take that premise and work with in order to make things better.

 Agile does not mean Scrum, although Scrum can at times be Agile

If Scrum is failing you, or if you haven't tried it yet, there might be a simpler way to dip your toe in the Agile pool. Start with the four principles here (or a subset of them), and measure the gains they get you. The rest of the Scrum framework is barely required if you want to be Agile. 
  • Limit Work In Progress
  • Work In Small Batches
  • Improve Collaboratively
  • Get Rapid User Feedback
I would argue you can be more Agile with these four changes than you would be if you adopted full on Scrum. Agile does not mean Scrum, although Scrum can at times be Agile. This is not to say that Scrum is a bad way to go. The point there are some background "intents of the law" that make Scrum work. It might be a simpler approach to adopt these intents in order to create an Agile mindset as opposed to adopting an Agile framework. Teams see improvement Scrum with not because they strictly adhere to Scrum rules, but because of the intentional or unintentional adoption of practices that make them Agile.

In the interest of small batches and of limiting your WIP, pick one of these four principles and start there. This might be more effective than taking on an entire framework. Each of these principles in themselves is not easy. Wouldn't it make sense to try one smaller difficult thing, rather than a set of multiple difficult things at the same time?

Tuesday, December 27, 2016

Handoffs Create Heartbreaks - A Christmas Story

Our story begins and ends at one of Florida's largest tourist attractions - The Sawgrass Mills Mall. Just like any good procrastinating parent, I waited until the last week before Christmas to get the earrings that my daughter had (strongly) hinted at. Over lunch on the 21st of December, I headed to the Swarovski store at Sawgrass Mills mall. This was also an economical decision in terms of time invested, as I was able to pick up a gift for my wife at the mall, as well. 

My daughter had shown great interest in two separate pairs of earrings. As is the most efficient method available at these stores, I walked straight to the case and located the two earrings. I asked the shopping assistant working the floor to help me get the earrings, as they were behind a locked case. The assistant, a very nice and personable gentleman, let us call him Joe, was happy to help. The fact that this was the week before Christmas meant that the store was pretty busy. Joe, did his best to help me while still helping two other customers. Since I knew exactly what I wanted, my transaction was simple and Joe led me straight to the payment counter after finding the appropriate boxes.

There were still the other customers on the floor that Joe had been helping. As Joe started ringing me up, His colleague (Let us call her Mary) who had just finished ringing up another customer, suggested that he return to the floor and she would help finish my transaction. Mary's was a logical proposal, as this arrangement would make sure that all transactions proceed unimpeded. I am able to pay for the earrings that I am buying and other customers would be helped in making the best selections possible in their context. Joe took the offer, said a courteous "Happy Holidays" to me and went over to help other customers. Mary helped with the bagging of the boxes and the finalisation of the bill. I was happy that I was able to get two pairs of earrings for my daughter in the space of 10 minutes, as this left me with time to pick up a gift for my wife as well.

After getting home, I hid the earrings and again, like a good procrastinating parent waited until the afternoon of the 24th to actually wrap the gifts. I wrapped the two sets of earrings in one gift pack and placed it under the tree for my daughter to open on Christmas morning. For any parents with a teenage daughter, the excitement is easy to imagine. You are so sure that jewellery is going to be a success. Even if every other gift is rejected, jewellery, which is backed by strong hints, is absolutely going to work.

The night passes, and I am sure Santa has done well. My wife, my daughter, our German Shephard and our Maltese are all gathered around the tree. We start opening gifts or sitting in boxes, based on our preference ( you have no idea how small a box a 90 lb German Shephard thinks she can fit inside). It is my daughter's turn and she is unwrapping the box with the earrings. I have done well, she has no hint of what is inside. She finally sees the Swarovski boxes and immediately knows what she got for Christmas. She opens the first box and there are the crystal hoop earrings that she wanted. The thank yous, kisses and hugs are being distributed. She opens the other box and... nothing! The box is empty. We shake the box, turn it in every direction, close it and reopen it, but the earrings do not appear. This just turned into an anti-climactic heartbreak. The starfish shaped earrings that I bought for my daughter are nowhere to be found. I show my daughter the picture on the box, to assure her that I had bought the right earrings and tell her that I will visit the store soon after Christmas to get the earrings that she wanted and the ones I paid for. That box was definitely not worth the money I paid for it.

Not the day after Christmas(because I love my daughter, but I hate crowds), but the day after that, I headed over to the mall. Luckily, the "returning of the gifts" crowds had died down enough for me to find decent parking and not hyperventilate due to an over-crowded mall. I made a beeline straight to the Swarovski store and was met there by two shopping assistants (not Joe or Mary) that were working at the time. One of the assistants, let us call her Kate, approached and asked what she could help me with. I explained the problem to them and asked if I could get the earrings that I had purchased. Kate seemed to be very surprised by the request. She explained that it was store policy to show the customer the box before finally closing it and bagging it. Only after the customer has verified are they supposed to bill the customer. She saw the receipt and remarked that Joe "is very good" and she is surprised that there was a slip up in the processing of my purchase. Kate asked me if I could wait to talk to the manager so that she could take care of it. I didn't mind waiting and when the manager became available, she promptly took care of the matter for me by getting me a pair of the same earrings (after doing her due dilligance of checking the vedio tape ofcourse).

If Joe is a well-respected employee with a reputation that people are surprised when things go wrong, how did things go wrong? Putting on my Agile Coach/Process Junkie hat I can see exactly where the issue occurred. When Joe handed over my transaction to Mary, there was some loss of information. Mary assumed that Joe had already shown me the box and Joe assumed that Mary will show me the box with the earrings inside. Neither of them did. This is why handoffs are dangerous. They might seem efficient at first, but there is always some loss of information when the handoff happens.

Think of all the handoffs that a single work item in software goes through. Customer request - Product Owner - Business Analyst - Software Engineer - Quality Assurance - Build Team - Operations - Customer. It is a long game of telephone where any one of these handoffs can result in a heartbreak. In the case of the earrings, it took just one handoff to cause the issue. Closing the handoff loops and eliminating handoffs is one of the reasons Agile first took flight. DevOps is the latest iteration of this. Fewer handoffs, result in fewer communication issues.

This does not mean that in every organisation, all handoffs, will be eradicated. Handoffs will exist, but where ever there is a handoff, there need to be explicit rules. There have to be explicit exit criteria before the work item can exit a stage. In order for Joe to hand something off to Mary, he should let her know of some basic information, including whether he has shown the contents of the box to the customer and received an acknowledgment. Ideally, there is no handoff, but if there is one, the policies for the handoff are explicit and understood, otherwise, there will be many heartbreaks on Christmas mornings.

Monday, December 12, 2016

How One Team Went #NoEstimates

This post is mostly a case study on one team. This is a team that I led for a couple of years. We challenged our way of working often. We changed our way of working often. We tried out changes and kept the ones that improved the rate at which we were getting things done and the quality of the work that was getting done. In early 2015, after being exposed to #NoEstimates, I decided to try the approach with the team.

As a traditional scrum team, we used to observe all the major scrum ceremonies. The major ceremonies though, morphed and changed a lot for us over time. In April 2014, 8 months before our "No Estimates" move, we had already stopped doing sprint planning. We had a pretty steady velocity of about 30 points. Instead of spending 2-3 hours estimating and planning out each sprint, we would put an estimate on a story as we were about to start work on it. We had one rule with this "just in time" estimation - We would never work on a story that was estimated at more than 5 points. We discovered two advantages with this approach. First, instead of taking the entire team offline for half a day, we would only take 3-4 people offline for 5 minutes per story at different points of time during the iteration. Second, it allowed the product owner to reorder the priorities mid-sprint. We were fine with any work that has not yet started being reprioritized. Working on a product that is affected by local, state and federal regulation changes, this allowed the product owner to change priorities, whenever any government changed its mind.

By the middle of 2014, we had discovered that our no-planning experiment was working out very well. To be clear, by no-planning I mean not doing sprint planning. We still had dates to hit, we just realized that we were just as predictable, whether we did sprint planning or not. January 2015 is when we first started questioning if the estimates themselves were providing any value. I pulled up the numbers from the last four iterations(which included the Christmas and New Year's breaks) to take a closer look at what value the estimates were giving us. Below is the breakdown of the points per iteration and points per story.

What the numbers told us here was that if we had kept the same rules (don't work on anything that takes too long) and estimated everything as 2.5 points, we would have achieved the same results. Based on this analysis we did the following (this is an excerpt from an email documenting the changes) -
  1.          We bring in stories that are as small as possible(just as we do today)
  2.          If any story looks like it will take the dev + qa kicking the story off more than 6 working days to complete, then split the story down.
  3.      .   For reporting purposes consider all stories to be 2.5 points.

Eventually, as the organization got used to fewer teams reporting points and more reporting story counts, we got rid of bullet number 3 as well. Our Estimation discussions became sizing discussions. We went from "How many points is this story?" to "Is this too big to work on?". This, surprisingly made us more predictable in terms of the number of items we can get done over a time period. It also gave our PO the ability to pivot more frequently.

As a matter of coincidence, our Product Owner, in his numbers used to multiply the stories he had in a feature by 2.5 to figure out roughly when that feature would be done based on our velocity of 30 points per sprint. We also took that indirection away. He wasn't multiplying things in his head and we were not spending time estimating. We were both simply counting stories. Challenging the norm, in this case, seemed that it was making things easier for us, without having an adverse effect.
Without deviation from the norm, progress is not possible. -Frank Zappa
I am personally a big fan of sizing, finding out if something is too large and needs to be broken down. I am no longer a fan of estimating. Story points are a level of indirection that the entire Agile movement could have done without. There is usually the argument that they are a better way of getting teams used to sizing their work. I agree, they are, but I am sure there are better alternatives out there to introduce sizing. Removing story points did not change our throughput or predictability, so, in lean terms, they were a form of waste. I would highly recommend doing the math for your teams and finding out if story points are providing you with value. Maybe they are, but if they are directly and consistently in proportion with story counts, most likely they are wasted effort.

Friday, November 11, 2016

The Polls Were Wrong, But Not As Wrong As You Think

Were the polls wrong? To some extent they were. What was worse though was our understanding of the polls. We took them at face value. Whether it was the media, the pollsters or the public in general, we talked about the polls in a deterministic way. Nate Silver's website, on the other hand, presented its results in a probabilistic manner. If we look at those results, things start to make a lot more sense. Yes, the underlying polls were inaccurate, but Silver adjusts for this to some extent.

As we have discussed before, the quality and accuracy of projections coming out of Monte Carlo depend squarely on the model being used as input for the projections. Silver tries to make the model a better one by adjusting the poll results. I am not sure what the Nate Silver secret sauce is, but my guess would be it involves looking at past performance of the particular poll. Also, Silver looks at a multitude of polls and combines their results. This results in a model that is not influenced greatly by one incorrect poll. It is most likely, a weighted combination where historically better performing polls have more of an impact than the more inaccurate polls. A great description of how the adjustments are done to the polls is available on Nate Silver's website - http://fivethirtyeight.com/features/a-users-guide-to-fivethirtyeights-2016-general-election-forecast

Silver then runs the simulations based on these adjusted polls. After running 20,000 simulations using a thoroughly well-adjusted model, Silver's team is now able to project probabilistically, what are the chances of each candidate winning the election. This was what the forecast looked like on the eve of election day.

What the above result means is that more than 1 out of 4 simulations resulted in Trump winning. Hillary had about a 70% chance of winning the election. Yes, that is a greater probability than what trump had, but it is around the same probability as that of rolling a 3 or more on an unloaded, 6 sided die. It is by no means a slam dunk.

...the probability of an attempted dunk being successful in the NBA is about 93%. That is more than 20% greater than the chances Hillary Clinton had of winning the election.

This is where our understanding of the polls was wrong. Yes, the polls themselves were not very accurate, but our understanding of them was even more flawed. Let us think about it in terms of a slam dunk. According to http://www.basketball-reference.com/ the probability of an attempted dunk being successful in the NBA is about 93%. That is more than 20% greater than the chances Hillary Clinton had of winning the election.  Along the same vein, Hillary's chances were also lower than NBA's overall free throw percentage - 76.5% .

Silver's models also predicted which states are likely to swing the elections. Notice the "Blue Wall" states being high on the list. Pennsylvania, Michigan, and Wisconsin are 2nd 3rd and 7th on the list - 
Both the most consequential and the least consequential events in our life are not deterministic. We live in a probabilistic world and we need to stop thinking deterministically about how things are going to turn out. Yes, the polls were inaccurate, especially in Michigan and Wisconsin, but they still provided us with enough of a trend as to say that this was not a sure thing. There was better than 1 out 4 chance of Trump moving into the White House. 

Thursday, November 3, 2016

What Do NFL Quarterbacks Have To Do With Software Teams?

In the last post, we attempted to predict the number of passing yards Aaron Rodgers will accumulate in the 2016 season. We reached some interesting conclusions based on the results of the Monte Carlo simulations that we ran. What does any of this have to do with Software Development teams and projects?  More than you would think. Coaches, agents, teams and the players themselves are interested in two things above all - Productivity and Predictability. Sounds familiar? Assuming we are maintaining quality, software team manager and directors are usually trying to increase both the productivity and predictability of their teams.

The Monte Carlo simulations for 16 games (1 season) for Aaron Rodgers can give us clues to both his productivity and predictability.  Let us take a look at what Rodgers' numbers would look like if we were to predict 16 games today.

15% Certainty4667 yds
30% Certainty4541 yds
50% Certainty4396 yds
70% Certainty4236 yds
80% Certainty4155 yds
85% Certainty4101 yds
The magnitude of these numbers, i.e. the general range of them, gives us an idea of the productivity of the quarterback. Let us take the middle number, the 50% Certainty number(the median) as an indication of this.  We can compare the median prediction for Rodgers (4396 yards) to that of other QBs to get an idea of the level of production we can expect from them in a season. The spread of these numbers, i.e. the difference between the 15% certainty and 85% certainty (566 yards for Aaron Rodgers) gives us an idea of the predictability or consistency of the quarterback. The lower the spread the more consistent the QB. QBs with higher spreads are less predictable as the answer for how many yards they would throw for changes greatly at different levels of certainty.

Now if you replace QBs with software teams and yards with stories, the same interpretations would hold true. The general magnitude of these numbers, represented by the median would be an indication of the productivity of the team. The spread of these numbers, represented by the difference between 15% certainty and 85% certainty number would be a representation for the predictability of the team. Just as in the case of Quarterbacks, the software teams would want the median to be high, to represent higher productivity. Teams would also want the spread to be low, in order to represent greater predictability. Running Monte Carlo simulations on Story throughput as described here, can get us these numbers for teams.

Let us take a look at what these numbers look like for 10 of the modern era quarterbacks. We have 9 currently active quarterbacks and Peyton Manning included in this dataset. We have run Monte Carlo simulations on the data from previous games(going as far back as 2011), for these quarterbacks. The graphs below show the median and spread for the results.

Once we have these numbers, we might be able to come up with a simple metric that helps us identify the kind of Quarterback we need or the team that would be best suited for a project. Very frequently one of these numbers comes at the cost of the other. Usually, the higher the median gets, the more spread is the distribution. The ideal team will have a very high median and a very small distribution. That gives us some hints towards how we should construct this metric. The median should probably be the numerator so that the metric increases as the median increases and the spread should be the denominator so that it has the opposite effect. This gives us a very simple metric - Median/Spread. that might not be enough, some teams might value higher predictability and others might value higher productivity. We can use exponents to get a greater emphasis on one part of the metric over the other. Furthermore, we can scale the metric to have the highest Quarterback being compared in the set to always be a 100 and others fall in behind the Quarterback.

Let us start with the simplest case where the two properties - productivity and predictability are weighted equally. Let us see what our formula for Quarterback rating tells us about our chosen quarterbacks.

Drew Brees
Peyton Manning
Aaron Rodgers
Russell Wilson
Tom Brady
Jay Cutler
Andrew Luck
Ben Roethlisberger
Ryan Tannehill
Eli Manning
What the table above tells us is that if we value productivity and predictability in equal amounts, Drew Brees would be our top pick. Most of the table looks like it is giving us expected results. There is one exception, Tom Brady seems to be lagging behind Rodgers and Wilson. That runs counter to our understanding of the football world. Upon closer inspection, we see that while Brady is more productive than both Wilson and Rodgers, both of the higher rated (Wilson is barely ahead) are more consistent and predictable.

Now, what if we gave productivity more weight, and pay a little less attention to consistency. Let us give productivity 25% more weight and see what it does to our ratings.

Drew Brees
Peyton Manning
Aaron Rodgers
Tom Brady
Russell Wilson
Jay Cutler
Andrew Luck
Ben Roethlisberger
Eli Manning
Ryan Tannehill
Drew Brees and Peyton Manning separate themselves from the pack once again. There seems to be a little more order to the world with Brady jumping ahead of Wilson. The "also-rans" do not change their order as much except for the very bottom of the table.

Running multiple combinations of these numbers, the top two remain almost constant. Regardless of how much weight we put on the two components(productivity and predictability), Drew Brees seems to beat out the competition in every case. Peyton Manning seems to always come in right behind Brees. Russell Wilson, with the lowest "spread" in his predictions, moves further up the chart the more we rely on consistency. Tom Brady moves further up the more importance we give to productivity. Neither of them catches up to Peyton unless we say that predictability is more than twice as important as productivity.

We need to answer the same questions in a software development context. What matters more to us- Productivity or Predictability? Usually, the answer is both. Unless we are careful though, one if them can hurt the other. We have to use them as balancing metrics. I do not have enough data to say if that balancing metric can be used to compare teams. We can definitely use the balancing metric to see if the same team(or QB) is improving in the direction we expect them to go. We can look at the predictions for our teams regularly and figure out these numbers make a quick determination of are we become more predictable or productive or both.

Another note on this. Brees tops the table every time, Brady floats all over the place, and Eli Manning is almost always in the bottom three. Meanwhile, Brady has 4 Super Bowl rings, Eli has 2 and Brees has 1. Productivity and predictability ar not the only tools to success. They are keys to making successful plans, but there are other variables in the equation. A great defense and a good running game are also needed to win championships. Similarly, for our software teams, regardless of how predictable and productive they are, working on the right things and producing quality products are imperative in order to achieve success.

Saturday, October 29, 2016

Passing Yards For Aaron Rodgers in The 2016 Season (Aka Monte Carlo Models for Aaron Rodgers)

In a post earlier this year, we tried to figure out how many passing yards Peyton Manning would put up if he were to return to football for one game. We answered this question using the techniques for probabilistic forecasting of single items. What if we were to try and use probabilistic forecasting techniques to figure out an entire season's worth of performances. We have already discussed in earlier posts that forecasting multiple items cannot be done by simply forecasting single items multiple times (well kind of yes and no...). This means we have to reach into our toolbox to retrieve our preferred technique for predicting multiple items - Monte Carlo simulations. We are also going to change our subject to a more relevant one. As we are almost midway through the current season, let us pick an active player and see what predictions we can make concerning his performances for the remainder of this season. We are going to pick Green Bay Packers' Quarterback Aaron Rodgers as our subject for these predictions. We will compare how well he has done in comparison to our predictions so far in the season and how well he will do in the remainder of the season. Answering how many yards a quarterback, especially one as prolific as Aaron Rodgers will throw for in a season can be a daunting task. As daunting as predicting the end date for a software project.

The quality and accuracy of projections coming out of Monte Carlo depend squarely on the model being used as input for the projections. The first decision we have to make is which past data points do we use as inputs to our Monte Carlo simulations. Aaron Rodgers was drafted in 2005 and took over as starting quarterback for the Packers in 2008. That means we can safely ignore all games he participated in before 2008. Also, clearly the team and the system under which Rodgers plays has changed quite a bit since 2008. For this reason, we can limit ourselves to the last 5 seasons. We are also going to exclude any games where Rodgers left the field injured and could not complete the game. That leaves us with 77 games, including the ongoing season. We will give each of the performance in these games equal weight for our simulations. Which means, that for each of the set of games we are trying to predict, Rodger's performance is equally likely to be similar to any of the past 77 games. 

Now that we have narrowed down our input data set to what we believe is a representative range for the upcoming games, Let us see how the results that we get from MonteCarlo compare to those we would have gotten by straight averages. For the seasons from 2011-2015, for the games that we are considering as input, The average for Rodgers was about 276.75 yards per game. If we were to make predictions based on this average, we would say that for the entire season, Rodgers will throw for 4428 yards this season. Also, for the first six games of the season(Games completed at the time of writing of this article) Rodgers, based on average would have accomplished 1660 yards.

Testing Predictions Against Past Games

Let us run the MonteCarlo simulations for the first 6 games of the season, assuming that these games have not yet taken place. In other words, let us pretend it is the beginning of the season and we are trying to predict how many yards Rodgers is going to throw for in the first 6 games.
We get the following results -

Predictions For The First 6 games of 2016 Season
15% Certainty1835 yds
30% Certainty1750 yds
50% Certainty1661 yds
70% Certainty1570 yds
80% Certainty1524 yds
85% Certainty1484 yds

These results can be interpreted as confidence ranges for Rodger's performance. What Monte Carlo is telling us is that we have an 85 percent confidence that Aaron Rodgers can throw for at least 1484 yards, we have a 50% certainty that he can throw for 1661 yards, 15% certainty for 1835 yards, and so on. As we see, the higher the confidence level the lower the number of yards we can predict.  So far, Rodgers has thrown for 1496 yards, which is 164 yards(or more than the total yards in a game vs Arizona last year) off. The 85 percent certainty number from Monte Carlo, on the other hand, is off by 12 yards, or for a prolific quarterback like Aaron Rodgers, the yards gained from one pass. At the beginning of the season, Rodgers (and his agent) can use this information to set expectations for the season. Coaches can use this information to plan for the season and decide how much importance they put on the run game and on their defense based on the level of confidence/risk they want to assume.

These numbers also provide us some validation for our present model and give us another bit of information. The fact that we are getting such a close prediction at the 85% certainty mark, tells us that Aaron Rodgers is performing at a lower level than what he is capable of. 85% certainty can by equated to saying that Rodgers is performing at 15% of his maximum potential and  about 30% of his median potential.

Predicting The Rest Of The Season

Using the same methods we used to predict the first six games, we can attempt to predict the remainder of the season. We will make the 6 games that have already happened this season as a part of our model. Running the model through Monte Carlo for the remaining ten games of the season gives us the following results - 

Predictions For The Last 10 games of 2016 Season
15% Certainty2948 yds
30% Certainty2849 yds
50% Certainty2733 yds
70% Certainty2615 yds
80% Certainty2549 yds
85% Certainty2516 yds

Since Rodgers has already thrown for 1496 yards this season, we can try to figure out the number of yards the Packers Quarterback will rack up for the season -

Predictions For The Entire 2016 Season

15% Certainty4444 yds
30% Certainty4345 yds
50% Certainty4229 yds
70% Certainty4111 yds
80% Certainty4045 yds
85% Certainty4012 yds

If we had taken the average yards per game from the previous 5 seasons(276.75 yds/game), and used that as a projection for the 16 games in this season. We would have predicted that Rodgers would pass for 4428 yards this season. Based on the simulations we have run so far, it seems that Rodgers can only hit that mark at the rate predicted with 20% certainty. For a quarterback that is operating below par and maybe inspiring a lower level of reliability, we should use a number at the other end of the scale if we were forced to pick a number. Using the 85% certainty number, which is 4012 yards, is probably a much safer bet to make and plan for, whether you are Rodgers, his coaches, his agent or someone placing bets in Vegas.

A Smarter Model

 Our model so far has been pretty straightforward. Assume that Aaron Rodgers will perform in future games in a manner similar to one of the past 77 games. The beauty of this model is the simplicity of it. It requires almost zero football knowledge to understand it. All it needs us to understand is that yards are a unit of measurement of productivity for a player in a football game. We do not have to understand any rules, strategies or other measures and metrics regarding football. What if we could come up with a smarter(maybe better) model that still maintains this simplicity. 

"Everything should be as simple as possible, but no simpler" - Albert Einstein

Let us run the same simulations using a model where we considered the opponent that the Green Bay Packers are up against. What that means for us is that as we try to figure out future performances, we will not randomly select from the past 77 games. We will instead simulate from full games that Aaron Rodgers has played against the particular opposition the packers up against. In essence, all games against Chicago Bears will be sampled only from prior games against Chicago Bears. Using this model we get the following results for the first 6 games of 2016 -

Predictions For The First 6 games of 2016 Season

15% Certainty1717 yds
30% Certainty1641 yds
50% Certainty1554 yds
70% Certainty1481 yds
80% Certainty1432 yds
85% Certainty1402 yds

These predictions are all lower than the predictions of the simpler "random" model. In fact, in this case, the actual yardage of 1496 has a 66% certainty or saying that Rodgers is performing at 34% of his maximum potential (as opposed to 15% from the "random" model) .Why is this model giving us more pessimistic results? Why are the same 1496 actual yards interpreted as different levels of performance for Aaron Rodgers? Taking a closer look at the data answers the question for us. Since 2011, Rodgers' only 400+ yard games have come against Denver, Washington, and New Orleans, teams that are not on the schedule for 2016. This means that when we simulate the games for 2016 based on the opposing team, these games do not get considered at all. This lowers the projections for the group of games we are simulating for.

The projections for the remainder of the season and the overall projections are as follows - 

Predictions For The Last 10 games of 2016 Season

15% Certainty3046 yds
30% Certainty2943 yds
50% Certainty2854 yds
70% Certainty2761 yds
80% Certainty2704 yds
85% Certainty2675 yds

Predictions For The Entire 2016 Season

15% Certainty4542 yds
30% Certainty4439 yds
50% Certainty4350 yds
70% Certainty4257 yds
80% Certainty4200 yds
85% Certainty4147 yd

This model for Monte Carlo suggests that Rodgers can be expected to do better than the predictions from the "simple" model from the rest of the season. The Packers' Quarterback has historically performed better against the teams the Packers are going to play in the remaining of the season as compared to those that they have played in the last 6 games. This is also borne out when we look at averages - Against the first 6 oppositions, Rodgers averaged 263 yds/game as opposed to 270 yds/game against the next 10 opponents.

In Summation

In summation, we can conclude that Monte Carlo predictions (probabilistic forecasts) give us a much better chance of answering the question regarding multiple game performance than averages do. Based on our level of confidence/risk tolerance, we can choose the certainty level and plan accordingly. We also see that different models give us different results. We have to figure out the best models that fit the reality of our situation, but at the same time not make them too complex or specialized. As Albert Einstein said - "Everything should be as simple as possible, but no simpler".