Re-platforming a system and the value of 1 week iterations

For the last 14 months I have been working at Shopzilla to re-platform the Inventory system. I haven’t had the time to blog about what we experimented with, what worked, what didn’t. This is the first of a number of posts I intend on putting up regularly.

When trying to figure out how to re-platform a system, Scrum does not say much about how to execute this effectively. How do we divide up the work? How do we communicate to the business that we are designing? How do we retrospect quickly and modify our design in response to learning more about our project and our product?

The last question here is the most important, I think. The Agile community abhors BUFD – Big Up Front Design. The reaction to this was to go completely the other way. Design as one goes along – also known as Continuous Design (from XP). My belief is the amount of design is directly proportional to the complexity of a system. There is a fine line to tread between not designing enough, designing just enough and way too much design: getting into BUFD and potential analysis paralysis. When designing, the key is to defer all decisions to the last responsible moment.

Deferring decisions to the last responsible moment is extremely effective. It makes us look at and assess the bare minimum we need to know in order to make the right design decisions, know enough to begin assessing technologies etc. Funnily enough, the DSDM agile method calls this Foundations. Spotify in their adapted Agile method call this Think Time. Scrum woefully lacks in this area, yet it is of fundamental importance.

So how did 1 week iterations help us?

1 week iterations have a two important and valuable traits:

We get to retrospect on a weekly basis.
Stories must be small.

Retrospecting on a weekly basis allowed us to defer decisions easier because, should we make an incorrect decision, we would only have lost 1 week of work, proofing etc. I feel this is particularly effective in periods of large uncertainty, such as at the beginning of a project when there are many unanswered questions. The driver is to give us more opportunities to gather more information and make better decisions. So perhaps the decision we deferred we realised we had to bring back in, but that’s ok because, again, we only lost 1 week.

We have also found that small stories are particularly effective at giving direction to engineers. We currently have a limit of 5 points for any story admitted into iteration planning. This forced us to think carefully about the acceptance criteria. They had to be exact. Here are some examples. Note that our system uses VoltDB to persist data and we have a suite of Blackbox tests that test end to end system processing (see future blog post).

1 point story

[DP] – Data Publish Service meets organisational standards
As TheOrganisation.com, I want to be able to deploy, configure and troubleshoot the data publish so that the service can integrate into the organisation’s operational environment.

Acceptance criteria

Logging configuration follows organisational standards wrt access/error/console/syslog
Configurable parameters follows organisational standards (properties file outside of WAR)
Packaging follows organisational standards (deployable WAR, probably nothing to do here)
Build/archive functionality follows organisational standard (build and archive uses CM scripts on Jenkins)
There is a health-check that checks that the service and its dependencies are healthy.
There is an index page with the following features:
- a link to the healthcheck
- a link to the configuration information
- a link to the build version information
- a form that allows users to submit JSON requests to the service’s API
Relevant statistics are exposed via JMX

5 point story

[Quality] Reliable VoltDB connections
As shopzilla.com, I want services accessing Volt to have reliable connections so that they may continue to process on the event of a cluster node or network failure.

Acceptance criteria

Services that lose the connection to VoltDB will make reconnection attempts.
A successful reconnection attempt means that the service resumes functioning correctly.
Services can start up without a running VoltDB instance.
Every lost VoltDB connection and each failed reconnection attempt is logged at ERROR level.
Lack of VoltDB connectivity should lead to failing health checks.
The reconnecting algorithm should be configurable using properties – so, for instance, if the chosen algorithm means that the service will make a certain number of reconnection attempts and then give up, the number of attempts to make and time interval between attempts should be parameters.
The blackbox tests are updated so Volt does not need to be restarted on re-deployment of a component using a Volt client (i.e. DI and RDEP)

Note
It is possible that we will want to contribute this code back to VoltDB, as it may benefit us if the maintenance burden for the code is shifted to VoltDB. So the code should contain as few dependencies as possible.

Think about what to implement!

Our statistics show that we regularly have a mode story size of 1 or 2 and a median of 3, so we keep everybody focused. This includes the Product Owner and Tech Lead who must think about what is being delivered, leading to a specific, targetted set of instructions for the engineers. The feedback on this approach from engineers has been excellent. Rarely do we have any uncertainty about what should be implemented, it is usually just minor clarification questions. Rarely does a story take longer than its alloted story point size. An additional benefit of this is that, with almost all uncertainty removed from story implementation, we can focus on ensuring coding standards, approaches, libraries etc. are consistent across all components of the new platform.

The psychology of an engineer

1 week iterations have been so effective and well received by the team that we have continued the practice even today when the platform has begun to mature. Sometimes we work a bit too fast for product, thanks to the investment made in the system that allows extremely fast development and release – a good problem to have! The reason, I think, is because of the way an engineer’s mind works. We work in a very tangible area. We can come in in the morning, find a problem, write a test to prove the problem exists, write code to fix the problem and demonstrably show we have achieved our objective of fixing that problem. This can be extremely quickly – certainly within a day on many occasions. Now compare and contrast this to cultural change programmes where the implementer never knows whether what they have done is going to work for a year or more! So as engineers fix issues, we can also check off the tasks. Small stories mean tasks are granular and easy to complete. This means we are also continually completing tasks and stories. To an engineer, this is very satisfying.

Why not Kanban?

The obvious question when thinking about 1 week iterations is whether the team should have adopted Kanban instead. Kanban is used when the business objectives change on a sub 1 week basis. This makes it very difficult to have a long term goal and go towards it. If the future direction and strategy is mostly known, major objectives and milestones must be met and there is uncertainty about the journey to go on to achieve that via co-ordinated efforts (such as architecture/design, interfaces with other teams etc.) then 1 week iterations may be for you.

What are your experiences?

In closing, I’d be interested to know what iteration lengths have worked for you and why. Have you tried 1 week iterations? How did it turn out? What other techniques have you used to de-risk designing a major system?

Note

This is a modified post from the one that originally appeared at http://tech.shopzilla.com/2013/03/re-platforming-a-system-and-the-value-of-1-week-iterations/

Re-platforming a system and the value of 1 week iterations

3 Responses to Re-platforming a system and the value of 1 week iterations

Leave a comment

Archives

Meta

Re-platforming a system and the value of 1 week iterations

Related

3 Responses to Re-platforming a system and the value of 1 week iterations

Leave a comment

Archives

Meta