No more burndown, no more definition of done

Over the last year and a half of effectively working with 1 week iterations, two aspects of the traditional method of running an Agile project have not been required for us.  The first is the burndown chart, and the second the Definition of Done.  I used to be a real bearer on using both these tools, but 1 week iterations has changed my mind about them.  Now I prefer quality as tasks.

With 1 week iterations, we have a cycle time of 3 – 4 days.*  This isn’t great if you consider that there are only 5 working days in a 1 week iteration.  We haven’t really been able to figure out how to get this cycle time down even though our mean and mode story size is 2.  I have a feeling that, perhaps, that is the most efficient cycle time  for the team in our organisation if we take into account our “organisational lag” (i.e. dependencies on outside teams, previous architectural decisions on legacy systems we have to interface with, and plain old lack of knowledge about legacy systems because those people have left).  What we do know is that we have to complete the stories within one week.  During the iteration, the team has an excellent feel for how they’re doing and what will and won’t be completed about two or three days in.  As a result, we don’t use the burndown chart.

Another aspect which is interesting is that we don’t need the Definition of Done.  This is a direct consequence of not business accepting a story until the service is in production.  To ensure we don’t deploy something that is going to fall over, we have a story template that we apply to all new features, bug fixes and modifications so that we ensure we cover all the bases.  The template story has the following tasks (remember, we practice BDD):

  • Write failing BDD test
  • Implement
  • Code review
  • Fixes from code review
  • Add blackbox test OR verify current blackbox tests still work
  • Quality Engineer – verify
  • Merge and mint
  • Deploy and verify in staging
  • Demo
  • Deploy and verify in production
  • Enable statistics collection
  • Create statistics graphs
  • Update Helpdesk Monitoring Handbook

We take the above template tasks and modify as appropriate.  For example, there may be no stats to be collected or graphed because it’s a modification to a feature that already has its statistics being collected.  In this case, the statistics tasks will be removed.

An item in the template which bears mentioning is the Code Review task.  Since the inception of our project, we have used the team email to do code review for a couple of reasons.  The main reason is full transparency to all team members so we can each learn from each other.  We tried various code review tools but decided to not use them for two reasons:

  1. It was difficult to see the context of the solution because a developer, or even the tool set up, meant that only changed classes were included in the review….so we ended going back to our IDEs anyway.
  2. Only the developer submitting the code could see the result of the review.

When performing the code review, we have another template:

  • Design – focus on the design of the solution
  • Application code
  • Unit test code
  • BDD tests
  • Sonar metrics

The above techniques mean that we have a structured manner in which to ensure that code that gets released to production does not get rolled back.  Indeed, in the last year with an average deployment rate of 2 – 3 deployments per day (manually – we’re working on automating this), we have only had a single figure number of rollbacks.  Pretty good, I think!

It would be great to know what Agile tools or techniques you have either modified or removed due to project and/or application context, so do feel free to leave a comment, below.


Foot notes:

* Cycle time is the amount of time that passes between a story being put in progress and it being business accepted.

Posted in Agile, Software Execution | Tagged | 4 Comments

Presentation on Behaviour Driven Development available

Here is the presentation I gave to the Los Angeles Java User Group about BDD and its application using Cucumber in the following contexts: testing REST APIs, testing web sites (integration with Selenium) and testing Hadoop / Hive.

The slides are here: LAJUG_v0.7

A video of the presentation is here:

Posted in Agile, Software Execution | Tagged , , , , , | Leave a comment

Blackbox testing, Whitebox testing and Behaviour Driven Development

As I have mentioned before, I have been involved in the re-platform of the Inventory system here at Shopzilla for the last 14 months.  During this time we got to experiment with a couple of approaches with great success.  One was the concept of blackbox testing and whitebox testing using Behaviour Driven Development (BDD).  Before I explain what this is, let’s take a look at the new Inventory platform.

Our new Inventory platform switched from a feed batch processing approach to a streaming model.  This meant that we had a number of services deployed, each with their own responsibilities in the pipeline.  At a high level this is as follows:

A simplified view of the new inventory platform

A simplified view of the new inventory platform

As can be seen from the diagram, a feed is ingested by the Feed Validation Service. This validates the feed and transforms it to an internal format. The Feed Processing Service picks it up, performs delta calculations and creates individual “Offer Events” that are streamed to the downstream services where they are persisted and made available to clients.

Each of these services has a defined contract at its API and performs a subset of the overall platform functions. Together the services perform the end to end goal of ingesting a feed and making it available to clients.

We needed to find a way to test not only if a service worked in isolation but also that the services all worked in conjunction with each other.  Further, we wanted to experiment with Behaviour Driven Development (BDD).  BDD has been around for several years yet it is only recently that the frameworks have begun to get mature enough to use in the development of a new platform.

In order to achieve confidence we came up with the concepts of Blackbox testing and Whitebox testing.

Whitebox testing is individual service or component granular testing.  Blackbox testing tests end to end system functionality.

In more detail:

Whitebox tests

  • Covers all the scenarios that the service API contract should cover
  • The person writing the test cares about the internals of the service, the what, not the how
  • Fast set up, execution and tear down
  • A large number of tests
  • Each test has a lower data set up and tear down overhead, and contains targeted data validation for the scenario under test

Blackbox tests

  • Deploy our entire platform freshly onto an integration environment – typically taking the master build  of each service.
  • Are not concerned about the API contract of services, rather the path data takes through the system
  • The person writing the tests only cares about data set up and making sure that the first service in the pipeline is invoked, data is persisted in the correct places and downstream services are notified appropriately
  • Slow tests that take longer to run – typically on a scheduled basis

We implemented both the Whitebox and Blackbox tests using BDD.  So let’s take a look at example Gherkin files.

Whitebox test Gherkin file

This BDD scenario checks that a German feed can be processed:

Scenario: Accepts feed for Germany
Given a feed file with the following contents

|    titre               |    id   |    link                      |    price    |    sale price    |    description    |
| Toms klassische Schuhe |    10   |   |    10       |        5         |    description    |

When the system receives an ingestion request
Then the output file has the following header row

unique_id   title product_url description  original_price unit_price another_header another_header2
And maps the following content for the feed:

|   title                                       | unique_id | product_url            | original_price | current_price |
|  Toms klassische Schuhe | 10               | | 1000                   |    500                |

Note that there is neither much data setup, nor data verification.  A Whitebox feature file can consist of many  scenarios of the same style.  So with Whitebox tests we have: small amount of data setup, small amount of data verification, many scenarios per feature.

Blackbox test Gherkin file

Here is the blackbox equivalent of processing a feed – note we don’t care what country it is, that’s left for the Whitebox tests:

Scenario: Feed is processsed by the pipeline
Given that the ID generator service on port 6666 allocates ids up to 1970

And all databases are cleanly initialized with no data
And the all services are reset flushed

And a connection can be made to all databases
And the feed information is configured:

| merchantId | feedId | feedPreprocessor  | protocol | path    | feed     | encoding | header | delimiter | columns | quotes | country |
| 666        | 0      | MyFeedProcessor   |  http    | /feeds/ | feed.csv | UTF-8    | y      | ,         | 5       | n      | US      |

And check that all known services are healthy
And the delta repository contains no offers
And a file named "feed.csv" with the following contents:
| line |
| id , MyCategory , Manufacturer , model , product_url , title , price , description |
| 227 , 13.050.902 , GE , JEM25DMBB , , GE Black Spacemaker II Microwave Oven - JEM25DMBB , 155 , microwave! |
| 1485 , 11.510.100 , Bose , 17626 , , Bose UB-20 Wall/Ceiling Brackets In Black - 17626 , 29.0 , the wall brackets |
| 551 , 11.960.000 , Sony , PSL-X250H , , Sony Turntable - PSL-X250H , 89.00 , amazing |
When the Feed Validation Service receives a feed to process:
| merchantId | feedId | feedFileLocation |
| 666        | 0      | feed.csv         |
Then within 10 minutes, the delta repository contains:
| oid  | in_progress |
| 1871 | 0           |
| 1872 | 0           |
| 1873 | 0           |
And the "Feed Processing Service" responds with "DONE" for merchantId 666 and feedId 0
And the "Feed Validation Service" responds with "DONE" for merchantId 666 and feedId 0
And within 1 minutes, the Retail Data Repository contains these offers:
| oid  | merchantId |
| 1871 | 666        |
| 1872 | 666        |
| 1873 | 666        |
So for the same function of processing a feed, you can see that the Blackbox tests are more focussed on the end to end, ensuring that known data inputs result in services in the pipeline behave as expected and data is persisted in the appropriate places.

When to run Whitebox and Blackbox tests

We actually run our Whitebox tests as part of our Maven build.  Cucumber JVM, which we selected as our BDD framework, makes the integration extremely easy via its Cucumber JUnit runner.  Our builds typically take anything from one minute to 4 minutes for a full maven clean install.  Note that the four minute build is only one service and it has Hadoop MapReduce based tests using MiniMRCluster which is slow in and of itself. Note: we’ve made some inroads into efficiently BDD testing using Hadoop, but that will be the subject of another blog post.

Our Blackbox tests are run at midnight and take about 40 minutes.

Developing with Blackbox tests

As you can imagine, Whitebox tests are very fast to develop with given they only test one service and are executed as part of the build.  Blackbox tests, on the other hand, by their nature require a dedicated environment with all services deployed.  We can’t have lots of these as it becomes resource expensive and quite unmanageable.  Instead, team members have to communicate when they are using the integration environment, then wait about 2 or 3 minutes while they point the integration environment towards their dev machine.  This process is slow because it:

  • Sets up all servers to point back to the developer’s machine so that the BDD framework can communicate with the environment when executing the tests
  • Deploys the latest version of the services to the integration environment, including any branches of services the developer is working on

As you can imagine, the long running nature of these tests means that the turnaround time is not great.  However, we are very specific with the Blackbox tests we create and, now the platform is maturing, our addition of blackbox tests has decreased considerably.  Many changes now only require updates to a service’s Whitebox tests.

BDD frameworks

Finally, a quick note on BDD frameworks.  Originally we assessed executing BDD tests using Python and Lettuce and Cucumber JVM.  We originally thought Python and Lettuce would be faster after our initial assessment so we used that.   While using it to develop real functionality, we began to suspect that Python/Lettuce was much slower than Cucumber JVM.  Not to mention that Cucumber JVM’s codebase was being regularly committed to and features were being added very quickly.  So we converted a suite of tests over to Cucumber JVM to find a roughly 25% reduction in execution time.  So we gradually migrated all Whitebox tests for all services to Cucumber JVM and haven’t looked back.  Our Blackbox tests are still in Python and Lettuce, which we’ve gotten used to working with.  There is a slow initial set up and execution time anyway for the Blackbox tests so the Python/Lettuce decision has far less of an impact.  If we were to re-write the Blackbox tests, we’d probably stick with the same approach but use Cucumber JVM.

Future direction

Blackbox and Whitebox testes have worked very well for us, especially the non-coding way writing a BDD failing whitebox test makes us think about what we’re about to implement.  We will continue working this way until a better method comes along.

As for the speed of the Blackbox tests, we’re always working on streamlining them.  I believe as of writing we’ve just got them down to 20 minutes execution time from 40 minutes.

So how about you?  Have you used BDD or the concept of Blackbox and Whitebox tests for a multi-service system?  How did it work out?


This is a modified post from the one that originally appeared at

Posted in Architecture, Software Execution, Technology | 1 Comment

Re-platforming a system and the value of 1 week iterations

For the last 14 months I have been working at Shopzilla to re-platform the Inventory system.  I haven’t had the time to blog about what we experimented with, what worked, what didn’t.  This is the first of a number of posts I intend on putting up regularly.

When trying to figure out how to re-platform a system, Scrum does not say much about how to execute this effectively.  How do we divide up the work? How do we communicate to the business that we are designing?  How do we retrospect quickly and modify our design in response to learning more about our project and our product?

The last question here is the most important, I think.  The Agile community abhors BUFD – Big Up Front Design.  The reaction to this was to go completely the other way.  Design as one goes along – also known as Continuous Design (from XP).  My belief is the amount of design is directly proportional to the complexity of a system.  There is a fine line to tread between not designing enough, designing just enough and way too much design: getting into BUFD and potential analysis paralysis.  When designing, the key is to defer all decisions to the last responsible moment.

Deferring decisions to the last responsible moment is extremely effective.  It makes us look at and assess the bare minimum we need to know in order to make the right design decisions, know enough to begin assessing technologies etc. Funnily enough, the DSDM agile method calls this Foundations.  Spotify in their adapted Agile method call this Think Time.  Scrum woefully lacks in this area, yet it is of fundamental importance.

So how did 1 week iterations help us?

1 week iterations have a two important and valuable traits:

  • We get to retrospect on a weekly basis.
  • Stories must be small.

Retrospecting on a weekly basis allowed us to defer decisions easier because, should we make an incorrect decision, we would only have lost 1 week of work, proofing etc. I feel this is particularly effective in periods of large uncertainty, such as at the beginning of a project when there are many unanswered questions.  The driver is to give us more opportunities to gather more information and make better decisions.  So perhaps the decision we deferred we realised we had to bring back in, but that’s ok because, again, we only lost 1 week.

We have also found that small stories are particularly effective at giving direction to engineers.  We currently have a limit of 5 points for any story admitted into iteration planning. This forced us to think carefully about the acceptance criteria. They had to be exact.  Here are some examples. Note that our system uses VoltDB to persist data and we have a suite of Blackbox tests that test end to end system processing (see future blog post).

1 point story

[DP] – Data Publish Service meets organisational standards
As, I want to be able to deploy, configure and troubleshoot the data publish so that the service can integrate into the organisation’s operational environment.

Acceptance criteria

  • Logging configuration follows organisational standards wrt access/error/console/syslog
  • Configurable parameters follows organisational standards (properties file outside of WAR)
  • Packaging follows organisational standards (deployable WAR, probably nothing to do here)
  • Build/archive functionality follows organisational standard (build and archive uses CM scripts on Jenkins)
  • There is a health-check that checks that the service and its dependencies are healthy.
  • There is an index page with the following features:
    • a link to the healthcheck
    • a link to the configuration information
    • a link to the build version information
    • a form that allows users to submit JSON requests to the service’s API
  • Relevant statistics are exposed via JMX

5 point story

[Quality] Reliable VoltDB connections
As, I want services accessing Volt to have reliable connections so that they may continue to process on the event of a cluster node or network failure.

Acceptance criteria

  • Services that lose the connection to VoltDB will make reconnection attempts.
  • A successful reconnection attempt means that the service resumes functioning correctly.
  • Services can start up without a running VoltDB instance.
  • Every lost VoltDB connection and each failed reconnection attempt is logged at ERROR level.
  • Lack of VoltDB connectivity should lead to failing health checks.
  • The reconnecting algorithm should be configurable using properties – so, for instance, if the chosen algorithm means that the service will make a certain number of reconnection attempts and then give up, the number of attempts to make and time interval between attempts should be parameters.
  • The blackbox tests are updated so Volt does not need to be restarted on re-deployment of a component using a Volt client (i.e. DI and RDEP)

It is possible that we will want to contribute this code back to VoltDB, as it may benefit us if the maintenance burden for the code is shifted to VoltDB. So the code should contain as few dependencies as possible.

Think about what to implement!

Our statistics show that we regularly have a mode story size of 1 or 2 and a median of 3, so we keep everybody focused.  This includes the Product Owner and Tech Lead who must think about what is being delivered, leading to a specific, targetted set of instructions for the engineers.  The feedback on this approach from engineers has been excellent. Rarely do we have any uncertainty about what should be implemented, it is usually just minor clarification questions. Rarely does a story take longer than its alloted story point size.  An additional benefit of this is that, with almost all uncertainty removed from story implementation, we can focus on ensuring coding standards, approaches, libraries etc. are consistent across all components of the new platform.

The psychology of an engineer

1 week iterations have been so effective and well received by the team that we have continued the practice even today when the platform has begun to mature.  Sometimes we work a bit too fast for product, thanks to the investment made in the system that allows extremely fast development and release – a good problem to have!  The reason, I think, is because of the way an engineer’s mind works.  We work in a very tangible area.  We can come in in the morning, find a problem, write a test to prove the problem exists, write code to fix the problem and demonstrably show we have achieved our objective of fixing that problem.  This can be extremely quickly – certainly within a day on many occasions.  Now compare and contrast this to cultural change programmes where the implementer never knows whether what they have done is going to work for a year or more!  So as engineers fix issues, we can also check off the tasks. Small stories mean tasks are granular and easy to complete.  This means we are also continually completing tasks and stories.  To an engineer, this is very satisfying.

Why not Kanban?

The obvious question when thinking about 1 week iterations is whether the team should have adopted Kanban instead.  Kanban is used when the business objectives change on a sub 1 week basis.  This makes it very difficult to have a long term goal and go towards it.  If the future direction and strategy is mostly known, major objectives and milestones must be met and there is uncertainty about the journey to go on to achieve that via co-ordinated efforts (such as architecture/design, interfaces with other teams etc.) then 1 week iterations may be for you.

What are your experiences?

In closing, I’d be interested to know what iteration lengths have worked for you and why.  Have you tried 1 week iterations?  How did it turn out? What other techniques have you used to de-risk designing a major system?


This is a modified post from the one that originally appeared at

Posted in Agile | 3 Comments