Friday, June 1, 2012

Yellow Squad Weekly Retrospective Minutes: June 1, 2012

Introduction

What is this post?

I'm the lead for the "Yellow" squad in Canonical's collection of geographically distributed, agile squads.  We're directed to work as needed on various web and cloud projects and technologies.  Every Friday, our squad has a call to review what happened in the past week and see what we can learn from it.  We follow a simple, evolving format that we keep track of on a wiki.  This post contains the minutes of one of those meetings.

Why read it?

The point of the meeting, and of these minutes, is to share and learn.  We'd be happy if you do both of those.  You might be interested in our technical topics, or in the problems we encounter, in the process change that we try to follow based on our successes and failures.

What are we working on right now?

Our current project is applying LXC virtualization to the 5+ hour test suite of the Launchpad web application.  By parallelizing the test suite across lightweight virtual machines on the same box, we've gotten the time down to under 40 minutes.  That's still not ideal, but it is a whole lot better.

Now read the minutes!


Attendance

Attending: bac benji frankban gary_poster
Apologies: gmb
(These are freenode.net nicks)

Project plan

  • We are waiting on the two new 24 core machines in the data center to actually run the virtualized tests in production.  We continue to use a 32 core EC2 machine and Juju for our tests now.
  • We are fixing big bugs and we have a very bad record for passing runs lately, but it feels like we're about to return to green runs better than before.

New tricks

frankban: ``apt-get install stress``

The stress command is a cool debugging tool that is good for seeing how testrunner works on a loaded machine. It can stress load average, memory, cpu, and disk writes.  It is good for seeing how testrunner works on a loaded machine, and he used it to examine some timeout bugs in our JavaScript tests.

Be careful with arguments because you can freeze your machine.  You can mitigate the risk in two ways.  First, you can limit the stress to only a certain amount of time, like 20 seconds.  Second, you can use a dry run.

This tool can also help identify bugs that are actually testing real elapsed time, which is not reliable in a test environment.  That leads to our next topic...

frankban: Don't test actual elapsed time.

When writing a test that needs to look at elapsed time, test this virtually, rather than with real wall-clock time.  You could use a mock or factor your code to get the test behavior you need without actually looking at the clock.  Never assume anything will be a certain time--the amount of time between one line in your file and the next cannot be relied upon.  Benji: time hates us all.

benji: termbeamer beta test is going well 

Termbeamer (a secure way to share terminal sessions over Jabber, just as GTalk) is still going well.  We had a productive handoff thanks to it this week. No new features this week.  [Message from editor: try it!  It's particularly easy to try on Ubuntu, but should work fine on other systems with Python too.]

benji: checkmarkable.com is available

Robert Collins gave us access to checkmarkable.com.  Have any of us used it? gary_poster: I tried it for the checklist for this meeting and found it wanting for this use case: it seems to be only for simple checklists, a la CHR tasks.

bac: Anyone else trying to use HP cloud?

Brad tried to sign up for HP cloud.  He's waiting on an answer back from Jorge and then plans to try out our EC2 tasks on the HP cloud.

Successes

gary_poster: lpsetup LEP discussion led to positive changes

We will have a better API for the slack project after working on the LEPbenji: It also may have been useful as a marketing/informational tool.  Engagement with Robert was a particularly helpful aspect of the process.  (We then discussed bzr lightweight checkouts and colocated branches to make sure everyone knew what they were and how to use them.)

gary_poster: try getting unstuck by convening a panel

We had two problematic bugs this week: bug 996729 and bug 1007111.  They were very similar to bug 1004088 from last week.  In all three cases, the person or people working on the bug were stuck.  When one particular person saw the bug (benji, bac, and gmb, respectively) that person had an insight within the first few minutes that unstuck the problem.  These were big successes.

We agreed that this chain of success might suggest a process: getting unstuck by convening a panel. When you are stuck, convene a panel of the entire squad.  Present the base problem as broadly as possible first, to try and get different approaches from what you are doing.  If that fails, show where you are stuck now.  Constrain time to 15 or 30 minutes max, and if there's more than two or three minutes of silence, stop.

This works nicely with our policy to regard any kanban card that hasn't moved for a day as a problem to address in our daily call.  We can then convene the panel on the problem at that time, when we are already together.

Convening a panel does not replace collaboration, but augments it. Collaboration should still be used before and after the panel.

We'll give this a try for a bit and keep using the tool/process if it seems useful.  We did not establish an official experiment, though we really should set an evaluation time.

Pain

gary_poster: gary blocked the board

gary_poster blocked the kanban board with an in-progress card for days, and then the resolution was “delete the card! We are not doing it after all! Ha ha ha. Um.” Squad lead responsibilities blocked him, and other problems took precedence over that card.  Suggestion: gary must collaborate when he starts a card.  benji: could have own kanban board lane, but probably not as good a resolution.

benji: always ping when the hangout url is ready

On IRC, when we are going to have a Google hangout, always ping when url is mentioned.

frankban: Late-in-project redesign for lpsetup was painful

It would have been better if lpsetup's LEP had started earlier, so that we would not have had to do rework.  The project was almost ready to be released.  We redirected/redesigned the project at the end again. gary_poster: agree, but it is a lesson we discussed last week: when you work on something that the larger team maintains, do a LEP early.  We didn't do that this time, and paid the price.

gary_poster: The card for bug 996729 is not moving

This is already discussed sufficiently, we decide.

No comments: