Monday, June 18, 2012

Yellow Squad Weekly Retrospective Minutes: June 15

Introduction

What is this post?

I'm the lead for the "Yellow" squad in Canonical's collection of geographically distributed, agile squads.  We're directed to work as needed on various web and cloud projects and technologies.  Every Friday, our squad has a call to review what happened in the past week and see what we can learn from it.  We follow a simple, evolving format that we keep track of on a wiki.  This post contains the minutes of one of those meetings.

Why read it?

The point of the meeting, and of these minutes, is to share and learn.  We'd be happy if you do both of those.  You might be interested in our technical topics, or in the problems we encounter, in the process change that we try to follow based on our successes and failures.

What are we working on right now?

Our current project is applying LXC virtualization to the 5+ hour test suite of the Launchpad web application.  By parallelizing the test suite across lightweight virtual machines on the same box, we've gotten the time down to under 40 minutes.  That's still not ideal, but it is a whole lot better.

Now read the minutes!

Attendance

Attending: the gang's all here (bac benji frankban gary_poster gmb)
(These are freenode.net nicks)

Project plan

  • We have fixed all the critical bugs we know about for the parallel testing project, except for recurring problems with bug 994752, 1011793, and 1002820.  See discussion in "Problems" below.
  • Our lxc workers are completing at very different times--a difference of more than 10 minutes, typically.  If we can fix this, perhaps we could save another five minutes (of 38-ish on our 32 core testing EC2 machine).  See discussion in "Problems" below.
  • We are still waiting on the two new 24 core machines in the data center to actually run the virtualized tests in production.  We continue to use a 32 core EC2 machine and Juju for our tests now.
  • We are moving to stretch goal.  lpsetup gets first billing, and hopefully we'll have time for a Juju Launchpad developer charm and a revised "ec2 test" command also.

Action Items

Gary was supposed to add a number of checklists for us to follow, about working across teams, running the morning meeting, and working as the observer in pair programming.  This is done: all of them are started, and listed with our other checklists, here.

New tricks

gmb: Beware: zope.testing will occasionally swallow ImportErrors

If zope.testing reports not being able to find a test, try importing your test in the Python interpreter.  It may be an import error.  Gmb suggests that the new tagline for zope.testing is "Likes to mess with you!"

benji: Beware: be careful when powering off lxc containers

You need to be very careful when calling poweroff in your lxc container, because it might not be your lxc container but your host!  gary_poster: I've thought it might be nice to have a lxc-poweroff installed in the container for just this reason.  It would not exist in the host, and in the container it would simply be an alias for/link to poweroff.

benji: Beware: if you poweroff ec2 you might break your next juju run

If you poweroff your ec2 host (see the previous warning), juju may not work properly the next time you start a new environment: one of your machines may hang in "pending" forever.  It's far from obvious, but this is a complaint caused by juju trying to reuse a security group name that is currently connected to your old non-terminated ec2 instance.  You need to terminate that old instance (e.g. in the EC2 AWS console) so that the next juju startup won't be "pending" forever.

frankban: Beware: change working directory before imports

Launchpad's setup code did a lot of imports, including docutils; and then changed the working directory. That broke things, because importing docutils sets a global state at import time based on the current working directory.  He fixed this by changing the working directory before importing.  benji: two corollaries are "global state hates you," and "don't do anything at module import time."

Successes

gary_poster: Look at how well our test runs are doing!

We had a recent record-breaking 13 passing runs in a row, which got us back above the 90% success rate; and we only have a few bugs left.  Let's take a second and be happy about it!  benji: learning about lxc and ec2 and ebs has been a lot of fun, and it has been nice to be working on and helping with code and tools that are essential to the company.  gmb: agrees.

Pain

benji: What can we learn from bug 996729: Global state hates you, and stdout is global state

[Editors note: The kanban card for bug 996729 took more than a day to move out of coding, so it automatically became a topic for the weekly call, per our checklist.]

This bug was about making sure that Launchpad testrunner's subunit generation code is the only code that has control over stdout and stderr when tests are run in this mode (--subunit).  This is important because the testrunner must be producing a parsable subunit stream, and random junk can pollute it and break subunit consumers.  Getting it to work properly has taken more than a week.  Why?

Global state hates you.  stdout and stderr are global state, and code everywhere expects to be able to write to them, and manipulate them.  For example, zope.testing uses stderr as a back channel for the child process (running a test layer) to talk to the parent process (aggregating the results).  Because running tests involves arbitrary code, that's very unsafe. Another example that we encountered is forking: remember that forking copies global state, including things you might not realize like the Python buffer of stdout and stderr.  That one was a bit painful to debug.


The Python "mutable" and "pretend I'm immutable" variants of stdout and __stdout__ are no solution either.  We had to try to assert priority: noone gets to write to stdout except our code...so we ended up having to mutate sys.__stdout__ too.  Does that mean that there should be an ___stdout___ and ___stderr___?  When does it end?  (We briefly jokingly suggest names "dunder" for double under, "tunder" for triple under, and so on.)  As an associated point, don't assume __stdout__ and __stderr__ are the things that were in stdout and stderr before you started mucking with them.  Who knows what some crazy other code has done with those globals?


gmb: If you muck with stdout and stderr, we on the yellow squad will run around with shotguns being paranoid.


benji/gmb: testr's desire/expectation that subunit will own stdout is problematic from our perspective: ssh combines stderr and stdout, and stdout is used for so many things in arbitrary code, and even required for things like pdb.  We talked about other options but stuck with this approach because Robert preferred it, we understood his point, and there was no guarantee that the other approaches, like giving subunit its own channel, would be any less problematic within the large scope of our project.

bac: What can we learn from bug 682772: Don't go on vacation with an active card

[Editors note: The kanban card for bug 996729 took more than a day to move out of coding, so it automatically became a topic for the weekly call, per our checklist.]

Don't leave for vacation when you have an active card, unless you've explicitly handed it off.

gary_poster: Increased timeouts are not enough



We have LXC startup timeouts that were supposed to be solved by bug 994752, which increased timeouts and made some code more efficient.  On test EC2 machines, we have increased the timeout to as much as 10 minutes and still, rarely, seen startup timeouts. Bug 1002820, and possibly bug 1011793 have not been solved by increasing timeouts.  They continue to occur, rarely but frequently enough to keep us from achieving our goal statistics.


A simple thing to try is to decrease the number of LXC instances we run at once, using a hack until bug 957145 is addressed [Editor's note: which it has been now--thanks Robert!].  When total test time starts increasing, maybe that's too little parallelization.  When tests get better passing stats and test time is about the same, maybe that means we've found some kind of ideal. But this takes a long time to determine, and it doesn't dig into the underlying problem.What else can we do to investigate and solve the problem?


benji: Bug 1011793 may be solved by returning a missing log directory.  To investigate the rest, try using iostat and vmstat, and pidstat -T for each lxc.

[Editor's note: since the call, we have done some of this investigation, and have some interesting results.  We intend to do a bit more data gathering and then file a bug for our fine LXC engineers to explore; and we think that bug 1002820 is a separate issue.]

gary_poster: The discrepancy between worker completion times is really costing us

Often the first LXC container in the parallel tests to complete its tests finishes 12 or 13 minutes before the last LXC container in the same test run.  Even given a three minute timeout for LXC starts (see previous item), that's a big discrepancy.  Why is it happening?


We agreed it was worth filing a card for, but did not have any immediate insights.  We briefly lamented that we could not feed tests to workers dynamically, or just in time; but had to decide what tests each worker should run at the start of the test run.


[Editor's note: investigating this problem later in the day led to us filing bug 1013921, which might help significantly and shave off as much as five minutes from the total test time.]


No comments: