Code Singer: Gary Poster's Blog: Yellow Squad Weekly Retrospective Minutes: June 8, 2012

Introduction

What is this post?

I'm the lead for the "Yellow" squad in Canonical's collection of geographically distributed, agile squads. We're directed to work as needed on various web and cloud projects and technologies. Every Friday, our squad has a call to review what happened in the past week and see what we can learn from it. We follow a simple, evolving format that we keep track of on a wiki. This post contains the minutes of one of those meetings.

Why read it?

The point of the meeting, and of these minutes, is to share and learn. We'd be happy if you do both of those. You might be interested in our technical topics, or in the problems we encounter, in the process change that we try to follow based on our successes and failures.

What are we working on right now?

Our current project is applying LXC virtualization to the 5+ hour test suite of the Launchpad web application. By parallelizing the test suite across lightweight virtual machines on the same box, we've gotten the time down to under 40 minutes. That's still not ideal, but it is a whole lot better.

Now read the minutes!

Attendance

Attending: benji frankban gary_poster gmb
Apologies: bac
(These are freenode.net nicks)

Project plan

We are fixing bugs, and as hoped, we're now back to getting a lot of green runs--even more than before. Since Tuesday, we have a 74% pass rate. Only 2 of 35 of those recent runs had new failures, and they were pretty easy. Only bug 996729 remains from legacy bugs.
We are still waiting on the two new 24 core machines in the data center to actually run the virtualized tests in production. We continue to use a 32 core EC2 machine and Juju for our tests now.
gary_poster has a crazy plan for converting our "ec2" command into a combination of smaller parts: lpsetup; a Juju Launchpad dev charm that uses lpsetup; a subordinate charm to let the Launchpad charm reuse previous Launchpad builds on EC2, saved on EBS snapshots, so the Launchpad charm can start faster; and a much smaller "ec2" command that's only responsible for merging, starting the test runner, and sending emails. Watch for a proposal coming to a wiki near you!

Action Items

No action items from last week.

New tricks

frankban: the "fixtures" test fixture package.

If you have not investigated Robert Collins' Python test fixtures (Launchpad, PyPI), frankban recommends it. This week frankban worked with a number of fixtures. In particular, the FakeLogger fixture is very useful for ensuring that global logs are not printed to standard output (and frankban recently worked with Robert to improve it, to be released soon). The environment variables fixture was very useful for fixing a test failure too--it sets specified environmental variables at the start of a test and automatically and correctly resets them at the end of the test. Another nice feature of fixtures is that they can be combined easily.

benji: debugging output streams

If you are trying to figure out what is going to stdout, stderr, __stdout__, and __stderr__, then benji recommends creating an object for these file objects that tees the output both to the normal destination and to a file. The file can include debugging information. For him, he was able to solve his problem simply by noting what the divisions were (e.g., when something writes to stdout, his debugger divided this up from other parts of the output to clearly see that it was a distinct unit); you could also include tracebacks to show what code is writing what messages.

gary_poster: killing lots of lxc ephemerals is annoying, but I have a band-aid

lxc-start-ephemeral is a script (in the Ubuntu Debian lxc packaging) to start a temporary lxc instance. It can use another lxc instance as a base, but it writes all of the filesystem changes to memory (via overlayfs). It can be very fast and effective for doing parallel work like we are doing now, because filesystem I/O is not a blocker.

Right now there are some circumstances in which lxc-start-ephemeral will not shut down properly; it has signal handling but for some reason sometimes it does not clean up, and none of the standard lxc tools (lxc-stop, lxc-destroy) work. When you have lots of these at once--we have 32 at once right now--it can be quite annoying.

For us, then, this is the kind of thing we need to do (thanks to benji for refinements; mistakes are gary_poster's). [June 11, 2012: updated to correct three errors.]

find /var/lib/lxc/ -mindepth 1 -maxdepth 1 -name *-temp-* -printf '%f\n' | xargs -n 1 sudo lxc-stop -n ; umount /var/lib/lxc/*-temp-*/ephemeralbind /var/lib/lxc/*-temp-* /tmp/lxc-lp-*; rm -rf /var/lib/lxc/*-temp-* /tmp/lxc-lp-*

frankban: what about fixing lxc-start-ephemeral to handle signals better? gary: yes; not sure what is causing this and haven't gotten around to diagnosing. But look, I have a band-aid!

gmb & gary_poster: bisect and conquer for test isolation problems

One of the bigger sources of our problems in parallel testing is in test isolation. Launchpad has run its tests altogether and in the same order for years. "Tests pass" if the suite passes, run collectively in the usual order.

The parallel testing project divides up the tests across processes, and the grouping is therefore variable. To deliver a more robust testing system and discover test bugs faster, we went a step farther to run tests with --shuffle: a random ordering within the random grouping.

When changing ordering and grouping causes test failures, it's usually a sign of test isolation problems. We've had to identify and fix a lot of those, so we've come up with a process to diagnose them.

The first step is to be able to identify what tests were run. We worked with Robert Collins and Jono Lange to let our test parallelization tool, testr, include subunit tags that identify what tests are run together. Our buildbot configuration includes the lists of what tests are run together, and when a test failure happens, the report includes the name of the list that ran the failed test. You might notice that bugs we file generally include the associated test list (see bug 1010251, for example, which begins with a link to "worker-17"'s tests).

We also need to be able to run tests in the order that the testrunner had them. Our testrunner can do that with ./bin/test --load-list, thanks to some changes bac made.

Now we are ready to bisect and conquer. Here's our process.

Does the test fail by itself? If so, your test probably relies on another being run before it, and that's not the kind of isolation error we're talking about now. If not, great, let's bisect.
Actually, before we bisect, let's optimize and shorten the test run time. Delete all the tests after the failed test from the test list. Unless the future can affect the past, you won't need them.
Now for one more optimization that's really specific to Launchpad: we will delete all the tests that are not in the same "layer" as the failing test. Each layer gives you collected, reused setup (such as memcache setup or database setup), and generally each layer is run in its own process. Therefore, generally a failed test will only be affected by other tests in the same layer, though more on that later. But for now, go with it: we're going to only run tests in the same layer. Run ./bin/test --load-list=YOUR_TEST_LIST --list-tests, where YOUR_TEST_LIST is the list of tests that you modified in step 1. When you don't run with --subunit it will include layer names. Find the first test in the last layer of the result--the first test that is in the same layer as the failing test. In your test list, delete all tests prior to that test.
Now start running ./bin/test --load-list YOUR_TEST_LIST with the first half of the list, plus the last test, and then again with the second half of the list. If one of them fails, keep on doing this step, making the list smaller and smaller until you've identified the test that triggers the failure. Go look at that other test and clean up the isolation problem.
On the other hand, if you come to a point that dividing up a test list does not result in either half triggering the failure, assuming that the problem is not intermittent, you may have an N-way interaction: you must run three or more tests in order to trigger the problem state. I think our record is four non-isolated tests together triggering a failure in the final step. You'll need to divide up the list into groups and bisect each group.

This process works well for us. It's also scriptable. gary_poster might or might not be close to a rough Go version of a script that does this.

About the optimization in step three: it's not really safe, but it usually (almost always) works for us. Possible reasons for this not working include file system changes, layers with real teardowns, and layers that don't have to change processes to start up.

benji: escaping from the lxc login

OK, this didn't actually happen at the meeting, but later I found out from benji how to escape from the lxc login. For instance, if you use lxc-start to start an lxc container, and then lxc-console to use the container, and then you want to exit, logging out won't cut it: you'll be challenged to log in again. The trick is to use ctrl-a q (or with screen, ctrl-a a q) when you are being asked for to log in.

Successes

No innovations to learn from this week

Pain

gmb: -t and --load-list don't work together in Launchpad's bin/test

This is a bug. Beware. (Maybe we should file it!)

gary_poster: danger zone kanban cards

We regard any active kanban card that doesn't move for a day as a problem to be solved. We had two of them this week (bug 996729 and bug 682772), and our efforts to move them failed. Why? benji and gmb: Our testing environment was hard to use for bug 996729: we actually need buildbot to see some failures and we don't know why. We also made some mistakes that slowed us down.

Pair programming usually helps us move cards, but it didn't for these too. Does the observer programmer need to keep some things in mind? We agree that the observer should be actively skeptical and watch the other person's back. When something doesn't make sense, this is a trigger for both parties to check assumptions and step back. ACTION: gary_poster will try making these thoughts into a simple checklist.

benji: Timezone differences introduced some slowdowns and reduced the effectiveness of pairing. I had no one to pair with after gmb's end of day, because other people were busy on other tasks. We agree that we need to refine our process for "danger zone" cards that are not moving. As before, if an active card does not move for 24 hours, we should apply problem solving at the morning meeting and encourage pairing. However, if the card is still blocked after another 24 hours, and pairing has been a problem, we need to pause at least one of the active tasks in order to enable pairing/swarming on the problem. benji: a checklist for the morning meeting would help us follow this process, and could include our "convene a panel" pattern discussed last week. ACTION: gary_poster will do this. frankban: after we've successfully completed a card that went into the "danger zone", we should share knowledge with mini-postmortem. This can go into the morning meeting checklist also.

gary_poster: slowdowns caused by working across teams

We've had two kanban cards waiting for months on SpamapS to have time to finish them. One needs him to sponsor python-shelltoolbox into Ubuntu, and the other needs him to package the Python charm helpers we provided (and which depend on python-shelltoolbox). He is very busy. We've been bothering him about it every couple of weeks. What could we have done better to get this to happen?

gmb: we could have taken this to canonical-tech, and could still. gary_poster: yes, but these were specifically about the charm helpers, and SpamapS owns them and has certain requirements for them (in particular, the one charmhelpers project makes several packages, one for each language). gmb: maybe we should have gone to the Juju team rather than SpamapS explicitly? gary_poster: he is the charmhelpers guy. gmb: yes, but then juju team could schedule/prioritize it within their own goals, and maybe also work together. gary_poster: so is the lesson that we should never ask any single person to do something? gmb: no... but we need something new. Timezone differences also make pings very difficult.

Perhaps when working across teams we should request a delivery date guess, and request that we schedule a call on that date. If the delivery doesn't happen on that date, on the scheduled call ask for three things: a revised delivery date, another associated call, and a plan to try something else if the second delivery date fails. ACTION: gary_poster will convert this into an experimental checklist for how to deal with inter-team requests.

gary_poster: checklists or flowcharts?

The checklists that we discuss seem like flowcharts, not checklists. benji: keep them as checklists to keep them loose.

Code Singer: Gary Poster's Blog

Friday, June 8, 2012

Yellow Squad Weekly Retrospective Minutes: June 8, 2012