Monday, July 2, 2012

Yellow Squad Weekly Retrospective Minutes: June 29

Introduction

What is this post?

I'm the lead for the "Yellow" squad in Canonical's collection of geographically distributed, agile squads.  We're directed to work as needed on various web and cloud projects and technologies.  Every Friday, our squad has a call to review what happened in the past week and see what we can learn from it.  We follow a simple, evolving format that we keep track of on a wiki.  This post contains the minutes of one of those meetings.

Why read it?

The point of the meeting, and of these minutes, is to share and learn.  We'd be happy if you do both of those.  You might be interested in our technical topics, or in the problems we encounter, in the process change that we try to follow based on our successes and failures.

What are we working on right now?

Our current project is applying LXC virtualization to the 5+ hour test suite of the Launchpad web application.  By parallelizing the test suite across lightweight virtual machines on the same box, we've gotten the time down to under 40 minutes.  That's still not ideal, but it is a whole lot better.

Now read the minutes!

Attendance

Attending: bac benji frankban gary_poster.
gmb is unavailable, though he shared notes for one item.
(These are freenode.net nicks)

Project report

Status

gmb was unavailable this week, though he still worked on a related project, as noted below.
  • We've twice gotten as high as 97% success rate over a three day rolling period of tests, and once as low as 83%.  Our goal is 95% or higher.
  • We had one new failure, which occurred once: the networking did not start within our one minute timeout in the (non-ephemeral) lxc we use for initially building the code before the tests are run.  It worked fine before and after the failure on the same machine, and we don't know what to do to investigate this further.  It smells a bit related to bug 1014916, mentioned last week as getting a fix, but the symptoms are somewhat different and the described symptoms of bug 1014916 have not occurred since we instituted the workaround.
  • As in past weeks, we saw a few instances of timeout related failures from 974617 and 1011847, and one timeout related failure from 1002820.  For the first pair, we asked for assistance and opinions from stub and lifeless.  Stub mentioned increasing the timeout again, but since we are already at what we perceive to be a large 3 minute timeout we have not pursued that yet.
  • The two previous bullet points describe the entirety of the failures we encountered.
  • Working on a sprint with the blue squad (jam, jelmer, vila, and mgz) this week in Amsterdam, gmb has led an effort to get the parallel testing story working with Launchpad running on Python 2.7 in Precise LXC containers (it is currently running on Python 2.6 in Lucid LXC containers, matching the current production configuration in the data center).  On Friday, gmb and the blue squad made a breakthrough on this front that fixed a bug in zope.testing and got the tests fully running in the 2.7/Precise environment.
  • We are still waiting to hear from IS that the two new 24 core machines to actually run the virtualized tests in production have arrived.
  • We proposed an approach to configure the two production machines, leveraging lpsetup, and have not yet heard back from IS.
  • We have made progress on the refactoring of lpsetup.  In particular, we committed initial versions of the inithost and initlxc commands.
  • Led by bac, and thanks to help from James Westby and Diogo Matsubara, we now have tarmac running tests and managing commits to our main lpsetup branch.  bac also set up tarmac to run tests and manage commits for our zope.testing fork.
Also, the announcement is a week late, but thanks to gmb, Launchpad has a screencast for fixing bugs.  Take a look!

Goals for next week

We've added this section to the Friday call and minutes in order to eliminate the biweekly status emails I was producing.  The "goals" section in the biweekly status emails was the only non-duplicated section that we deemed important.

Next week, frankban will be unavailable all week, and bac, benji and gary_poster will be unavailable on Wednesday.
  • Continue running parallel tests on the EC2 32 core machine and aggregating results.
  • Make another attempt on at least one of  974617/1011847 and 1002820.
  • Land initial and usable versions of the remaining lpsetup commands: get, update, and inittests.
  • Package and use a refactored version of lpsetup for our parallel testing setup to validate the fact that it still works there.
  • Agree with IS on an approach to configuring the two new production machines.

Action Items

  • ACTION: gary_poster will make a kanban card to create a first cut at smoke tests for all of our subcommands.
    COMPLETED. gary_poster made the card, and benji and frankban subdivided and completed it.
  • ACTION: gary_poster will make a kanban card to make the tarmac gatekeeper enforce our test coverage expectations (the code review of the enforcement will also include the discussion as to what we are enforcing as a first cut).
    COMPLETED. gary_poster made the card and bac completed it (see "Successes" for a more complete discussion).  This ended up not involving a code review, and so the test coverage conversation did not happen.
  • ACTION: gary_poster will create a slack card for investigating integration test approaches.  If someone works on this in slack time and shows us a way forward, we'll open this conversation again.  Until that point, or until we successfully release lpsetup for developer usage, they are postponed and effectively discarded.
    COMPLETED.  gary_poster made a card.  frankban and gary_poster also discussed how to do this because of some manual testing on ec2 that they both had done.  They proposed a way forward.  See "Problems" for discussion.
  • ACTION: bac will research how to get and integrate tarmac resources (a testing machine) for a project.  He will first consult with matsubara about this.  The results will be new/improved documentation on how to get tarmac deployed for a project, and/or information on what it would take to make this easier.
    COMPLETED. bac documented how to do this with James Westby's puppet scripts and Canonistack here: 
    https://dev.launchpad.net/yellow/TarmacOnCanonistack.  Diogo Matsubara's current approach was very nice but required us to have access to the QA lab.  We requested access from IS and have not heard back.  James' approach let us move forward quickly.  He is reportedly working on a Juju solution to replace the Puppet scripts, and we'll be interested in that when it is ready.  This action item came from concerns about Launchpad's zope.testing fork.  bac also integrated tarmac with our zope.testing fork as part of this effort, to gate landing code with running tests.

New tricks

gary_poster: how can code determine if it is being run within an LXC container?

gary_poster asked Serge Hallyn if there were a reliable way for code to determine if it is being run within an LXC container.   Serge said yes, and gave these steps (note that this may be Ubuntu-specific; Ubuntu-specific is good enough for us right now).
  • if /bin/running-in-container is present (precise and above, always), run it and check for 0 return value
  • else, if lxc-is-container is not present, assume lxcguest is not installed and you're not in a container (or are in trimmed container)
  • else, run lxc-is-container, if 0, you're in a container, if 1 you're not

gary_poster translated that into this Python code, which seems to work everywhere he's tried it so far.

import subprocess, errno
def running_in_container():
    # 'running-in-container' is Precise and greater; 'lxc-is-container' is
    # Lucid.  These are provided by the lxcguest package.
    for command in ('running-in-container', 'lxc-is-container'):
        try:
            return not subprocess.call([command])
        except OSError, err:
            # ENOENT means "No such file or directory."
            if err.errno != errno.ENOENT:
                raise
    return False
Someone else on the #ubuntu-server freenode channel also recommended https://github.com/kwilczynski/facter-facts/blob/master/lxc.rb for ideas, which gary_poster passes on without having given much more than a glance so far.

benji: use -s flag to combine nosetests and pdb

benji discovered that nosetests eats stdout by default, which is not terribly helpful if you want to use pdb.  Use nosetests' -s flag for great justice.

bac: nosetests will not discover test modules if the execute bit is set

See title.  bac found this surprising.  nosetests --help gives the workaround and explanation.
  --exe                 Look for tests in python modules that are executable.
                        Normal behavior is to exclude executable modules,
                        since they may not be import-safe [NOSE_INCLUDE_EXE]

gmb: beware: lpsetup probably shouldn't overwrite your SSH keys with nonsense

gmb pointed out (via a pre-recorded note) that it is possible, and arguably too easy, to make lpsetup overwrite your SSH keys with nonsense.  Admittedly, what he did what was a mistake, but still.  We already have a card for making an interactive installation story for lpsetup, but this is worth its own bug (https://bugs.launchpad.net/lpsetup/+bug/1018823and kanban card.

bac: beware: multiple running gpg-agents are bad. Years-old login files (.profile, .bash_rc) are bad.

It's pretty common to be warned that you should not have multiple running gpg-agents.  However, you might not realize that, as you accrete login files across distribution upgrades over the years, these may add multiple running gpg-agents that you didn't notice.  He didn't.  Beware!  I suppose the lesson to be learned is that you should carefully review your login files after each distribution upgrade?

Successes

bac: a fast-to-deploy, repeatable tarmac story

In last week's meeting, we identified some costly mistakes because we had committed some failing tests to two projects, lpsetup and Launchpad's zope.testing fork.  We wanted to make that impossible via automation. tarmac is a merge manager for Launchpad that can run tests before it merges, and it has widespread use at Canonical.  We wanted to use it for this automation.


bac took on this task and made excellent progress: tarmac now gates both projects, automatically merging from approved merge proposals unless the merged code fails a test run.  


A primary source of that success was incrementalism--making incremental steps towards the goal, in order to bring value as quickly as possible.  bac brought value quickly by choosing a solution that could be immediately available (Canonical's internal OpenStack cloud resources), rather than the alternative, which requires IS to get around to giving us access to new resources.  The solution also does not use Juju, which we would have preferred; but waiting on the Juju charm to be written would not have brought value as quickly.  We should be able to migrate to a Juju charm when it is available, but meanwhile, we have something working and bringing value now.


Another important source was communication and company sharing.  We published our meeting minutes last week, communicating our needs and plans.  James Westby read them, and offered to share his solution.  James and bac coordinated, and James' solution was the quick-to-deploy one that we have now. That's a big validation for us of the effort we are making to share these minutes!  It's also a big cause for a thank you to James.  Thanks!


bac took the communication idea two steps further.  

Pain

benji: decoy doctests

lpsetup has some docstrings that have examples in them.  The examples look suspiciously like doctests, and he thought they were.  This caused him some confusion, because the examples were a decoy: they looked like doctests, but they were not hooked up to actually run in the test suite.  The examples are actually rewritten as unit tests in the normal test suite.

Could we either remove the docstring tests or hook them up?

gary_poster: which should it be, removal or test suite integration?  If the examples in a docstring are good for documentation, we should keep them and make them run in the test suite.  Even if the examples are only moderately valuable as examples, they can also effectively provide a warning system for when the docstring's prose needs to be reviewed.

[Editor: We didn't talk about it much, and didn't come to a strong resolution, but we are generally preferring removal at this time as a matter of practice.]

benji: chafing on frameworks

lpsetup is a fairly small package, but it also works as its own small framework.  In order to implement the subcommand pattern (in which you can run "[COMMAND] [SUBCOMMAND] [options]", like "lpsetup inithost" or "lpsetup initlxc" or "lpsetup get"), the main "lpsetup" command calls the subcommands (e.g., "inithost," "initlxc," etc.).  Therefore, when you write a subcommand, you are experiencing inversion of control, which is a primary characteristic of a framework.  Moreover, the subcommands are generally created by subclassing and extending a subcommand base class, which is another pattern typical of a framework.

benji has been burned by frameworks, and prefers libraries.  For lpsetup, the framework is small and malleable enough that the annoyances encountered have only been minor, but in the future he would prefer to avoid inversion of control entirely, unless it is truly called for.  He gives examples of reasonable inversion of control as select loop code like Twisted, UI toolkits like GTK, and URL dispatch like Ruby on Rails ("RoR").

frankban: isn't the lpsetup approach really a similar pattern to RoR URL dispatch? 
benji: maybe.  I'm not that worried about lpsetup.  In fact...

benji: ...gary_poster asked me to talk about this, after I mentioned the thought to him. Why, gary_poster? 
gary_poster: our squad has had pain in the past with developers rewriting other developer's code.  A concrete example is our lp2kanban project (https://launchpad.net/lp2kanban), which pushes Launchpad bug state to LeanKitKanban. One developer switched us from another developer's functional approach to an object oriented approach.  If we can roughly agree on design goals initially, that should reduce rework, and maybe reduce friction.

gary_poster: Another point is that this appears to be a particular problem for slack projects done by an individual that become team-maintained projects--at least, we have two data points in this direction, lpsetup and lp2kanban.  We have already realized that slack projects need to be analyzed for their future maintenance expectations when they are first proposed, and this is further confirmation of that.

benji: how can we agree on a design productively?
bac/benji: we want to encourage autonomy and avoid design-by-committee.
gary_poster: I think our prototype process can address this.  Our (incredibly simple) checklist about this says that a person or two should prototype, and then we all come together to discuss and agree on what the rewrite should look like, and then we actually write the code with TDD.
bac/benji: If a developer has an issue with the design after we've discussed and agreed, our default stance should be "When in Rome...": follow the existing design.  A corollary for us might be that "if you want to rebuild Rome, ask the citizenry first": we should only rewrite if we have built consensus.

gary_poster: updating lpsetup's system workarounds over time

One of the goals that Robert gave us for the lpsetup project was that we would be able to run it again in the future and have it update a system to remove workarounds that were no longer necessary and add newly discovered workarounds.  Our preexisting code did not attempt this, and we have not tried to code this yet.  gary_poster made a strawman proposal about how to do this in code.  What do we think?

benji: this sounds a lot like Debian packages.  It's a hard problem, and packaging systems have been working on it for a long time.  Maybe we should have a Debian package that manages our workarounds for an LXC host, and one that manages our workarounds for an LXC container?  We all think benji might be rather clever.  ACTION: gary_poster will try to arrange a time to discuss this with Robert.

gary_poster/frankban: integration tests for lpsetup, take 2

Last week we talked about how we might make integration tests for lpsetup, and resolved to create a slack card to investigate.  In the course of doing manual integration tests, gary_poster gathered some information that might help automated tests.  frankban had already done similar work. gary_poster and frankban discussed it.  gary_poster recorded notes from the discussion and made a simple proposal for a way forward (https://lists.launchpad.net/yellow/msg00971.html).  Any comments?

No comments.  ACTION: gary_poster will make a kanban card for developing an integration test suite that works in the way described for the first (manual run) step.

No comments: