On monotone selectors

This is the first post in a small series of posts which will show off some of the new functionality you can expect in the next major version of monotone. While there is no fixed release date set for it yet, we plan to release it in fall this year. If you look at the roadmap you see that most things have already been implemented and merged into mainline, so we’re definitely on plan 🙂

Anyways, lets begin this little series with the selector rewrite Tim merged a couple of weeks ago. Selectors are one of the main concepts in monotone to pick revisions other than by their 40 byte long hash id and are therefor very useful to “navigate” between different development lines.

Monotone up until 0.48 knows already many selectors – you can select revisions by tag, by branch, by author, by custom cert values and so on. Selectors can be combined to calculate the intersection between two single sets, like “show me all revisions from author ‘Jon’ on branch ‘my.project'” which would essentially look like this:

$ mtn automate select "a:jon/b:my.project"

The syntax for these selectors is all nice and simple – each selector is prefixed with a unique character and multiple selectors are concatenated with a single slash. While these old-style selectors solved many use cases, some however kept unresolved in the past and users from other DVCS like Darcs had a rather hard time figuring out how to accomplish a certain selection in monotone.

A particular good example is “how can I easily view the changes of a development branch since the last merge point?”. Up until now you either had to figure out manually the revision of the merge point by looking at the output of log or use some scary construct like the following:

$ mtn au common_ancestors $(mtn au select h:main.branch) \
    $(mtn au select h:) | mtn au erase_ancestors -@-

Enter selector functions

Luckily, you don’t have to write these things anymore starting from 0.99 onwards. Give the new selector functions a warm applause!

$ mtn au select "lca(h:main.branch;h:feature.branch)"

In this example “lca” stands for the “least common ancestors” function which takes two arguments, i.e. two other selectors. The syntax is extra short in a workspace where an empty head selector h: defaults to the branch recorded in the workspace options, so if you’re in the feature.branch workspace, just type:

$ mtn au select "lca(h:main.branch;h:)"

Quite convenient, eh? This is not only short, but up to five times faster than the above complex command line. Of course the selector can be directly used in a call to diff or log, like so:

$ mtn diff -r "lca(h:main.branch;h:)"
$ mtn log --to children(lca(h:main.branch;h:))"

But huh, whats that nested children call you ask? Well, the lca function picks the merge point in the _main branch_ and if the revision graph goes around that, log would otherwise happily log more parents (earlier revisions) on the feature branch. The call to children ensures that we pick the merge revision in the feature branch and therefor really stop logging at this revision.

Test drive

There are many more of these selector functions and explaining them all in detail is out of scope here, please have a look at “composite selectors” in the nightly built manual.
And if you want to have an early look at this and play around without having to compile it yourself – at least if you’re on openSUSE or Fedora – just download the binaries from our nightly builds.

MySQL partitioning benchmark

I had a little research task today at work where I needed to evaluate which MySQL storage engine and technique would be the fastest to retrieve lots of (like millions) log data. I stumbled upon this post which explained the new horizontal partitioning features of MySQL 5.1 and what I read there made me curious to test it out myself, also because the original author forgot to include a test with a (non-)partitioned, but indexed table.

This is my test setup: Linux 2.6.34, MySQL community server 5.1.46, Intel Pentium D CPU with 3.2GHz, 2GB RAM

Test MyISAM tables

The table definitions are copied and adapted from the aforementioned article:

CREATE TABLE myi_no_part (
      c1 int default NULL,
      c2 varchar(30) default NULL,
      c3 date default NULL
) engine=MyISAM;

CREATE TABLE myi_no_part_index (
      c1 int default NULL,
      c2 varchar(30) default NULL,
      c3 date default NULL,
      index(c3)
) engine=MyISAM;

CREATE TABLE myi_part (
  c1 int default NULL,
  c2 varchar(30) default NULL,
  c3 date default NULL
) PARTITION BY RANGE (year(c3))
(PARTITION p0 VALUES LESS THAN (1995),
 PARTITION p1 VALUES LESS THAN (1996),
 PARTITION p2 VALUES LESS THAN (1997),
 PARTITION p3 VALUES LESS THAN (1998),
 PARTITION p4 VALUES LESS THAN (1999),
 PARTITION p5 VALUES LESS THAN (2000),
 PARTITION p6 VALUES LESS THAN (2001),
 PARTITION p7 VALUES LESS THAN (2002),
 PARTITION p8 VALUES LESS THAN (2003),
 PARTITION p9 VALUES LESS THAN (2004),
 PARTITION p10 VALUES LESS THAN (2010),
 PARTITION p11 VALUES LESS THAN MAXVALUE) 
 engine=MyISAM;

CREATE TABLE myi_part_index (
  c1 int default NULL,
  c2 varchar(30) default NULL,
  c3 date default NULL,
  index(c3)
) PARTITION BY RANGE (year(c3))
(PARTITION p0 VALUES LESS THAN (1995),
 PARTITION p1 VALUES LESS THAN (1996),
 PARTITION p2 VALUES LESS THAN (1997),
 PARTITION p3 VALUES LESS THAN (1998),
 PARTITION p4 VALUES LESS THAN (1999),
 PARTITION p5 VALUES LESS THAN (2000),
 PARTITION p6 VALUES LESS THAN (2001),
 PARTITION p7 VALUES LESS THAN (2002),
 PARTITION p8 VALUES LESS THAN (2003),
 PARTITION p9 VALUES LESS THAN (2004),
 PARTITION p10 VALUES LESS THAN (2010),
 PARTITION p11 VALUES LESS THAN MAXVALUE) 
 engine=MyISAM;

Test Archive tables

Since MySQL’s Archive engine does only support one index which is primarily used for identifying the primary id, I left out the indexed versions for that:

CREATE TABLE ar_no_part (
      c1 int default NULL,
      c2 varchar(30) default NULL,
      c3 date default NULL
) engine=Archive;

CREATE TABLE ar_part (
  c1 int default NULL,
  c2 varchar(30) default NULL,
  c3 date default NULL,
  index(c3)
) PARTITION BY RANGE (year(c3))
(PARTITION p0 VALUES LESS THAN (1995),
 PARTITION p1 VALUES LESS THAN (1996),
 PARTITION p2 VALUES LESS THAN (1997),
 PARTITION p3 VALUES LESS THAN (1998),
 PARTITION p4 VALUES LESS THAN (1999),
 PARTITION p5 VALUES LESS THAN (2000),
 PARTITION p6 VALUES LESS THAN (2001),
 PARTITION p7 VALUES LESS THAN (2002),
 PARTITION p8 VALUES LESS THAN (2003),
 PARTITION p9 VALUES LESS THAN (2004),
 PARTITION p10 VALUES LESS THAN (2010),
 PARTITION p11 VALUES LESS THAN MAXVALUE) 
 engine=Archive;

Test data

I re-used the procedure to create about 8 million test data records spread randomly over the complete partitioned area and subsequently copied the generated data to the other tables:

delimiter //

CREATE PROCEDURE load_part_tab()
     begin
      declare v int default 0;
              while v < 8000000
      do
      insert into myi_no_part
      values (v,'testing partitions',adddate('1995-01-01',(rand(v)*36520) mod 3652));
      set v = v + 1;
      end while;
     end
     //

delimiter ;

call load_part_tab;

insert into myi_no_part_index select * from myi_no_part;

...

Test query and the results

I used the same query to retrieve data from all of the tables:

select count(*) from TABLE_NAME 
where c3 > date '1995-01-01' and c3 < date '1995-12-31';

and these were the results (mean values of several executions):

table exec time
`myi_no_part` ~ 6.4s
`myi_no_part_index` ~ 1.2s
`myi_part` ~ 0.7s
`myi_part_index` ~ 1.3s
`ar_no_part` ~ 10.2s
`ar_part` ~ 1.1s

These results were actually pretty suprising to me, for various reasons:

  • I would not have thought that intelligent partitioning would beat an index on the particular column by saving the hard disk space for the index at the same time (roughly 1/3 of the total data size in this test case).
  • The values for `myi_no_part` were actually better than expected - I would have thought that these should be much worse, also if you compare them with the values from the author of the original article.
  • The archive engine adds actually nothing to the mix, but disadvantages. Maybe my test case is flawed because I "only" tested with 8 million rows, but one can clearly see that a partitionated MyISAM table beats a partitionated Archive table by more than 40%, so the usage of the Archive engine gives you no advantages, but only disadvantages, like being not able to delete records or add additional indexes.
  • Apparently partitioning and indexing the column in question is slightly slower instead of faster, however if one tries to use a subset of a partitioned table (like restricting to where c3 > date '1995-06-01' and c3 < date '1995-08-31') it is faster - ~0.3s with index vs ~0.7s without index.

Conclusion

MySQL's partitioning is a great new feature in 5.1 and should be used complementary to subtle and wise indexing.

Software now patentable in Germany

If you haven’t got the news already: The Federal Court of Justice in Germany recently declared software patentable without any reasonable limitation (German version on news service heise.de here and here).

While there are many efforts in the United States to fix the brokeness of their patent system – also in respect to Software Patents which have made more harm in the last decades than anything else – we here in Europe and especially in Germany are just doing the same mistakes again.

This is a very bad day for the Freeware, Shareware and also the Open Source scene – look out for patent trolls nearby you in the future…

Interoperatibility without openess?

I was pointed yesterday to a page from the Free Software Foundation Europe (fsfe short) which describes the changes between the original and current draft for the “European Interoperatibility Framework” EIF – to quote the conclusion:

[…] we can only conclude that the European Commission is giving strong preference to the viewpoint of a single lobby group. Regarding interoperability and open standards, key places of the consultation document were modified to comply with the demands of the BSA. Input given by other groups was not considered on this issue. Beyond ignoring this input, the Commission has apparently decided to ignore the success of the first version of the EIF, and to abandon its efforts towards actually achieving interoperability in eGovernment services.

EIFv2: Tracking the loss of interoperability – enjoy the read…

openSUSE madness

Just in case you wonder why a simple `sudo zypper install
` sometimes loads dozens or more unneeded, but possibly related packages, its not a bug, its a feature!

While Debian by default only hints you to these additional packages during the install phase, openSUSE installs them all by default. Try it with `git` and you’ll get everything here: `cvsps git git-core git-cvs git-email git-gui gitk git-svn git-web libpurple-tcl subversion-perl tcl tk xchat-tcl`.

There are two ways to get rid of this nasty behaviour:

1. Temporarily by adding the `–no-recommends` option to your call

2. Permanently by editing `/etc/zypp/zypper.conf` and configuring `installRecommends = no` in the `[solver]` section.

Hey, at least they have an option to disable it, though its completely beyond me why somebody wants to have this enabled by default. Maybe they get a cookie for every additional package download…?

openSUSE build service client ported

I used to create packages for a couple of open source projects for the openSUSE Linux distribution. They have this really nice build service running on build.opensuse.org, on which you can – despite of its name – also build packages for other Linux distributions like Fedora, Gentoo or Debian.

While the web-based interface of the service is nice, some configurations and local builds require their command line client osc though, which is python-based and works similar to subversion. This client however was only packaged for the main distros the build service itself supports, but was unavailable for others like f.e. Mac OS X, so I created a MacPort for it today (installable via sudo port install osc).

Of course local Linux builds are not possible with it, as we’re missing the complete environment, but I think its still useful for maintaining and managing remote builds on the service itself. Have fun with it!

Why the lucky stiff

I actually don’t know where this guy got his name from, nor how exactly I stumbled upon him or his excellent book on Ruby, but one is for sure: He’s one of those adorable people with more than one or two skills. You can already get this if you read the first few pages of the aforementioned book, which is even if you’re not interested in Ruby still worth a read alone for all his genious anecdotes, weird examples and funny cartoons. So if you have half an hour or two, look at it and think about it.

And yet again something repeats which I already encountered in other areas of my life: At the time I get to know to a particular band or in this case software evangelist, they already retired for unknown reasons from their work. All the best for you, you lucky stiff. I’m sure you’ll also do an excellent job in the “offline world”, wherever you are…

Read encrypted emails via webmail?

I was recently asked how to read encrypted emails securely in some untrusted environment via webmail. Imagine you’re sitting on someone else’ computer and absolutely need to check your inbox for this one encrypted email which contains a password without which you can’t continue. Or you’re in some internet cafe and got an important encrypted email – how would you do that?

Actually, the only thing which comes into my mind here is a combination of Portable Firefox and FireGPG on an USB stick (possibly encrypted). This, of course, bears a couple of problems:

  1. If you don’t know which OS your “target” computer has, you need to have this “tandem” in at least three different binary versions, Mac OS X, Linux and Windows. While this doesn’t sound too hard (three partitions on the same drive), it’ll probably harder to encrypt all three and have something like “plug-and-mail-ready” for the target OS.
  2. If you use a non-standard webmailer (i.e. no public service, but an own setup, like I have with roundCube Webmail), you won’t have a really good integration with FireGPG (i.e. no interface buttons, auto-decryption and other stuff) unless the webmail software plans support for FireGPG. (roundCube targeted it for “later“.)
  3. And maybe the greatest show-stopper is the question: Is it really secure in untrusted environments? After all, GnuPG needs to load your private key into RAM to decrypt your message, and if it resides unprotected there (does it?), it could be, at any time, be read out by some hidden daemon and boom, your private key would be compromised…

How would you solve this dilemma? A VPN to a trusted PC from which you send and receive emails?

If there are no other good solutions then I guess people will have to choose between accessibility from everywhere and email security. And I bet they don’t choose security…

SSL Verification with Qt and a custom CA certificate

So I wanted to make my application updater for guitone SSL-aware the other day. The server setup was an easy job: Add the new domain (guitone.thomaskeller.biz) to cacert.org, create a new certificate request with the new SubjectAltName (and all the other, existing alternative names – a procedure where this script becomes handy), upload to CAcert, sign it there, download and install the new cert on my server, setup a SSL vhost for the domain – done!

Now, on Qt’s side of things using SSL is rather easy as well, the only thing you have to do is give the setHost method another parameter:

QHttp * con = new QHttp();
con->setHost("some.host.com", QHttp::ConnectionModeHttps);
con->get("/index.html");
// connect to QHttp's done() signal and read the response

This should actually work for all legit SSL setups if Qt (or, to be more precise, the underlying openssl setup) knows about the root certificate with which your server certificate has been signed. Unfortunately, CAcert’s root certificate is not installed in most cases, so you basically have two options:

  1. Connect to QHttp’s sslErrors(...) signal to the QHttp::ignoreSslErrors() slot. This, of course, pretty much defeats the whole purpose of an SSL connection, because the user is not warned on any SSL error, so also legit errors (certificate expired or malicious) are just ignored. (*)
  2. Make the root certificate of CAcert known to the local setup, so the verification process can proceed properly.

I decided to do the latter thing. This is how the code should now look like:

QHttp * con = new QHttp();
QFile certFile("path/to/root.crt");
Q_ASSERT(certFile.open(QIODevice::ReadOnly));
QSslCertificate cert(&certFile, QSsl::Pem);
// this replaces the internal QTcpSocket QHttp uses; unfortunately
// we cannot reuse that one because Qt does not provide an accessor
// for it
QSslSocket * sslSocket = new QSslSocket(this);
sslSocket->addCaCertificate(cert);
httpConnection->setSocket(sslSocket);
con->setHost("some.host.com", QHttp::ConnectionModeHttps);
con->get("/index.html");
// connect to QHttp's done() signal and read the response

Particularily interesting to note here is that the QIODevice (in my case the QFile instance) has to be opened explicitely before it is given to QSslCertificate. I did not do this previously, Qt neither gave me a warning nor an error, but simply refused to verify my server certificate, just because it didn’t load the root certificate properly.

(*) One could, of course, check the exact triggered SSL error from QSslError::error(), in our case this could be f.e. QSslError::UnableToGetLocalIssuerCertificate, but this is rather hacky and could certainly be abused by a man in the middle as well.

Better than grep

If you’re a programmer and you’re using a command line heavily, you’ve certainly come across a big nuisance in the usage of GNU’s grep utility: The verbosity you need to exclude files from being searched, such as backup files (*~ or *#) or any kind of vcs inventory stupidity like .svn or .cvs. (Did I mention earlier that this is one of the many reasons why I hate svn and cvs with a passion – for the fact that they clutter my workspace with this crap?)

Anyways, with grep you usually end up with something like that:

$ grep -R myterm | grep -v ‘\.svn’ | grep -v ‘~:’

or, to speed up your searches a bit and let grep not even crawl things you don’t like to see anyways:

$ grep -R myterm `find . -type f | grep -v ‘\.svn’ | grep -v ‘~:’`

Anybody else thinks that this syntax is just hilarious and totally overkill? So I looked at grep’s manpage for some kind of .greprc which I could define in my homedir and in which I could set all the things I want to exclude, but apparently no such file is acknowledged by grep.

Finally I did a Google search for “grep ignore svn directories” and found ack – a Perl tool which resolves this and other problems with grep. My personal feature highlights:

  • You can use real Perl regular expressions (no grep -r stuff needed)
  • -A, -B and -C options work just like I know them from grep
  • Output is highlighted with terminal colors and very much cleaned up in comparison with grep
  • There are predefined sets of file extensions to search, i.e. ack --php foo will search all php files (php3, php4, php, aso.) for foo, also with an option to exclude these sets with f.e --noperl

So I’m more than satisfied with this – if I would have only found this earlier! This saves my day!

Oh, I think I forgot to mention one absolute killer feature of ack – to quote the author:

Command name is 25% fewer characters to type! Save days of free-time! Heck, it’s 50% shorter compared to grep -r.

Go, get it while its hot!