Batch-remove empty lines at the end of many Confluence pages

In a customer project we’ve decided to collaboratively write a bigger bunch of documentation in Atlassians Confluence and export that with Scroll Office, a third-party Confluence plugin, into Word.

That worked fine so far, but soon we figured that we’ve been kind of sloppy with empty lines at the end of each page, which were obviously taken over into the final document. So instead of going over each and every page and remove the empty lines there, I thought it might be easier to directly do this on the database, in our case MySQL.

The query was quickly developed, but then I realized that MySQL had no PREG_REPLACE function built-in, so I needed to install a UDF, a user-defined function first. Luckily, this UDF worked out of the box and so the query could be finalized:

UPDATE BODYCONTENT
JOIN CONTENT ON CONTENT.CONTENTID=BODYCONTENT.CONTENTID
AND CONTENTTYPE LIKE "PAGE" AND PREVVER IS NULL
SET BODY=PREG_REPLACE("/(

<.p>)+$/", "", BODY)
WHERE BODY LIKE "%

";

This query updates all current pages (no old versions) from all spaces that end with at least one empty line – this is Confluence’s internal markup for that – and removes all of these empty lines from all matches pages.

This was tested with MySQL 5.5.35, lib_mysqludf_preg 1.2-rc2 and Confluence 5.4.2.

I don’t need to mention that it is – of course – highly recommended that you backup your database before you execute this query on your server, right?

jazzlib – an alternative for reading ZIP files in Java

Java had zip-reading capabilities for a long time, naturally because `jar` files are simply compressed zip files with some meta data. The needed classes reside in the `java.util.zip` namespace and are `ZipInputStream` and `ZipEntry`.

Recently, however, `ZipInputStream` gave me a huge headache. My use case was as simple as

* read the zip entries of a list of zip files (each varying in size, but usually around 20MB)
* skip to the zip entry that has a certain name (a single text file with only two bytes of contents)
* read the contents of this zip entry and close the zip

Doing this for about 25 files took my Pentium D (2GHz) with 3GB of RAM roughly **20 seconds**. Wow, 20 seconds really? I created a test case and profiled the code in question separately with [YourKit](http://www.yourkit.com) (which is a really great tool, by the way!):

It got stuck quite a bit in `java.util.zip.Inflater.inflateBytes` – but that seemed to use native code, so I couldn’t profile any further.

So I went on and searched for an alternative of `java.util.zip` – and luckily I found one with [jazzlib](http://jazzlib.sourceforge.net), which provides a pure Java implementation for ZIP compression and decompression. This library is GPL-licensed (with a small exception clause to prevent the pervasiveness of the GPL) and comes in two versions, one that duplicates the single library classes underknees `java.util.zip` (as a drop-in replacement for JDK versions where this is missing) and one that comes in its own namespace, `net.sf.jazzlib`.

After I went for the second version, I restarted my test and it only took about **7 seconds** this time. At first I thought that there must be some downside to this approach, so I checked the timings for a complete decompression of the archive, but the timings here were on par with the ones from `java.util.zip` (roughly 5 seconds for a single 20MB file).

I haven’t tested compression speed, because it doesn’t matter much for my use case, but the decompression speed alone is astonishing. I wonder why nobody else stumbled upon these performance problems before…

Debugging Java WebStart applications with Eclipse

The last couple of days I went through a little nightmare: I needed to debug a Java application which showed some weird behaviour only when loaded via WebStart, but not when executed within Eclipse.

[Multiple](http://www.jacoozi.com/index.php?option=com_content&task=view&id=119&Itemid=134) [internet](http://javaswamy.blogspot.com/2008/10/debugging-jws-with-eclipse.html) [resources](http://www.ibm.com/developerworks/java/library/os-eclipse-javadebug/index.html) told me to use the `JAVAWS_VM_ARGS` environment variable to tell the Java VM to either create a debug socket itself or connect to some socket started from Eclipse and then simply set a breakpoint there and wait for the meal. But nothing worked out for me, until I found two very important issues nobody wrote about so far:

* Under Java 6 `JAVAWS_VM_ARGS` seems to be completely ignored by `javaws`, on Linux and on Windows. The way to go was to use separate `-J` options in the call to `javaws` – and given a running Eclipse debug server you finally saw the threads of your running application.
* Wait, were these threads really belonging to _your_ running application? This was the second issue. Apparently `javaws` directly forks away as soon as it started and runs your application in a separate VM, which _of course_ did not get the original debug options passed. Again, some internet resources said it would be enough to load the jnlp file from the net and start locally, but actually this did not work out. The magic option to apply to `javaws` here was `-Xnofork` and finally, on the next run I saw my application’s threads in Eclipse and my breakpoints magically worked as expected.

Here is the complete command line example again (given a running Eclipse debug server on port 8000):

javaws -Xnofork -J-Xdebug -J-Xnoagent \
-J-Xrunjdwp:transport=dt_socket,server=n,address=8000 \
path/to/app.jnlp

Hope that helps someone 🙂

#112123

Today I completed my ScrumMaster certification. After the two days with [Joseph Pelrine in early November](http://www.thomaskeller.biz/blog/2010/11/05/why-i-do-open-source-development/) they finally send me some login for their website scrumalliance.com, offered a short web-based test (which you couldn’t really fail) and tada – now I can call myself [“Certified ScrumMaster”](/bewerbung/csm.pdf).

So everything seemed to look fine, but wait, why has this certificate an expiry date tacked on it? Uh, yes, erm, I’m supposed to renew it every year beginning from 2012 for 50 bucks. Its not like your drivers license – once you’ve learned to ride your car you cannot unlearn it, right? All the knowledge about Scrum is of course totally forgotten and buried in my head when the magic year 2012 has been reached… now thats business!

But hey, I got something in return. I’m listed with the other 112122 fellows on their website as CSM and are “allowed” to use the term “CSM”, the “logo” (which I had to extract via photoshop from the certificate itself, because they do not offer a download) and the nice little Scrum community they build around the website.

Thank god that Joseph was entertaining and educational enough for the two days in November that all the (company) money was worth the certification. The certificate itself and the “services” around it are certainly not worth it.

The fuzzy cloud

Until now “the cloud” (the computing paradigm) was for me more or less a fuzzy defined hype I missed to really “see” the advantages of: I delegate infrastructure resources to some external “hoster” and pay for that, but I instantly got this nagging fear that I’ll loose full control over my resources and data when I’d do so.

While I am (unfortunately) not in the right business for experiencing and working with “the cloud” myself, I’m still very interested in learning about the possibilities and changes cloud computing introduces, especially from the development point of view.

Today I now stumbled upon [Adrian Cockcroft](http://perfcap.blogspot.com/)’s outstanding [presentation](http://www.slideshare.net/adrianco/netflix-on-cloud-combined-slides-for-dev-and-ops) in which he outlines how Netflix, the company he’s working for, introduced cloud-based services step by step as a replacement for traditional data center services and monolithic architectures. After reading through the roughly hundred slides I can say I’m much more confident to believe that all this *is* the right direction in the future, also (but not solely) because it forces us developers to write much more domain-independent, easy servicable, stripped down code.

So if you have 30 minutes for a very good introduction in practical cloud computing, I urge you to [go ahead and read the slides](http://www.slideshare.net/adrianco/netflix-on-cloud-combined-slides-for-dev-and-ops).

MySQL partitioning benchmark

I had a little research task today at work where I needed to evaluate which MySQL storage engine and technique would be the fastest to retrieve lots of (like millions) log data. I stumbled upon this post which explained the new horizontal partitioning features of MySQL 5.1 and what I read there made me curious to test it out myself, also because the original author forgot to include a test with a (non-)partitioned, but indexed table.

This is my test setup: Linux 2.6.34, MySQL community server 5.1.46, Intel Pentium D CPU with 3.2GHz, 2GB RAM

Test MyISAM tables

The table definitions are copied and adapted from the aforementioned article:

CREATE TABLE myi_no_part (
      c1 int default NULL,
      c2 varchar(30) default NULL,
      c3 date default NULL
) engine=MyISAM;

CREATE TABLE myi_no_part_index (
      c1 int default NULL,
      c2 varchar(30) default NULL,
      c3 date default NULL,
      index(c3)
) engine=MyISAM;

CREATE TABLE myi_part (
  c1 int default NULL,
  c2 varchar(30) default NULL,
  c3 date default NULL
) PARTITION BY RANGE (year(c3))
(PARTITION p0 VALUES LESS THAN (1995),
 PARTITION p1 VALUES LESS THAN (1996),
 PARTITION p2 VALUES LESS THAN (1997),
 PARTITION p3 VALUES LESS THAN (1998),
 PARTITION p4 VALUES LESS THAN (1999),
 PARTITION p5 VALUES LESS THAN (2000),
 PARTITION p6 VALUES LESS THAN (2001),
 PARTITION p7 VALUES LESS THAN (2002),
 PARTITION p8 VALUES LESS THAN (2003),
 PARTITION p9 VALUES LESS THAN (2004),
 PARTITION p10 VALUES LESS THAN (2010),
 PARTITION p11 VALUES LESS THAN MAXVALUE) 
 engine=MyISAM;

CREATE TABLE myi_part_index (
  c1 int default NULL,
  c2 varchar(30) default NULL,
  c3 date default NULL,
  index(c3)
) PARTITION BY RANGE (year(c3))
(PARTITION p0 VALUES LESS THAN (1995),
 PARTITION p1 VALUES LESS THAN (1996),
 PARTITION p2 VALUES LESS THAN (1997),
 PARTITION p3 VALUES LESS THAN (1998),
 PARTITION p4 VALUES LESS THAN (1999),
 PARTITION p5 VALUES LESS THAN (2000),
 PARTITION p6 VALUES LESS THAN (2001),
 PARTITION p7 VALUES LESS THAN (2002),
 PARTITION p8 VALUES LESS THAN (2003),
 PARTITION p9 VALUES LESS THAN (2004),
 PARTITION p10 VALUES LESS THAN (2010),
 PARTITION p11 VALUES LESS THAN MAXVALUE) 
 engine=MyISAM;

Test Archive tables

Since MySQL’s Archive engine does only support one index which is primarily used for identifying the primary id, I left out the indexed versions for that:

CREATE TABLE ar_no_part (
      c1 int default NULL,
      c2 varchar(30) default NULL,
      c3 date default NULL
) engine=Archive;

CREATE TABLE ar_part (
  c1 int default NULL,
  c2 varchar(30) default NULL,
  c3 date default NULL,
  index(c3)
) PARTITION BY RANGE (year(c3))
(PARTITION p0 VALUES LESS THAN (1995),
 PARTITION p1 VALUES LESS THAN (1996),
 PARTITION p2 VALUES LESS THAN (1997),
 PARTITION p3 VALUES LESS THAN (1998),
 PARTITION p4 VALUES LESS THAN (1999),
 PARTITION p5 VALUES LESS THAN (2000),
 PARTITION p6 VALUES LESS THAN (2001),
 PARTITION p7 VALUES LESS THAN (2002),
 PARTITION p8 VALUES LESS THAN (2003),
 PARTITION p9 VALUES LESS THAN (2004),
 PARTITION p10 VALUES LESS THAN (2010),
 PARTITION p11 VALUES LESS THAN MAXVALUE) 
 engine=Archive;

Test data

I re-used the procedure to create about 8 million test data records spread randomly over the complete partitioned area and subsequently copied the generated data to the other tables:

delimiter //

CREATE PROCEDURE load_part_tab()
     begin
      declare v int default 0;
              while v < 8000000
      do
      insert into myi_no_part
      values (v,'testing partitions',adddate('1995-01-01',(rand(v)*36520) mod 3652));
      set v = v + 1;
      end while;
     end
     //

delimiter ;

call load_part_tab;

insert into myi_no_part_index select * from myi_no_part;

...

Test query and the results

I used the same query to retrieve data from all of the tables:

select count(*) from TABLE_NAME 
where c3 > date '1995-01-01' and c3 < date '1995-12-31';

and these were the results (mean values of several executions):

table exec time
`myi_no_part` ~ 6.4s
`myi_no_part_index` ~ 1.2s
`myi_part` ~ 0.7s
`myi_part_index` ~ 1.3s
`ar_no_part` ~ 10.2s
`ar_part` ~ 1.1s

These results were actually pretty suprising to me, for various reasons:

  • I would not have thought that intelligent partitioning would beat an index on the particular column by saving the hard disk space for the index at the same time (roughly 1/3 of the total data size in this test case).
  • The values for `myi_no_part` were actually better than expected - I would have thought that these should be much worse, also if you compare them with the values from the author of the original article.
  • The archive engine adds actually nothing to the mix, but disadvantages. Maybe my test case is flawed because I "only" tested with 8 million rows, but one can clearly see that a partitionated MyISAM table beats a partitionated Archive table by more than 40%, so the usage of the Archive engine gives you no advantages, but only disadvantages, like being not able to delete records or add additional indexes.
  • Apparently partitioning and indexing the column in question is slightly slower instead of faster, however if one tries to use a subset of a partitioned table (like restricting to where c3 > date '1995-06-01' and c3 < date '1995-08-31') it is faster - ~0.3s with index vs ~0.7s without index.

Conclusion

MySQL's partitioning is a great new feature in 5.1 and should be used complementary to subtle and wise indexing.

Symfony development

Last week the second incarnation of Symfony Live came to an end and I just had the time to check a couple of shared slides from the event.

Definitely interesting stuff going on there, especially the preview release of Symfony 2.0 whose code is available on GitHub since a couple of weeks and which makes major changes to the “good old way” one used to write symfony applications (actions are now “controllers” and extensions “bundles” and well, a dozen of other things changed as well of course… you can read everything in detail here).

Also, Doctrine 2.0 seems to be the first PHP ORM which decouples the modelling approach from the actual database abstraction layer, skips the need for base classes and enables the model definition via annotations. Also, they seem to fight against the overly complex magic from Doctrine 1.x (one of my top complaints on Doctrine in comparison to, f.e. Propel) – maybe I’ll revisit Doctrine again when the next version gets stable.

The guys at Sensio labs do really have a fast development pace and I get more and more the impression that the Symfony ecosystem is the major competitor for the Zend framework. Community-wise I think Symfony is already much bigger than any other PHP framework.

Never trust doctrine:data-dump…

…and especially not if you get the impression that the dump will afterwards be readable by the `doctrine:data-load` command of symfony.

It was a costly lesson today when I tried to reimport a dump of a couple of Sympal tables. One of them, the one which models the menu items, has a nested set behaviour, and apparently this one cannot be restored properly by doctrine:

[Doctrine_Record_UnknownPropertyException]                                    
  Unknown record property / related component "children" 
  on "sfSympalMenuItem"

Apparently this particular issue popped up a couple of times in the past for other people as well (Google for it) and while the help of `doctrine:data-dump` still (Doctrine 1.2) blatantly states

The doctrine:data-dump task dumps database data:

./symfony doctrine:data-dump

The task dumps the database data in data/fixtures/%target%.

The dump file is in the YML format and can be reimported
by using the doctrine:data-load task.

./symfony doctrine:data-load

(with the emphasis of “can be reimported”)

the author of Doctrine, Jonathan Wage, told me today on Sympal’s IRC (shortened):

<jonwage> we don’t want people to think you can dump and then restore
<jonwage> that is not what the data fixtures are for
<jonwage> b/c dumping and then loading will never work
<jonwage> an ORM modifies data on the way and and the way out
<me> I mean the least thing doctrine could do there is that if it detects the nested set behaviour it should error out clearly on dump
<jonwage> so you can’t dump the data through an ORM and then try and reload it
<jonwage> i.e. hashed passwords
<me> if dumping is “never” going to work – why do you support dumping into yaml at all?!
<jonwage> if we do that then we would have to throw errors in sooooooo many other cases too
<jonwage> because it is at least a little bit of a convenience
<me> its like a half-baked feature then
<jonwage> we dump the raw data
<jonwage> and you can tweak it
<jonwage> thats my point though, it will ALWAYS be a half baked feature thats why we document it that way
<jonwage> it can NEVER work 100% the way you want it to
<jonwage> so if we fix that one thing, a million other things will be reported that we cannot fix
<jonwage> bc an ORM is not a backup and restore tool
<jonwage> it is impossible

Now I know that as well. My only problem was that I struggled “what is wrong with my fixtures” the whole time and never dared to ask “what is wrong with doctrine”…

Tip: Logging with Symfony >= 1.2

Imagine you have a business method in your model which needs to be accessed by two environments: once from a symfony task and once from the web. So far so good, now what if this business method should be able to log contents somewhere visibly, in case of the command line task to console and to a file and in case of the web application to the default logging mechanisms used there?

Getting the logger in web context is easy, all you have to do is

$logger = sfContext::getInstance()->getLogger();

but its a little harder to do for the command line task.

By default no symfony context is created for a command line task and even if it is created, the above call returns an instance of sfNoLogger. Logging in command applications happens through the sfTask::logSection() method, which basically throws an event at the created dispatcher in SYMFONYDIR/lib/command/cli.php. There you can also see that an instance of sfCommandLogger is created, but there is no way to get your fingers at this instance, because its purely local.

So what can we do? Parametricizing the business method with the sfTask instance and using the logSection() is obviously no solution, because this would break in web context where no such sfTask instance exists…

My solution was a bit more straight forward – I simply decided to not use the task-supplied logging schema at all, but created my own logger like this:

$dispatcher = new sfEventDispatcher();
$logger = new sfAggregateLogger($dispatcher);
$logger->addLogger(new sfCommandLogger($dispatcher));
// optionally add another file logger
if ($logToFile)
{
    $logger->addLogger(
        new sfFileLogger($this->dispatcher, ...)
    );
}

Hope this helps somebody.

Doctrine Horror

My latest Symfony project uses Doctrine as ORM, which is considered to be a lot better than Propel by many people…

Well, not by me. Doctrine seems to have a couple of very good concepts, amongst them built-in validators, a powerful query language, and last but not least, an easy schema language. (Though to be fair, Propel will gain most of these useful things in the future as well or already has, f.e. with its `PropelQuery` feature.)

But Doctrine also fails in many areas; the massive use of overloads everywhere makes it very hard to debug and even worse, it tries to outsmart you (the developer) in many areas, which makes it even more hard to debug stuff which Doctrine doesn’t get right.

A simple example – consider this schema:

Foo:
  columns:
     id: { type: integer(5), primary: true, autoincrement: true }
     name: { type: string }

Bar:
  columns:
     id: { type: integer(5), primary: true, autoincrement: true }
     name: { type: string }

FooBarBaz:
  columns:
     foo_id: { type: integer(5), primary: true }
     bar_id: { type: integer(5), primary: true }
     name: { type: string }

(I’ll skip the relation setup here, Doctrine should find them all with an additional `detect_relations: true`)

So what do you expect you see when you call this?

$obj = new FooBarBaz();
print_r($obj->toArray());

Well, I expected to get an empty object, with a `NULL`ed `foo_id` and `bar_id`, but I didn’t! For me `foo_id` was filled with a 1. Wait, where does this come from?

After I digged deep enough in Doctrine_Record, I saw that this was automatically assigned in the constructor, coming from a statically incremented `$_index` variable. I could revert this by using my own constructor and call `assignIdentifier()` like this:

class FooBarBaz extends BaseFooBarBaz 
{
   public function __construct()
   {
      parent::__construct();
      $this->assignIdentifier(false);
   }
}

but now this object could no longer be added to a `Doctrine_Collection` (which is a bummer, because if you want to extend object lists with “default” empty objects, you most likely stumble upon a Doctrine_Collection, which is the default data structure returned for every SQL query).

So you might ask “Why the hell does all this impose a problem for you?”

Well, if you work with the `FooForm` created by the doctrine plugin for you in Symfony and you want to add `FooBarBazForm` via `sfForm::embedFormForEach` a couple of times (similar to the use case described here), you suddenly have the problem that your embedded form for the appended new `FooBarBaz` object “magically” gets a foo_id of a wrong (maybe not existing) `Foo` object and you wonder where the heck this comes from…

I have my lesson learned for the last one and a half days. I promise I’ll never *ever* create a table in Doctrine with a multi-key primary key again and I’m returing back to Propel for my next project.