Interpreting a WordPress Export XML file

Tags:

In preparation to writing a tool that can import most of an existing WordPress site into WebGUI (see the Kickstarter to release WebGUI Version 8) the good news is that a WordPress export file can be loaded into Perl quite easily:

#!/usr/bin/perl use XML::Simple qw(:strict); my $xs = XML::Simple->new(); my $ref = $xs->XMLin($ARGV[0], ForceArray => 1, KeyAttr => []);

Pass that script the name of your export file, and you can peruse the results with

use Data::Dumper; print Dumper($ref);

Here's what you will find:

WordPress contents

That's the more-or-less organization of this file. As you can see it delightfully shows that some of WordPress's foundations still suffer original cracks.


Example code:

#!/usr/bin/perl

use XML::Simple qw(:strict);
binmode(STDOUT, ":utf8");   # enable Unicode output

my $xs = XML::Simple->new();
my $ref = $xs->XMLin($ARGV[0], ForceArray => 1, KeyAttr => []);

use Data::Dumper;

use HTML::TreeBuilder;
sub rectify_html {

    my $munged_text = shift;
    $munged_text =~ s/\n\n/\n<p>/g;

    my @pre_segments;
    my $seg_id=0;
    $munged_text =~ s{<pre\b(.*?)>(.*?)</pre\s*>}{$pre_segments[++$seg_id]=$2; "<pre data-seg=\"$seg_id\"$1></pre>";}gsex;

    my $atree = HTML::TreeBuilder->new();

    # Prepare to store comments. Requires wrapping in <html><body>
    $atree->store_comments(1);
    $atree->parse("<html><body>$munged_text</body></html>");

    # Replace original text contents for <pre> elements
    foreach my $pre_element ($atree->look_down('_tag', 'pre')) {
    $pre_element->push_content($pre_segments[$pre_element->attr('data-seg')]);
    $pre_element->attr('data-seg',undef);
    }

    # print Dumper($atree);
    print $atree->as_HTML(undef, ' ', {});
}


# Here we show just the contents of the posts and pages.

my $posts = $ref->{channel}[0]->{item};

foreach my $post (@{$posts}) {
    if ($post->{'wp:post_type'}[0] =~ /^post|page$/) {
    print "Entry with ID=$post->{'wp:post_id'}[0] of type $post->{'wp:post_type'}[0] contains:\n";
    print $post->{'content:encoded'}[0];
    print "\n-----------------\n";
    print rectify_html($post->{'content:encoded'}[0]);
    print "\n-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-\n";
    } else {
    print Dumper($post->{'wp:post_type'});
    
    }
}