Spelunking in M&G's Zapiro Archive
Those who follow me will know that I used to maintain a web frontend to the Mail & Guardian Online Zapiro archive.
M&G used to have a rather crufty website. Subscriber-only content was trivial to access (for non-subscribers), URLs were ugly, and dinosaurs roamed in the far corners of the site. It had RSS feeds, but not an RSS feed for the zapiro archive (or any specific-interest RSS feeds for that matter). I don't check websites, I read RSS feeds.
Me being a young geek with a little too much spare time, I put together zapiro.rivera.za.net, as a ~200-line PHP script (with no SQL DB) that was really nice to use (in my books) and gave me a Zapiro RSS feed.
When they noticed, the powers at be at M&G weren't too impressed with it, because it deprived them of eyeballs (and hot-linked their Zapiro images). However I felt satisfied that I was merely providing a fair-use access to their content and allowing people to follow it who wouldn't have been able to otherwise. The site never got much traffic, so thus far it's not been a serious problem.
Around June this year, M&G redesigned their website, and I don't think I even noticed (did I say something about them not having decent feeds?). This redesign broke the machinery in zapiro.rivera.za.net but I didn't notice that because Zapiro had taken a sabbatical earlier this year, and was going weeks without posting cartoons.
Enough back-story. Point is I took a look at the new M&G Zapiro Archive this evening and was shocked. Before I go into all my problems with it, let me just disclaim that they are rather nit-picky but if these problems weren't there they site would be a hell of a lot more usable:
- There are still no useful RSS feeds. There is a rather terse selection of general feeds.
- The Archive menu only goes back to 2001. M&G has zapiro cartoons going back to 1999.
- Archive menu URLs are in /Month/Year format. Did anyone even think about URL-scheme when they were designing?
- Each cartoon has two URLs. Ok, I guess they weren't thinking about URL scheme.
- Today's cartoon has the /zapiro/all/ URL. Yesterdays /zapiro/all/1, etc. going back to the begging of time (currently residing at /zapiro/all/1870). Way to go with permalinks guys. Oh and did you notice that they are all titled "Latest Zapiro"?
- Clicking on the "Comments" link or using the "Archive" menu below takes you to something like /zapiro/fullcartoon/1. Oh, except 1 gives us a non-existent cartoon at the beginning of this Unix Epoch. But take a closer look: it has tags associated. Can anyone say WTF?.
The insanity continues: 2 gives us a cartoon from September 1999. 3-25 are more non-existent wonders, and then things go backwards in time until 36 which jumps us to June 3 2008. (Hmm, I think that may have been around the M&G redesign launch date.)
We move forward in time until 40, when we start moving backwards from May 2008, through many seas of well-tagged gaps, to ... well somewhere. (OK, so I got bored and didn't manually crawl 2000 pages, but would you?) Some cartoons are in totally the wrong position, we randomly move backwards and forwards and sideways.
Finally things settle down, and we go forwards again (with gaps of course) from 2054 to today's cartoon at 2101 — a fine Zapiro specimen if every I saw one.
Why was I doing all this mind-numbing crawling you ask? Well I wanted to know if I could do anything to make my Zapiro scraper work again. The answer? Not simply. They don't have any sensible way to locate the cartoon from a specific day, short of crawling the entire archive and recording the URLs found. I don't think there is any logic to this LSD-induced URL scheme.
URL schemes matter. This seems to be something that the big guns haven't noticed. I don't think it's a co-incidence that the most expensive CMSs out there have the worst URLs, whereas Wordpress and Drupal (with pathauto) encourage sensible URLs and are Open Source.
Sure, most users don't change what they see in the address bar, but if people are going to link into your site, you should provide nice permalinks. Then, if you want anyone to build anything on top of your site (where anyone includes yourself), it would really help if you had a sane URL scheme. Finally, it gives you geek-cred. :-)
While I think of a better way to get my scraper working again, Happy Spelunking!