The Content Ripper

by Mike Shea on 30 December 2002

Already started on the content ripper.

Here's an original Roger Ebert Review and here's the ripped version of that same review. Another example, an article about a new J.R.R Tolkein book and ripped version of the new J.R.R Tolkein book article.

Why is this a good thing? Four big reasons:

It can be easily read on portable devices like a Palm using avantgo. It can be easily read by accssability devices for the disabled. It is far faster to download for all readers on all network connections. It gets rid of non-essential parts of the webpage.

I've got a couple of problems with it so far. It is hard to pass URLs around when you've got lots of embedded CGI characters moving from system to system. I can't get the output to be fully XHTML strict compliant without somehow passing it through a web version of HTML tidy, yet another process on another machine. It won't work well with multi-page articles unless I can rig up some kind of smart URL redirect that keeps everything together. I can't imagine the copywright problems this would bring up if it were ever heavily used. There are probably 50 problems I haven't seen yet as well. For a 10 minute start, it works for Eberts stuff and thats good enough to me.

Apps like this prove one thing. If people who were in control of the content cared about all their readers, they'd come up with an HTML Basic version of all their content using a standard. Mayhaps it's RDF but probably not.