Web Scraper Pt. 02: Reading MSDN into Something Simpler

Reading things from msdn.microsoft.com has always been difficult since they put one line of content surrounded by all kinds of other junk. I wanted a simpler representation that I could print or read without clicking on too many links.

So I made this webscraper which grabs content from an msdn node (eg a webpage like http://msdn.microsoft.com/en-us/library/ms679320(VS.85).aspx), simplifies it, stores it locally, then tries to visit all its sub pages to do the same. The result will be an html file named after the title of the node passed to it, and a directory containing sub pages each of which may have directories containing sub pages of their own. There will also be one file named $title.long.html which will contain the text of all nodes visited.

This webscraper makes use of WWW::Mechanize, HTML::TreeBuilder, HTML::TreeBuilder::XPath, and WWW::Mechanize::TreeBuilder so I have the ability to use an XPath on anything I want. Despite this, it has still been a bit difficult to figure out in all cases which links in the content of a page are the sub pages, so for the sake of time here is the imperfect script. It works for only some msdn nodes, others you will get a tree of links that are a bit off, and others this script will go in an infinite loop (if the max_depth variable is set high enough):

Posted at 10:56 pm on March 7, 2009 | 1 comment | Filed Under: Uncategorized | read on

Web Scraper Pt. 01: Youtube Subscription Videos

It used to be that there was no way to view the newest videos released by my youtube subscriptions on my iphone. Now theve made it a bit easier that you can go to the page and click the video to pop up the youtube player program. But its still a bit complicated to navigate and you’ll have to load up a few extra pages before you get to the actual video.

So I created a simple webscraper using perl’s WWW::Mechanize library with parsing done by XML::LibXML’s XPath subroutines that sends an email with Mail::Sendmail that lists links to all the newest videos.

I have my computer run this everyday so i can consistently get the newest vids on my phone. The user and pass listed in this script are temporary accounts at youtube i made just to test this thing works. fill in these values with account info/email info of your own.

Posted at 3:28 am on February 18, 2009 | leave a comment | Filed Under: Uncategorized | read on

Blog Stats

  • 1,033 hits

Archives

Categories

Monthly Archives