<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Digital Ambulation &#187; python</title>
	<atom:link href="http://www.alanbriolat.co.uk/tag/python/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.alanbriolat.co.uk</link>
	<description>Life, programming and general geekery</description>
	<lastBuildDate>Mon, 09 Jan 2012 12:07:50 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Quick and dirty screen scraping of UK MP expenses data</title>
		<link>http://www.alanbriolat.co.uk/2009/05/quick-and-dirty-screen-scraping-of-uk-mp-expenses-data/</link>
		<comments>http://www.alanbriolat.co.uk/2009/05/quick-and-dirty-screen-scraping-of-uk-mp-expenses-data/#comments</comments>
		<pubDate>Wed, 13 May 2009 21:41:56 +0000</pubDate>
		<dc:creator>Alan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[politics]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://www.alanbriolat.co.uk/?p=340</guid>
		<description><![CDATA[A dominating story in the news recently has been that of UK Members of Parliament &#8220;abusing&#8221; the expenses system. As part of this, expense claims data has been released to the public, but unsurprisingly not in a simple CSV format that just anybody can play with. All I could find were PDF files and news [...]]]></description>
			<content:encoded><![CDATA[<p>A dominating story in the news recently has been that of UK Members of Parliament &#8220;abusing&#8221; the expenses system.  As part of this, expense claims data has been released to the public, but unsurprisingly not in a simple CSV format that just anybody can play with.  All I could find were PDF files and news sites representing the data in their own way.</p>
<p>This led me to wonder how easy it would be to &#8220;scrape&#8221; the data from one of these sites.  Having heard about <a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a>, a Python HTML &#8220;tag soup&#8221; parser, I opened up an interactive Python session and started to play with the data from <a href="http://news.bbc.co.uk/1/hi/uk_politics/8044207.stm">this BBC News page</a>.  Luckily for me, the BBC page&#8217;s HTML isn&#8217;t <em>too</em> ugly, so figuring out how to get the data rows wasn&#8217;t that hard.</p>
<p>The end result is the following Python script which scrapes the data from the BBC News page and saves it in both CSV and JSON formats.</p>
Note: There is a file embedded within this post, please visit this post to download the file.
]]></content:encoded>
			<wfw:commentRss>http://www.alanbriolat.co.uk/2009/05/quick-and-dirty-screen-scraping-of-uk-mp-expenses-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

