For anyone interested (yes this is mostly for you, @HolyHaddock), Westminster Hubble is written in a combination of PHP and JavaScript, with a MySQL database as its backend.
Most of the clever work happens in the background, set off by a number of cron jobs with various tasks such as keeping the MP list in sync with TheyWorkForYou, polling our blogs, generating statistics on the contents of the database, and the big one: trawling through all the MPs’ feeds themselves.
The latter is a mammoth job, and trying to keep up has been a constant battle against allowed cron intervals and PHP timeouts as we can as yet only afford shared hosting for the site rather than our own dedicated server. We keep a record of the last time an MP’s feeds were checked, and every five minutes, we pick the 60 oldest ones and check them. 60 is a rough value arrived at through some pretty low-tech testing, and there’s still plenty of work to do to optimise this. With 650 MPs in total, checking 60 every 5 minutes means we cycle through everyone in about an hour, which isn’t too bad, though this will get much worse once we add in MEPs and members of the regional Parliament and Assemblies.
Items that get scraped are added to the cache table in the Westminster Hubble database, from where they’re served at user request without having to re-visit the original feeds. We use SimplePie to find and scrape RSS feeds, after my own attempt proved to be more trouble than it was worth. SimplePie manages its own cache as a flat file structure, and uses its own intelligence to try and detect when feeds are unchanged, lightening our server load when scraping feeds that don’t update very often.
There’s currently no expiry condition for items in our cache. Disk space is not an issue, but load times may prove to be at some point in the future. If and when they do, we will start removing the oldest items from the cache, possibly with some kind of type bias so that blog posts hang around longer than tweets.
On the user experience side, there’s nothing much complicated going on. jQuery is used extensively for pulling in page contents so that we can load pages with feeds on quickly. Likewise, we use jQuery so we can filter feeds, and switch between Search, Map and List on the home page without reloading, and we use the Autocomplete jQuery plugin on our search box.
The Map view is powered by the Google Maps API, and we generate the data for the pins from TheyWorkForYou’s database of constituency locations.
All in all it’s not been a tremendously difficult project - there have been no major hurdles that have caused me to tear clumps of hair out or affected the stocks of coffee producers. Though that said, Westminster Hubble is still in beta, and there could be many more issues ahead…
Comments
Thanks for the insight.
One question is whether you cache changes made to MP profiles by the public before they're made live?
Also, any chance of dashes in the URLs of names? Would improve Googleability for you
We do! We just have a simple moderation queue that stores up every change people make, and admins can click through the list and approve/reject.
There's no intelligence there at the moment, so if we start getting a lot of spam submissions we might have to figure out an easy way of weeding them out.
Thanks for the tip on dashes, I didn't realise Google cared! I'll get that implemented.
Interesting. I have a project which has been partially rolled out (no, no public names!) where I want public additions/edits, but need to make sure there's a moderation queue. I'll take a further look into that!
Regarding Google, it certainly prefers dashes over underscores or anything else - it pretty much treats them as spaces.
Ah okay, if it even prefers dashes to underscores I still have some work to do! :D
As a first attempt I swapped out spaces for underscores, because names can themselves contain dashes and I wanted it to be unambiguous, but we could certainly change to dashes if we make it a bit more clever if the user types in e.g. wh.com/Johnny-Double-Barrelled.