Problems sorting search results?

I recently wrote about how you can sort search results on the Hansard prototype site.

We’re still getting reports of the functions not working as people expect them to. If you have a problem sorting search results, please let us know – we’d be particularly interested in what browser and operating system you’re using.

3rd November Deployment

We’ve just deployed the most recent build of the Hansard application code. We’ve also reparsed the source files and added some additional content, improved the parsing of some earlier pages and added in some previously missing volumes.

In addition to the usual collection of minor alterations and typos, publicly visible improvements continue, including:

* an extended and improved method of generating identifiers for text sections

* correcting source file volumes when parsed out of sequence

* improving the speed of the parsing and text reindexing processes

* handling additional time formats in divisions, constituency formats, member suffixes

* linking to other Acts/Bills with the same name in the index from Act/Bill pages: for example – http://hansard.millbanksystems.com/acts/access-to-health-records-act-1990 links to http://hansard.millbanksystems.com/acts/a#Access%20to%20Health%20Records%20Act

* improved identification of Bills

* improved identification of Bill and Act mentions

* more detail in volume tables: for example – http://hansard.millbanksystems.com/volumes/5C/

* sections with empty title get given title “Summary of day”

* added “search in this period” form to timeline pages

* revised front page style to columns, moved some front page material to this site

* made skip to links contingent on having sections to skip to

* added “On This Day”

* extended division parsing to handle tables that use ‘th’ tags rather than ‘td’ tags for headers

* handling offices that start with a non-letter

* Windows-friendly fonts, hopefully

* improved print style sheet

* adjusted Lords parliament.uk url generation for post 2006 version of their url

* permitted oral questions to be parsed in Westminster Hall sittings

* changed Volume table layout, column number style, Volume by column pages

* fixed sections for Acts and Bills

* lots and lots of caching

* Hansard reference now at top of section

* explanation of faceted search

* Improving text alignment in nav by columns columns

* Adding width to nav by columns columns to make easier to use

* link to google issue reporting form

* Adding info on sittings loaded versus sittings found to series index

* Adding last parser run time to volumes index view

* Handling case where numbers in square brackets occur outside the text of a question

* Enabling the display of lists of missing volumes in the volumes by series index

* Specifying “text/html” content type template in opensearch definition

* Fixing error thrown by non-specific hansard reference “H C Deb 1958-59 December 1958″

Work also continues on a new method of matching names of Members speaking. We’re not yet rolling this out, but hope to soon. We’re particularly glad to be able to thank Leigh Rayment for his work on memberships of the House of Commons, which has been a valuable resource.

Workshop on Finding and Re-using Public Information

I and another member of the development team recently attended a ‘Workshop on Finding and Re-using Public Information‘, held at the London Knowledge Lab: “This informal, hands-on workshop will bring government information experts together with those who are interested in finding and re-using government information.”

The event was compered very ably by Rufus Pollock, with Jonathan Gray updating a wiki live on the projector, both of the Open Knowledge Foundation.

Sorting search results

The Hansard prototype site can be searched from ‘outside’ – with a search engine that spiders the site, or from ‘inside’ with the search engine developed at part of the site’s functions.

You can search the site for the word ‘bingo’ using Google, by restricting the results Google returns. Adding “site:hansard.millbanksystems.com” to your query – http://www.google.co.uk/search?q=bingo+site:hansard.millbanksystems.com.

A similar feature is available using Yahoo’s search engine: http://search.yahoo.com/search?p=bingo&vs=hansard.millbanksystems.com and Microsoft’s Live Search: http://search.live.com/results.aspx?q=bingo+site:hansard.millbanksystems.com. Other search engines are available.

There’s also a Google Custom Search for the site, which allows you to refine results by using parts of the URL to identify Acts, Bills and so on.

A search for ‘bingo’ using the internal search engine can be seen and bookmarked at http://hansard.millbanksystems.com/search/bingo. Once a list of search results is returned, you can sort the results. The ‘Sort by MOST RECENT’ button shows the results sorted by their earliest date first. The ‘Sort by EARLIEST’ button shows the results sorted by their latest date first. The ‘Sort by MOST RELEVANT’ button sorts the results in the default order decided by the search engine – that which is considers the most relevant to the query.

FreeLegalWeb BarCamp

I and another member of the development team will be talking about the Hansard prototype site and related matters at the Royal Society of Arts on Saturday 18th October, some time between 10.10am and 5.00pm.

It’ll be in the Adelphi Room, Royal Society of Arts, 8 John Adam Street, London WC2N 6EZ.

I’m very grateful to Nick Holmes of the Free Legal Web for arranging the barcamp – and to John Sheridan of OPSI for arranging the venue.

So – what is a barcamp anyway? It’s an “open, participatory workshop-event, whose content is provided by participants” (Wikipedia). If you get the opportunity to attend and contribute to one – do!

How ‘official’ is the Hansard prototype site?

The site is generated from information from Hansard, the Official Report of Parliament. It is not a complete nor an official record. Material from the site should not be used as a reference to or cited as Hansard. The material on this site cannot be held to be authoritative. Speakers names, the dates of first and last speeches, offices held and other data are identified from available contributions, which may not be complete. Information presented is generated from the available XML files: there are likely to be errors, omissions or repetition.

The site is an experiment. Elements of the site are likely to change without notice. Functionality provided by the site may be altered, become unavailable, change behaviour, or be removed entirely. URLs may change without notice. The site may be unavailable for periods without notice. The site may be withdrawn entirely without notice. No guarantees of uptime, speed or service are offered.

The site has been sponsored by Parliament in order to test and demonstrate user interfaces for historic data, certain functionality and for other exploratory work. Information used to make the site has been provided by the Hansard Digitisation Project, directed by the Directorate of Information Services of the House of Commons and the Library of the House of Lords.

The time and resources used to generate the site have been and continue to be paid for by the House of Commons and the House of Lords. The site is not part of the official Parliament site, nor is it intended to become part of the official site in its current form. The site is supported only on a best efforts basis. The site is not supported by the Department of Information Services and comments should not be directed to them or to the Web Centre.

Material on the site remains under Crown and Parliamentary Copyright. Within these copyright constraints, you are encouraged to use and to explore the information provided.

Where can I find Hansard after 2005 or before 1803?

Contemporary Hansard can be found on the UK Parliament site, including content going back to 1988 in the House of Commons and 1995 in the House of Lords. Hansard is available on the site about three hours after debates.

The 18th Century Official Parliamentary Publications Portal includes information from Hansard prior to 1803 up to 1834. Access to the site is only available to the higher and further education academic communities within the UK and other selected institutions.

The UK Parliament lists Digitised Historical Parliamentary Material – including Public, Local and Private Acts; Statutory Instruments; Proceeding and Journals; Judgements of the House of Lords; historic records of the devolved Parliament and Assembly of Northern Ireland, the Parliaments of Scotland and Welsh Assembly; Acts of the Oireachtas and debates of the Dáil and Seanad Éireann.

Institute of Historical Research

I’ll be demoing some features of the Hansard prototype site at the Institute of Historical Research on Tuesday 7th October, 5.15pm.

It’ll be in the ‘Low Countries’ Room, University of London, Senate House, Malet Street, London WC1E 7HU.

I’m very grateful to Paul Seaward, of the History of Parliament Trust, – together with Colin Brooks of Sussex University and Valerie Cromwell – for the invitation to speak.

Search

We talked recently to various people interested in providing search over parliamentary data in other Parliaments. One thing that I hope came across when we described search on the site is that search doesn’t have to be complicated or expensive to implement, and certainly shouldn’t be complicated for users.

By keeping our pages simple and self-descriptive and by making sure that each one reflects a logical piece of content (e.g. one debate rather than a physical page from the original Hansard volume), we try to make relevant information from Hansard easy to find from search engines outside the site.

The search on the site itself is implemented through Solr, a web service wrapper around the Lucene Java search engine library, a long-established Open Source project. Solr supports faceted search, so we can show people using the site how the speeches relevant to their query break down over time, by speaker and by the type of debate they appeared in, and let them use these facets to home in on specific results.

We integrate our Rails application with Solr using the acts_as_solr plugin. We’ve made a few changes to the plugin, mostly to speed up the process of indexing content, but basically we’re using Solr out of the box. We have 13 million speeches indexed at the moment, and queries on the site usually return in under 10 seconds.

Our initial focus was on making sure that our internal search was providing something useful. Now we’ll also be working on making it speedy!

BarCampLondon5 Spillover

I’m Robert Brook, a member of the Prototyping team. I’ll be at BarCampLondon5 Spillover this Saturday 27th September – and will be glad to talk about the Hansard prototype – that is, if anyone actually wants to hear about it.

The event will be held at the BCS offices in central London, just off the Strand. You can register for free using Eventwax.

Actually – is there anything else I should talk about?

( I edited this post to refer to myself as a person. It was all starting to get a bit weird talking about Robert as though he was someone else. )