Yesterday I attended the Data and News Sourcing workshop co-organised by the Media Standards Trust and BBC College of Journalism. There were two sessions running in parallel and Martin Belham will no doubt write has written about the crowd sourcing news and crime data sessions I did not attend.
The first session was titled Open Government data, data mining and the semantic web. This is an area I have a degree of familiarity with but it was interesting to hear stories of wrestling with data on a day to day basis. As well as the general lack of journalism being done with the data published to date.
Alex Wood gave an interesting account of a BBC World Service data journalism project looking at the global occurrence of road accidents. Working initially with World Health Organisation data Alex made the point that the initial dataset helps you ask the right questions but does not necessarily give you the final answer. Scatter plots quickly show anomalies in data and raise issues about how data is collected and categorised. This is when you need to start talking to the people who understand and collect the data.
Chris Taggart spoke of similar challenges when dealing with local data. As the founder of openlylocal.com he has dedicated a number of years to the task of collecting data about and from local councils and politicians. The messiness of data, varying formats and the lack of id’s to stitch datasets together means openlylocal would not exist if it had not been for passionate individuals dedicating time and resources to it. Chris’s most recent collaboration Open Corporates represents a similar labour of love and co-founder Rob McKinnon spoke of the challenge of stitching together datasets when governments and councils have no common notion of a corporation running through their data. Nigel Shaldbolt was quick to point out these common addresses (uris) to tie together disparate datasets is an important outcome of the data.gov.uk work and its embracing of a approach that works with the web.
Aside from the challenges of collecting the data and shaping it into something meaningful Alex emphasised the importance of telling stories with the data. Kevin Marsh (College of Journalism) made an interesting point that this is not always the case. Newspapers for years have provided data alongside stories; weather information, tv listings and stock prices. Much of the value of a newspaper is shared between stories and pure information and in a digital environment this is no different. In fact the collection of a dataset like openly local can facilitate services that provide useful information at a very local and targeted level. This, it was suggested, is the modern equivalent of the local newspaper’s role as information/data provider. Very local data has potentially a huge, and currently unrecognised, value to audiences.
It was clear from the discussion that working with data is an involved and time intensive process. Perhaps for this reason we have not seen more stories or applications come out of the initial successes of opening data in the UK. Chris did question why organisations like the BBC were not biting his hand off to get access to the open corporates dataset.
The second session was Expert sources in science and health. There was some interesting discussion regarding the use of expert sources in media and the role organisations like the Science Media Centre play in ensuring expert scientists are available. Ben Goldacre raised the issue of transparency in journalism and how few stories link through to original sources like research papers. He cited his long running battle with the BBC to link to sources from their science stories.
Mark Henderson of The Times spoke about the difficulties for journalist in linking to sources. Often at the time of writing a story a research paper will not be published online or be hard to find. Even if you do link to the source publishers are notorious for changing the url’s of papers. Some of these issues have been resolved with the introduction of document object identifiers (DOI) but these are not used consistently across the publishing community.
The relative importance of communication and investigation to journalism were questioned. The panellist emphasised the importance of communication for the mainstream press relaying the developments of science and the playing out of the scientific process. This is in contrast to pointing to experts as a source of facts. Investigative journalism in science is particularly difficult as it requires a deep technical expertise in a given area. Because of this it was suggested the professional blogger community is better placed to provide this analysis work as they often focus solely on their particular area of expertise, have freedom to explore topics and the range of necessary contacts to draw upon.
It did occur to me that the challenges of investigative journalism in science are comparable to the challenges to journalism being done with the datasets currently being opened up. It will take a community of passionate experts to interrogate, analyse and uncover stories in very complex and specialised datasets. In order to be sustainable and to encourage the best data journalism, like the blogging community, it will need the support of the mainstream media.
At the same time there is a role for technologies like Linked Data to reduce the cost of collecting and analysing data and so make data sourced journalism easier to do. A common theme across the sessions was the need for clear and persistent urls for both documents, to aid linking to sources, as well as urls for common things (like corporations) to enable the joining up of the data where the interesting stories lie. As Ben Goldacre said the information architecture of journalism needs to be vastly improved.
In particular I found the following quote from Dan Conver compelling:
“[The] raw material of this information economy is essentially like oil shale: the latent value is obvious, but the cost of extracting these information resources from today’s existing deposits (think web archives) is so high given today’s technology that no one is going to spend a dime to start the project.”
Stijn comments further on this point::
“…Both approaches [emphasis on structured news formats, and rock solid metadata at the story level] wish to extract more value from journalism through structure and relationships. Both approaches have you trade a little hurt during content creation for yet-to-materialize advantages. That’s unavoidable — no such thing as a free lunch.”
Essentially the annotation of news articles with controlled vocabularies. I see the potentail impact of the semantic web slightly differently though do not disagree in principle that the annotation of journalist output is a useful activity. I think perhaps too much emphasis is placed on the extraction of knowledge from editorial assets. I believe the oil shale of journalism is the by product of the process itself.
Guardian is a case in point. I get the impression that the datablog started out with a hunch it might be of interest to publish some of the spreadsheets that Guardian journalists collected and curated in the process of writing stories. What has been particularly remarkable is that the success of the datablog has probably been greater than the Open Platform. Why? Because it gave access to something that had not been available before.
These to me are the true oil shale of journalism.
How the semantic web and linked data play a part in my opinion is no more than reducing the cost and ease of using these data sets in a useful way. I have written before about how we have used semantic web technologies at the BBC to build websites. The combining of BBC editorial assets, with commercial data and open data sources enabled the BBC to do things they would never have dreamed of doing with internally managed data sets or bespoke taxonomies.
By using linked data techniques and simple tools like Google Refine (with the Deri RDF extension) it would be relatively simple to map the datablog spreadsheets to common RDF vocabularies and identifiers. These data sets could then be used to add context, navigation and weave new narrative threads through the Guardian’s editorial output or anyone else’s for that matter. In much the same way Wildlife Finder has used open data sets like Dbpedia to support the delivery of BBC wildlife programmes.
The main issue for the semantic web/linked data is cost incurred due to the current lack of expertise, the barrier to learning new things (in places complicated and unintuitive things), and the relative immaturity of the technologies. With time this will change and savings made from the ease of integrating disparate data sets and the value of mining the ‘raw material of this information economy’ will justify the costs.
How does the emergence of the semantic web and its associated technologies change the way we approach user experience design and more specifically information architecture?
In Tim Berners Lee’s original proposal for the web he gave us the basic ingredients to build the web of documents as we experience it today. This gave us a easy means to publish documents, refer to them with url’s and point from one document to another with a hyperlink.
In many ways the web became a victim of its own success. The simplicity with which we could publish documents meant we were soon overwhelmed. At this point Information Architects were employed to group together documents into managable piles.
The process would often take the form of a content audit. Grouping an organisations documents into similar types then giving these groups a label and arranging these groups into small hierarchies. If you were lucky you might also do some user testing. For example a card sort to see if a user of your site would expect to find a document in the place you have grouped it.
The problem with this approach is that if we start out focusing on documents our sites turn out document centric. For example navigation that includes things like pictures, news and features or opinion and archive.
If we step back and think about it. The user coming to our site does not have a mental image of a document but rather the player or team they are interested in. People are interested in things not documents.
This leads us to move away from a document orientated approach to web development to a thing focused one and with this move comes the need for new tools and approaches to information architecture.
One approach at the BBC has been to use Domain Driven Design. DDD encourages you before you have written a line of code or draw a wireframe to collectively understand the things and relationships between them in the problem space you are trying to solve. This model becomes the ubiquitous language used by all members of the project. At this point we can also test our model against the mental model of the user. Ensuring the users mental models are built in the very core of the site.
If we look at an example – when the BBC wanted to open up their archive of wildlife clips instead of beginning to publish pages for the clips they first published a page for the things of interest and links between these things. So publishing a page for every species, link them to habitats and behaviours.
Each of these pages then links back the species that it relates to. So you soon start to build up a dense network of links between these things. The emphasis in this approach is to shift focus from the content to model. The assets are associated with the things in the model but the model provides the context.
Anyone who has been involved in building even a modest taxonomy for a site will understand the maintenance overhead that this introduces. In using the rich relationships that an ontology like approach introduces it would be not be feasible to for the BBC to build and manage this product.
Instead the model was populated by sourcing data from the web and stitching it together with common web identifiers. In this case DBpedia . So different sources of data can provide the concepts and – links between concepts – at no extra cost to the BBC. In the cases where a concept is missing, for example in Wikipedia then the team of editorial experts at the BBC will edit or create the concept in Wikipedia. This means not only are we reusing what is already available on the web but in the places where it is wrong or missing we correct it at source so others benefit.
One of the outcomes of focusing on publishing urls for things and creating a dense network of links between them is it has had great benefits in terms of Google rank. In the case of Wildlife Finder some species are being placed above their equivalent page in Wikipedia on UK Google searches.
This approach has not been restricted to wildlife but has been used across the BBC including the World Cup.
The BBC football site at it exists today consisted of a limited number of editorially managed indexes. This means that editorial resource dictated the types of aggregations the site had. So we have no index for the England team or brazil but rather an general and slightly meaningless index called internationals. In addition to this the BBC purchases sports stats from an external provider but at the moment these are not brought together to tell a coherent story.
So in order to solve these challenges the starting point was to think about the things of importance to the world cup as opposed to the documents.
The approach was to focus on the model and then associate content with the things in the model. As the model is device agnostic the views that provide the user experience on top of this can be tailored to be the best we have to offer for a given device.
The starting point of the modelling was to recognise the importance of the event to sport. If we can handle events we can represent the majority of sports. Building upon the existing events ontology we then set about specialising it in order that we could represent the complex structure of a sports competition.
For instance the world cup is a multistage event made up of a group stage and a knockout stage each of which contain rounds and those rounds contain matches.
Once we had developed a model we then decided upon the views that we would want to show the user for a variety of devices. For example html web views would include amongst other things teams, players and groups.
Once we knew the views we wanted to create we could then be sure that if journalists annotated with a select number of tag class types that the model could handle the rest. So we asked them to tag with player, team, competition and venue. By keeping the tagging simple we ensured it would be of high quality.
Here we have a team page for Italy – you can see how the approach starts to bring stories and data together to tell stories in a more coherent way. Though aggregating assets with tags is not particularly novel when we look at the group pages we can see how this approach is different.
You might remember that we did not ask journalists to tag with group but we are still able to construct this view for users because the model knows which teams played in which group and which players played for which team. No additional editorial intervention was needed to generate these additional views.
By focusing on the model it allowed us to easily integrate a variety of data sources and pull them together to provide a coherent user experience. In addition by tagging content with concept from the model we are increasing the benefits we get from the cost of tagging content. So a tag that has a web scale identifier enables the content to be contextualised in previously impossible ways.
In summary we have looked at a number of ways that semantic web like thinking changes the way we work. Firstly we start developing sites with processes that encourage us to focus us on things and the relationships between them as opposed to the documents, secondly it introduces a culture of building with open vocabularies to add context and links that would never be possible otherwise and it also enables us to maximise the value we get out of our tagging of content. Like the world cup group page example.
All this thinking is heavily influenced by the work of my colleagues at the BBC. Notably Michael Smethurst, Tom Scott, Chris Sizemore and Michael Atherton. For further reading regarding this approach and the work of the BBC there is no better starting point than Michael’s posts on the BBC internet blog and Tom Scott’s personal blog.
We express our identities through our collections. Online these collections take the form of Amazon wishlists, Last fm playlists and lists of friends on Facebook. Perhaps less consciously we have search histories, purchase profiles and a trail of cookies picked up from website visits.
In David Siegel’s book Pull he posits a future where our personal details are consolidated in a private space in the web:
Your personal data locker will store your personal ontology. It helps you find television shows and movies, it helps you learn about wines you might enjoy, it helps you find bargins online, plan a trip, find events you might want to attend, or spot a new restaurant, and it can help with dating life if you’re single. Hook it in to your everyday activities and you’ll build an ontology with millions of triples, all of which make your data locker into a ’smart’ virtual assistant that continues to learn as you go through the day.
Few news organisations have attempted to bridged this gap between the news story and our personal profiles. The New York Times perhaps being the exception taking a users LinkedIn account, looking at your area of work and then serving contextual stories and ads related to your area of work.
In some respects SEO (and the optimising of keywords in story titles) could be considered a crude attempt by news organisations at mapping stories to the profiles (keyword search patterns) of their intended audience. We have recently seen a move away from SEO effort in the news industry in favour of building more meaningful relationships with loyal customers.
I suspect with time we will see a focus of effort on mapping the model of the news domain to the domain of the user (personal data locker). Relating the context of the story to the things of importance in our world; the topics, events, work, people and hobbies.
Posted on March 19, 2010
Filed Under information architecture | Comments Off
The initial impetus for writing this series of posts was the increasing presence of information architectures driven by metadata and the impact this has on editorial curation.
How does moving from a document focused view of the world to a thing focused view change the role of the collection?
We took Wildlife Finder as our example. Wildlife Finder is built upon a domain modelled approach and dynamically aggregates content and data around the ‘things’ in the model. Collections can then be used to build editorial layers on top. As Tom Scott points out:
Tom goes on to say that by releasing the data for Wildlife Finder it means that “our audiences and ‘users’ could also build stories”.
In Pivot’s own words:
In short, datasets are organized as collections. Results can be as granular or as big-picture as the user desires, and correlations and patterns are easy to see and examine through powerful but simple visualizations. Imagine browsing through thumbnails representing Kiva loans, then sorting the loans by the different types of businesses they helped established.
In order for Pivot to work datasets need to be in a certain format. I suspect that Linked Data will lend itself to these types of tools and products like Wildlife Finder that have focused on curating context as opposed to curating content will benefit greatly.keep looking »