|
By Karen Loasby
Following their recent ISKO UK presentation, Karen Loasby discusses automatic classification with freelance information architect Helen Lippell and BBC information architect Silver Oliver.
KL: Firstly, when did you get involved in automatic classification and why?
HL: I suppose I'm a grizzled veteran. I started life as a manual indexer for the Financial Times, indexing news articles. I enjoyed the process, and the opportunity to work on an automation project came up. It was a chance to make the product better; I was interested in language and enjoyed reading newspapers.
SO: I was at the BBC, working on IA and taxonomy projects. The BBC was trying to set up a semi-automatic system and was getting the IAs to manage the system. It was an area I was really interested in and so was happy to get involved.
KL: What are the different types of system?
HL: Broadly, there are two types. Firstly rules-based, which uses Boolean logic (i.e. AND/OR/NOT constructions), and may use other operators (e.g. which look at word order, word proximity and content structure) to look for patterns in the text and apply terms accordingly. Secondly Bayesian, or statistical-based systems, that are trained using a set of representative documents to identify useful or 'statistically rare' concepts.
SO: Choosing an approach isn't straightforward. Neither approach is necessarily easier or quicker. I don't want to say it depends, but it does. Combining multiple approaches could be the way forward.
KL: How have you used these tools?
HL: At ft.com I used them to apply terms from different facets to news articles. More recently, in a central government project, I've led work to use automatic classification to re-tag all content on a substantial website against IPSV version 2, leveraging the speed and consistency of an automated approach.
SO: I've used them to power BBC Topic pages. In the past, it was more about the back-end. Now it is becoming about driving scalable user journeys. That's an area I'm really interested in.
KL: What roles are there for humans in these processes?
HL: I'm not describing job roles here, so much as tasks:
- There's a need to research the domain in order to train the system, particularly for rules-based systems. Rules have to be created, or training sets chosen for statistical systems
- There's quality checking the performance of the system, that has to be a human task and lots of it is needed. It can be good to involve editorial staff here
- Designing, or being involved in the design of, the product that uses the automatic classification
- There's usually a role for someone to re-design the tool itself, to make it something that actually feels nice to use and administer. Many of them are pretty awful to use
- And someone has to sell the value of the system to the business.
SO: New technologies always claim to remove people from the process but they are usually only shifting resources around; you never really manage to get rid of people. The 'semantic search' style products are based on very complicated ontologies. These need to be configured so you have to understand the complexity. They can be pretty 'black-box' so understand what they are doing can be a time-consuming task.
HL: At the FT, some of the manual indexers moved on to work on the automatic system. Not everyone, but there was a role if people were interested. You are achieving more with those people, though.
KL: What pitfalls and problems have you found?
SO: The question is, can automatically applied metadata create meaningful scalable user journeys? Can you go on a nice journey from the bottom up? If you apply metadata to content, does that create a user journey between content that a user actually wants to follow? We think this is right, but we don't know. The 'about' narrative can be fairly weak. We need the relationships to be stronger, more typed, more ontological.
Usage is shifting user expectations. If you use it to cross-link content, then expectations of quality may be higher than when metadata is used for, say, search results ranking.
Interestingly, the New York Times now bases related links on your LinkedIn profile, if you are logged in. It gets the industry you work in from LinkedIn, then targets related links around your area of work. There are potential problems with this happening in the background without clearly communicating what is happening to the user.
HL: The main barrier to successful implementation of automatic classification systems is the English language itself! The quirks that make English fantastic for cryptic crosswords are the ones that cause ambiguity, multiple meanings and flexibility in parts of speech. Words shift meaning over time, new words come into the language, and idioms can be a nightmare. Therefore, all of these may impair the ability of an automated system to provide the high quality that is needed for the kind of applications that Silver is talking about.
KL: How do taxonomies fit in?
HL: Taxonomies can be the glue of an automatic classification implementation. They are the vocabulary that rules, whether Boolean or statistical, are built upon, allowing concepts to be applied consistently to content. Taxonomies also provide the framework of relationships, such as synonyms and related terms between concepts - they help the automatic system to understand the domain in the way that users do.
SO: We have started to talk about web-scale identifiers, something that performs like barcodes and ISBN numbers. As we start to use data on a Web scale, we will need commonly shared web identifiers. An example of this would be the use by BBC Music using MusicBrainz IDs for artists and tracks. These are commonly understood IDs for these entities on the web. There has been a recent surge of services (OpenCalais, Zemantra, Muddy Boots) offering auto-classification systems that provide web-scale identifiers (usually DBPedia). It has become some what of a holy grail of auto-classification.
KL: ...... and folksonomies?
HL: Where there is user interest, and a potential or existing community, folksonomies/tags are a valuable repository of concepts that an automatic classification system can use in the same way as a more controlled vocabulary. Also, some domains may lend themselves to more flexible approaches to automatic tagging e.g. where they are very fast-moving, interactive or subjective. In this situation, an automated system may not be able to keep up. Finally, where the content is non-textual (such as music), leveraging the interest and knowledge of users may be the best way of generating information for automatic indexing systems
SO: We're avoiding them at the moment, but they have their place. I'm not convinced, outside of particular use cases, that there is the user motivation. I did find a Japanese site called Nico Nico Douga. Here, free text tags are used to tag user-submitted videos. Only five tags are allowed at a time, so huge tag wars kick-off. Eventually a stasis is reached with a negotiated semantic meaning. Interestingly, most tags are totally unique to Nico Nico Douga and some become so important they warrant their own entries in Wikipedia. In addition, videos are now made for tags as opposed to the other way round. For me, this is where folksonomy is interesting, but we are committed to disambiguated web-scale identifiers for driving our business requirements.
KL: When do you think auto-classification is, and isn't, suitable?
HL: Some of the key things to consider are:
- Volume and importance of content to be indexed - so for a technical or scientific system, where individual items have a high value, the greatest value from indexing may very well come from specialist manual indexers who know the domain inside out
- Timeliness - automatic classification can offer great benefits in getting well-tagged content out of the door, especially where there is a need for content to be published quickly e.g. news
- Integration with business processes for content production i.e. when automatic classification is seen as a tool to help support the people who publish and produce content, it has a greater chance of being successful than if it has no other purpose than to "get rid of a few indexers".
SO: I think it can always be suitable. Power House does this well (http://www.powerhousemuseum.com/). It combines curatorial annotation, user tags and auto-cat on all its content items. I think what it does really well is make it clear to the user exactly where these navigation items are coming from. You have to be transparent about where different annotations are coming from and their relative value.
KL: What are your hopes and expectations for auto-cat?
HL: I would like to see better tools with more user-friendly user interfaces. I also think it's time for automatic classification to move beyond its place behind-the-scenes to being a valuable part of how we present, organise and link information on the web, especially the relatively uncharted territory of the semantic web.
SO: Again, the Holy Grail for me is a high quality, publicly available auto-cat service that provides web-scale identifiers. Watch for OpenCalais, Zemantra and Muddy Boots.
KL: Thank you, both, very much.
By Karen Loasby
Karen Loasby is contributing editor of the FUMSI Manage practice area.
Silver
Oliver is an Information Architect currently working at the BBC. His
background is in Library science but most of the last six years has
been spent working in user experience design. His interests are in
metadata, taxonomies and navigation. Silver blogs at www.blockslabpillar.com'
Helen
Lippell is a freelance information architect. She has worked in the
information field ever since realising a degree in Latin and Economics
didn't open up an obvious career path. She started working life as an
indexer for the Financial Times, and developed automatic categorisation
systems for ft.com. She worked in metadata management and information
architecture for bbc.co.uk, and is currently building taxonomies in
central government. She lives in London and enjoys reading, cycling and
following hopeless football teams.'
FUMSI articles by Karen Loasby »
Click here for copyright permissions!
Copyright 2010 Free Pint Limited
Related articles:
You may also be interested in:
|