FUMSI is for individuals, teams and organisations. Get the benefits of a FUMSI subscription. Learn more »

FUMSI: Subscribe
Flexible, practical value for individuals, teams and organisations.
Learn more »

Enter your
email address:

FUMSI Account »
FreePint Account »

Bookmark and Share






If you find this useful, please consider subscribing, sharing your feedback or providing a testimonial. Browse most recent articles.
 

Automatic Classification: A Panel Discussion

January 2009 | Perma Link
Bookmark and Share  \"Feed\"   
Subscribe to FUMSI »  
Views: 2,009  

Written by Karen Loasby

Following their recent ISKO UK presentation, Karen Loasby discusses automatic classification with freelance information architect Helen Lippell and BBC information architect Silver Oliver.

KL: Firstly, when did you get involved in automatic classification and why?

HL: I suppose I'm a grizzled veteran. I started life as a manual indexer for the Financial Times, indexing news articles. I enjoyed the process, and the opportunity to work on an automation project came up. It was a chance to make the product better; I was interested in language and enjoyed reading newspapers. 

SO: I was at the BBC, working on IA and taxonomy projects. The BBC was trying to set up a semi-automatic system and was getting the IAs to manage the system. It was an area I was really interested in and so was happy to get involved. 

KL: What are the different types of system?

HL: Broadly, there are two types. Firstly rules-based, which uses Boolean logic (i.e. AND/OR/NOT constructions), and may use other operators (e.g. which look at word order, word proximity and content structure) to look for patterns in the text and apply terms accordingly. Secondly Bayesian, or statistical-based systems, that are trained using a set of representative documents to identify useful or 'statistically rare' concepts. 

SO: Choosing an approach isn't straightforward. Neither approach is necessarily easier or quicker. I don't want to say it depends, but it does. Combining multiple approaches could be the way forward. 

KL: How have you used these tools?

HL: At ft.com I used them to apply terms from different facets to news articles. More recently, in a central government project, I've led work to use automatic classification to re-tag all content on a substantial website against IPSV version 2, leveraging the speed and consistency of an automated approach. 

SO: I've used them to power BBC Topic pages. In the past, it was more about the back-end. Now it is becoming about driving scalable user journeys. That's an area I'm really interested in. 

KL: What roles are there for humans in these processes?

HL: I'm not describing job roles here, so much as tasks: 

  • There's a need to research the domain in order to train the system, particularly for rules-based systems. Rules have to be created, or training sets chosen for statistical systems
  • There's quality checking the performance of the system, that has to be a human task and lots of it is needed. It can be good to involve editorial staff here
  • Designing, or being involved in the design of, the product that uses the automatic classification
  • There's usually a role for someone to re-design the tool itself, to make it something that actually feels nice to use and administer. Many of them are pretty awful to use
  • And someone has to sell the value of the system to the business.

SO: New technologies always claim to remove people from the process but they are usually only shifting resources around; you never really manage to get rid of people. The 'semantic search' style products are based on very complicated ontologies. These need to be configured so you have to understand the complexity. They can be pretty 'black-box' so understand what they are doing can be a time-consuming task.  

HL: At the FT, some of the manual indexers moved on to work on the automatic system. Not everyone, but there was a role if people were interested. You are achieving more with those people, though. 

KL: What pitfalls and problems have you found?

SO: The question is, can automatically applied metadata create meaningful scalable user journeys? Can you go on a nice journey from the bottom up? If you apply metadata to content, does that create a user journey between content that a user actually wants to follow?  We think this is right, but we don't know. The 'about' narrative can be fairly weak. We need the relationships to be stronger, more typed, more ontological. 

Usage is shifting user expectations. If you use it to cross-link content, then expectations of quality may be higher than when metadata is used for, say, search results ranking. 

Interestingly, the New York Times now bases related links on your LinkedIn profile, if you are logged in. It gets the industry you work in from LinkedIn, then targets related links around your area of work. There are potential problems with this happening in the background without clearly communicating what is happening to the user.

HL: The main barrier to successful implementation of automatic classification systems is the English language itself! The quirks that make English fantastic for cryptic crosswords are the ones that cause ambiguity, multiple meanings and flexibility in parts of speech. Words shift meaning over time, new words come into the language, and idioms can be a nightmare. Therefore, all of these may impair the ability of an automated system to provide the high quality that is needed for the kind of applications that Silver is talking about.

KL: How do taxonomies fit in?

HL: Taxonomies can be the glue of an automatic classification implementation. They are the vocabulary that rules, whether Boolean or statistical, are built upon, allowing concepts to be applied consistently to content. Taxonomies also provide the framework of relationships, such as synonyms and related terms between concepts - they help the automatic system to understand the domain in the way that users do.

SO: We have started to talk about web-scale identifiers, something that performs like barcodes and ISBN numbers. As we start to use data on a Web scale, we will need commonly shared web identifiers. An example of this would be the use by BBC Music using MusicBrainz IDs for artists and tracks. These are commonly understood IDs for these entities on the web. There has been a recent surge of services (OpenCalais, Zemantra, Muddy Boots) offering auto-classification systems that provide web-scale identifiers (usually DBPedia). It has become some what of a holy grail of auto-classification.

KL: ...... and folksonomies?

HL: Where there is user interest, and a potential or existing community, folksonomies/tags are a valuable repository of concepts that an automatic classification system can use in the same way as a more controlled vocabulary. Also, some domains may lend themselves to more flexible approaches to automatic tagging e.g. where they are very fast-moving, interactive or subjective. In this situation, an automated system may not be able to keep up. Finally, where the content is non-textual (such as music), leveraging the interest and knowledge of users may be the best way of generating information for automatic indexing systems

SO: We're avoiding them at the moment, but they have their place. I'm not convinced, outside of particular use cases, that there is the user motivation. I did find a Japanese site called Nico Nico Douga. Here, free text tags are used to tag user-submitted videos. Only five tags are allowed at a time, so huge tag wars kick-off. Eventually a stasis is reached with a negotiated semantic meaning. Interestingly, most tags are totally unique to Nico Nico Douga and some become so important they warrant their own entries in Wikipedia. In addition, videos are now made for tags as opposed to the other way round. For me, this is where folksonomy is interesting, but we are committed to disambiguated web-scale identifiers for driving our business requirements.

KL: When do you think auto-classification is, and isn't, suitable?

HL: Some of the key things to consider are:

  • Volume and importance of content to be indexed - so for a technical or scientific system, where individual items have a high value, the greatest value from indexing may very well come from specialist manual indexers who know the domain inside out
  • Timeliness - automatic classification can offer great benefits in getting well-tagged content out of the door, especially where there is a need for content to be published quickly e.g. news
  • Integration with business processes for content production i.e. when automatic classification is seen as a tool to help support the people who publish and produce content, it has a greater chance of being successful than if it has no other purpose than to "get rid of a few indexers".

SO: I think it can always be suitable. Power House does this well (http://www.powerhousemuseum.com/). It combines curatorial annotation, user tags and auto-cat on all its content items. I think what it does really well is make it clear to the user exactly where these navigation items are coming from. You have to be transparent about where different annotations are coming from and their relative value.

KL: What are your hopes and expectations for auto-cat?

HL: I would like to see better tools with more user-friendly user interfaces. I also think it's time for automatic classification to move beyond its place behind-the-scenes to being a valuable part of how we present, organise and link information on the web, especially the relatively uncharted territory of the semantic web.

SO: Again, the Holy Grail for me is a high quality, publicly available auto-cat service that provides web-scale identifiers. Watch for OpenCalais, Zemantra and Muddy Boots.

KL: Thank you, both, very much.


Karen Loasby is contributing editor of the FUMSI Manage practuce area.

Silver Oliver is an Information Architect currently working at the BBC. His background is in Library science but most of the last six years has been spent working in user experience design. His interests are in metadata, taxonomies and navigation. Silver blogs at www.blockslabpillar.com'




Helen Lippell is a freelance information architect. She has worked in the information field ever since realising a degree in Latin and Economics didn't open up an obvious career path. She started working life as an indexer for the Financial Times, and developed automatic categorisation systems for ft.com. She worked in metadata management and information architecture for bbc.co.uk, and is currently building taxonomies in central government. She lives in London and enjoys reading, cycling and following hopeless football teams.'


Related FUMSI articles:

Euro IA Review: http://web.fumsi.com/go/article/share/3357

Experiencing Information: A Personal View of the 2008 Information Architecture Summit: http://web.fumsi.com/go/article/manage/3095

How the Semantic Web Will Change Information Management: Three Prediction: http://web.fumsi.com/go/article/manage/3327

Web Analytics and Information Architecture: http://web.fumsi.com/go/article/manage/3460



[Get Copyright Permissions] Click here for copyright permissions!
Copyright 2008 Free Pint Ltd.

You may also be interested in:

 

Latest Articles:

Show me all FUMSI articles »

 

Latest Reports and Tools:

Show me all Reports and Tools »

This section sponsored by:


Read more about our sponsors »

FUMSI Focus: Disability Resources

Our Editor Recommends...

AbilityNet is a fabulous resource that both helps disabled users of technology and provides advice and services to IT professionals. Factsheets and skillsheets guide you through the mysteries of customising your computer, setting up your monitor, and using both the keyboard and mouse.

For professionals, AbilityNet provide training courses, accessibility audits and disabled user testing. There is also an industry news feed and articles on topics from creating accessible PDFs through to understanding cognitive difficulties.

Check back regularly for new recommendations, or subscribe to FUMSI Focus, a free monthly update.

Contribute

Karen LoasbyContact Karen Loasby, our contributing editor for the Manage practice area, with your feedback and suggestions for articles or resources.

Subscribe

Get the monthly FUMSI Magazine, FUMSI Folios and discounts on reports. Find out more »

Sponsor

Sponsors of the Manage practice area reach records managers, information policy managers and senior leaders of IT teams with budgetary control or influence over their organisation's data purchases. Sponsorships for this practice area are limited, so contact us today for further information. Learn more now »

Comment

Ask your tricky Manage-related questions in the FreePint Bar -- our community is ready to help!

Email any suggestions on FUMSI using our Suggestion Box »

Tell Others

If you find FUMSI useful, please tell a colleague, forward an article, or promote a FUMSI Professional or FUMSI Enterprise subscription within your organisation.

Supply a Testimonial

If you find FUMSI useful, we would love to hear from you.

More MANAGE Resources

Latest MANAGE articles:

More MANAGE articles »

Latest MANAGE tools and reports:

More tools and reports »

Subscribe to FUMSI »

Why subscribe? Because you get:

  • Monthly FUMSI Magazine
  • Monthly FUMSI Folios
  • All FUMSI Reports
  • Other valuable Free Pint Limited discounts

Learn more and subscribe »

 
How do I FUMSI?
» Find
» Use
» Manage
» Share
Subscribe
Magazine Articles
» 'Find' Articles
» 'Use' Articles
» 'Manage' Articles
» 'Share' Articles
FUMSI Magazine
FUMSI Folios
Reports
» 'Find' Reports
» 'Use' Reports
» 'Manage' Reports
» 'Share' Reports
About FUMSI
» Philosophy
» People
» Site Map
» Search
» Sponsors
Contact
» Suggestion Box
» Testimonial