Refining TOI’s supply-side taxonomy
This post was originally published at next.timesofindia.com with Niharika Bisaria.
Sometimes, to achieve X you need to do Y and the outcomes of Y are not immediately visible. Rebuilding TOI’s Taxonomy was one such infrastructural project for us.
Lack of processes
- There is no written standard operating procedure (SOP) of what makes a story an explainer versus opinion. And hence, the decision on how to classify stories varies.
- Sometimes, for tactical gains, after an election concluded, decisions were taken to file election-related stories in the India or Politics section instead of the Elections section.
Due to this it is often difficult to know where to look for a particular story.
Immediately, our audiences struggle to find relevant content.
We first discovered this problem when we were redoing the Covid-19 coverage in April 2021. A content audit of the website revealed that beyond hard news on Coronavirus, TOI website already had the following content but it was all siloed in its individual sections:
- Coronabytes — a newsletter on Coronavirus that was read by over a million subscribers.
- A daily podcast on the Covid-19 situation.
- A daily LiveBlog that kept audiences up-to-date about the situation on the ground.
- Etimes had a rich collection of explainers on Covid-19 prevention and treatment.
- TOI Plus had over 100 nuanced pieces, most useful among them were Q&A with doctors.
- The data team had built rich dashboards that tracked spread of the virus.
While we had a rich set of information on Covid-19, discovery by audiences was not easy because the content was scattered across.
Strategically, AI is inevitable.
However, recommendation and personalization algorithms require clean datasets like taxonomy as features for model building. A clear cut taxonomy would ensure a clean database, because if the input data had errors, then that would propagate into the AI’s output.
Hence, we embarked on a project to correct TOI’s taxonomy.
Must we reinvent the wheel?
Our first hunch was to implement either IPTC or IAB taxonomies. Infact, Times Internet’s ad-tech product Columbia already had algorithms auto-classifying content by the IAB taxonomy. However, off-the-shelf taxonomies don’t work for all use-cases.
Below are some decisions we took before starting the discovery process.
- Start from the End: TOI Plus, our subscription product, was the startup in the group. It was new and the future in many ways. Hence, we wanted to work backwards from it.
- Content Journey: The Editorial Team should get clear signals on when a story should scale up into a cluster, a cluster into a microsite/landing page, and finally into a section. Previously, in the blogpost Editors can mix-n-match Newscards for different use cases we showed use of clusters.
- Backward Compatibility: We wanted to minimize the disruption among the editorial teams and the website, yet build a slow ramp up to the goal. Hence, the new taxonomy should map back to the existing taxonomy used in TOI, Etimes, and our Sports Section. This would allow us to build user journeys across the two taxonomies.
- Extendible: The taxonomy is a hierarchical tree so that if new sub-topics need to be created then the taxonomy governance team can choose to add them.
- Tagging SOP: Until an AI classification model took over, the desk would have to tag. Hence, it was important that the rules for tagging be deterministic enough to take us down to at least level two (or three) of the tree.
- Editorial Leads/SPOCs: To aid with the commissioning decisions, SPOCs were assigned to various topics/levels within the taxonomy.
Discovering the taxonomy
We always knew that discovering a new taxonomy should be editorial-led and product-supported. Hence, to lead this effort we hired a senior editor supported by 2–3 people. The goal was to encode her judgment as training data for an AI/ML algorithm.
Over the next 3–4 months, they iteratively narrow down to a hierarchical taxonomy from the content already published over the years using methods similar to Grounded Theory.
Constant Feedback Cycle
We also did rounds of checks with the TOI Plus editors who were allocated as SPOCs to parts of the taxonomy. The SPOCs evaluated the taxonomy logic and helped us to create bundles which we plugged in at the bottom of articles to provide the reader with active links for related stories. (We will talk about this in a separate blog.)
Along with the Denmark team, we had multiple rounds of consultation with our News Editor and Cities Editor to narrow down to a tagging UX that was least inconvenient — no memorization by the desk and least amount of clicks.
The final taxonomy tree has 242 subtopics (nodes) spread across a depth of 4 and width of 98.
Deeper understanding of topic performance
When mixed with our story rating algorithm, we realized which topics do well. At a very high-level, stories related to South Asia do twice as well as stories related to the rest of the world, including China. This gives us signals to hire new writers. We’re integrating the taxonomy with Signals — our editorial analytics dashboard.
Replace section pages with taxonomy pages
Each Level 1 topic in the taxonomy tree gets a landing page of their own. For example, in the topic — “India and Constitution”, we cover subtopics like:
We were already rating TOI+ stories for editorial analytics. When combined with the taxonomy, it gave us an early recommendation algorithm that we plugged at the bottom of TOI+. This widget has 107% higher CTR than the earlier widget that was doing the same job to be done.
Personalization: This project lays the foundation for introducing the ‘Follow’ button.
AI/ML: While we are tagging articles manually right now, the eventual goal is for AI to suggest and auto-tag stories. For this, we are exploring topic modeling algorithms like Top2Vec and BERTopic.