Refining TOI’s supply-side taxonomy | Ritvvij Parrikh Humane ClubMade with Humane Club

Refining TOI’s supply-side taxonomy

Published Oct 07, 2022
Updated Jan 11, 2023

This post was originally published at next.timesofindia.com with Niharika Bisaria.


Sometimes, to achieve X you need to do Y and the outcomes of Y are not immediately visible. Rebuilding TOI’s Taxonomy was one such infrastructural project for us.

Need

Categories don’t work. Sections are an ineffective way to organize news websites. Online, people organize around interests.

Lack of processes

Moreover:

  • There is no written standard operating procedure (SOP) of what makes a story an explainer versus opinion. And hence, the decision on how to classify stories varies.
  • Sometimes, for tactical gains, after an election concluded, decisions were taken to file election-related stories in the India or Politics section instead of the Elections section.

Due to this it is often difficult to know where to look for a particular story.

We first discovered this problem when we were redoing the Covid-19 coverage in April 2021. A content audit of the website revealed that beyond hard news on Coronavirus, TOI website already had the following content but it was all siloed in its individual sections:

  • Coronabytes — a newsletter on Coronavirus that was read by over a million subscribers.
  • A daily podcast on the Covid-19 situation.
  • A daily LiveBlog that kept audiences up-to-date about the situation on the ground.
  • Etimes had a rich collection of explainers on Covid-19 prevention and treatment.
  • TOI Plus had over 100 nuanced pieces, most useful among them were Q&A with doctors.
  • The data team had built rich dashboards that tracked spread of the virus.

While we had a rich set of information on Covid-19, discovery by audiences was not easy because the content was scattered across.

However, recommendation and personalization algorithms require clean datasets like taxonomy as features for model building. A clear cut taxonomy would ensure a clean database, because if the input data had errors, then that would propagate into the AI’s output.

Hence, we embarked on a project to correct TOI’s taxonomy.


Must we reinvent the wheel?

Our first hunch was to implement either IPTC or IAB taxonomies. Infact, Times Internet’s ad-tech product Columbia already had algorithms auto-classifying content by the IAB taxonomy. However, off-the-shelf taxonomies don’t work for all use-cases.


Criteria

Below are some decisions we took before starting the discovery process.

  • Start from the End: TOI Plus, our subscription product, was the startup in the group. It was new and the future in many ways. Hence, we wanted to work backwards from it.
  • Content Journey: The Editorial Team should get clear signals on when a story should scale up into a cluster, a cluster into a microsite/landing page, and finally into a section. Previously, in the blogpost Editors can mix-n-match Newscards for different use cases we showed use of clusters.
Credits: As of May 21, 2022, we have a cluster running on data stories related to NFHS data.
  • Backward Compatibility: We wanted to minimize the disruption among the editorial teams and the website, yet build a slow ramp up to the goal. Hence, the new taxonomy should map back to the existing taxonomy used in TOI, Etimes, and our Sports Section. This would allow us to build user journeys across the two taxonomies.
  • Extendible: The taxonomy is a hierarchical tree so that if new sub-topics need to be created then the taxonomy governance team can choose to add them.
  • Tagging SOP: Until an AI classification model took over, the desk would have to tag. Hence, it was important that the rules for tagging be deterministic enough to take us down to at least level two (or three) of the tree.
  • Editorial Leads/SPOCs: To aid with the commissioning decisions, SPOCs were assigned to various topics/levels within the taxonomy.

Discovering the taxonomy

We always knew that discovering a new taxonomy should be editorial-led and product-supported. Hence, to lead this effort we hired a senior editor supported by 2–3 people. The goal was to encode her judgment as training data for an AI/ML algorithm.

Over the next 3–4 months, they iteratively narrow down to a hierarchical taxonomy from the content already published over the years using methods similar to Grounded Theory.

Below are some broad rules we adopted along the way

Iterative: Like with any grounded theory-like method, we went through multiple rounds of refining the taxonomy. Our aim was to make our taxonomy agnostic of individual interpretations and biases. We discussed within the product and edit teams, revised our rules again and again and finally created the logic for each taxonomy topic.

Focus on audience segments: Each topic in the taxonomy ought to have a narrow audience segment. For example should a story on the impact of covid on mental health go to a section on covid or should it go into a section on mental health? Everytime we reached such a point, we asked ourselves a subsequent question — Would most audiences interested in mental health read this story or would most audiences interested in Covid read this story?

Easy first: Some stories are easy to classify, for example business, personal finance, etc. We completed these topics first and then moved to the more nuanced ones like secularism, democracy, etc.

Often a particular topic would become too big! For instance, we realized we had innumerable stories about the economy. So we further broke it down into sub-topics like budget, expenditure, revenue, reforms, FDI, etc. Conversely, if a particular topic would be too shallow right now — not covered in depth or there was no intent to cover it in depth — then we would not create a new sub-topic for it.

Knowing when to stop: It was possible to take even deeper dives and keep on creating newer taxonomies as the themes that stories covered were manifold. However, once we were satisfied that we had covered a reasonably large layer of topics we decided to stop.

We also did rounds of checks with the TOI Plus editors who were allocated as SPOCs to parts of the taxonomy. The SPOCs evaluated the taxonomy logic and helped us to create bundles which we plugged in at the bottom of articles to provide the reader with active links for related stories. (We will talk about this in a separate blog.)

Along with the Denmark team, we had multiple rounds of consultation with our News Editor and Cities Editor to narrow down to a tagging UX that was least inconvenient — no memorization by the desk and least amount of clicks.


Outcome

The final taxonomy tree has 242 subtopics (nodes) spread across a depth of 4 and width of 98.

Discovery of new topics

Over the years, we have covered stories in beats that we had not formally recognized as a beat. For example, we recognized that we had over 100 deeply-researched stories on the Internet. The topic covered stories on BigTech, Internet and Culture, Privacy, Regulation, Security, etc. This was different from our Cryptocurrency or Web 3.0 section.

Almost 10% of our coverage overlapped with Etimes yet these TOI Plus stories were not filed in the right sections.

When mixed with our story rating algorithm, we realized which topics do well. At a very high-level, stories related to South Asia do twice as well as stories related to the rest of the world, including China. This gives us signals to hire new writers. We’re integrating the taxonomy with Signals — our editorial analytics dashboard.

Credits: Screenshots from the Signals Editorial Analytics platform. More on this later.

Audience-facing rollout

Showcase diversity of topics

On a typical day, ~80% of TOI+ stories would be tagged in the India section. Since the rollout of the taxonomy, our audiences are able to see the diversity of topics that the TOI+ edit team covers stories on.

Before and After the taxonomy

Each Level 1 topic in the taxonomy tree gets a landing page of their own. For example, in the topic — “India and Constitution”, we cover subtopics like:

  • Democratic
  • Fraternity
  • Liberty
  • Federalism
  • Justice
  • Equality
  • Secularism

We were already rating TOI+ stories for editorial analytics. When combined with the taxonomy, it gave us an early recommendation algorithm that we plugged at the bottom of TOI+. This widget has 107% higher CTR than the earlier widget that was doing the same job to be done.


What next

Personalization: This project lays the foundation for introducing the ‘Follow’ button.

AI/ML: While we are tagging articles manually right now, the eventual goal is for AI to suggest and auto-tag stories. For this, we are exploring topic modeling algorithms like Top2Vec and BERTopic.