Refining topic modeling to automate taxonomy #JournalismAI
In the previous post, we explored various topic modeling algorithms and tested accuracy. We were leaning towards BERTopic because it gave more diversity in topics and was a tad more accurate (52.97%) than Top2Vec.
Our next goal was to increase the accuracy of the algorithm.
Approaches to increase accuracy include:
- Testing out if pre-trained embedding models and see if it fits one’s use case.
- Tune hyper-parameters of UMAP and HDBSCAN.
Trying out more embeddings
We know training in BERTopic takes significantly longer than Top2Vec. Hence, we decided to train and test on only 5000 stories as a quick test to see if it gives promising results.
First, we tried out pre-trained sentence transformers models in BERTopic. Below were the results.
We also tried out the universal-sentence-encoder with Top2Vec but found the accuracy to be very low (33%).
Finally, we tried Doc2Vec which generated word and document embeddings. This performed significantly better (>70%).
Evaluating failed cases
We grouped all failed cases by section (and other dimensions) to see if there was a clear pattern and found that stories from the Etimes (entertainment, TV, web series, and lifestyle) section accounted for 97% of the failed cases.
Turns out, most of the stories in entertainment are focused around named entities (celebrities) instead of topics. Top2Vec still classified these stories into 344 topics. While this could be correct from a traditional topic determination perspective but for our use case, i.e. identifying newscycles (what Twitter calls Trends), this did not fit in.
Hence, we decided to retrain and test the model by dropping ETimes stories from the dataset.
Selecting an algorithm
We retrained the Doc2Vec model with 25,000 stories (after removing the Etimes stories) and the accuracy of the model rose from 50.73% (from the previous blog) to 73.55%.
Satisfied with the accuracy for now, we have decided to go ahead with production and test results in real-world scenarios.
Dashboard for supervising the output
The dashboard lists all the topics that the algorithm has found.
Editor can choose a topic and then it shows all the stories in that topic.
Editor can label the topic and write a 160-character description for the same.
Editors can also mark a story as False Positives, i.e., stories that the algorithm said are part of a topic but isn’t.
On doing so, the row appears struck out in the dashboard.
Editors can also fix False Negatives, i.e. stories that the Editor feels are part of the topic but the algorithm did not catch it. They do this by adding the story’s ID.
We’ll be deploying the algorithm in the production environment and generate and tag topics against TOI and TOI+ stories live with a 15-minute delay.
To build confidence in the algorithm’s ability to find and tag topics, we will also get our Editors to use the dashboard.
We are also anticipating that editors might ask for functionality to merge two topics or split a topic into subtopics.
Finally, we’ll also start work on building a basic timeline.