• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Researchers at HSE in St Petersburg Develop Superior Machine Learning Model for Determining Text Topics

Researchers at HSE in St Petersburg Develop Superior Machine Learning Model for Determining Text Topics

© iStock

They also revealed poor performance of neural networks on such tasks

Topic models are machine learning algorithms designed to analyse large text collections based on their topics. Scientists at HSE Campus in St Petersburg compared five topic models to determine which ones performed better. Two models, including GLDAW developed by the Laboratory for Social and Cognitive Informatics at HSE Campus in St Petersburg, made the lowest number of errors. The paper has been published in PeerJ Computer Science.

Determining the topic of a publication is usually not difficult for the human brain. For example, any editor can easily tag this article with science, artificial intelligence, and machine learning. However, the process of sorting information can be time-consuming for a person, which becomes critical when dealing with a large volume of data. A modern computer can perform this task much faster, but it requires solving a challenging problem: identifying the meaning of documents based on their content and categorising them accordingly.

This is achieved through topic modelling, a branch of machine learning that aims to categorise texts by topic. Topic modelling is used to facilitate information retrieval, analyse mass media, identify community topics in social networks, detect trends in scientific publications, and address various other tasks. For example, analysing financial news can accurately predict trading volumes on the stock exchange, which are significantly influenced by politicians' statements and economic events.

Here's how working with topic models typically unfolds: the algorithm takes a collection of text documents as input. At the output, each document is assessed for its degree of belonging to specific topics. These assessments are based on the frequency of word usage and the relationships between words and sentences. Thus, words such as ‘scientists,’ ‘laboratory,’ ‘analysis,’ ‘investigated,’ and ‘algorithms’ found in this text categorise it under the topic of ‘science.’

However, many words can appear in texts covering various topics. For example, the word ‘work’ is often used in texts about industrial production or the labour market. However, when used in the phrase ‘scientific work,’ it categorises the text as pertaining to ‘science.’ Such relationships, expressed mathematically through probability matrices, form the core of these algorithms.

Topic models can be enhanced by creating embeddings—fixed-length vectors that describe a specific entity based on various parameters. These embeddings serve as additional information acquired through training the model on millions of texts. 

Any phrase or text, such as this news item, can be represented as a sequence of numbers—a vector or a vector space. In machine learning, these numerical representations are referred to as embeddings. The idea is that measuring spaces and detecting similarities becomes easier, allowing comparisons between two or more texts. If the similarities between the embeddings describing the texts are significant, then they likely belong to the same category or cluster—a specific topic.

Scientists at the HSE Laboratory for Social and Cognitive Informatics in St Petersburg examined five topic models—ETM, GLDAW, GSM, WTM-GMM and W-LDA, which are based on different mathematical principles:

  • ETM is a model proposed by the prominent mathematician David M. Blei, who is one of the founders of the field of topic modelling in machine learning. His model is based on latent Dirichlet allocation and employs variational inference to calculate probability distributions, combined with embeddings.
  • Two models—GSM and WTM-GMM—are neural topic models.
  • W-LDA is based on Gibbs sampling and incorporates embeddings, but also uses latent Dirichlet allocation, similar to the Blei model.
  • GLDAW relies on a broader collection of embeddings to determine the association of words with topics.

For any topic model to perform effectively, it is crucial to determine the optimal number of categories or clusters into which the information should be divided. This is an additional challenge when tuning algorithms.

Sergei Koltsov

Sergey Koltsov, primary author of the paper, Leading Research Fellow, Laboratory of Social and Cognitive Informatics

Typically, a person does not know in advance how many topics are present in the information flow, so the task of determining the number of topics must be delegated to the machine. To accomplish this, we proposed measuring a certain amount of information as the inverse of chaos. If there is a lot of chaos, then there is little information, and vice versa. This allows for estimating the number of clusters, or in our case, topics associated with the dataset. We applied these principles in the GLDAW model.

The researchers investigated the models for stability (number of errors), coherence (establishing connections), and Renyi entropy (measuring the degree of chaos). The algorithms' performance was tested on three datasets: materials from a Russian-language news resource Lenta.ru and two English-language datasets - 20 Newsgroups and WoS. This choice was made because all texts in these sources were initially assigned tags, allowing for evaluation of the algorithms' performance in identifying the topics.

The experiment showed that ETM outperformed other models in terms of coherence on the Lenta.ru and 20 Newsgroups datasets, while GLDAW ranked first for the WoS dataset. Additionally, GLDAW exhibited the highest stability among the tested models, effectively determined the optimal number of topics, and performed well on shorter texts typical of social networks.

Sergey Koltsov, primary author of the paper, Leading Research Fellow, Laboratory of Social and Cognitive Informatics

We improved the GLDAW algorithm by incorporating a large collection of external embeddings derived from millions of documents. This enhancement enabled more accurate determination of semantic coherence between words and, consequently, more precise grouping of texts.

GSM, WTM-GMM and W-LDA demonstrated lower performance than ETM and GLDAW across all three measures. This finding surprised the researchers, as neural network models are generally considered superior to other types of models in many aspects of machine learning. The scientists have yet to determine the reasons for their poor performance in topic modelling.

See also:

Analysing Genetic Information Can Help Prevent Complications after Myocardial Infarction

Researchers at HSE University have developed a machine learning (ML) model capable of predicting the risk of complications—major adverse cardiac events—in patients following a myocardial infarction. For the first time, the model incorporates genetic data, enabling a more accurate assessment of the risk of long-term complications. The study has been published in Frontiers in Medicine.

A New Tool Designed to Assess AI Ethics in Medicine Developed at HSE University

A team of researchers at the HSE AI Research Centre has created an index to evaluate the ethical standards of artificial intelligence (AI) systems used in medicine. This tool is designed to minimise potential risks and promote safer development and implementation of AI technologies in medical practice.  

Smoking Habit Affects Response to False Feedback

A team of scientists at HSE University, in collaboration with the Institute of Higher Nervous Activity and Neurophysiology of the Russian Academy of Sciences, studied how people respond to deception when under stress and cognitive load. The study revealed that smoking habits interfere with performance on cognitive tasks involving memory and attention and impairs a person’s ability to detect deception. The study findings have been published in Frontiers in Neuroscience.

Russian Physicists Determine Indices Enabling Prediction of Laser Behaviour

Russian scientists, including researchers at HSE University, examined the features of fibre laser generation and identified universal critical indices for calculating their characteristics and operating regimes. The study findings will help predict and optimise laser parameters for high-speed communication systems, spectroscopy, and other areas of optical technology. The paper has been published in Optics & Laser Technology.

Children with Autism Process Auditory Information Differently

A team of scientists, including researchers from the HSE Centre for Language and Brain, examined specific aspects of auditory perception in children with autism. The scientists observed atypical alpha rhythm activity both during sound perception and at rest. This suggests that these children experience abnormalities in the early stages of sound processing in the brain's auditory cortex. Over time, these abnormalities can result in language difficulties. The study findings have been published in Brain Structure and Function.

HSE Team Takes First Place in RuCode Algorithmic Programming Championship

On October 20, 2024, the final round of the RuCode Algorithmic Programming Championship took place, setting a new record in the Russian Book of Records as the ‘Largest Competitive Programming Event.’ The event, held simultaneously across 24 locations, hosted 1,450 participants divided into 500 teams. The overall winner of the senior team division was the M.O.S.C.O.W. team from the HSE Faculty of Computer Science (FCS).

HSE Teachers Awarded Yandex ML Prize

The awards ceremony for the international Yandex ML Prize was held in Moscow. This year, all three winners in the ‘ML Educators’ category were HSE faculty members—Evgeny Sokolov, Associate Professor and Head of the Big Data and Information Retrieval School, Anton Konushin, Associate Professor at the Faculty of Computer Science, and Aleksei Shpilman, Associate Professor at the Department of Informatics at HSE’s St Petersburg School of Physics, Mathematics, and Computer Science.

Smartphones Not Used for Digital Learning among Russian School Students

Despite the widespread use of smartphones, teachers have not fully integrated them into the teaching and learning process, including for developing students' digital skills. Irina Dvoretskaya, Research Fellow at the HSE Institute of Education, has examined the patterns of mobile device use for learning among students in grades 9 to 11.

Working while Studying Can Increase Salary and Chances of Success

Research shows that working while studying increases the likelihood of employment after graduation by 19% and boosts salary by 14%. One in two students has worked for at least a month while studying full time. The greatest benefits come from being employed during the final years of study, when students have the opportunity to begin working in their chosen field. These findings come from a team of authors at the HSE Faculty of Economic Sciences.

Beauty in Details: HSE University and AIRI Scientists Develop a Method for High-Quality Image Editing

Researchers from theHSE AI Research Centre, AIRI, and the University of Bremen have developed a new image editing method based on deep learning—StyleFeatureEditor. This tool allows for precise reproduction of even the smallest details in an image while preserving them during the editing process. With its help, users can easily change hair colour or facial expressions without sacrificing image quality. The results of this three-party collaboration were published at the highly-cited computer vision conference CVPR 2024.