Mar

2024

‘Bots Are Simply Imitators, not Artists’: How to Distinguish Artificial Intellect from a Real Author

Today, text bots like ChatGPT are doing many tasks that were originally human work. In our place, they can rewrite ‘War and Peace’ in a Shakespearean style, write a thesis on Ancient Mesopotamia, or create a Valentine’s Day card. But is there any way to identify an AI-generated text and distinguish it from works done by a human being? Can we catch out a robot? The Deputy Head of the HSE School of Data Analysis and Artificial Intelligence, Professor of the HSE Faculty of Computer Science Vasilii Gromov explained the answer in his lecture ‘Catch out a Bot, or the Large-Scale Structure of Natural Intelligence’ for Znanie intellectual society.

‘Why are modern texts created and who writes them?’ asked Vasilii Gromov. His generation and the generation of lecture listeners grew up on works written by people for people: authors of such texts put a certain meaning into their works, had a certain goal, whether the book was ‘Sleeping Beauty,’ ‘War and Peace,’ or a textbook of mathematical analysis, the professor notes. However, nowadays, children from a very early age are surrounded by texts written by an unknown author with an unclear purpose for an undefined audience. Vasilii Gromov and his colleagues wondered whether such a child would grow up the same way the previous generations have done.

Vasilii Gromov

The ongoing change is neither good nor bad, because the world is transforming. Humankind is now experiencing the process of ‘co-evolution of artificial intelligence and humans.’ Along with its rapid development, AI is adapting to humans, but humans also are beginning to adapt to artificial intelligence as well. To secure our future, or at least for ‘basic information hygiene,’ we need to learn to distinguish texts generated by bots (artificial intelligence systems that generate texts in natural languages like Russian, Chinese, etc) from those written by people.

Using a number of existing generated texts, it would not be difficult to identify whether a new text was written by a specific bot or a human: we simply need to load a large number of similarly generated texts into the neural network—and there you go, mission accomplished. However, after this, no-one would continue using that particular bot, and it would simply be replaced by another artificial intelligence. Therefore, scientists need to develop a mechanism capable of distinguishing any bot from any human. To do this, we need to look at the structure of language itself, which brings us to research, explaining natural languages from a mathematical point of view. Now, let’s take a look at the necessary steps.

The scientific field of natural language processing works, in particular, with the representation of words and sequences of words (n-grams, where n is the number of words) in the form of vectors (several elements of a certain number in a row), which creates a certain vector space.

Working with the representation of individual words reveals that the vocabulary of bots is no different from the vocabulary of an ordinary person. However, as soon as it comes to a sequence of two or three words, it turns out that the sequence generated by bots is significantly more predictable and much poorer in linguistic terms than the one that even the most poorly educated person can create (for example, a bot is more likely to repeat patterns). The difference between the n-gram sequence of bots and people is statistically significant even for large bots (ChatGPT), and this is what helps catch them.

Further study of natural language from a mathematical point of view brings scholars to some judgments on the location of such word vectors in space. There are regions of vector space (especially when it comes to the sequences of words) that only bots visit, and others that only people visit. Most (90–95%) are used by both, but there are separate bot areas—which is another way to catch them out.

If we cluster (a mathematical operation when sets of similar elements can be combined into one group—a cluster) a sequence of bots, these sequences turn out to be more rigid, compact, and without any discrepancies. When a verbal sequence of people of different genders and ages, with different education and backgrounds is clustered, the result is more blurry, indistinct clusters. Humans think significantly less clearly than bots, and this is another way to catch them.

If we represent each word or each n-gram as a vector, then their entire collection can be represented as a geometric object or a certain surface in a multidimensional space. Then, for example, if we take all possible word sequences in Russian, we may find that they do not fill the entire semantic space, but only part of it. Scientists can study and measure this sequence as a surface, even compare it with other surfaces (for example, with the surface of the English language). So, every surface in space has a dimension, ie, the number of independent parameters necessary to describe this object (for points on a sphere, for example, these are two values—longitude and latitude).

Studying the dimension of natural language, Vasilii Gromov expected to find an infinite value, but in the end, analysts came to the conclusion that language has a 9–10-digit dimension, and this figure varies slightly from language to language, but what is certain: human language lies in larger space dimensions than the bot's language.

Finally, the results of a recent 2023 study showed that this surface has ‘holes’ in it, like Swiss cheese. The holes are those areas of semantic space that our language has not yet reached. Although at the moment analysts cannot clearly indicate what is hidden behind them, they can detect them. Different languages have different holes, also referred to as ‘blind spots.’ When catching bots, it is important to remember that people are drawn to the boundaries of such holes, because they use language to create new meanings and ideas. Meanwhile, bots, like learned programs, move away from these holes, which makes the task of catching them easier for now. Surprisingly, it is humour that most often appears at the boundaries of such holes.

‘Bots are simply imitators, not artists. Technology does not stand still, so we must try to solve this “bot-catching” problem and understand what a language is from a mathematical point of view,’ summarised Vasilii Gromov.

Date

5 March 2024

Topics

Research & Expertise

Keywords

artificial intelligence data analysis

About

Faculty of Computer Science, School of Data Analysis and Artificial Intelligence

About persons

Vasilii Gromov

‘HSE’s Industry Ties Are Invaluable’

Pan Zhengwu has spent the last seven years at HSE University—first as a student of the Bachelor’s in Software Engineering and now in the Master’s in System and Software Engineering at the Faculty of Computer Science. In addition to his busy academic schedule, he works as a mobile software engineer at Yandex and is an avid urban photographer. In his interview with the HSE News Service, Zhengwu talks about the challenges he faced when he first moved to Russia, shares his thoughts on ‘collaborating’ with AI, and reveals one of his top spots for taking photos in Moscow.

28 March

Mar

2025

Scientists Present New Solution to Imbalanced Learning Problem

Specialists at the HSE Faculty of Computer Science and Sber AI Lab have developed a geometric oversampling technique known as Simplicial SMOTE. Tests on various datasets have shown that it significantly improves classification performance. This technique is particularly valuable in scenarios where rare cases are crucial, such as fraud detection or the diagnosis of rare diseases. The study's results are available on ArXiv.org, an open-access archive, and will be presented at the International Conference on Knowledge Discovery and Data Mining (KDD) in summer 2025 in Toronto, Canada.

27 March

Mar

2025

Megascience, AI, and Supercomputers: HSE Expands Cooperation with JINR

Experts in computer technology from HSE University and the Joint Institute for Nuclear Research (JINR) discussed collaboration and joint projects at a meeting held at the Meshcheryakov Laboratory of Information Technologies (MLIT). HSE University was represented by Lev Shchur, Head of the Laboratory for Computational Physics at the HSE Tikhonov Moscow Institute of Electronics and Mathematics (HSE MIEM), as well as Denis Derkach and Fedor Ratnikov from the Laboratory of Methods for Big Data Analysis at the HSE Faculty of Computer Science.

3 March

Feb

2025

AI vs AI: Scientists Develop Neural Networks to Detect Generated Text Insertions

A research team, including Alexander Shirnin from HSE University, has developed two models designed to detect AI-generated insertions in scientific texts. The AIpom system integrates two types of models: a decoder and an encoder. The Papilusion system is designed to detect modifications through synonyms and summarisation by neural networks, using one type of models: encoders. In the future, these models will assist in verifying the originality and credibility of scientific publications. Articles describing the Papilusion and AIpom systems have been published in the ACL Anthology Digital Archive.

27 February

Feb

2025

HSE Researchers Develop Python Library for Analysing Eye Movements

A research team at HSE University has developed EyeFeatures, a Python library for analysing and modelling eye movement data. This tool is designed to simplify the work of scientists and developers by enabling them to efficiently process complex data and create predictive models.

19 February

Jan

2025

‘Many Want to Create AI-Based Products and Become More Competitive’

In 2024, the online Russian-taught master’s programme ‘Artificial Intelligence,’ offered by the HSE Faculty of Computer Science, saw a record number of first-year students—over 300. What accounts for such a high interest in AI, how the curriculum is structured, and what new skills will graduates acquire? Elena Kantonistova, the programme’s academic director, shares more.

22 January

Jan

2025

HSE University and Yandex Education Release Free Online Handbook in Math and Data Analysis

Experts from the HSE Continuing Education Centre, the Master's Programme 'Artificial Intelligence', and Yandex Education have developed and published a free math handbook in data analysis (in Russian). This is the seventh online publication in a series of digital self-study textbooks dedicated to specific IT areas.

14 January

Jan

2025

'I Would Like to Leave a Lasting Impact on Science'

Aibek Alanov pursues his own scientific research and leads two teams of scientists, one at HSE University and the other at AIRI. In this interview for the HSE Young Scientists project, he explores the parallels between today's AI researchers and early 20th-century physicists, discusses generative models, and shares his passion for bachata partner dancing.

13 January

Dec

2024

HSE’S Achievements in AI Presented at AIJ

The AI Journey international conference hosted a session led by Deputy Prime Minister Dmitry Chernyshenko highlighting the achievements of Russian research centres in artificial intelligence. Alexey Masyutin, Head of the HSE AI Research Centre, showcased the centre’s key developments.

19 December 2024

Dec

2024

Drivers of Progress and Sources of Revenue: The Role of Universities in Technology Transfer

In the modern world, the effective transfer of socio-economic and humanities-based knowledge to the real economy and public administration is essential. Universities play a decisive role in this process. They have the capability to unite diverse teams and, in partnership with the state and businesses, develop and enhance advanced technologies.

16 December 2024