'Colossal misunderstanding': What is the aim of BgGPT, Bulgaria’s generative AI tool?
INSAIT's language model goes beyond being a public chatbot
It doesn't sound very "smart", it hallucinates, easily falls for manipulations and even denies knowing its own creators and its Bulgarian roots. It has been described as a "blunder", a "disaster" and a product that undermines the otherwise positive image of the Institute of Computer Science, Artificial Intelligence and Technology (INSAIT). It became the occasion for a dose of sharp criticism against the Bulgarian state for having thrown "millions" (the exact amount is not known) in support of its creation.
Meet BgGPT, the long-awaited first large language model using the Bulgarian language, that officially went live on 3 March (the country’s national holiday). Within hours of its release, social networks were flooded with mocking memes and comments about it.
"This is an initial-phase model, in a few weeks it will be even better, and it will clean itself of errors," INSAIT founder Martin Vechev said in defense in an interview with bTV.
Amidst the recent comments against BgGPT, IT and data science experts have noted that INSAIT's large language model goes beyond being a public chatbot and we should draw general conclusions from interacting with it at this early stage in its development.
The mockery of BgGPT is a sign of a colossal misunderstanding of what this model does and why its launch is actually a big deal," commented Dobroslav Dimitrov, chairman of the Bulgarian Association of Software Companies (BASCOM), to Economic.bg.
Context is important
The platform, which the mass Bulgarian user got access to on 3 March, is something akin to a demo or just one variant of the model that INSAIT has created. This version is tailored to perform all kinds of tasks so that it can be tested by anyone. And this is where one of the keys to the (mis)understanding of BgGPT is hidden - people compare it to chatbots that already exist, but this is not enough in itself.
"The fact that it does not do very well in such a comprehensive field of action is completely logical," Nikola Tulechki, a linguist and data expert at Ontotext, told Economic.bg.
This is not the final product, it’s only a demo that INSAIT has released. It is not what Bulgarian businesses are expected to use."
According to Nikola Tulechki, it is very important to consider the context surrounding the creation and purpose of BgGPT. "Its performance compared to the OpenAI models is poorer, and that's absolutely expected," commented the expert, explaining the reason - "The INSAIT model is much smaller, and that's by design."
In other words, it hasn’t been designed to compete with ChatGPT in terms of comprehensive knowledge.
Yes, it'll certainly make mistakes, hallucinate, and all that, but that's part of the nature of these models. That is why they can do the things they do because they are not linear, they can make mistakes and learn", adds Dobroslav Dimitrov, chairman of BASKOM.
He recalled how just a year or so ago people were making fun of ChatGPT for not knowing the recipe for tarator (a popular Bulgarian cold soup), he says that the current ridicule of BgGPT's inaccuracies is a waste of energy. Instead, what’s more indicative of his capabilities is his speed of work, for one.
In this regard, Nikola Tulechki adds that part of INSAIT's achievement consists precisely in choosing architecture that best fits the application's goals.
Size matters
The size of a language model is generally determined by the number of parameters it operates on. When INSAIT first test-launched BgGPT in January, it reported that it had 7 billion of these. By comparison, OpenAI's GPT3 and GPT4 models boast 200 billion and 1.3 trillion, respectively. The difference is huge, but:
The fact that a (generative) model is large is not necessarily a good thing", points out Nikola Tulechki and points out one of the main reasons - "the larger a model is, the more expensive it gets not only to train and create but also to use".
Nikola Tulechki explains that in order for a generative model to work efficiently, all its parameters must be loaded into the memory of the infrastructure used. By comparison, a single graphics card handles models with between 7 and 14 billion parameters. Above this limit, a second one will be necessary, which makes the process more expensive.
INSAIT wants to make a small model so that it can be used affordably by Bulgarian businesses," points out Nikola Tulechki.
He gives an example of GPT4, which - to answer one query - puts multiple graphics cards into action. "Behind all that there’s an infrastructure that costs about a million dollars."
In this line of thought, the fact that BgGPT is not the size of the OpenAI models, for example, is a very conscious decision and is related to the idea of the Bulgarian model being easily accessible and budget-friendly.
OpenAI models can only be used by very, very large-scale infrastructure. However, INSAIT have made a model that can be used by anyone - on their own infrastructure and at a relatively low cost," points out Nikola Tulechki.
He adds that companies don't necessarily have to invest in buying their own graphics cards, but can rent cloud infrastructure where they could download and train their model.
Open to adaptation
Its small size and open-source nature allow the Bulgarian language model to be downloaded to its own infrastructure and adapted to the specific needs of a given company or institution.
Here are a few examples of possible integrations:
- the model can be used to develop an interface between an institution and citizens so as to replace the person at the counter. "For example, when you want to pay your local tax, the tool can explain very clearly why, how much and where you need to do it, saving you tedious digging through the pages of the municipality," explains Dobroslav Dimitrov.
- it can also hook into some internal data source and again serve as an interface to help those working in that institution talk more productively with the data they've collected. "Imagine, for example, a very, very well-performing search engine on internal massive data sets that can make sense of it by finding text documents, but also returns meaningful, analytical answers based on the information contained within them," says Nikola Tulechki.
- companies dealing with legislative matters can build their own models, which will be extremely useful to the lawyers working in them. They could prepare contracts or compare legislations between the different countries where they operate, among other things.
- in the field of education, one can train one's own model on specific data from the textbooks of a particular publisher, for example, with the help of which additional explanations can be given on the material.
When asked whether Bulgarian institutions and businesses are ready to recognize the need for investments in such models, Dobroslav Dimitrov says that IT companies in particular are already doing so. As for the public administration, there are some proactive municipalities, like Burgas, for example, which, according to him, are also open to riding the wave. There are already available experts who understand how to train the model for a particular business's needs, he says.
It's a tool—a very, very powerful tool—that can be used by a pretty wide range of IT professionals to help the lives of people like you and me. That's the big news. And that difficult task has been done by INSAIT", commented Dobroslav Dimitrov.
Science and the Bulgarian language
Part of this hard task is training the language model specifically with Bulgarian data. Many tried to counter the project by commenting that models like ChatGPT are already performing very well in Bulgarian. However, according to Nikola Tulechki and Dobroslav Dimitrov, the fact that BgGPT was initially trained in Bulgarian gives it advantages that will be useful in its subsequent applications.
ChatGPT, for instance, was trained primarily with English-language data and primarily in the context of US law. This means that it is much more useful for American cases and problems rather than for Bulgarian ones. After all, to put it simply, the data it was trained with is not ours," explains Dobroslav Dimitrov.
Nikola Tulechki once again draws attention to the fact that it is not factuality that is the strong point of the model, but that it was taught "to understand the Bulgarian language in its nuances".
"Going forward, further work is required on the [Bulgarian] open model so that it will learn to solve specific tasks."
The need to have our own [Bulgarian language] model is absolutely mandatory, because if we lack our own sovereign science in this field, we won't have much of a chance, and we would be only left to work with whatever the big tech had happened to come up with," adds Dobroslav Dimitrov.