UCT builds AI language model to support all 11 official South African languages
Postado por Editorial em 06/05/2026 em TECH NEWSResearch introduces a new dataset and base model designed to expand AI capabilities in underrepresented languages and enable local development of language tools.

The UCT researchers behind MzansiLM. From left: Simbarashe Mawere, Anri Lombard, Dr Jan Buys, and Dr Francois Meyer.
Researchers at the University of Cape Town (UCT) have developed a language model trained specifically on South Africa’s 11 official written languages, addressing a longstanding limitation in how artificial intelligence systems handle local linguistic diversity.
The project combines two components: MzansiText, a curated multilingual dataset, and MzansiLM, a language model trained from the ground up using that data. The work, led by Anri Lombard and Dr Jan Buys from UCT’s Department of Computer Science alongside Dr Francois Meyer and collaborators, will be presented at the Language Resources and Evaluation Conference in Spain.
The research responds to a structural issue in language technology. Most AI systems perform unevenly across languages because training data is not equally available. This gap is particularly visible in South Africa, where several official languages fall into what researchers classify as “low-resource” due to limited textual data.
“In language modelling, languages are considered low resource, primarily because there are much fewer and smaller textual datasets available in these languages for training language models,” said Dr Buys. “Our dataset, MzansiText, is still small compared to data available for high-resource languages such as English and major European and Asian languages, but larger than previous datasets for South African languages.”
MzansiLM is designed to cover all 11 official written languages in a single model, including those that have received little attention in previous research. “MzansiLM is believed to be the first publicly available decoder-only language model to explicitly target all 11 languages.”
“There has been real progress in language modelling for African languages, including some South African ones like isiXhosa and isiZulu,” said Dr Meyer. “But most existing models only cover a subset of languages. With MzansiLM, we wanted to build a single model focused specifically on South Africa that covers all 11 official written languages, including those that are often left out.”
The model has 125 million parameters, placing it below the scale of commercial systems, but it was tested on tasks where it showed competitive performance. In some cases, it outperformed larger open-source models on benchmarks for South African languages, including text generation in isiXhosa.
The researchers emphasise that MzansiLM is not intended as a direct user-facing assistant. It serves as a base model that can be adapted for specific applications through fine-tuning, allowing developers to build tools such as summarisation systems or data annotation solutions tailored to local languages.
“In practice, that means developers could build tools for specific use cases; for example, summarising information or annotating raw data, in South African languages,” Meyer said. “Adapting MzansiLM for a limited use case might be more effective and affordable than relying on proprietary large language models, if you want users to be able to interact with a system in their home language.”
The current version reflects the limitations of available data. “Our findings show that the model can work well when fine-tuned for specific tasks but is not yet able to work well for general-purpose user interaction or instruction following, due to the limited training data,” Buys explained. “This helps to explain why even larger language models don’t yet work as well when used in languages other than English.”
The team has released both the dataset and the model publicly, positioning the work as a foundation for further development. According to the researchers, expanding AI capabilities in South African languages will depend on continued collaboration and the availability of shared data and tools.
“A lot of the progress we were able to make depends on earlier open research from the African Natural Language Processing research community, so continuing that openness is essential,” Lombard said. “We still need better and broader data sources, stronger benchmarks, and the kind of shared datasets, models, code, and results that make it possible for others to reproduce and extend the work.”