Sweden’s National Library trains AI models in Swedish text, to preserve 500 years of history, linguistics and humanities literature 


The National Library of Sweden is harnessing AI technology developed by NVIDIA to preserve almost half a millenia of literature in digital form.

The library is renowned for archiving ancient and modern Swedish literature and is now working on converting millions of documents into accessible digital assets. The project will benefit researchers in humanities subjects, linguistics, history and media studies, but provides a principal role in the preservation and showcase of mediaeval manuscripts. 

Swedish law requires that a copy of everything officially published in Swedish is submitted to the National Library of Sweden (also known as Kungliga Biblioteket) for public archival record. This includes state documentation, journals, books, plays, internet content, menus, all TV/film/radio media and even video games. This enormous body of data – 26 petabytes in total, has provided a plethora of information for NVIDIA GDX systems and everything needed for a comprehensive Swedish-language training program for AI models.

Researchers are currently developing over 24 open-source transformer models to enable research at the library building in Humlegården, Stockholm and also at other academic institutions around the country.

In 2019 the Kungliga Biblioteket – also known as KB, established a department called the KBLab. Researchers began experimenting on just 5GB of Swedish-language text and sought inspiration from early language processing models created by Google. Soon after, the lab began testing AI training methods on an international data set of Dutch, German and Norwegian language text. This work continues in efforts towards computing larger models for international language research and content translation. 

As results grew more positive, researchers at KBLab began to focus more on their own body of Swedish-language data and upgrading systems from The NVIDIA GPUs to the NVIDIA DGXs. 

The current GDX models are effective in helping researchers create specialized data sets to understand the specific context and nature of every piece of Swedish-language content. From postcards to blog posts, videos and social media this technology will also enable language analysts to review how written and spoken Swedish has evolved over time, its societal influences and distinction from other European languages. 

In addition to the transformer models, KBLab are working on an AI sound-transcription tool, to create a written record of existing digital media. 

Partnering with the University of Gothenburg, KBLab have also announced an upcoming project to support the Swedish Academy’s work to modernize data-driven techniques for creating Swedish-language dictionaries. 


Source link

Comments are closed.