Using technology recycling to develop a named entity recogniser for Afrikaans

Peer-Reviewed Research
  • SDG 16
  • Abstract:

    Named entity recognition is a core technology, widely used in applications such as question–and–answer systems, automated text summarisation systems, information retrieval, machine translation, bioinformatics and search engines. Owing to the resources needed, the labour–intensive methods required and the difficulty of developing a named entity recogniser (NER), the focus has mainly been on developing NERs for resource–rich languages such as Dutch, Swedish and Spanish, while neglecting the development for related resource–scarce languages such as Afrikaans, Norwegian and Portuguese. The lack of NERs limits access to information and also lowers the general development thrust of resources for such languages. This article presents a time– and cost–efficient method based on the principle of technology recycling (or technology transfer) that can be used to develop NERs for suitable resource–scarce languages; its application to Afrikaans is used here as a case study. Six experiments are described, which differ only by the pre–processing methods (A2DC and gazetteers) and the language of the input data used. The final experiment yielded an f–score of 0.72 for the identification and an f–score of 0.65 for the classification of named entities. This study provides evidence for the usefulness of technology recycling in developing an NER for Afrikaans and potentially for other languages that have so far been neglected.