ArabicNLP

Key NLP Resources & Collaborations

Your main access points for datasets, models, and long-term collaborations

Explore the core sources powering Arabic NLP research — models, datasets, repositories, and institutional collaborations.

Featured Arabic NLP Resources

Tarab

a large-scale Arabic creative-text corpus that unifies song lyrics and poetry, containing 2,557,311 verses and 13,509,336 tokens, spanning Classical Arabic, MSA, and six major regional dialect groups.

AraFinNews

212,500 financial news articles from Argaam.com for summarisation, event extraction, and financial NLP.

ArabJobs

8,546 job ads with metadata for gender, salary, profession, and dialect-sensitive labour market analysis.

MCWC

Multilingual constitutions from 191 nations, aligned for legal NLP and MT research.

Kalimat

18,256 cleaned Arabic news articles in a modern ML-ready format.

Habibi

30,000+ Arabic song lyrics across 18 countries for dialect, authorship, and cultural NLP.

EASC

EASC contains 153 documents and 765 human-made summaries. A foundational resource for Arabic extractive summarisation.

MultiLing

Multilingual single-document and multi-document abstractive summarisation in over 10 languages.

Arabic Dialects

A bivalency and code-switching focused corpus covering five major Arabic dialect varieties.

DARES

DARES: Dataset for Arabic Readability Estimation of School Materials - Transformers based evaluation of readability of school materials in Saudi Arabia from grades 1 to 12.

OSMAN

Osman is a readability metric for Arabic that functions as the closest counterpart to the Flesch measure. The repository also includes a parallel Arabic–English dataset from the UN corpus. .

Featured Corpus

Browse Corpora

View Dataset

ArabicNLP.uk

Welcome to ArabicNLP.uk