View Dataset
Welcome to ArabicNLP.uk
Home to open-source Arabic NLP datasets, corpora, and tools — created and maintained by Dr Mo El-Haj.
Explore ResourcesOpen-source datasets, corpora, and tools for Arabic NLP
Your main access points for datasets, models, and long-term collaborations
Explore the core sources powering Arabic NLP research — models, datasets, repositories, and institutional collaborations.
a large-scale Arabic creative-text corpus that unifies song lyrics and poetry, containing 2,557,311 verses and 13,509,336 tokens, spanning Classical Arabic, MSA, and six major regional dialect groups.
212,500 financial news articles from Argaam.com for summarisation, event extraction, and financial NLP.
8,546 job ads with metadata for gender, salary, profession, and dialect-sensitive labour market analysis.
Multilingual constitutions from 191 nations, aligned for legal NLP and MT research.
18,256 cleaned Arabic news articles in a modern ML-ready format.
30,000+ Arabic song lyrics across 18 countries for dialect, authorship, and cultural NLP.
EASC contains 153 documents and 765 human-made summaries. A foundational resource for Arabic extractive summarisation.
Multilingual single-document and multi-document abstractive summarisation in over 10 languages.
A bivalency and code-switching focused corpus covering five major Arabic dialect varieties.
DARES: Dataset for Arabic Readability Estimation of School Materials - Transformers based evaluation of readability of school materials in Saudi Arabia from grades 1 to 12.
Osman is a readability metric for Arabic that functions as the closest counterpart to the Flesch measure. The repository also includes a parallel Arabic–English dataset from the UN corpus. .