View Dataset
Welcome to ArabicNLP.uk
Home to open-source Arabic NLP datasets, corpora, and tools — created and maintained by Dr Mo El-Haj.
Explore ResourcesOpen-source datasets, corpora, and tools for Arabic NLP
Your main access points for datasets, models, and long-term collaborations
Explore the core sources powering Arabic NLP research — models, datasets, repositories, and institutional collaborations.
212,500 financial news articles from Argaam.com for summarisation, event extraction, and financial NLP.
8,546 job ads with metadata for gender, salary, profession, and dialect-sensitive labour market analysis.
Multilingual constitutions from 191 nations, aligned for legal NLP and MT research.
18,256 cleaned Arabic news articles in a modern ML-ready format.
30,000+ Arabic song lyrics across 18 countries for dialect, authorship, and cultural NLP.
EASC contains 153 documents and 765 human-made summaries. A foundational resource for Arabic extractive summarisation.
Multilingual single-document and multi-document abstractive summarisation in over 10 languages.
A bivalency and code-switching focused corpus covering five major Arabic dialect varieties.
DARES: Dataset for Arabic Readability Estimation of School Materials - Transformers based evaluation of readability of school materials in Saudi Arabia from grades 1 to 12.
Osman is a readability metric for Arabic that functions as the closest counterpart to the Flesch measure. The repository also includes a parallel Arabic–English dataset from the UN corpus. .