Towards Representation of Sudanese Arabic Dialect in Large Language Models
A collaborative initiative to build a high-quality dataset and fine-tune language models that understand and generate Sudanese Arabic.
Sudanese Arabic is a widely spoken but underrepresented dialect in the field of Natural Language Processing (NLP). This project aims to:
- Collect and annotate Sudanese Arabic text from diverse sources.
- Create a balanced and labeled corpus.
- Fine-tune Arabic-supportive LLMs (e.g., AraBERT, CAMeL BERT, LLaMA).
- Evaluate model performance on dialect comprehension and generation.
- Fork this repository.
- Create a new branch
git checkout -b feature/your-task
- Make your changes and commit.
git commit -m "Your message"
- Push your branch and open a Pull Request.
- Use GitHub Issues or Discussions to coordinate and communicate.
Area | Description |
---|---|
🗂️ Data Collection | Gather Sudanese Arabic from social media, transcripts, and oral storytelling. |
📝 Annotation | Label text using dialect-specific guidelines. |
🔧 Script Writing | Automate preprocessing, cleaning, and formatting tasks. |
🧪 Model Fine-tuning | Fine-tune LLMs using the Sudanese corpus. |
📊 Evaluation | Test model understanding and generation of Sudanese Arabic. |
📢 Communication | Help with outreach, documentation, and community involvement. |
Goal: Consistent and culturally-accurate annotation of Sudanese Arabic text.
- Sudanese Arabic vs. Modern Standard Arabic (MSA).
- Tag terms unique to Khartoum, Darfur, East, North, South Sudan.
- Respect Sudanese usage (e.g.,
شنو؟
instead ofماذا؟
).
- Daily Conversations
- Political Speech
- Social Media Expressions
- Folk Stories / Proverbs
- Songs / Poetry
Full guidelines can be found in the
docs/annotation-guidelines.md
.