GEC Code and Documentation

This repository contains the code and documentation for the GEC project.

This readme will be split into 5 parts:

Overall We import these from modules from compute canada:

module load python/3.10 scipy-stack gcc arrow cuda cudnn rust

If your environment is not set up, you can run the following command to set up your environment:

virtualenv --no-download ENV
source ENV/bin/activate
pip install --no-index --upgrade pip
pip install -r requirements.txt

Note: This might take a while to run.

Gector

Gector Implementation for Arabic Language

Installation

Install NVIDIA-Apex (for using amp with deepspeed)

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Modify the scripts in scripts/ to fit your system as a lot of them have my configs in them.

Preprocess Data

Generate edits from parallel sents
```
sbatch scripts/prepare_data.sh
```
Train Model
```
sbatch scripts/train.sh
```
Inference
```
sbatch scripts/predict.sh
```

In the inference script you might have to modify which epoch you wanna use for the model.

LLMs

Installation

First Download all the models even if you are running this on cedar as these models are often unavailable.

LLama

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")
model = AutoModelForCausalLM.from_pretrained("huggyllama/llama-7b")

model.save_pretrained("llama-7b")
tokenizer.save_pretrained("llama-7b")

Alpaca

1. Convert Meta's released weights into huggingface format. Follow this guide:
    https://huggingface.co/docs/transformers/main/model_doc/llama
2. Make sure you cloned the released weight diff into your local machine. The weight diff is located at:
    https://huggingface.co/tatsu-lab/alpaca-7b/tree/main
3. Run this function with the correct paths. E.g.,
    python weight_diff.py recover --path_raw <path_to_step_1_dir> --path_diff <path_to_step_2_dir> --path_tuned <path_to_store_recovered_weights>

Vicuna

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("lmsys/vicuna-13b-v1.3")
model = AutoModelForCausalLM.from_pretrained("lmsys/vicuna-13b-v1.3")

model.save_pretrained("vicuna-13b-v1.3")
tokenizer.save_pretrained("vicuna-13b-v1.3")

MBZUAI/bactrian-x-bloom-7b1-lora

MBZUAI/bactrian-x-llama-7b-lora

Training

To do training without peft use this script:

sbatch run_jasmine.sh

To do training with peft use this script:

sbatch run_peft_jasmine.sh

Inference

To run inference on the models use this script:

python infer_jasmine.py

Note: You might have to chnage the model names to run these I didnt use a sbatch script to run these as they are pretty slow.

To run inference on the models with peft use this script:

python generate_alpaca.py

T5s

To run the T5s you can use the following scripts Note: THis has been modified to generate Test, dev and 2015 results all in one go.

sbatch run_t5.sh <model_name>

If you wanna run this on synthetic data use this script this script does two stage training:

sbatch run_t5_synth.sh <model_name>

To just generate results and not do training use:

sbatch run_t5_test.sh <model_name>

To run the models as different scripts just modify the following and run them:

./sbatch_run_t5.sh

Corruptor

To run the corruptor use the following script:

Note you need to add the dataset path to the script before running it.

./sbatch_run_corruptor.sh

Folder Tree:

.
├── LLMs
│   ├── bloom_train.sh
│   ├── generate_alpaca.py
│   ├── infer_jasmine.py
│   ├── run_jasmine.sh
│   ├── run_peft_jasmine.sh
│   ├── train_jasmine.py
│   └── train_peft_jasmine.py
├── T5s
│   ├── run_summarization.py
│   ├── run_t5.sh
│   ├── run_t5_synth.sh
│   ├── run_t5_test.sh
│   └── sbatch_run_t5.sh
├── corrupt
│   ├── run_corruptor.py
│   ├── run_corruptor.sh
│   └── sbatch_run_corruptor.sh
├── gector
│   ├── configs
│   │   ├── ds_config_basic.json
│   │   └── ds_config_zero1.json
│   ├── data
│   │   ├── verb-form-vocab.txt
│   │   └── vocabulary
│   │       ├── d_tags.txt
│   │       ├── labels.txt
│   │       ├── labels_ar.txt
│   │       ├── labels_zh.txt
│   │       └── non_padded_namespaces.txt
│   ├── predict.py
│   ├── scripts
│   │   ├── predict.sh
│   │   ├── prepare_data.sh
│   │   └── train.sh
│   ├── src
│   │   ├── __pycache__
│   │   │   ├── dataset.cpython-310.pyc
│   │   │   ├── model.cpython-310.pyc
│   │   │   ├── predictor.cpython-310.pyc
│   │   │   └── trainer.cpython-310.pyc
│   │   ├── dataset.py
│   │   ├── model.py
│   │   ├── predictor.py
│   │   └── trainer.py
│   ├── train.py
│   └── utils
│       ├── __pycache__
│       │   ├── common_utils.cpython-310.pyc
│       │   ├── helpers.cpython-310.pyc
│       │   └── mismatched_utils.cpython-310.pyc
│       ├── change.txt
│       ├── common_utils.py
│       ├── csv_to_edits.py
│       ├── gen_labels.py
│       ├── generate_labels.py
│       ├── helpers.py
│       ├── labels.sh
│       ├── mismatched_utils.py
│       ├── orgtext.txt
│       ├── preprocess_data.py
│       ├── segment.py
│       └── tokenization.py
├── README.md
└── reqirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GEC Code and Documentation

Gector

Installation

Preprocess Data

LLMs

Installation

LLama

Alpaca

Vicuna

Training

Inference

T5s

Corruptor

Folder Tree:

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
GED		GED
LLMs		LLMs
T5s		T5s
corrupt		corrupt
gector		gector
README.md		README.md
reqirements.txt		reqirements.txt

gagan3012/ARAGEC

Folders and files

Latest commit

History

Repository files navigation

GEC Code and Documentation

Gector

Installation

Preprocess Data

LLMs

Installation

LLama

Alpaca

Vicuna

Training

Inference

T5s

Corruptor

Folder Tree:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages