PETRA

license: apache-2.0 datasets:

PetraAI/PetraAI language:
ar
en
ch
zh metrics:
accuracy
bertscore
bleu
chrf
code_eval
brier_score tags:
chemistry
biology
finance
legal
music
code
art
climate
medical
text-generation-inference Inference Speed The result is generated using this script, batch size of input is 1, decode strategy is beam search and enforce the model to generate 512 tokens, speed metric is tokens/s (the larger, the better).

PETRA

Overview

PETRA is a multilingual dataset for training and evaluating AI systems on a diverse range of tasks across multiple modalities. It contains data in Arabic and English for tasks including translation, summarization, question answering, and more.

Dataset Structure

Data is separated by language into /ar and /en directories
Within each language directory, data is separated by task into subdirectories
Tasks include:
- Translation
- Summarization
- Conversational
- Feature extraction
- Zero-shot classification
- Text generation
- Fill mask
- Sentence similarity
- Text-to-speech
- Automatic speech recognition
- Text classification
- Token classification
- Table question answering
- Question answering
- Text2text generation
- Audio-to-audio
- Audio classification
- Voice activity detection
- Depth estimation
- Image classification
- Object detection
- Image segmentation
- Text-to-image
- Image-to-text
- Image-to-image
- Unconditional image generation
- Reinforcement learning
- Video classification
- Robotics
- Tabular classification
- Tabular regression
- Table-to-text
- Multiple choice
- Text retrieval
- Tabular-to-text
- Text-to-video
- Time series forecasting
- Visual question answering
- Zero-shot image classification
- Graph ML

Dataset Size

1M < n < 10M samples

Licenses

Apache 2.0

Citation

If you use this dataset, please cite it as:

[cite paper, arXiv, etc]

@article{PetraAI2022PetraAI, title={PetraAI: A Massive Multilingual Dataset for Machine Learning}, author={First Last and First Last}, journal={arXiv}, year={2022}, url={https://huggingface.co/datasets/PetraAI/PetraAI} }

Contact

For any questions, please reach out to [shadilytn@gmail.com]

Dataset Cards

What are Dataset Cards?

Each dataset may be documented by the README.md file in the repository. This file is called a dataset card, and the Hugging Face Hub will render its contents on the dataset’s main page. To inform users about how to responsibly use the data, it’s a good idea to include information about any potential biases within the dataset. Generally, dataset cards help users understand the contents of the dataset and give context for how the dataset should be used.

You can also add dataset metadata to your card. The metadata describes important information about a dataset such as its license, language, and size. It also contains tags to help users discover a dataset on the Hub. Tags are defined in a YAML metadata section at the top of the README.md file.

Dataset card metadata

A dataset repo will render its README.md as a dataset card. To control how the Hub displays the card, you should create a YAML section in the README file to define some metadata. Start by adding three --- at the top, then include all of the relevant metadata, and close the section with another group of --- like the example below:

The metadata that you add to the dataset card enables certain interactions on the Hub. For example:

Allow users to filter and discover datasets at https://huggingface.co/datasets.
If you choose a license using the keywords listed in the right column of this table, the license will be displayed on the dataset page.

When creating a README.md file in a dataset repository on the Hub, use Metadata UI to fill the main metadata:

To see metadata fields, see the detailed dataset card metadata specification here.

Dataset card creation guide

For a step-by-step guide on creating a dataset card, check out the Create a dataset card guide.

Reading through existing dataset cards, such as the ELI5 dataset card, is a great way to familiarize yourself with the common conventions.

Linking a Paper

If the dataset card includes a link to a paper on arXiv, the Hub will extract the arXiv ID and include it in the dataset tags with the format arxiv:<PAPER ID>. Clicking on the tag will let you:

Visit the Paper page
Filter for other models on the Hub that cite the same paper.

What is StarCoder?

StarCoder is a language model (LM) trained on source code and natural language text. Its training data incorporates more that 80 different programming languages as well as text extracted from GitHub issues and commits and from notebooks. This repository showcases how we get an overview of this LM's capabilities.

News

May 9, 2023: We've fine-tuned StarCoder to act as a helpful coding assistant 💬! Check out the chat/ directory for the training code and play with the model here.

Disclaimer

Before you can use the model go to hf.co/bigcode/starcoder and accept the agreement. And make sure you are logged into the Hugging Face hub with:

huggingface-cli login

Quickstart

StarCoder was trained on GitHub code, thus it can be used to perform code generation. More precisely, the model can complete the implementation of a function or infer the following characters in a line of code. This can be done with the help of the 🤗's transformers library.

Installation

First, we have to install all the libraries listed in requirements.txt

pip install -r requirements.txt

Code generation

The code generation pipeline is as follows

from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "bigcode/starcoder"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# to save memory consider using fp16 or bf16 by specifying torch_dtype=torch.float16 for example
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
outputs = model.generate(inputs)
# clean_up_tokenization_spaces=False prevents a tokenizer edge case which can result in spaces being removed around punctuation
print(tokenizer.decode(outputs[0], clean_up_tokenization_spaces=False))

or

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
checkpoint = "bigcode/starcoder"

model = AutoModelForCausalLM.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)
print( pipe("def hello():") )

For hardware requirements, check the section Inference hardware requirements.

Text-generation-inference

docker run -p 8080:80 -v $PWD/data:/data -e HUGGING_FACE_HUB_TOKEN=<YOUR BIGCODE ENABLED TOKEN> -d  ghcr.io/huggingface/text-generation-inference:latest --model-id bigcode/starcoder --max-total-tokens 8192

For more details, see here.

Fine-tuning

Here, we showcase how we can fine-tune this LM on a specific downstream task.

Step by step installation with conda

Create a new conda environment and activate it

conda create -n env
conda activate env

Install the pytorch version compatible with your version of cuda here, for example the following command works with cuda 11.6

conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia

Install transformers and peft

conda install -c huggingface transformers 
pip install git+https://github.com/huggingface/peft.git

Note that you can install the latest stable version of transformers by using

pip install git+https://github.com/huggingface/transformers

Install datasets, accelerate and huggingface_hub

conda install -c huggingface -c conda-forge datasets
conda install -c conda-forge accelerate
conda install -c conda-forge huggingface_hub

Finally, install bitsandbytes and wandb

pip install bitsandbytes
pip install wandb

To get the full list of arguments with descriptions you can run the following command on any script:

python scripts/some_script.py --help

Before you run any of the scripts make sure you are logged in and can push to the hub:

huggingface-cli login

Make sure you are logged in wandb:

wandb login

Now that everything is done, you can clone the repository and get into the corresponding directory.

Datasets

StarCoder can be fine-tuned to achieve multiple downstream tasks. Our interest here is to fine-tune StarCoder in order to make it follow instructions. Instruction fine-tuning has gained a lot of attention recently as it proposes a simple framework that teaches language models to align their outputs with human needs. That procedure requires the availability of quality instruction datasets, which contain multiple instruction - answer pairs. Unfortunately such datasets are not ubiquitous but thanks to Hugging Face 🤗's datasets library we can have access to some good proxies. To fine-tune cheaply and efficiently, we use Hugging Face 🤗's PEFT as well as Tim Dettmers' bitsandbytes.

Stack Exchange SE

Stack Exchange is a well-known network of Q&A websites on topics in diverse fields. It is a place where a user can ask a question and obtain answers from other users. Those answers are scored and ranked based on their quality. Stack exchange instruction is a dataset that was obtained by scrapping the site in order to build a collection of Q&A pairs. A language model can then be fine-tuned on that dataset to make it elicit strong and diverse question-answering skills.

To execute the fine-tuning script run the following command:

python finetune/finetune.py \
  --model_path="bigcode/starcoder"\
  --dataset_name="ArmelR/stack-exchange-instruction"\
  --subset="data/finetune"\
  --split="train"\
  --size_valid_set 10000\
  --streaming\
  --seq_length 2048\
  --max_steps 1000\
  --batch_size 1\
  --input_column_name="question"\
  --output_column_name="response"\ 
  --gradient_accumulation_steps 16\
  --learning_rate 1e-4\
  --lr_scheduler_type="cosine"\
  --num_warmup_steps 100\
  --weight_decay 0.05\
  --output_dir="./checkpoints" \

The size of the SE dataset is better manageable when using streaming. We also have to precise the split of the dataset that is used. For more details, check the dataset's page on 🤗. Similarly we can modify the command to account for the availability of GPUs

python -m torch.distributed.launch \
  --nproc_per_node number_of_gpus finetune/finetune.py \
  --model_path="bigcode/starcoder"\
  --dataset_name="ArmelR/stack-exchange-instruction"\
  --subset="data/finetune"\
  --split="train"\
  --size_valid_set 10000\
  --streaming \
  --seq_length 2048\
  --max_steps 1000\
  --batch_size 1\
  --input_column_name="question"\
  --output_column_name="response"\ 
  --gradient_accumulation_steps 16\
  --learning_rate 1e-4\
  --lr_scheduler_type="cosine"\
  --num_warmup_steps 100\
  --weight_decay 0.05\
  --output_dir="./checkpoints" \

Merging PEFT adapter layers

If you train a model with PEFT, you'll need to merge the adapter layers with the base model if you want to run inference / evaluation. To do so, run:

python finetune/merge_peft_adapters.py --base_model_name_or_path model_to_merge --peft_model_path model_checkpoint

# Push merged model to the Hub
python finetune/merge_peft_adapters.py --base_model_name_or_path model_to_merge --peft_model_path model_checkpoint --push_to_hub

For example

python finetune/merge_peft_adapters.py --model_name_or_path bigcode/starcoder --peft_model_path checkpoints/checkpoint-1000 --push_to_hub

Evaluation

To evaluate StarCoder and its derivatives, you can use the BigCode-Evaluation-Harness for evaluating Code LLMs.

Inference hardware requirements

In FP32 the model requires more than 60GB of RAM, you can load it in FP16 or BF16 in ~30GB, or in 8bit under 20GB of RAM with

# make sure you have accelerate and bitsandbytes installed
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoder")
# for fp16 replace with  `load_in_8bit=True` with   `torch_dtype=torch.float16`
model = AutoModelForCausalLM.from_pretrained("bigcode/starcoder", device_map="auto", load_in_8bit=True)
print(f"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB")

Memory footprint: 15939.61 MB

You can also try starcoder.cpp, a C++ implementation with ggml library.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PETRA

Overview

Dataset Structure

Dataset Tags

Dataset Size

Licenses

Citation

Contact

Dataset Cards

What are Dataset Cards?

Dataset card metadata

Dataset card creation guide

Linking a Paper

What is StarCoder?

News

Disclaimer

Table of Contents

Quickstart

Installation

Code generation

Text-generation-inference

Fine-tuning

Step by step installation with conda

Datasets

Stack Exchange SE

Merging PEFT adapter layers

Evaluation

Inference hardware requirements

About

Uh oh!

Releases

Packages

Uh oh!

License

PetraAI/V1

Folders and files

Latest commit

History

Repository files navigation

PETRA

Overview

Dataset Structure

Dataset Tags

Dataset Size

Licenses

Citation

Contact

Dataset Cards

What are Dataset Cards?

Dataset card metadata

Dataset card creation guide

Linking a Paper

What is StarCoder?

News

Disclaimer

Table of Contents

Quickstart

Installation

Code generation

Text-generation-inference

Fine-tuning

Step by step installation with conda

Datasets

Stack Exchange SE

Merging PEFT adapter layers

Evaluation

Inference hardware requirements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Packages