+
Skip to content

tznurmin/TEA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Taxonomic Entity Augmentation (TEA)

TEA is a text augmentation tool that helps prevent machine learning models from overfitting to important but repetitive content in NLP examples that use biological texts as source material. TEA targets taxonomic species names and strain names by either switching them into other valid taxonomic names automatically or by scrambling defined strain names from the text.

To see TEA in action, refer to TEA_ft repository.

Installation

You will need a Hugging Face library compatible tokenizer. You can install Transformers package from Hugging Face, which includes the required dependency. Run the following to do this:

pip install transformers

Next, clone this repository and run the following to install TEA as a Python package:

cd TEA
pip install .

Quickstart

The package provides two general text augmentation strategies.

To switch species:

from transformers import AutoTokenizer
from tea import TEA

tokenizer = AutoTokenizer.from_pretrained('dmis-lab/biobert-base-cased-v1.2', do_lower_case=False, model_max_length=100000)
tea = TEA(tokenizer)

tea.switch('Hello E. coli!')
# => 'Hello D. cephalotes!'

To scramble strains:

from transformers import AutoTokenizer
from tea import TEA

tokenizer = AutoTokenizer.from_pretrained('dmis-lab/biobert-base-cased-v1.2', do_lower_case=False, model_max_length=100000)
tea = TEA(tokenizer)

tea.scramble('E. coli strain HB101 is a handy laboratory strain for molecular biology laboratory work.', ['HB101'])
# => 'E. coli strain FQ414 is a handy laboratory strain for molecular biology.'

# this also works
tea.scramble('E. coli strain HB101 is a handy laboratory strain for molecular biology laboratory work.', ['strain HB101'])
# => 'E. coli strain SW565 is a handy laboratory strain for molecular biology.'

Dataset generation

A script (gen_strategy.py) is provided for example usage of TEA as part of a more advanced dataset generation pipeline. The example script assumes that TEA_curated_data is found from the same directory where it is run. Run the following command to download the curated data:

wget https://github.com/tznurmin/TEA_curated_data/archive/refs/tags/v1.0.tar.gz -qO - | tar -xz && mv TEA_curated_data-1.0 TEA_curated_data
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载