ROSA+: RWKV's ROSA implementation with fallback statistical predictor
ROSA+ is an extension of the statistical next-token predictor proposed by BlinkDL in extending the RWKV language model. It provides an intuitive Python interface as well as a fallback Witten–Bell predictor for unknown sequences.
The implementation is self-contained in rosaplus.py
. You can download the repository and use it from there.
# example.py
from rosaplus import ROSAPlus
import requests
# Train on Shakespare
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
# Download the text
response = requests.get(url)
text = response.text
print("Downloaded text.")
# Initialize model
m = ROSAPlus(max_order=1048576, use_eot=False, seed=0)
m.train_example(text) # Train ROSA
m.build_lm() # Train fallback predictor
# Prompting
prompt = "ROMEO:" # Novel text
max_tokens = 256
# Eval mode
print(prompt + m.generate(prompt, steps=max_tokens))
# Saving model
m.save("rosa-model.json")
m2 = ROSAPlus.load("rosa-model.json") # Loading model
Output: (verbatim)
ROMEO:
In faith, I will. Let me peruse this face.
Mercutio's kinsman, noble County Paris!
What said my man, when my betossed soul
Did not attend him as we rode? I think
...
ROSA+ can also be used to generate novel sequences that do not show up in the training dataset. You can enable this by always using the fallback predictor. It often leads to coherent, surprising results.
# add always_fallback=True to the example
print(prompt + m.generate(prompt, steps=max_tokens, always_fallback=True))
Output: (novel)
ROMEO:
The exchange of joy
That only Warwick's daughter.
CLARENCE:
To whom, my lord; the foe vantage.
But make you read no other, and look'd deadly that name remains;
The cruelty and envy of the people,
Permitted by our faces
For man or master; then it for some
As you can see, these arrangement of sentences do not show up in the dataset (CTRL+F). Rather, ROSA+ intelligently splices and pulls together the features from ROSA to perform next-character prediction.
For any given prefix, you can also get the probability distribution for the next token:
# Eval mode
print(m.get_dist("ROMEO:\nOh, how could yo"))
Output:
{'u': 0.9999989177710094, 'n': 5.442332067424175e-07, 'k': 5.379892385443467e-07, 'r': 6.0439900862193395e-12, ' ...
As you can see, ROSA+ is extremely confident that 'u' is the next token (and it is correct!)
This is just a standalone example of ROSA and does not provide RWKV integration. You will have to go to the RWKV Discord or ask the main maintainer (BlinkDL) for assistance in this regard.
ROSA+ extends ROSA by:
- Allowing training and sampling on individual sequences, similar to a LLM
- Utilizing a (coherent) fallback Witten–Bell based predictor for when ROSA is unsure of the next token.
This makes it extremely fast, since ROSA is used for 99% of the predictions and the fallback only occurs for novel sequences.
Tokenization: The default tokenization is character-based (I will add support for new tokenizers coming soon.)
If you install orjson, it will use it automatically and lead to far faster import/export speed. Docs coming soon.
ROSA+ is entirely statistical-based -- it extends upon the ROSA predictor proposed by BlinkDL, then provides a probability predictor as a fallback. However, this means it only has a database-like understanding of text -- it can stitch together multiple sentences and demonstrate grammar, but it lacks the same context understanding as an NN (RWKV, Transformer etc.)
For instance, when trained on Shakespeare, and with always_fallback=True
(forcing novel predictions), it generates text that "looks right", but switches between characters every stanza.
COMINIUS:
Well, one nail;
Right noble is thy mercy dried their watches of chance and thy lord's false love;
For both of you are birds of selfsame feather.
KING EDWARD IV:
Peace, wilful boy, or I will charm your tongue.
CLARENCE:
Unhappy fortune! by my troth, I looked upon his faith iron cook you, sir, he bid me knocks; ha! let me be unrolled and said 'The better for our purpose.'
KING RICHARD III:
So proud the name of Henry with your holy look'd on me,
And wouldst do not break your oaths; for of that sin
May deny her aiding have these nothing here.
AUTOLYCUS:
I hope so, sir; for I have about me manner doth accuse my husband, I
...
A ChatGPT analysis of ROSA+'s lines uncovers some insight:
Short answer: it’s Shakespeare-flavored, not Shakespearean. It reads like a collage of misquoted or remixed lines, with scrambled idioms, mixed plays (Juliet/Romeo with Buckingham and Gaunt), and meter/grammar that don’t line up with blank verse.
Quick notes:
* “Now, by Saint Peter’s Church…” and “I have forgot why I did call thee back” echo *Romeo and Juliet*, but they’re spliced into new contexts.
* “The world goes his bonnet to an oystery” mangles Pistol’s “The world’s mine oyster.”
* Shifts between **you/thee/thou/thy** are inconsistent (use *thou* as subject, *thee* as object, *thy/thine* as possessives).
* Many lines don’t scan as iambic pentameter (10 syllables, mostly unstressed–stressed).
A true NN-based model would outperform a standalone ROSA+ implementation because of the understanding of actual context. While ROSA+ has impressive surface-level understanding, it lacks deeper, low level meaning expressed by NNs.
You can view all the samples in the samples
directory -- interestingly, in sample_default_output.txt
, the model falls into an attractor state, repeating itself every ~3k lines, halfway through. However, in sample_novel_output.txt
, you can spot some very novel sentences:
LADY ANNE:
Well, well, peace be with you, sir, he bid me know the points o' the dead
May walk again: if such thing but what I am,
I would wish it gone,
The phrases Well, well, peace be with you
and I would wish it gone
never show up in the training data.
- Autocorrect / word prediction
- Translation (possibly)
- Features for a lower level model
- Generating surface-level text that fools detectors
One may be able to create a coherent language model simply by feeding ROSA+ embeddings into a GRU. Since ROSA+ captures the immediate surface-level features of text, a sufficient neural network may be able to operate on these embeddings and alter the distribution for more fine-grained understanding.
Unless statistical LMs incorporate some kind of statistical attention mechanism (which is possible!) they will never be able to grasp a high-level understanding of text as do humans and LMs. A statistical LM is unable to copy data / tokens from one place to another, operate on a continous state, blend together tokens across different spaces, perform few-shot learning (needs neural mechanism!) or transfer learning (no state vectors!). Therefore, their purpose remains limited to grasping surface-level features of text, like syntax, or general structure.
Google pushed to make their translation software (which in the 2010s, was n-gram based) the best at the time, but even LSTMs (which were invented way before Transformers) managed to outperform them.
Do not let this discourage you though. It may be practical to incorporate some kind of continous state vector / representation within a statistical model, making it drastically more efficient than LLMs while preserving all the benefits of NN-based models. This is an active field of research at Bellevue College ML (BCML) -- and if pioneered, could result in language models thousands of times more efficient. Don't let an article discourage you.