Cosine similarity scores between random words are well above 0.9

When calculating the cosine similarity between the embeddings (mean pooling as implemented using sentence-transformers) of random english words is giving scores well above 0.9 for some reason I can't quite understand. Can you help me understand why this might be happening?

Here is the code to reproduce:

````python
import torch
import numpy as np
import random
from sentence_transformers import SentenceTransformer, util
import seaborn as sns
import matplotlib.pyplot as plt

def randomWords(amount):
    # wget https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt
    with open('words_alpha.txt') as f: 
        words = f.read().splitlines()
        return [random.choice(words) for _ in range(amount)]

model = SentenceTransformer('allenai/longformer-base-4096')
rand_words = randomWords(300)
rand_embeddings = model.encode(rand_words)
rand_rand_similarities = np.array(util.cos_sim(rand_embeddings, rand_embeddings))

# plot distribution of similarity scores
fig = plt.figure(figsize=(10,5))
sns.histplot(rand_rand_similarities.flatten(), label='rand-rand')
plt.legend()
plt.show()
````

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cosine similarity scores between random words are well above 0.9 #259

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cosine similarity scores between random words are well above 0.9 #259

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions