Large Language Model (LLM) experts often use the analogy when talking about words that while many words are obviously spelled differently, some are obviously more alike than others. ‘Yellow’ and ‘Red’ are of course different words, with different meanings, but they are more alike to each other than to ‘chicken’.
So I decided to see if I could play with that. There are numerous LLM models out there and they all do different encodings. Plus, words often need context to be understood properly (i.e. “I will read” has a different meaning for read than “I read”).
Perplexity helped me a bit here with my query for “How do I look at the vectors for specific words encoded by the transformers library?”
#!/usr/bin/env python3
#import tensorflow as tf
import numpy as np
from transformers import AutoTokenizer, AutoModel
import torch
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
def get_word_vector(word, model, tokenizer):
# Tokenize the word and convert to tensor
inputs = tokenizer(word, return_tensors="pt")
# Get the model output
with torch.no_grad():
outputs = model(**inputs)
# Extract the word embedding from the last hidden state
word_embedding = outputs.last_hidden_state.squeeze()[1:-1].mean(dim=0)
return word_embedding
Each of those is 768 floating-points long. That can’t be all the detail, but this is just an exercise.
So what words are alike and what are different. The dot-products of various word vectors seem to be in the high-100’s for alike words and lower for completely unrelated words.
for wordpair in wordpairs:
word1 = wordpair[0]
word2 = wordpair[1]
dp = np.dot(wordvector[word1],wordvector[word2])
print(word1,word2,dp)
Word1 | Word2 | Dot-product | Comment |
red | green | 169.48 | Makes sense |
green | yellow | 173.48 | |
yellow | banana | 154 | |
red | banana | 155 | (surprised it’s higher than yellow) |
left | right | 164.47 | |
black | white | 164.13 |
How low can I go?
statue | absurd | 118.83 |
screwdriver | epistolary | 78.05 |
obedient | molasses | 58.3 |
tardy | nominative | 41.82 |
psoriasis | Watts | 33.65 |
My best match is invisible and transparent at 173.
Oddly you’d think synonyms would do better:
Some surprises:
There’s quantification to likelihoods. That is, if I told you people would probably attend an event then I’m presuming at least 50% will show up. If I said they will definitely attend that’s closer to 90-ish percent. Maybe would imply 30-ish I presume. Still, they are both expressions of probability so I would expect them to be similar.
certainly definitely 135.12376
probably impossible 141.61134
absolutely never 147.89499
I’m sure there’s a reason behind this. Maybe I’ll use phrases for my next toying. Or try to find out *where* these words are the most similar (explore the encodings more).