Deterministic semantic encoding - word similarity, valence detection, and clustering without machine learning
Learn how to encode words into deterministic numeric codes that capture semantic meaning, part of speech, abstractness, and sentiment valence
Oyemi is an offline semantic lexicon that converts words into structured numeric codes. Unlike word embeddings (Word2Vec, GloVe), Oyemi codes are:
| Component | Meaning | Example |
|---|---|---|
| HHHH | Semantic superclass (category) | 0121 = emotion.fear |
| LLLLL | Local synset ID | 00003 = specific sense |
| P | Part of speech | 1=noun, 2=verb, 3=adj, 4=adv |
| A | Abstractness | 0=concrete, 1=mixed, 2=abstract |
| V | Valence (sentiment) | 0=neutral, 1=positive, 2=negative |
Install Oyemi from PyPI. The package includes the pre-built lexicon - no additional downloads required:
pip install oyemi
Successfully installed oyemi-3.0.1
Encode words to get their semantic codes. Words with multiple meanings return multiple codes:
from Oyemi import Encoder
# Initialize encoder
enc = Encoder()
# Encode a simple word
codes = enc.encode("happy")
print("Codes for 'happy':", codes)
# Polysemous word (multiple meanings)
codes = enc.encode("bank")
print("Codes for 'bank':", codes[:3]) # First 3 senses
# Check lexicon size
print(f"Lexicon contains {enc.word_count:,} words")
Codes for 'happy': ['3010-00001-3-1-1', '3999-05469-3-1-1', '3999-05731-3-1-1'] Codes for 'bank': ['0174-00012-1-0-0', '0045-00089-1-0-0', '2030-00156-2-1-0'] Lexicon contains 145,014 words
Use encode_parsed() to get structured SemanticCode objects with named attributes:
# Get parsed semantic codes
parsed = enc.encode_parsed("fear")
# Examine the primary sense
primary = parsed[0]
print(f"Word: fear")
print(f" Code: {primary.raw}")
print(f" Superclass: {primary.superclass}")
print(f" Part of Speech: {primary.pos_name}")
print(f" Abstractness: {primary.abstractness_name}")
print(f" Valence: {primary.valence_name}")
# Compare positive vs negative words
for word in ["love", "hate", "table"]:
p = enc.encode_parsed(word)[0]
print(f"{word:10} -> {p.valence_name}")
Word: fear Code: 0121-00003-1-2-2 Superclass: 0121 Part of Speech: noun Abstractness: abstract Valence: negative love -> positive hate -> negative table -> neutral
Oyemi provides built-in sentiment detection without any ML models. Perfect for deterministic text analysis:
# Analyze sentiment of a sentence
sentence = "The manager was incompetent and the layoffs were devastating"
# Tokenize and analyze
words = sentence.lower().split()
valence_counts = {'positive': 0, 'negative': 0, 'neutral': 0}
for word in words:
try:
parsed = enc.encode_parsed(word, raise_on_unknown=False)
if parsed:
valence = parsed[0].valence_name
valence_counts[valence] += 1
if valence != 'neutral':
print(f" {word}: {valence}")
except:
pass
print(f"\nValence Summary: {valence_counts}")
# Calculate sentiment score
total = sum(valence_counts.values())
score = (valence_counts['positive'] - valence_counts['negative']) / total
print(f"Sentiment Score: {score:.2f}")
incompetent: negative
layoffs: negative
devastating: negative
Valence Summary: {'positive': 0, 'negative': 3, 'neutral': 5}
Sentiment Score: -0.38
Find true synonyms using WordNet synset matching - words that share the same meaning:
from Oyemi import find_synonyms
# Find synonyms for emotional words
for word in ["happy", "angry", "fired"]:
syns = find_synonyms(word, limit=5)
print(f"{word}: {syns}")
# Get weighted synonyms (higher weight = closer match)
weighted = find_synonyms("fear", return_weighted=True, limit=5)
print("\nWeighted synonyms for 'fear':")
for syn, weight in weighted:
print(f" {syn}: {weight:.2f}")
happy: ['felicitous', 'glad', 'well-chosen'] angry: ['furious', 'raging', 'tempestuous', 'wild'] fired: ['discharged', 'dismissed', 'laid-off', 'pink-slipped'] Weighted synonyms for 'fear': dread: 1.00 fearfulness: 1.00 fright: 0.85 reverence: 0.50 awe: 0.50
Calculate similarity between words based on their semantic codes - no embeddings required:
from Oyemi import semantic_similarity
# Compare word pairs
pairs = [
("happy", "joyful"), # Synonyms
("happy", "sad"), # Antonyms
("dog", "cat"), # Same category
("dog", "computer"), # Different categories
("layoff", "fired"), # Related workplace terms
]
print("Semantic Similarity Scores:")
for w1, w2 in pairs:
sim = semantic_similarity(w1, w2)
print(f" {w1:12} <-> {w2:12}: {sim:.2f}")
Semantic Similarity Scores: happy <-> joyful : 0.85 happy <-> sad : 0.42 dog <-> cat : 0.78 dog <-> computer : 0.15 layoff <-> fired : 0.72
Group words by their semantic category (superclass) for automatic topic clustering:
from Oyemi import cluster_by_superclass
# Words from employee feedback
words = [
"manager", "boss", "supervisor", # Leadership
"salary", "bonus", "compensation", # Money
"layoff", "fired", "terminated", # Employment
"stress", "anxiety", "fear", # Emotions
]
# Cluster by semantic category
clusters = cluster_by_superclass(words)
print("Semantic Clusters:")
for superclass, cluster_words in clusters.items():
print(f"\n [{superclass}]")
for w in cluster_words:
print(f" - {w}")
Semantic Clusters:
[0214] Leadership
- manager
- boss
- supervisor
[0220] Compensation
- salary
- bonus
- compensation
[0233] Employment Actions
- layoff
- fired
- terminated
[0121] Emotions
- stress
- anxiety
- fear
What you can do with deterministic semantic encoding:
Same input always produces same output. No model randomness, no training variance. Perfect for regulated industries.
No NLTK, no transformers, no GPU required at runtime. Just pure Python with a bundled SQLite lexicon.
Every code component has meaning. Superclass 0121 means "emotion.fear" - not a black box 768-dim vector.
Add deterministic semantic encoding to your NLP pipeline in minutes.
Oyemi v3.0.1 | Lexicon built from Princeton WordNet + SentiWordNet