Oyemi
Deterministic Semantic Word Encoding for NLP - No Machine Learning Required
Installation
Install Oyemi from PyPI using pip. The package includes a pre-built lexicon with 145K+ words - no additional downloads required:
# Install from PyPI
pip install oyemi
# Verify installation
python -c "from Oyemi import Encoder; print(Encoder().word_count)"
# 145014
Key Features:
- Zero runtime dependencies - just pure Python + SQLite
- 100% deterministic - same input always gives same output
- No internet required - fully offline operation
- Works on Python 3.8+
Quick Start
Get started with Oyemi in just a few lines:
from Oyemi import Encoder, encode, semantic_similarity
# Simple encoding
codes = encode("happy")
print(codes)
# ['3010-00001-3-1-1', '3999-05469-3-1-1', '3999-05731-3-1-1']
# Using encoder instance
enc = Encoder()
parsed = enc.encode_parsed("fear")
print(parsed[0].valence_name)
# 'negative'
# Semantic similarity
sim = semantic_similarity("happy", "joyful")
print(sim)
# 0.85
Code Format
Oyemi codes follow the format HHHH-LLLLL-P-A-V:
HHHH | Superclass - Semantic category (e.g., 0121 = emotion.fear) |
LLLLL | Local ID - Specific synset within the superclass |
P | Part of Speech - 1=noun, 2=verb, 3=adjective, 4=adverb |
A | Abstractness - 0=concrete, 1=mixed, 2=abstract |
V | Valence - 0=neutral, 1=positive, 2=negative |
Example: 0121-00003-1-2-2 for "fear" means:
- Superclass 0121 (emotion category)
- Local synset ID 00003
- Part of speech: noun (1)
- Abstractness: abstract (2)
- Valence: negative (2)
Word Encoding
Encode words to get their semantic codes:
from Oyemi import Encoder
enc = Encoder()
# Simple encoding returns list of code strings
codes = enc.encode("bank")
print(codes)
# ['0174-00012-1-0-0', '0045-00089-1-0-0', '2030-00156-2-1-0']
# Multiple codes = polysemous word (multiple meanings)
# Check if word exists
print(enc.has_word("serendipity")) # True
print(enc.has_word("asdfgh")) # False
# Get lexicon stats
print(f"Lexicon size: {enc.word_count:,} words")
# Lexicon size: 145,014 words
Parsed Codes
Use encode_parsed() to get structured SemanticCode objects:
parsed = enc.encode_parsed("fear")
code = parsed[0]
# Access components directly
print(code.raw) # '0121-00003-1-2-2'
print(code.superclass) # '0121'
print(code.local_id) # '00003'
print(code.pos) # 1
print(code.pos_name) # 'noun'
print(code.abstractness) # 2
print(code.abstractness_name)# 'abstract'
print(code.valence) # 2
print(code.valence_name) # 'negative'
# Get full synset ID
print(code.synset_id) # '0121-00003'
SemanticCode Properties
raw | Full code string |
superclass | 4-digit category code |
local_id | 5-digit synset ID within category |
synset_id | Combined superclass + local_id |
pos / pos_name | Part of speech (int / string) |
abstractness / abstractness_name | Concreteness level |
valence / valence_name | Sentiment polarity |
Valence Detection
Oyemi provides built-in sentiment detection with 95%+ accuracy:
# Check valence of individual words
for word in ["happy", "angry", "table", "fired"]:
parsed = enc.encode_parsed(word)
if parsed:
print(f"{word}: {parsed[0].valence_name}")
# Output:
# happy: positive
# angry: negative
# table: neutral
# fired: negative
# Analyze sentence sentiment
sentence = "The manager was incompetent and the layoffs were devastating"
words = sentence.lower().split()
valence_counts = {'positive': 0, 'negative': 0, 'neutral': 0}
for word in words:
parsed = enc.encode_parsed(word, raise_on_unknown=False)
if parsed:
valence_counts[parsed[0].valence_name] += 1
print(valence_counts)
# {'positive': 0, 'negative': 3, 'neutral': 5}
Text Analysis
Analyze entire text strings for valence/sentiment with analyze_text():
from Oyemi import analyze_text, Encoder
# Convenience function
result = analyze_text("I feel hopeful but anxious about the future")
print(f"Score: {result.valence_score:+.2f}")
print(f"Sentiment: {result.sentiment}")
print(f"Positive: {result.positive_words}")
print(f"Negative: {result.negative_words}")
# Output:
# Score: +0.00
# Sentiment: neutral
# Positive: ['hopeful']
# Negative: ['anxious']
# Using encoder instance
enc = Encoder()
result = enc.analyze_text("The team achieved great success")
print(f"Analyzed: {result.analyzed_words} words")
print(f"Positive: {result.positive_pct:.1f}%")
print(f"Negative: {result.negative_pct:.1f}%")
# Convert to dict for JSON serialization
data = result.to_dict()
print(data)
TextAnalysis Properties
total_words | Total words extracted from text |
analyzed_words | Words found in lexicon |
positive_words | List of positive valence words |
negative_words | List of negative valence words |
neutral_words | List of neutral valence words |
valence_score | Overall score (-1.0 to +1.0) |
sentiment | Label: "positive", "negative", or "neutral" |
to_dict() | Convert to dictionary for JSON |
Synonym Discovery
Find true synonyms using WordNet synset matching:
from Oyemi import find_synonyms
# Basic synonym search
syns = find_synonyms("happy", limit=5)
print(syns)
# ['felicitous', 'glad', 'well-chosen']
# Weighted synonyms (higher = closer match)
weighted = find_synonyms("fear", return_weighted=True, limit=5)
for syn, weight in weighted:
print(f" {syn}: {weight:.2f}")
# dread: 1.00
# fearfulness: 1.00
# fright: 0.85
# With constraints
syns = find_synonyms(
"run",
limit=10,
pos_lock=True, # Same part of speech only
abstractness_lock=True # Don't mix abstract/concrete
)
Parameters
word | Word to find synonyms for |
limit | Maximum synonyms to return (default: 20) |
pos_lock | Only return same part of speech (default: True) |
abstractness_lock | Don't mix abstract/concrete (default: True) |
return_weighted | Return (word, weight) tuples (default: False) |
Antonym Detection
Check and retrieve antonyms for words:
from Oyemi import are_antonyms, get_antonyms
# Check if two words are antonyms
print(are_antonyms("happy", "sad")) # True
print(are_antonyms("good", "bad")) # True
print(are_antonyms("happy", "table")) # False
# Get all antonyms for a word
antonyms = get_antonyms("happy")
print(antonyms)
# ['sad', 'unhappy', 'sorrowful', ...]
Semantic Similarity
Calculate similarity between words based on their semantic codes:
from Oyemi import semantic_similarity, find_similar
# Compare two words
sim = semantic_similarity("happy", "joyful")
print(f"happy <-> joyful: {sim:.2f}") # 0.85
# Compare multiple pairs
pairs = [
("dog", "cat"), # Same category
("dog", "computer"), # Different categories
("layoff", "fired"), # Related terms
]
for w1, w2 in pairs:
print(f"{w1} <-> {w2}: {semantic_similarity(w1, w2):.2f}")
# Find similar words
similar = find_similar("happy", top_n=5)
print(similar)
# [('joyful', 0.85), ('glad', 0.82), ('cheerful', 0.78), ...]
Topic Clustering
Group words by their semantic category (superclass):
from Oyemi import cluster_by_superclass
# Words from employee feedback
words = [
"manager", "boss", "supervisor",
"salary", "bonus", "compensation",
"layoff", "fired", "terminated",
"stress", "anxiety", "fear",
]
# Cluster by semantic category
clusters = cluster_by_superclass(words)
for superclass, cluster_words in clusters.items():
print(f"\n[{superclass}]")
for w in cluster_words:
print(f" - {w}")
# Output:
# [0214] Leadership
# - manager
# - boss
# - supervisor
# [0220] Compensation
# - salary
# - bonus
# ...
Distance Functions
Calculate semantic distance between words or codes:
from Oyemi import code_distance, word_distance, DistanceResult
# Distance between codes directly
dist = code_distance("0121-00003-1-2-2", "0121-00005-1-2-2")
print(f"Distance: {dist}")
# Distance between words (returns DistanceResult)
result = word_distance("happy", "sad")
print(f"Distance: {result.distance}")
print(f"Same superclass: {result.same_superclass}")
print(f"Same POS: {result.same_pos}")
Lexicon Storage
Access the underlying SQLite lexicon directly:
from Oyemi import get_storage, LexiconStorage
# Get storage instance
storage = get_storage()
# Query the lexicon directly
print(f"Total words: {storage.word_count}")
print(f"Total synsets: {storage.synset_count}")
# Low-level lookup
codes = storage.lookup("happy")
print(codes)
Error Handling
Oyemi provides specific exceptions for different error cases:
from Oyemi import (
OyemiError, # Base exception
UnknownWordError, # Word not in lexicon
LexiconNotFoundError, # Database file missing
InvalidCodeError, # Malformed code string
)
try:
codes = enc.encode("asdfghjkl")
except UnknownWordError as e:
print(f"Word not found: {e.word}")
# Or use raise_on_unknown=False
codes = enc.encode("asdfghjkl", raise_on_unknown=False)
print(codes) # []
API Reference
Encoder Class
encode(), encode_parsed(), has_word(), find_synonyms(), are_antonyms(), get_antonyms()
Distance Functions
code_distance(), word_distance(), semantic_similarity(), find_similar()
Clustering
cluster_by_superclass()
Storage
get_storage(), LexiconStorage
Superclass Categories
Oyemi uses hierarchical semantic categories based on WordNet hypernyms:
0100-0199: Psychological
Emotions, mental states, cognition, feelings
0200-0299: Social
People, organizations, roles, relationships
0300-0399: Physical
Objects, materials, substances, artifacts
0400-0499: Actions
Verbs, activities, processes, events
1000-1999: Nature
Animals, plants, natural phenomena
2000-2999: Activities
Work, sports, communication, motion
3000-3999: Attributes
Qualities, properties, states, conditions
4000-4999: Relations
Spatial, temporal, logical relations
Use Cases
- Sentiment Analysis - Deterministic valence detection without ML models
- Search Enhancement - Query expansion using synonyms and similar words
- Topic Modeling - Group documents by semantic category
- Text Classification - Use codes as features for classifiers
- Lexical Databases - Build domain-specific semantic indexes
- Educational Tools - Vocabulary analysis and language learning
See Oyemi in Action
Check out the interactive tutorials with real code examples.
View Use Cases