Open Source

Oyemi

Deterministic Semantic Word Encoding for NLP - No Machine Learning Required

Installation

Install Oyemi from PyPI using pip. The package includes a pre-built lexicon with 145K+ words - no additional downloads required:

# Install from PyPI
pip install oyemi

# Verify installation
python -c "from Oyemi import Encoder; print(Encoder().word_count)"
# 145014

Key Features:

  • Zero runtime dependencies - just pure Python + SQLite
  • 100% deterministic - same input always gives same output
  • No internet required - fully offline operation
  • Works on Python 3.8+

Quick Start

Get started with Oyemi in just a few lines:

from Oyemi import Encoder, encode, semantic_similarity

# Simple encoding
codes = encode("happy")
print(codes)
# ['3010-00001-3-1-1', '3999-05469-3-1-1', '3999-05731-3-1-1']

# Using encoder instance
enc = Encoder()
parsed = enc.encode_parsed("fear")
print(parsed[0].valence_name)
# 'negative'

# Semantic similarity
sim = semantic_similarity("happy", "joyful")
print(sim)
# 0.85

Code Format

Oyemi codes follow the format HHHH-LLLLL-P-A-V:

HHHHSuperclass - Semantic category (e.g., 0121 = emotion.fear)
LLLLLLocal ID - Specific synset within the superclass
PPart of Speech - 1=noun, 2=verb, 3=adjective, 4=adverb
AAbstractness - 0=concrete, 1=mixed, 2=abstract
VValence - 0=neutral, 1=positive, 2=negative

Example: 0121-00003-1-2-2 for "fear" means:

  • Superclass 0121 (emotion category)
  • Local synset ID 00003
  • Part of speech: noun (1)
  • Abstractness: abstract (2)
  • Valence: negative (2)

Word Encoding

Encode words to get their semantic codes:

from Oyemi import Encoder

enc = Encoder()

# Simple encoding returns list of code strings
codes = enc.encode("bank")
print(codes)
# ['0174-00012-1-0-0', '0045-00089-1-0-0', '2030-00156-2-1-0']
# Multiple codes = polysemous word (multiple meanings)

# Check if word exists
print(enc.has_word("serendipity"))  # True
print(enc.has_word("asdfgh"))       # False

# Get lexicon stats
print(f"Lexicon size: {enc.word_count:,} words")
# Lexicon size: 145,014 words

Parsed Codes

Use encode_parsed() to get structured SemanticCode objects:

parsed = enc.encode_parsed("fear")
code = parsed[0]

# Access components directly
print(code.raw)              # '0121-00003-1-2-2'
print(code.superclass)       # '0121'
print(code.local_id)         # '00003'
print(code.pos)              # 1
print(code.pos_name)         # 'noun'
print(code.abstractness)     # 2
print(code.abstractness_name)# 'abstract'
print(code.valence)          # 2
print(code.valence_name)     # 'negative'

# Get full synset ID
print(code.synset_id)        # '0121-00003'

SemanticCode Properties

rawFull code string
superclass4-digit category code
local_id5-digit synset ID within category
synset_idCombined superclass + local_id
pos / pos_namePart of speech (int / string)
abstractness / abstractness_nameConcreteness level
valence / valence_nameSentiment polarity

Valence Detection

Oyemi provides built-in sentiment detection with 95%+ accuracy:

# Check valence of individual words
for word in ["happy", "angry", "table", "fired"]:
    parsed = enc.encode_parsed(word)
    if parsed:
        print(f"{word}: {parsed[0].valence_name}")

# Output:
# happy: positive
# angry: negative
# table: neutral
# fired: negative

# Analyze sentence sentiment
sentence = "The manager was incompetent and the layoffs were devastating"
words = sentence.lower().split()
valence_counts = {'positive': 0, 'negative': 0, 'neutral': 0}

for word in words:
    parsed = enc.encode_parsed(word, raise_on_unknown=False)
    if parsed:
        valence_counts[parsed[0].valence_name] += 1

print(valence_counts)
# {'positive': 0, 'negative': 3, 'neutral': 5}

Text Analysis

Analyze entire text strings for valence/sentiment with analyze_text():

from Oyemi import analyze_text, Encoder

# Convenience function
result = analyze_text("I feel hopeful but anxious about the future")

print(f"Score: {result.valence_score:+.2f}")
print(f"Sentiment: {result.sentiment}")
print(f"Positive: {result.positive_words}")
print(f"Negative: {result.negative_words}")

# Output:
# Score: +0.00
# Sentiment: neutral
# Positive: ['hopeful']
# Negative: ['anxious']

# Using encoder instance
enc = Encoder()
result = enc.analyze_text("The team achieved great success")

print(f"Analyzed: {result.analyzed_words} words")
print(f"Positive: {result.positive_pct:.1f}%")
print(f"Negative: {result.negative_pct:.1f}%")

# Convert to dict for JSON serialization
data = result.to_dict()
print(data)

TextAnalysis Properties

total_wordsTotal words extracted from text
analyzed_wordsWords found in lexicon
positive_wordsList of positive valence words
negative_wordsList of negative valence words
neutral_wordsList of neutral valence words
valence_scoreOverall score (-1.0 to +1.0)
sentimentLabel: "positive", "negative", or "neutral"
to_dict()Convert to dictionary for JSON

Synonym Discovery

Find true synonyms using WordNet synset matching:

from Oyemi import find_synonyms

# Basic synonym search
syns = find_synonyms("happy", limit=5)
print(syns)
# ['felicitous', 'glad', 'well-chosen']

# Weighted synonyms (higher = closer match)
weighted = find_synonyms("fear", return_weighted=True, limit=5)
for syn, weight in weighted:
    print(f"  {syn}: {weight:.2f}")
# dread: 1.00
# fearfulness: 1.00
# fright: 0.85

# With constraints
syns = find_synonyms(
    "run",
    limit=10,
    pos_lock=True,           # Same part of speech only
    abstractness_lock=True   # Don't mix abstract/concrete
)

Parameters

wordWord to find synonyms for
limitMaximum synonyms to return (default: 20)
pos_lockOnly return same part of speech (default: True)
abstractness_lockDon't mix abstract/concrete (default: True)
return_weightedReturn (word, weight) tuples (default: False)

Antonym Detection

Check and retrieve antonyms for words:

from Oyemi import are_antonyms, get_antonyms

# Check if two words are antonyms
print(are_antonyms("happy", "sad"))     # True
print(are_antonyms("good", "bad"))     # True
print(are_antonyms("happy", "table"))  # False

# Get all antonyms for a word
antonyms = get_antonyms("happy")
print(antonyms)
# ['sad', 'unhappy', 'sorrowful', ...]

Semantic Similarity

Calculate similarity between words based on their semantic codes:

from Oyemi import semantic_similarity, find_similar

# Compare two words
sim = semantic_similarity("happy", "joyful")
print(f"happy <-> joyful: {sim:.2f}")  # 0.85

# Compare multiple pairs
pairs = [
    ("dog", "cat"),          # Same category
    ("dog", "computer"),     # Different categories
    ("layoff", "fired"),     # Related terms
]
for w1, w2 in pairs:
    print(f"{w1} <-> {w2}: {semantic_similarity(w1, w2):.2f}")

# Find similar words
similar = find_similar("happy", top_n=5)
print(similar)
# [('joyful', 0.85), ('glad', 0.82), ('cheerful', 0.78), ...]

Topic Clustering

Group words by their semantic category (superclass):

from Oyemi import cluster_by_superclass

# Words from employee feedback
words = [
    "manager", "boss", "supervisor",
    "salary", "bonus", "compensation",
    "layoff", "fired", "terminated",
    "stress", "anxiety", "fear",
]

# Cluster by semantic category
clusters = cluster_by_superclass(words)

for superclass, cluster_words in clusters.items():
    print(f"\n[{superclass}]")
    for w in cluster_words:
        print(f"  - {w}")

# Output:
# [0214] Leadership
#   - manager
#   - boss
#   - supervisor
# [0220] Compensation
#   - salary
#   - bonus
# ...

Distance Functions

Calculate semantic distance between words or codes:

from Oyemi import code_distance, word_distance, DistanceResult

# Distance between codes directly
dist = code_distance("0121-00003-1-2-2", "0121-00005-1-2-2")
print(f"Distance: {dist}")

# Distance between words (returns DistanceResult)
result = word_distance("happy", "sad")
print(f"Distance: {result.distance}")
print(f"Same superclass: {result.same_superclass}")
print(f"Same POS: {result.same_pos}")

Lexicon Storage

Access the underlying SQLite lexicon directly:

from Oyemi import get_storage, LexiconStorage

# Get storage instance
storage = get_storage()

# Query the lexicon directly
print(f"Total words: {storage.word_count}")
print(f"Total synsets: {storage.synset_count}")

# Low-level lookup
codes = storage.lookup("happy")
print(codes)

Error Handling

Oyemi provides specific exceptions for different error cases:

from Oyemi import (
    OyemiError,           # Base exception
    UnknownWordError,     # Word not in lexicon
    LexiconNotFoundError, # Database file missing
    InvalidCodeError,     # Malformed code string
)

try:
    codes = enc.encode("asdfghjkl")
except UnknownWordError as e:
    print(f"Word not found: {e.word}")

# Or use raise_on_unknown=False
codes = enc.encode("asdfghjkl", raise_on_unknown=False)
print(codes)  # []

API Reference

Encoder Class

encode(), encode_parsed(), has_word(), find_synonyms(), are_antonyms(), get_antonyms()

Distance Functions

code_distance(), word_distance(), semantic_similarity(), find_similar()

Clustering

cluster_by_superclass()

Storage

get_storage(), LexiconStorage

Superclass Categories

Oyemi uses hierarchical semantic categories based on WordNet hypernyms:

0100-0199: Psychological

Emotions, mental states, cognition, feelings

0200-0299: Social

People, organizations, roles, relationships

0300-0399: Physical

Objects, materials, substances, artifacts

0400-0499: Actions

Verbs, activities, processes, events

1000-1999: Nature

Animals, plants, natural phenomena

2000-2999: Activities

Work, sports, communication, motion

3000-3999: Attributes

Qualities, properties, states, conditions

4000-4999: Relations

Spatial, temporal, logical relations

Use Cases

  • Sentiment Analysis - Deterministic valence detection without ML models
  • Search Enhancement - Query expansion using synonyms and similar words
  • Topic Modeling - Group documents by semantic category
  • Text Classification - Use codes as features for classifiers
  • Lexical Databases - Build domain-specific semantic indexes
  • Educational Tools - Vocabulary analysis and language learning

See Oyemi in Action

Check out the interactive tutorials with real code examples.

View Use Cases