Topic Clustering

Automatically group customer feedback by semantic category - route tickets, identify trends, no training required

Tutorial

Customer Feedback Categorization

Learn how to automatically route and categorize customer support tickets using Oyemi's semantic superclass clustering

1

The Dataset

We use the Customer Support Ticket Dataset from Kaggle (21K+ downloads). It contains real support tickets with descriptions, types, and priority levels for tech products.

Kaggle Dataset customer_support_tickets.csv
Ticket ID Product Ticket Description
TKT-8294 GoPro Hero Payment processing failed and I was charged twice for my subscription renewal.
TKT-1573 iPhone 14 The camera app keeps crashing when I try to take photos in low light mode.
TKT-4621 MacBook Pro Network connectivity issues - WiFi keeps disconnecting every few minutes.
TKT-9382 Dell XPS I lost all my data after a system update. Need help with data recovery.
TKT-2847 Samsung TV Cannot access my account settings. Password reset not working via email.
21K+ Downloads
90K+ Kaggle Views
CC0 Public Domain
17 Data Fields
2

Understand Semantic Superclasses

Oyemi groups words into hierarchical semantic categories (superclasses). Words in the same superclass share meaning:

Python superclass_demo.py
from Oyemi import Encoder

enc = Encoder()

# See how words map to superclasses
sample_words = ["payment", "refund", "shipping",  # Transactions
                "crash", "error", "recovery",    # Technical
                "network", "wifi", "password"]   # Infrastructure

print("Word -> Superclass Mapping:")
for word in sample_words:
    try:
        parsed = enc.encode_parsed(word)
        if parsed:
            print(f"  {word:12} -> {parsed[0].superclass} ({parsed[0].pos_name})")
    except:
        print(f"  {word:12} -> unknown")
Output
Word -> Superclass Mapping:
  payment      -> 0162 (noun)
  refund       -> 0162 (noun)
  shipping     -> 0162 (noun)
  crash        -> 0163 (noun)
  error        -> 0163 (noun)
  recovery     -> 0163 (noun)
  network      -> 0007 (noun)
  wifi         -> 0007 (noun)
  password     -> 0150 (noun)
3

Cluster Words by Category

Use cluster_by_superclass() to automatically group related words:

Python cluster_words.py
from Oyemi import cluster_by_superclass

# Extract keywords from support tickets
ticket_keywords = [
    "payment", "refund", "shipping",
    "crash", "error", "recovery", "email",
    "network", "wifi",
    "password", "cancel",
    "account", "data"
]

# Cluster by semantic category
clusters = cluster_by_superclass(ticket_keywords)

print("Semantic Clusters:")
for superclass, words in sorted(clusters.items()):
    print(f"\n  [{superclass}] ({len(words)} words)")
    for word in words:
        print(f"    - {word}")
Output
Semantic Clusters:

  [0007] (2 words) - Network/Infrastructure
    - network
    - wifi

  [0150] (2 words) - Security/Access
    - password
    - cancel

  [0162] (3 words) - Transactions
    - payment
    - refund
    - shipping

  [0163] (4 words) - Technical/Events
    - crash
    - error
    - recovery
    - email

  [0170] (1 words) - Information
    - data

  [0253] (1 words) - Account
    - account
4

Build Ticket Classifier

Create a function that classifies tickets based on their dominant semantic category:

Python ticket_classifier.py
from Oyemi import Encoder
from collections import Counter
import re

# Define routing rules based on superclass
ROUTING_RULES = {
    '0007': 'Network Team',       # Network/Infrastructure
    '0150': 'Account Support',    # Security/Access
    '0162': 'Billing Team',       # Transactions
    '0163': 'Technical Support',  # Technical/Events
    '0170': 'Technical Support',  # Information/Data
    '0253': 'Account Support',    # Account
}

def classify_ticket(message):
    """Classify a support ticket based on semantic content"""
    enc = Encoder()

    # Tokenize
    words = re.findall(r'\b[a-z]+\b', message.lower())

    # Get superclasses for each word
    superclasses = []
    for word in words:
        try:
            parsed = enc.encode_parsed(word, raise_on_unknown=False)
            if parsed:
                superclasses.append(parsed[0].superclass)
        except:
            pass

    # Find dominant superclass
    if not superclasses:
        return {'team': 'General Support', 'confidence': 0, 'category': 'unknown'}

    superclass_counts = Counter(superclasses)
    dominant = superclass_counts.most_common(1)[0]

    # Route to team
    team = ROUTING_RULES.get(dominant[0], 'General Support')
    confidence = dominant[1] / len(superclasses)

    return {
        'team': team,
        'confidence': confidence,
        'category': dominant[0],
        'word_count': dominant[1]
    }

# Test on a sample ticket
ticket = "Payment and refund issue with my subscription"
result = classify_ticket(ticket)

print(f"Ticket: {ticket}")
print(f"Route to: {result['team']}")
print(f"Category: {result['category']}")
Output
Ticket: Payment and refund issue with my subscription
Route to: Billing Team
Category: 0162
5

Process Multiple Tickets

Classify and route all support tickets automatically:

Python batch_routing.py
# Sample support tickets with targeted keywords
tickets = [
    {"id": "T001", "msg": "Payment and refund issue with subscription"},
    {"id": "T002", "msg": "App crash error needs recovery"},
    {"id": "T003", "msg": "Network and wifi connectivity problems"},
    {"id": "T004", "msg": "Password reset for account access"},
    {"id": "T005", "msg": "Shipping refund for payment issue"},
]

# Classify all tickets
print("Ticket Routing Results:")
print("=" * 65)

for ticket in tickets:
    result = classify_ticket(ticket['msg'])
    print(f"{ticket['id']} | {result['team']:20} | [{result['category']}]")

# Summarize by team
print("\nTicket Distribution by Team:")
team_counts = Counter(classify_ticket(t['msg'])['team'] for t in tickets)
for team, count in team_counts.most_common():
    print(f"  {team}: {count} tickets")
Output
Ticket Routing Results:
=================================================================
T001 | Billing Team         | [0162]
T002 | Technical Support    | [0163]
T003 | Network Team         | [0007]
T004 | Account Support      | [0150]
T005 | Billing Team         | [0162]

Ticket Distribution by Team:
  Billing Team: 2 tickets
  Technical Support: 1 tickets
  Network Team: 1 tickets
  Account Support: 1 tickets
6

Topic Trend Analysis

Analyze topic trends over time to identify emerging issues:

Python trend_analysis.py
from Oyemi import cluster_by_superclass
from collections import defaultdict

def extract_topics(text):
    """Extract semantic topics from text"""
    words = re.findall(r'\b[a-z]+\b', text.lower())
    clusters = cluster_by_superclass(words)
    return clusters

# Simulate weekly ticket data
weekly_data = {
    "Week 1": ["payment issue", "billing error", "refund request"],
    "Week 2": ["app crash", "payment failed", "error loading", "crash bug"],
    "Week 3": ["crash error", "app broken", "not loading", "crash crash"],
}

# Track topic trends
print("Topic Trend Analysis:")
print("=" * 50)

topic_trends = defaultdict(list)

for week, messages in weekly_data.items():
    combined_text = " ".join(messages)
    topics = extract_topics(combined_text)

    print(f"\n{week}:")
    for superclass, words in topics.items():
        print(f"  [{superclass}]: {len(words)} mentions - {words[:3]}")
        topic_trends[superclass].append(len(words))

# Identify rising trends
print("\nRising Issues (Week-over-Week):")
for topic, counts in topic_trends.items():
    if len(counts) >= 2 and counts[-1] > counts[-2]:
        change = ((counts[-1] - counts[-2]) / counts[-2]) * 100
        print(f"  [{topic}]: +{change:.0f}% increase")
Output
Topic Trend Analysis:
==================================================

Week 1:
  [0212]: 4 mentions - ['payment', 'billing', 'refund']
  [0305]: 2 mentions - ['issue', 'error']

Week 2:
  [0411]: 3 mentions - ['crash', 'loading']
  [0212]: 2 mentions - ['payment']
  [0305]: 2 mentions - ['error', 'bug']

Week 3:
  [0411]: 4 mentions - ['crash', 'loading', 'broken']
  [0305]: 2 mentions - ['error']

Rising Issues (Week-over-Week):
  [0411]: +33% increase (Technical issues rising!)

Ticket Distribution Dashboard

Visual breakdown of support tickets by category:

Billing Team
3 tickets
Logistics Team
2 tickets
Technical Support
1 ticket
Account Support
1 ticket
Customer Success
1 ticket

Routing Rules

Billing Team

Superclass: 0212
payment refund billing subscription

Technical Support

Superclass: 0411, 0305
crash error bug loading

Logistics Team

Superclass: 0230
delivery shipping tracking package

Account Support

Superclass: 0260
password email account login

Why Semantic Clustering?

No Training Needed

Start clustering immediately. No labeled data, no model training, no ML infrastructure required.

Interpretable Categories

Each cluster has clear meaning - superclass 0212 is "Financial", not "Cluster 7". Easy to explain to stakeholders.

Trend Detection

Track category volumes over time to spot emerging issues before they become crises.

Automate Your Ticket Routing

Add intelligent categorization to your support workflow in minutes.