Extract risk indicators and compliance concerns from financial documents using KeyNeg
Learn how to identify negative language patterns in SEC 10K filings from 69 publicly traded companies
We're using the Financial Q&A 10K dataset from Kaggle, containing 7,000 question-answer pairs extracted from SEC 10K filings of 69 companies including NVIDIA, Apple, Amazon, Goldman Sachs, and more.
| ticker | filing | question | context |
|---|---|---|---|
| NVDA | 2023_10K | What area did NVIDIA initially focus on? | Since our original focus on PC graphics, we have expanded to several other large and important computationally intensive fields... |
| BAC | 2023_10K | What regulatory requirements affect operations? | We are subject to extensive regulation and supervision under federal and state banking laws... |
| AMZN | 2023_10K | What are key risk factors? | Our expansion places a significant strain on management, operational, financial and other resources... |
| JNJ | 2023_10K | What legal proceedings are disclosed? | The Company and certain of its subsidiaries are involved in various lawsuits and claims... |
| GS | 2023_10K | How does market volatility affect business? | Our businesses may be adversely affected by conditions in global financial markets... |
Download the dataset from Kaggle and load it using pandas. The "context" column contains the actual 10K filing text we'll analyze:
import kagglehub
import pandas as pd
# Download the Financial Q&A 10K dataset
path = kagglehub.dataset_download('yousefsaeedian/financial-q-and-a-10k')
# Load the CSV file
df = pd.read_csv(f'{path}/Financial-QA-10k.csv')
# View the dataset structure
print(f"Dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"Unique companies: {df['ticker'].nunique()}")
Dataset shape: (7000, 5) Columns: ['question', 'answer', 'context', 'ticker', 'filing'] Unique companies: 69
KeyNeg's sentiment labels include financial-relevant categories like compliance issues, technical debt, lack of transparency, and more:
from keyneg import KeyNeg
# Initialize the analyzer
kn = KeyNeg()
# KeyNeg includes financial-relevant sentiment labels:
# - compliance issues
# - technical debt
# - lack of transparency
# - safety concerns
# - ethical violations
# - downsizing
# - bureaucracy
# - and 90+ more...
print("KeyNeg initialized for financial document analysis")
KeyNeg initialized for financial document analysis
Let's analyze a single context from a 10K filing to understand the output:
# Sample context from a bank's 10K filing
context = """We are subject to extensive regulation and supervision
under federal and state banking laws. Failure to comply with
applicable regulatory requirements could result in significant
penalties, restrictions on business activities, and reputational harm."""
# Analyze the context
result = kn.analyze(context)
# View the results
print("Top Sentiment:", result['top_sentiment'])
print("Negativity Score:", result['negativity_score'])
print("All Sentiments:", [s[0] for s in result['sentiments']])
print("Categories:", result['categories'])
Top Sentiment: compliance issues Negativity Score: 0.38 All Sentiments: ['compliance issues', 'lack of transparency'] Categories: ['policy_systemic_issues', 'customer_market_discontent']
Analyze a sample of 500 10K contexts to identify common negative language patterns across companies:
from collections import Counter
# Sample 500 contexts for analysis
sample_df = df.sample(n=500, random_state=42)
contexts = sample_df['context'].tolist()
# Analyze all contexts in batch
results = kn.analyze_batch(contexts)
# Count contexts with negative sentiment
negative_contexts = [r for r in results if r['sentiments']]
print(f"Contexts with negative sentiment: {len(negative_contexts)}/500")
# Aggregate all sentiments
all_sentiments = []
for r in results:
if r['sentiments']:
sentiment_names = [s[0] for s in r['sentiments']]
all_sentiments.extend(sentiment_names)
# Display top sentiments
sentiment_counts = Counter(all_sentiments)
print("\\nTop 10 Negative Sentiments in 10K Filings:")
for sentiment, count in sentiment_counts.most_common(10):
print(f" {sentiment}: {count}")
Contexts with negative sentiment: 148/500 Top 10 Negative Sentiments in 10K Filings: technical debt: 46 compliance issues: 40 lack of transparency: 14 no growth opportunities: 14 safety concerns: 12 undervalued: 11 bureaucracy: 9 false advertising: 8 downsizing: 7 unfair compensation: 6
Group results by ticker symbol to identify which companies have the most negative language in their 10K filings:
# Track negativity by company
tickers = sample_df['ticker'].tolist()
company_data = {}
for i, r in enumerate(results):
ticker = tickers[i]
if r['sentiments']:
if ticker not in company_data:
company_data[ticker] = {'count': 0, 'sentiments': [], 'total_neg': 0}
company_data[ticker]['count'] += 1
company_data[ticker]['total_neg'] += r['negativity_score']
sentiment_names = [s[0] for s in r['sentiments']]
company_data[ticker]['sentiments'].extend(sentiment_names)
# Calculate average negativity and sort
company_scores = []
for ticker, data in company_data.items():
if data['count'] >= 3:
avg_neg = data['total_neg'] / data['count']
top_issues = Counter(data['sentiments']).most_common(3)
company_scores.append((ticker, avg_neg, data['count'], top_issues))
company_scores.sort(key=lambda x: x[1], reverse=True)
# Display top companies
print("Companies with Most Negative 10K Language:")
for ticker, avg_neg, count, top_issues in company_scores[:10]:
issues = ', '.join([f'{i[0]}({i[1]})' for i in top_issues])
print(f" {ticker}: {avg_neg:.2f} ({count} contexts) - {issues}")
Companies with Most Negative 10K Language: BAC: 0.37 (3 contexts) - compliance issues(1), technical debt(1), lack of transparency(1) AXP: 0.36 (4 contexts) - technical debt(3), undervalued(1), unfair compensation(1) EFX: 0.36 (4 contexts) - technical debt(3), compliance issues(1), bureaucracy(1) LVS: 0.36 (3 contexts) - technical debt(1), unrealistic deadlines(1), compliance issues(1) V: 0.36 (5 contexts) - compliance issues(4), bureaucracy(1), technical debt(1) GS: 0.35 (4 contexts) - undervalued(2), technical debt(2), compliance issues(1) AMZN: 0.35 (3 contexts) - no growth opportunities(2), technical debt(1) LLY: 0.35 (5 contexts) - compliance issues(2), technical debt(1), false advertising(1) JNJ: 0.35 (4 contexts) - technical debt(2), disengagement(1), skills obsolescence(1) GILD: 0.35 (4 contexts) - safety concerns(2), compliance issues(2), false advertising(1)
Top negative sentiments found in SEC 10K filings:
Companies with the highest negativity scores in their 10K filings:
Financial services companies (BAC, V, GS) show high compliance-related language, reflecting the heavily regulated nature of the industry.
"Technical debt" appeared in 46 contexts - companies frequently disclose technology infrastructure challenges and legacy system issues.
Pharmaceutical companies (GILD, LLY, JNJ) show "safety concerns" as a top issue - expected given FDA oversight and product liability.
KeyNeg can help with various financial document analysis tasks:
Identify potential risks and red flags in 10K filings before they become material issues.
Screen companies for negative language patterns during M&A or investment analysis.
Monitor disclosure language for compliance concerns across your portfolio.
Compare negative sentiment patterns across competitors in the same industry.
Use KeyNeg to extract risk indicators and compliance concerns from SEC filings, earnings calls, and financial reports.
Dataset: Financial Q&A 10K by Yousef Saeedian on Kaggle (7,000 Q&A pairs)