NLP-Cyber-Harm-Detection

NLP Fraud/Scam Detection Baseline Models

GitHub Repository

A comprehensive baseline implementation for fraud and scam detection using Natural Language Processing techniques. This project provides multiple approaches from simple keyword-based detection to advanced BERT-based classification.

๐Ÿ“ Repository

๐Ÿ”— GitHub Repository: https://github.com/RockENZO/NLP-Cyber-Harm-Detection.git

๐ŸŽฏ Project Overview

This project implements baseline models for detecting fraudulent content (scams, phishing, spam) in text data, based on analysis of existing similar projects. It includes:

โšก DistilBERT Model Highlights

NEW: The project now includes a production-ready DistilBERT model with significant advantages:

The DistilBERT model is trained for multiclass classification, providing granular fraud type detection rather than just binary fraud/legitimate classification.

๐Ÿ“ Project Structure

โ”œโ”€โ”€ README.md                           # This comprehensive documentation
โ”œโ”€โ”€ requirements.txt                    # Python dependencies
โ”œโ”€โ”€ final_fraud_detection_dataset.csv  # Training dataset (Git LFS)
โ”œโ”€โ”€ models/                            # Saved trained models
โ”‚   โ”œโ”€โ”€ model.zip                      # Compressed model bundle (excluded from git)
โ”‚   โ”œโ”€โ”€ bert_model/                    # Trained BERT model files
โ”‚   โ”œโ”€โ”€ bert_tokenizer/               # BERT tokenizer files
โ”‚   โ”œโ”€โ”€ distilbert_model/             # Trained DistilBERT model files (60% faster)
โ”‚   โ””โ”€โ”€ distilbert_tokenizer/         # DistilBERT tokenizer files
โ”œโ”€โ”€ training/                          # Training scripts and notebooks
โ”‚   โ”œโ”€โ”€ baseline_fraud_detection.py   # Traditional ML baseline models
โ”‚   โ”œโ”€โ”€ bert_fraud_detection.py       # BERT-based classifier
โ”‚   โ”œโ”€โ”€ fraud_detection_baseline.ipynb # Interactive Jupyter notebook
โ”‚   โ””โ”€โ”€ kaggle_fraud_detection.ipynb  # Kaggle-optimized training notebook
โ”œโ”€โ”€ demos/                             # Demo and testing tools
โ”‚   โ”œโ”€โ”€ fraud_detection_demo.py       # Full-featured demo script
โ”‚   โ”œโ”€โ”€ fraud_detection_demo.ipynb    # Interactive demo notebook
โ”‚   โ””โ”€โ”€ quick_demo.py                 # Quick verification script
โ”œโ”€โ”€ reasoning/                         # ๐Ÿง  AI-powered reasoning pipeline  
โ”‚   โ”œโ”€โ”€ GPT2_Fraud_Reasoning.ipynb    # GPT2-based reasoning analysis
โ”‚   โ””โ”€โ”€ KaggleLLMsReasoning.ipynb     # Local reasoning notebook
โ”œโ”€โ”€ docs/                              # Documentation
โ”‚   โ””โ”€โ”€ nlp_terms_explanation.md      # NLP concepts explanation
โ”œโ”€โ”€ runs/                              # Training run outputs and analysis results
โ”‚   โ”œโ”€โ”€ fraud_analysis_results_20250916_155231.csv
โ”‚   โ”œโ”€โ”€ fraud-detection-kaggle-training-bert-run.ipynb
โ”‚   โ”œโ”€โ”€ gpt2_fraud_analysis_20250917_034015.csv
โ”‚   โ”œโ”€โ”€ LLMsReasoningResultVisualization.ipynb
โ”‚   โ”œโ”€โ”€ MultipleLLMsReasoning(small-models).ipynb
โ”‚   โ””โ”€โ”€ LLMsStats/                     # LLM performance comparison charts
โ”‚       โ”œโ”€โ”€ llm_category_heatmap.png
โ”‚       โ”œโ”€โ”€ llm_comparison_table.csv
โ”‚       โ”œโ”€โ”€ llm_performance_comparison.png
โ”‚       โ”œโ”€โ”€ llm_quality_radar.png
โ”‚       โ”œโ”€โ”€ llm_size_performance.png
โ”‚       โ”œโ”€โ”€ llm_speed_quality_scatter.png
โ”‚       โ”œโ”€โ”€ llm_model_size_comparison.png    # Model size vs performance charts
โ”‚       โ””โ”€โ”€ llm_speed_quality_bubble.png     # Speed vs quality bubble chart
โ”œโ”€โ”€ .gitattributes                     # Git LFS configuration
โ”œโ”€โ”€ .gitignore                         # Git ignore rules
โ””โ”€โ”€ .git/                             # Git repository

๐Ÿš€ Quick Start

If you have already trained a model on Kaggle:

  1. Install Dependencies
    pip install torch transformers pandas numpy matplotlib seaborn jupyter
    
  2. Quick Test Your Model
    python demos/quick_demo.py
    
  3. Interactive Demo Notebook
    jupyter notebook demos/fraud_detection_demo.ipynb
    
  4. Full Demo Script
    python demos/fraud_detection_demo.py
    
  5. Local AI Reasoning
    # Upload KaggleGPTReasoning.ipynb to Kaggle
    # Enable GPU accelerator
    # Run all cells for fraud detection + AI explanations
    # Download results - no API costs
    

    ๐Ÿ“Š LLM Performance Analysis: Check runs/LLMsStats/ for performance comparisons.

Option 2: Train from Scratch

  1. Install Dependencies
    pip install -r requirements.txt
    
  2. Run Traditional ML Baselines
    python training/baseline_fraud_detection.py
    
  3. Run BERT Baseline (requires more computational resources)
    python training/bert_fraud_detection.py
    
  1. Upload final_fraud_detection_dataset.csv to Kaggle
  2. Create a new notebook and copy the code from training/fraud-detection-kaggle-training-bert-run.ipynb
  3. Enable GPU accelerator for fast BERT training
  4. Download the trained models from Kaggle output
  5. Use the demo scripts to test your trained model

Note: The dataset is stored with Git LFS due to its size (~158MB). Clone with git lfs pull to download the full dataset. Large model files like model.zip are excluded from git to keep the repository size manageable.

๐Ÿ“Š LLM Performance Analysis Results

The runs/LLMsStats/ directory contains LLM model analysis for fraud reasoning tasks.

๐Ÿ“Š Models Implemented

1. Traditional ML Baselines (training/baseline_fraud_detection.py)

2. BERT-Based Classifier (training/bert_fraud_detection.py)

3. DistilBERT-Based Classifier (training/kaggle_fraud_detection.ipynb)

4. Kaggle Training Notebook (runs/fraud-detection-kaggle-training-bert-run.ipynb)

5. AI-Powered Reasoning Pipeline (reasoning/)

๐Ÿค– LLM Model Selection for Reasoning

LLM Model Size vs Performance

LLM Speed vs Quality

๐ŸŽฎ Demo and Testing Tools

Once you have a trained model, use these tools to test and demonstrate fraud detection capabilities:

2. fraud_detection_demo.py

3. quick_demo.py

๐ŸŽฏ Fraud Types Detected

Your trained model can detect these 9 classes:

  1. legitimate - Normal, safe messages
  2. phishing - Attempts to steal credentials/personal info
  3. tech_support_scam - Fake technical support
  4. reward_scam - Fake prizes/lottery winnings
  5. job_scam - Fraudulent employment opportunities
  6. sms_spam - Unwanted promotional messages
  7. popup_scam - Fake security alerts
  8. refund_scam - Fake refund/billing notifications
  9. ssn_scam - Social Security number theft attempts

๐Ÿ’ก Demo Usage Examples

Single Prediction

from demos.fraud_detection_demo import FraudDetectionDemo

demo = FraudDetectionDemo()
result = demo.predict_single("Your account has been compromised! Click here now!")
print(f"Prediction: {result['predicted_class']}")
print(f"Confidence: {result['confidence']:.4f}")
print(f"Is Fraud: {result['is_fraud']}")

Batch Prediction

texts = [
    "Meeting at 3 PM tomorrow",
    "URGENT: Verify your SSN now!",
    "You won $10,000! Send fee to claim"
]

results = demo.predict_batch(texts)
for result in results:
    print(f"{result['predicted_class']}: {result['text']}")

Interactive Jupyter Demo

# In demos/fraud_detection_demo.ipynb
your_text = "Your Netflix subscription has expired. Update your payment method to continue watching."
result = predict_fraud(your_text)
display_prediction(result)

๐Ÿ“ˆ Expected Performance

Based on similar projects and baseline implementations:

Model Type Expected Accuracy F1-Score Notes
Simple Rule-Based 60-70% 0.6-0.7 Quick prototype
TF-IDF + LogReg 80-90% 0.8-0.9 Good baseline
TF-IDF + SVM 80-90% 0.8-0.9 Robust to noise
BERT Fine-tuned 90-95% 0.9-0.95 Best performance
DistilBERT Fine-tuned 89-94% 0.89-0.94 60% faster, 97% of BERT performance

Demo Troubleshooting

Model Not Loading

Low Performance

Memory Issues

Customization Tips

๐Ÿ”ง Configuration

Traditional ML Parameters

# In training/baseline_fraud_detection.py
vectorizer = TfidfVectorizer(
    max_features=5000,    # Vocabulary size
    stop_words='english', # Remove common words
    ngram_range=(1, 2)    # Use unigrams and bigrams
)

BERT Configuration

# In training/bert_fraud_detection.py
classifier = BERTFraudClassifier(
    model_name='bert-base-uncased',  # Or 'distilbert-base-uncased' for faster training
    max_length=128,                  # Maximum sequence length
    num_classes=2                    # Binary classification
)
# In training/kaggle_fraud_detection.ipynb
model = DistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased', 
    num_labels=10                    # Multiclass classification (9 fraud types + legitimate)
)
batch_size = 16      # Can use larger batches due to lower memory usage
max_length = 128     # Maximum sequence length
epochs = 3          # Faster training allows more epochs
learning_rate = 2e-5 # DistilBERT learning rate

Kaggle Training Configuration

# In runs/fraud-detection-kaggle-training-bert-run.ipynb
batch_size = 16      # Adjust based on GPU memory
max_length = 128     # Maximum sequence length
epochs = 3          # Training epochs
learning_rate = 2e-5 # BERT learning rate

๐Ÿ“Š Sample Results

Traditional ML Output:

LOGISTIC_REGRESSION:
  Accuracy: 0.889
  F1-Score: 0.889
  AUC Score: 0.944

SVM:
  Accuracy: 0.889
  F1-Score: 0.889
  AUC Score: 0.944

BERT Output:

BERT Evaluation Results:
              precision    recall  f1-score   support

      normal       0.92      0.92      0.92        38
       fraud       0.92      0.92      0.92        37

    accuracy                           0.92        75
   macro avg       0.92      0.92      0.92        75
weighted avg       0.92      0.92      0.92        75

DistilBERT Output (Multiclass):

DistilBERT Multiclass Evaluation Results:
                    precision    recall  f1-score   support

         job_scam       0.89      0.94      0.91        32
       legitimate       0.95      0.91      0.93        45
         phishing       0.92      0.90      0.91        41
       popup_scam       0.88      0.92      0.90        38
      refund_scam       0.91      0.88      0.89        34
      reward_scam       0.90      0.93      0.91        36
         sms_spam       0.93      0.89      0.91        43
         ssn_scam       0.87      0.91      0.89        35
tech_support_scam       0.94      0.89      0.91        37

         accuracy                           0.91       341
        macro avg       0.91      0.91      0.91       341
     weighted avg       0.91      0.91      0.91       341

๐Ÿ“Š DistilBERT Overall Metrics:
Accuracy: 0.9120
F1-Score (Macro): 0.9088
F1-Score (Weighted): 0.9115

๐Ÿ“‹ Data Requirements

Current Implementation

  1. SMS Spam Collection - Classic spam detection
  2. Phishing URL Dataset - URL-based fraud detection
  3. Enron Email Dataset - Email fraud detection
  4. Social Media Scam Data - Social platform fraud

Data Format Expected

data = pd.DataFrame({
    'message': ['text content here', ...],
    'label': ['fraud', 'normal', ...]
})

๐Ÿ› ๏ธ Extending the Models

Adding New Features

# Add sentiment analysis
from textblob import TextBlob

def add_sentiment_features(text):
    blob = TextBlob(text)
    return {
        'polarity': blob.sentiment.polarity,
        'subjectivity': blob.sentiment.subjectivity
    }

Custom Preprocessing

def custom_preprocess(text):
    # Add domain-specific preprocessing
    text = remove_urls(text)
    text = normalize_currency(text)
    return text

Ensemble Methods

# Combine multiple models
def ensemble_predict(text):
    lr_pred = lr_model.predict_proba(text)[0][1]
    svm_pred = svm_model.predict_proba(text)[0][1]
    bert_pred = bert_model.predict(text)['probabilities']['fraud']
    
    # Weighted average
    final_score = 0.3 * lr_pred + 0.3 * svm_pred + 0.4 * bert_pred
    return 'fraud' if final_score > 0.5 else 'normal'

๐ŸŒ Deployment Options

Using the Demo Framework

# Production-ready integration using the demo class
from demos.fraud_detection_demo import FraudDetectionDemo

detector = FraudDetectionDemo()

# Single prediction
result = detector.predict_single(user_message)
if result['is_fraud'] and result['confidence'] > 0.8:
    alert_user(result['predicted_class'])

Flask Web App

from flask import Flask, request, jsonify
from demos.fraud_detection_demo import FraudDetectionDemo

app = Flask(__name__)
detector = FraudDetectionDemo()

@app.route('/predict', methods=['POST'])
def predict():
    text = request.json['text']
    result = detector.predict_single(text)
    return jsonify({
        'prediction': result['predicted_class'],
        'is_fraud': result['is_fraud'],
        'confidence': result['confidence']
    })

Streamlit Dashboard

import streamlit as st
from demos.fraud_detection_demo import FraudDetectionDemo

st.title("๐Ÿ›ก๏ธ Fraud Detection System")
detector = FraudDetectionDemo()

text_input = st.text_area("Enter message to analyze:")

if st.button("Analyze"):
    result = detector.predict_single(text_input)
    
    if result['is_fraud']:
        st.error(f"๐Ÿšจ FRAUD DETECTED: {result['predicted_class']}")
    else:
        st.success("โœ… Message appears legitimate")
    
    st.write(f"Confidence: {result['confidence']:.2%}")
    
    # Show probability distribution
    st.bar_chart(result['all_probabilities'])

๐Ÿ” Evaluation Metrics

The models are evaluated using:

For fraud detection, Recall is often most important (donโ€™t miss actual fraud).

๐Ÿšง Known Limitations

  1. Sample Data: Currently uses synthetic data; real datasets needed for production
  2. Class Imbalance: Real fraud data is typically very imbalanced
  3. Context: Simple models may miss contextual nuances
  4. Adversarial Examples: Sophisticated scammers may evade detection
  5. Language Support: Currently English-only

๐Ÿ”ฎ Next Steps

For Model Development

  1. Data Collection: Gather real fraud/scam datasets
  2. Feature Engineering: Add metadata features (sender, timestamp, etc.)
  3. Advanced Models: Experiment with RoBERTa, DistilBERT (already implemented), or domain-specific models
  4. Active Learning: Implement feedback loop for continuous improvement
  5. Multi-modal: Combine text with image analysis for comprehensive detection

For Production Deployment

  1. Performance Optimization: Optimize for low-latency inference using the demo framework
  2. A/B Testing: Compare model performance in production using demo tools
  3. Real-time Processing: Integrate demo classes into streaming systems
  4. Monitoring: Use demo tools to validate model performance over time
  5. User Interface: Build on the Streamlit demo for user-facing applications

For Demo Enhancement

  1. Interactive Web App: Extend the Streamlit demo with more features
  2. API Development: Use the demo classes to build REST APIs
  3. Batch Processing: Implement large-scale batch prediction capabilities
  4. Model Comparison: Add functionality to compare multiple model versions
  5. Feedback Collection: Integrate user feedback mechanisms for continuous learning

๐Ÿ“š References

Based on analysis of existing projects including:

๐Ÿ”— Quick Reference

Training Files

Demo Files

Model Files (after training)

Commands

# Quick test
python demos/quick_demo.py

# Interactive demo
jupyter notebook demos/fraud_detection_demo.ipynb

# Full demo
python demos/fraud_detection_demo.py

# Train from scratch
python training/baseline_fraud_detection.py

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add your improvements
  4. Submit a pull request

Note: This is a baseline implementation. For production use, consider: