A comprehensive baseline implementation for fraud and scam detection using Natural Language Processing techniques. This project provides multiple approaches from simple keyword-based detection to advanced BERT-based classification.
๐ GitHub Repository: https://github.com/RockENZO/NLP-Cyber-Harm-Detection.git
This project implements baseline models for detecting fraudulent content (scams, phishing, spam) in text data, based on analysis of existing similar projects. It includes:
NEW: The project now includes a production-ready DistilBERT model with significant advantages:
The DistilBERT model is trained for multiclass classification, providing granular fraud type detection rather than just binary fraud/legitimate classification.
โโโ README.md # This comprehensive documentation
โโโ requirements.txt # Python dependencies
โโโ final_fraud_detection_dataset.csv # Training dataset (Git LFS)
โโโ models/ # Saved trained models
โ โโโ model.zip # Compressed model bundle (excluded from git)
โ โโโ bert_model/ # Trained BERT model files
โ โโโ bert_tokenizer/ # BERT tokenizer files
โ โโโ distilbert_model/ # Trained DistilBERT model files (60% faster)
โ โโโ distilbert_tokenizer/ # DistilBERT tokenizer files
โโโ training/ # Training scripts and notebooks
โ โโโ baseline_fraud_detection.py # Traditional ML baseline models
โ โโโ bert_fraud_detection.py # BERT-based classifier
โ โโโ fraud_detection_baseline.ipynb # Interactive Jupyter notebook
โ โโโ kaggle_fraud_detection.ipynb # Kaggle-optimized training notebook
โโโ demos/ # Demo and testing tools
โ โโโ fraud_detection_demo.py # Full-featured demo script
โ โโโ fraud_detection_demo.ipynb # Interactive demo notebook
โ โโโ quick_demo.py # Quick verification script
โโโ reasoning/ # ๐ง AI-powered reasoning pipeline
โ โโโ GPT2_Fraud_Reasoning.ipynb # GPT2-based reasoning analysis
โ โโโ KaggleLLMsReasoning.ipynb # Local reasoning notebook
โโโ docs/ # Documentation
โ โโโ nlp_terms_explanation.md # NLP concepts explanation
โโโ runs/ # Training run outputs and analysis results
โ โโโ fraud_analysis_results_20250916_155231.csv
โ โโโ fraud-detection-kaggle-training-bert-run.ipynb
โ โโโ gpt2_fraud_analysis_20250917_034015.csv
โ โโโ LLMsReasoningResultVisualization.ipynb
โ โโโ MultipleLLMsReasoning(small-models).ipynb
โ โโโ LLMsStats/ # LLM performance comparison charts
โ โโโ llm_category_heatmap.png
โ โโโ llm_comparison_table.csv
โ โโโ llm_performance_comparison.png
โ โโโ llm_quality_radar.png
โ โโโ llm_size_performance.png
โ โโโ llm_speed_quality_scatter.png
โ โโโ llm_model_size_comparison.png # Model size vs performance charts
โ โโโ llm_speed_quality_bubble.png # Speed vs quality bubble chart
โโโ .gitattributes # Git LFS configuration
โโโ .gitignore # Git ignore rules
โโโ .git/ # Git repository
If you have already trained a model on Kaggle:
pip install torch transformers pandas numpy matplotlib seaborn jupyter
python demos/quick_demo.py
jupyter notebook demos/fraud_detection_demo.ipynb
python demos/fraud_detection_demo.py
# Upload KaggleGPTReasoning.ipynb to Kaggle
# Enable GPU accelerator
# Run all cells for fraud detection + AI explanations
# Download results - no API costs
๐ LLM Performance Analysis: Check runs/LLMsStats/
for performance comparisons.
pip install -r requirements.txt
python training/baseline_fraud_detection.py
python training/bert_fraud_detection.py
final_fraud_detection_dataset.csv
to Kaggletraining/fraud-detection-kaggle-training-bert-run.ipynb
Note: The dataset is stored with Git LFS due to its size (~158MB). Clone with git lfs pull
to download the full dataset. Large model files like model.zip
are excluded from git to keep the repository size manageable.
The runs/LLMsStats/
directory contains LLM model analysis for fraud reasoning tasks.
training/baseline_fraud_detection.py
)training/bert_fraud_detection.py
)training/kaggle_fraud_detection.ipynb
)runs/fraud-detection-kaggle-training-bert-run.ipynb
)reasoning/
)reasoning/KaggleGPTReasoning.ipynb
for local reasoning analysisOnce you have a trained model, use these tools to test and demonstrate fraud detection capabilities:
demos/fraud_detection_demo.ipynb
demos/fraud_detection_demo.py
demos/quick_demo.py
Your trained model can detect these 9 classes:
from demos.fraud_detection_demo import FraudDetectionDemo
demo = FraudDetectionDemo()
result = demo.predict_single("Your account has been compromised! Click here now!")
print(f"Prediction: {result['predicted_class']}")
print(f"Confidence: {result['confidence']:.4f}")
print(f"Is Fraud: {result['is_fraud']}")
texts = [
"Meeting at 3 PM tomorrow",
"URGENT: Verify your SSN now!",
"You won $10,000! Send fee to claim"
]
results = demo.predict_batch(texts)
for result in results:
print(f"{result['predicted_class']}: {result['text']}")
# In demos/fraud_detection_demo.ipynb
your_text = "Your Netflix subscription has expired. Update your payment method to continue watching."
result = predict_fraud(your_text)
display_prediction(result)
Based on similar projects and baseline implementations:
Model Type | Expected Accuracy | F1-Score | Notes |
---|---|---|---|
Simple Rule-Based | 60-70% | 0.6-0.7 | Quick prototype |
TF-IDF + LogReg | 80-90% | 0.8-0.9 | Good baseline |
TF-IDF + SVM | 80-90% | 0.8-0.9 | Robust to noise |
BERT Fine-tuned | 90-95% | 0.9-0.95 | Best performance |
DistilBERT Fine-tuned | 89-94% | 0.89-0.94 | 60% faster, 97% of BERT performance |
models/bert_model/
and models/bert_tokenizer/
exist (for BERT)models/distilbert_model/
and models/distilbert_tokenizer/
exist (for DistilBERT)pip install torch transformers pandas numpy matplotlib seaborn
predict_batch()
FraudDetectionDemo
class as a starting point for applications# In training/baseline_fraud_detection.py
vectorizer = TfidfVectorizer(
max_features=5000, # Vocabulary size
stop_words='english', # Remove common words
ngram_range=(1, 2) # Use unigrams and bigrams
)
# In training/bert_fraud_detection.py
classifier = BERTFraudClassifier(
model_name='bert-base-uncased', # Or 'distilbert-base-uncased' for faster training
max_length=128, # Maximum sequence length
num_classes=2 # Binary classification
)
# In training/kaggle_fraud_detection.ipynb
model = DistilBertForSequenceClassification.from_pretrained(
'distilbert-base-uncased',
num_labels=10 # Multiclass classification (9 fraud types + legitimate)
)
batch_size = 16 # Can use larger batches due to lower memory usage
max_length = 128 # Maximum sequence length
epochs = 3 # Faster training allows more epochs
learning_rate = 2e-5 # DistilBERT learning rate
# In runs/fraud-detection-kaggle-training-bert-run.ipynb
batch_size = 16 # Adjust based on GPU memory
max_length = 128 # Maximum sequence length
epochs = 3 # Training epochs
learning_rate = 2e-5 # BERT learning rate
LOGISTIC_REGRESSION:
Accuracy: 0.889
F1-Score: 0.889
AUC Score: 0.944
SVM:
Accuracy: 0.889
F1-Score: 0.889
AUC Score: 0.944
BERT Evaluation Results:
precision recall f1-score support
normal 0.92 0.92 0.92 38
fraud 0.92 0.92 0.92 37
accuracy 0.92 75
macro avg 0.92 0.92 0.92 75
weighted avg 0.92 0.92 0.92 75
DistilBERT Multiclass Evaluation Results:
precision recall f1-score support
job_scam 0.89 0.94 0.91 32
legitimate 0.95 0.91 0.93 45
phishing 0.92 0.90 0.91 41
popup_scam 0.88 0.92 0.90 38
refund_scam 0.91 0.88 0.89 34
reward_scam 0.90 0.93 0.91 36
sms_spam 0.93 0.89 0.91 43
ssn_scam 0.87 0.91 0.89 35
tech_support_scam 0.94 0.89 0.91 37
accuracy 0.91 341
macro avg 0.91 0.91 0.91 341
weighted avg 0.91 0.91 0.91 341
๐ DistilBERT Overall Metrics:
Accuracy: 0.9120
F1-Score (Macro): 0.9088
F1-Score (Weighted): 0.9115
data = pd.DataFrame({
'message': ['text content here', ...],
'label': ['fraud', 'normal', ...]
})
# Add sentiment analysis
from textblob import TextBlob
def add_sentiment_features(text):
blob = TextBlob(text)
return {
'polarity': blob.sentiment.polarity,
'subjectivity': blob.sentiment.subjectivity
}
def custom_preprocess(text):
# Add domain-specific preprocessing
text = remove_urls(text)
text = normalize_currency(text)
return text
# Combine multiple models
def ensemble_predict(text):
lr_pred = lr_model.predict_proba(text)[0][1]
svm_pred = svm_model.predict_proba(text)[0][1]
bert_pred = bert_model.predict(text)['probabilities']['fraud']
# Weighted average
final_score = 0.3 * lr_pred + 0.3 * svm_pred + 0.4 * bert_pred
return 'fraud' if final_score > 0.5 else 'normal'
# Production-ready integration using the demo class
from demos.fraud_detection_demo import FraudDetectionDemo
detector = FraudDetectionDemo()
# Single prediction
result = detector.predict_single(user_message)
if result['is_fraud'] and result['confidence'] > 0.8:
alert_user(result['predicted_class'])
from flask import Flask, request, jsonify
from demos.fraud_detection_demo import FraudDetectionDemo
app = Flask(__name__)
detector = FraudDetectionDemo()
@app.route('/predict', methods=['POST'])
def predict():
text = request.json['text']
result = detector.predict_single(text)
return jsonify({
'prediction': result['predicted_class'],
'is_fraud': result['is_fraud'],
'confidence': result['confidence']
})
import streamlit as st
from demos.fraud_detection_demo import FraudDetectionDemo
st.title("๐ก๏ธ Fraud Detection System")
detector = FraudDetectionDemo()
text_input = st.text_area("Enter message to analyze:")
if st.button("Analyze"):
result = detector.predict_single(text_input)
if result['is_fraud']:
st.error(f"๐จ FRAUD DETECTED: {result['predicted_class']}")
else:
st.success("โ
Message appears legitimate")
st.write(f"Confidence: {result['confidence']:.2%}")
# Show probability distribution
st.bar_chart(result['all_probabilities'])
The models are evaluated using:
For fraud detection, Recall is often most important (donโt miss actual fraud).
Based on analysis of existing projects including:
training/baseline_fraud_detection.py
- Traditional ML modelstraining/bert_fraud_detection.py
- BERT training scriptruns/fraud-detection-kaggle-training-bert-run.ipynb
- Kaggle BERT training notebookdemos/fraud_detection_demo.ipynb
- Interactive demo notebookdemos/fraud_detection_demo.py
- Full demo scriptdemos/quick_demo.py
- Quick verificationmodels/bert_model/
- Trained BERT modelmodels/bert_tokenizer/
- BERT tokenizermodels/distilbert_model/
- Trained DistilBERT model (60% faster)models/distilbert_tokenizer/
- DistilBERT tokenizermodels/model.zip
- Compressed model bundle (excluded from git)# Quick test
python demos/quick_demo.py
# Interactive demo
jupyter notebook demos/fraud_detection_demo.ipynb
# Full demo
python demos/fraud_detection_demo.py
# Train from scratch
python training/baseline_fraud_detection.py
Note: This is a baseline implementation. For production use, consider: