Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Real-Time Facial Expression Recognition Using YOLO Object Detection Models

Tech 2

1. Introduction

Facial expression recognition represents a critical research area within computer vision and affective computing. The advancement of deep learning techniques has enabled significant improvements in expression recognition accuracy and efficiency. This system implements facial expression detection using YOLOv8, YOLOv7, YOLOv6, and YOLOv5 architectures, providing real-time detection capabilities through an intuitive web interface.

The YOLO (You Only Look Once) family of algorithms has revolutionized object detection by processing images in a single forward pass, achieving remarkable speed while maintaining competitive accuracy. This characteristic makes YOLO particularly suitable for real-time expression recognition applications.

1.1 System Features

The implemented facial expression recognition system offers comprehensive functionality:

Real-time Camera Detection: The system captures video streams from webcams and processes each frame for facial expression identification. Detection results appear instantly on the user interface with bounding boxes and confidence scores.

Image File Analysis: Users can upload static images for batch processing. The system analyzes each uploaded image and displays recognized facial expressions with corresponding confidence levels.

Video File Processing: The system processes video files frame-by-frame, identifying and labeling expressions throughout the entire video duration. Users can review the annotated video with expression markers.

Model Selection: Multiple YOLO versions are integrated into the system, allowing users to compare performance across different model architectures (YOLOv8, YOLOv7, YOLOv6, YOLOv5).

Display Modes: The system supports simultaneous or individual display of detection results and original frames, enabling intuitive comparison.

Filtering Capabilities: Dropdown menus allow users to isolate specific expression categories for focused analysis.

Parameter Adjustment: Users can dynamically modify confidence thresholds and IOU (Intersection over Union) values to optimize detection performance.

Result Export: Detection results export to CSV format, while annotated videos save as AVI files for further analysis and archiving.

2. Research Background and Significance

2.1 Background

Traditional facial expression recognition methods face significant challenges when processing real-time video streams and handling complex backgrounds. The YOLO algorithm family addresses these limitations through its single-pass detection mechanism.

YOLO Evolution:

  • YOLOv1 (2015): Introduced regression-based object detection, predicting object locations and categories in a single forward pass
  • YOLOv2: Implemented batch normalization and anchor box mechanisms for improved convergence
  • YOLOv3: Incorporated deeper network structures with Feature Pyramid Networks (FPN)
  • YOLOv4 and YOLOv5: Further optimized network architectures and training strategies
  • YOLOv6, YOLOv7, YOLOv8: Introduced advanced training techniques and architectural improvements

2.2 Significance

This research holds substantial importance across multiple dimensions:

Performance Enhancement: The YOLO series continuously improves detection accuracy through architectural innovations. YOLOv8 achieves particularly strong results in expression recognition tasks while maintaining real-time processing speeds.

Practical Applications: Applications span intelligent surveillance, affective computing, human-computer interaction, online education, and mental health assessment. Real-time expression analysis enables adaptive responses in various contexts.

Technical Innovation: Recent YOLO versions introduce anchor-free detection mechanisms and advanced training strategies, providing new approaches for expression recognition challenges.

3. Dataset Processing

3.1 Data Collection and Annotation

The dataset comprises facial images spanning multiple expression categories, captured under diverse lighting conditions, backgrounds, and viewing angles. Annotation includes facial region boundaries and expression category labels.

3.2 Preprocessing

Image Resizing: All input images are resized to the model-required dimensions (typically 640x640 pixels) to ensure consistent input formatting.

Normalization: Pixel values undergo normalization to the [0, 1] range for optimal model processing.

3.3 Data Augmentation

Multiple augmentation strategies enhance dataset diversity and model generalization:

  • Random horizontal flipping
  • Rotation within ±15 degrees
  • Random cropping and scaling
  • Color jittering (brightness, contrast, saturation)
  • Mosaic augmentation: Combining four training images into a composite training sample

3.4 Dataset Splitting

The dataset divides into training (80%), validation (10%), and testing (10%) sets. Category distribution analysis ensures balanced representation across expression classes.

3.5 Expression Categories

The system recognizes seven fundamental facial expressions:

  • Angry
  • Disgust
  • Fear
  • Happy
  • Neutral
  • Sad
  • Surprise

4. Algorithm Implementation

4.1 Network Architecture

Backbone: YOLOv8 employs CSP (Cross Stage Partial networks) and ELAN (Enhanced Layer Aggregation Network) structures for robust feature extraction.

Neck: The Feature Pyramid Network (FPN) and Path Aggregation Network (PAN) enable multi-scale feature fusion.

Head: A decoupled head design separates classification and localization tasks, improving overall performance.

4.2 Loss Function

The Distribution Focal Loss optimizes detection performance by adjusting loss distributions to emphasize harder examples and rare expression categories.

4.3 Anchor-Free Mechanism

YOLOv8 implements anchor-free detection, directly predicting bounding box center points and dimensions without relying on predefined anchor boxes.

4.4 Code Implementation

Environment Setup

# Install required dependencies
pip install torch torchvision opencv-python-headless streamlit ultralytics

Dataset Class Definition

import cv2
import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms

class FacialExpressionDataset(Dataset):
    def __init__(self, image_paths, annotations, transform=None):
        self.image_paths = image_paths
        self.annotations = annotations
        self.transform = transform
    
    def __len__(self):
        return len(self.image_paths)
    
    def __getitem__(self, index):
        image = cv2.imread(self.image_paths[index])
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        label = self.annotations[index]
        
        if self.transform:
            image = self.transform(image)
        
        return image, label

# Define transformation pipeline
augmentation = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize((640, 640)),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], 
                        std=[0.229, 0.224, 0.225])
])

# Initialize data loader
training_data = FacialExpressionDataset(
    image_paths=train_paths,
    annotations=train_labels,
    transform=augmentation
)
train_loader = DataLoader(training_data, batch_size=16, shuffle=True)

Model Training

from ultralytics import YOLO

def train_expression_model():
    # Load base model
    detection_model = YOLO('yolov8n.pt')
    
    # Configure training parameters
    training_config = {
        'data': 'expression_dataset.yaml',
        'epochs': 100,
        'batch': 16,
        'imgsz': 640,
        'device': 'cuda' if torch.cuda.is_available() else 'cpu',
        'optimizer': 'AdamW',
        'lr0': 0.001,
        'weight_decay': 0.0005,
        'warmup_epochs': 3,
        'save_period': 10
    }
    
    # Execute training
    results = detection_model.train(**training_config)
    
    return results

if __name__ == "__main__":
    train_expression_model()

Model Inference

import numpy as np

class ExpressionDetector:
    def __init__(self, model_path):
        self.model = YOLO(model_path)
        self.class_names = ['angry', 'disgust', 'fear', 'happy', 
                           'neutral', 'sad', 'surprise']
    
    def detect_expressions(self, image_path):
        # Load and preprocess image
        image = cv2.imread(image_path)
        
        # Run inference
        predictions = self.model.predict(image, conf=0.5, iou=0.45)
        
        results = []
        for pred in predictions:
            boxes = pred.boxes
            for box in boxes:
                x1, y1, x2, y2 = box.xyxy[0].cpu().numpy()
                confidence = float(box.conf[0])
                class_id = int(box.cls[0])
                
                results.append({
                    'bbox': [int(x1), int(y1), int(x2), int(y2)],
                    'expression': self.class_names[class_id],
                    'confidence': round(confidence, 3)
                })
                
                # Draw annotations
                cv2.rectangle(image, 
                             (int(x1), int(y1)), 
                             (int(x2), int(y2)), 
                             (0, 255, 0), 2)
                label = f"{self.class_names[class_id]}: {confidence:.2f}"
                cv2.putText(image, label, 
                           (int(x1), int(y1) - 10),
                           cv2.FONT_HERSHEY_SIMPLEX, 
                           0.6, (0, 255, 0), 2)
        
        return image, results
    
    def process_video(self, video_path, output_path):
        cap = cv2.VideoCapture(video_path)
        fourcc = cv2.VideoWriter_fourcc(*'XVID')
        out = cv2.VideoWriter(output_path, fourcc, 30.0, 
                             (int(cap.get(3)), int(cap.get(4))))
        
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
                
            annotated_frame, _ = self.detect_expressions(frame)
            out.write(annotated_frame)
            
        cap.release()
        out.release()

Web Interface

import streamlit as st
import numpy as np
from PIL import Image

class ExpressionRecognitionUI:
    def __init__(self):
        st.set_page_config(
            page_title="Expression Recognition System",
            layout="wide"
        )
        self.detector = None
        self.initialize_session_state()
    
    def initialize_session_state(self):
        if 'model_loaded' not in st.session_state:
            st.session_state.model_loaded = False
        if 'results_history' not in st.session_state:
            st.session_state.results_history = []
    
    def render_sidebar(self):
        st.sidebar.header("Configuration")
        
        # Model selection
        model_choice = st.sidebar.selectbox(
            "Select Model",
            ['yolov8n.pt', 'yolov7.pt', 'yolov6n.pt', 'yolov5nu.pt']
        )
        
        # Threshold controls
        conf_threshold = st.sidebar.slider(
            "Confidence Threshold", 
            min_value=0.1, 
            max_value=1.0, 
            value=0.5
        )
        iou_threshold = st.sidebar.slider(
            "IOU Threshold",
            min_value=0.1,
            max_value=1.0,
            value=0.45
        )
        
        # Input source selection
        input_source = st.sidebar.radio(
            "Input Source",
            ["Camera", "Image Upload", "Video Upload"]
        )
        
        return model_choice, conf_threshold, iou_threshold, input_source
    
    def process_image_upload(self, uploaded_file, conf, iou):
        # Decode uploaded image
        image_bytes = uploaded_file.read()
        image_array = np.frombuffer(image_bytes, np.uint8)
        image = cv2.imdecode(image_array, cv2.IMREAD_COLOR)
        
        # Run detection
        results = self.detector.model.predict(
            image, conf=conf, iou=iou
        )
        
        # Annotate results
        annotated = image.copy()
        for result in results:
            boxes = result.boxes
            for box in boxes:
                x1, y1, x2, y2 = box.xyxy[0].cpu().numpy()
                label = f"{result.names[int(box.cls[0])]}: {box.conf[0]:.2f}"
                cv2.rectangle(annotated, 
                            (int(x1), int(y1)), 
                            (int(x2), int(y2)), 
                            (0, 255, 0), 2)
                cv2.putText(annotated, label,
                           (int(x1), int(y1)-10),
                           cv2.FONT_HERSHEY_SIMPLEX, 0.5,
                           (0, 255, 0), 2)
        
        return annotated
    
    def main(self):
        st.title("Facial Expression Recognition System")
        
        # Render sidebar and get configuration
        model_path, conf_thresh, iou_thresh, source = self.render_sidebar()
        
        # Initialize detector
        if not st.session_state.model_loaded:
            with st.spinner("Loading model..."):
                self.detector = ExpressionDetector(model_path)
                st.session_state.model_loaded = True
        
        # Main content area
        col1, col2 = st.columns(2)
        
        with col1:
            st.subheader("Input")
            if source == "Image Upload":
                uploaded_image = st.file_uploader(
                    "Choose an image", 
                    type=['jpg', 'jpeg', 'png']
                )
                if uploaded_image:
                    st.image(uploaded_image, use_container_width=True)
            elif source == "Video Upload":
                uploaded_video = st.file_uploader(
                    "Choose a video",
                    type=['mp4', 'avi']
                )
            else:
                st.info("Camera input would be processed here")
        
        with col2:
            st.subheader("Detection Results")
            if st.button("Start Detection"):
                with st.spinner("Processing..."):
                    # Processing logic
                    st.success("Detection completed!")

if __name__ == "__main__":
    app = ExpressionRecognitionUI()
    app.main()

5. Experimental Results

5.1 Experimental Setup

Training and evaluation employed identical datasets across all YOLO variants to ensure fair comparison. Performance metrics included F1-Score and mean Average Precision (mAP).

5.2 Results

Model Image Size mAP@50-95 CPU Speed (ms) Parameters (M) FLOPs (G)
YOLOv5nu 640 34.3 73.6 2.6 7.7
YOLOv8n 640 37.3 80.4 3.2 8.7
YOLOv6N 640 37.5 - 4.7 11.4
YOLOv7-tiny 640 37.4 - 6.01 13.1
Model mAP F1-Score
YOLOv5nu 0.989 0.98
YOLOv6n 0.988 0.98
YOLOv7-tiny 0.987 0.98
YOLOv8n 0.989 0.99

5.3 Analysis

Performance Metrics: All models achieved competitive mAP scores exceeding 0.98, with YOLOv8n slightly outperforming in F1-Score (0.99). The minimal performance gap indicates that all YOLO variants effectively capture facial expression features.

Detection Speed: YOLOv5nu demonstrates the fastest CPU inference speed, while YOLOv8n achieves optimal balance between accuracy and computational efficiency.

Confusion Matrix: Expression categories with distinct facial configurations (happy, surprise) achieved high recognition rates. Similar expressions (anger, disgust) occasionally misclassify due to overlapping facial muscle patterns.

6. System Architecture

6.1 Components

DetectionUI Class: Manages user interface interactions, coordinates between input sources and detection modules.

YOLODetector Class: Encapsulates model loading, preprocessing, prediction, and post-processing operations.

ResultLogger: Records detection metadata including expression labels, confidence scores, and spatial coordinates.

LogTable: Formats and displays detection results in tabular form for user review.

6.2 Processing Pipeline

  1. System initialization: Load configuration parameters and model weights
  2. Input handling: Capture camera feed, load uploaded files, or process video streams
  3. Frame processing: Apply preprocessing transformations
  4. Model inference: Execute forward pass through YOLO network
  5. Post-processing: Apply confidence thresholds and non-maximum suppression
  6. Result visualization: Overlay bounding boxes and labels
  7. Export: Save detection results to CSV or annotated video files

7. Future Directions

Model Optimization: Investigate neural architecture search techniques for developing specialized expression recognition models with reduced computational requirements.

Multi-modal Integration: Explore combining visual expression data with audio and text modalities for comprehensive emotion understanding.

Cross-domain Adaptation: Develop domain adaptation techniques to improve generalization across diverse demographic groups and environmental conditions.

Application Expansion: Deploy expression recognition in educational platforms for student engagement monitoring, healthcare systems for mood assessment, and smart environments for adaptive responses.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.