Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Graph Attention Networks: Adaptive Feature Aggregation via Self-Attention

Tech 2

Graph Attention Networks (GATs) introduce attention mechanisms into graph neural architectures to enable adaptive weighting of neighboring nodes. Proposed by Veličković et al. in 2017, this architecture computes importance coefficients between connected nodes through masked self-attention layers, allowing the model to focus on relevant topology regions without requiring costly matrix operations or prior knowledge of graph structure.

The attention mechanism adapts the concept originally developed for sequence modeling in neural machine translation. Unlike Graph Convolutional Networks that rely on fixed normalization constants based on node degree, GATs dynamically compute attention coefficients by comparing node features across edges. This produces learnable weights that indicate which neighbors contribute most significantly to a node's updated representation.

A single GAT layer computes attention scores using a shared linear transformation followed by an attention mechanism. For nodes i and j connected by an edge, the attention coefficient e_ij represents the importance of node j's features to node i. These coefficients are normalized across all neighbors using softmax, then used to compute a weighted combination of features. Multi-head attention extends this by performing K independent attention computations in parallel and concatenating or averaging their outputs.

Implemetnation using PyTorch Geometric demonstrates this architecture for node classification tasks:

import torch
import torch.nn.functional as F
from torch.nn import Module
from torch_geometric.nn import GATConv
from torch_geometric.datasets import Planetoid

class AttnGraphNet(Module):
    def __init__(self, feat_dim, hidden_dim, class_dim, attn_heads=4):
        super().__init__()
        self.layer1 = GATConv(feat_dim, hidden_dim, heads=attn_heads, dropout=0.6)
        self.layer2 = GATConv(hidden_dim * attn_heads, class_dim, heads=1, 
                              concat=False, dropout=0.6)
        
    def forward(self, node_feats, edge_list):
        h = self.layer1(node_feats, edge_list)
        h = F.elu(h)
        h = F.dropout(h, p=0.6, training=self.training)
        h = self.layer2(h, edge_list)
        return F.log_softmax(h, dim=1)

def optimize(model, data, optimizer):
    model.train()
    optimizer.zero_grad()
    predictions = model(data.x, data.edge_index)
    loss = F.nll_loss(predictions[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()
    return loss.item()

def evaluate(model, data):
    model.eval()
    with torch.no_grad():
        logits = model(data.x, data.edge_index)
        predictions = logits.argmax(dim=1)
        train_acc = (predictions[data.train_mask] == data.y[data.train_mask]).sum().item() / data.train_mask.sum().item()
        val_acc = (predictions[data.val_mask] == data.y[data.val_mask]).sum().item() / data.val_mask.sum().item()
        test_acc = (predictions[data.test_mask] == data.y[data.test_mask]).sum().item() / data.test_mask.sum().item()
    return train_acc, val_acc, test_acc

# Configuration
hidden_units = 16
attention_heads = 8
learning_rate = 0.005
weight_decay = 5e-4
epochs = 300

# Data preparation
dataset = Planetoid(root='/tmp/Cora', name='Cora')
graph = dataset[0]

# Model initialization
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
classifier = AttnGraphNet(dataset.num_features, hidden_units, 
                         dataset.num_classes, attention_heads).to(device)
graph = graph.to(device)
optimizer = torch.optim.Adam(classifier.parameters(), lr=learning_rate, 
                            weight_decay=weight_decay)

# Training procedure
for iteration in range(1, epochs + 1):
    loss = optimize(classifier, graph, optimizer)
    if iteration % 50 == 0:
        train_acc, val_acc, test_acc = evaluate(classifier, graph)
        print(f'Iter: {iteration:03d}, Loss: {loss:.4f}, Train: {train_acc:.4f}, Val: {val_acc:.4f}, Test: {test_acc:.4f}')

This implementation leverages multi-head attention to stabilize the learning process and improve model capacity. The attention mechanism enables the network to assign varying importance to different neighbors, making the architecture particularly effective for graphs with heterogeneous neighbrohood structures or varying node degrees.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.