In my previous article, I discussed how morphological feature extractors mimic the way biological experts visually assess images.
This time, I want to go a step further and explore a new question:
Can different architectures complement each other to build an AI that “sees” like an expert?
Introduction: Rethinking Model Architecture Design
While building a high accuracy visual recognition model, I ran into a key challenge:
How do we get AI to not just “see” an image, but actually understand the features that matter?
Traditional CNNs excel at capturing local details like fur texture or ear shape, but they often miss the bigger picture. Transformers, on the other hand, are great at modeling global relationships, how different regions of an image interact, but they can easily overlook fine-grained cues.
This insight led me to explore combining the strengths of both architectures to create a model that not only captures fine details but also comprehends the bigger picture.
While developing PawMatchAI, a 124-breed dog classification system, I went through three major architectural phases:
1. Early Stage: EfficientNetV2-M + Multi-Head Attention
I started with EfficientNetV2-M and added a multi-head attention module.
I experimented with 4, 8, and 16 heads—eventually settling on 8, which gave the best results.
This setup reached an F1 score of 78%, but it felt more like a technical combination than a cohesive design.
2. Refinement: Focal Loss + Advanced Data Augmentation
After closely analyzing the dataset, I noticed a class imbalance, some breeds appeared far more frequently than others, skewing the model’s predictions.
To address this, I introduced Focal Loss, along with RandAug and mixup, to make the data distribution more balanced and diverse.
This pushed the F1 score up to 82.3%.
3. Breakthrough: Switching to ConvNextV2-Base + Training Optimization
Next, I replaced the backbone with ConvNextV2-Base, and optimized the training using OneCycleLR and a progressive unfreezing strategy.
The F1 score climbed to 87.89%.
But during real-world testing, the model still struggled with visually similar breeds, indicating room for improvement in generalization.
4. Final Step: Building a Truly Hybrid Architecture
After reviewing the first three phases, I realized the core issue: stacking technologies isn’t the same as getting them to work together.
What I needed was true collaboration between the CNN, the Transformer, and the morphological feature extractor, each playing to its strengths. So I restructured the entire pipeline.
ConvNextV2 was in charge of extracting detailed local features.
The morphological module acted like a domain expert, highlighting features critical for breed identification.
Finally, the multi-head attention brought it all together by modeling global relationships.
This time, they weren’t just independent modules, they were a team.
CNNs identified the details, the morphology module amplified the meaningful ones, and the attention mechanism tied everything into a coherent global view.
Key Result: The F1 score rose to 88.70%, but more importantly, this gain came from the model learning to understand morphology, not just memorize textures or colors.
It started recognizing subtle structural features—just like a real expert would—making better generalizations across visually similar breeds.
If you’re interested, I’ve written more about morphological feature extractors here.
These extractors mimic how biological experts assess shape and structure, enhancing critical visual cues like ear shape and body proportions.
They’re a vital part of this hybrid design, filling the gaps traditional models tend to overlook.
In this article, I’ll walk through:
- The strengths and limitations of CNNs vs. Transformers—and how they can complement each other
- Why I ultimately chose ConvNextV2 over EfficientNetV2
- The technical details of multi-head attention and how I decided the number of heads
- How all these elements came together in a unified hybrid architecture
- And finally, how heatmaps reveal that the AI is learning to “see” key features, just like a human expert
1. The Strengths and Limitations of CNNs and Transformers
In the previous section, I discussed how CNNs and Transformers can effectively complement each other. Now, let’s take a closer look at what sets each architecture apart, their individual strengths, limitations, and how their differences make them work so well together.
1.1 The Strength of CNNs: Great with Details, Limited in Scope
CNNs are like meticulous artists, they can draw fine lines beautifully, but often miss the bigger composition.
Strong at Local Feature Extraction
CNNs are excellent at capturing edges, textures, and shapes—ideal for distinguishing fine-grained features like ear shapes, nose proportions, and fur patterns across dog breeds.
Computational Efficiency
With parameter sharing, CNNs process high-resolution images more efficiently, making them well-suited for large-scale visual tasks.
Translation Invariance
Even when a dog’s pose varies, CNNs can still reliably identify its breed.
That said, CNNs have two key limitations:
Limited Receptive Field:
CNNs expand their field of view layer by layer, but early-stage neurons only “see” small patches of pixels. As a result, it’s difficult for them to connect features that are spatially far apart.
For instance: When identifying a German Shepherd, the CNN might spot upright ears and a sloped back separately, but struggle to associate them as defining characteristics of the breed.
Lack of Global Feature Integration:
CNNs excel at local stacking of features, but they’re less adept at combining information from distant regions.
Example: To distinguish a Siberian Husky from an Alaskan Malamute, it’s not just about one feature, it’s about the combination of ear shape, facial proportions, tail posture, and body size. CNNs often struggle to consider these elements holistically.
1.2 The Strength of Transformers: Global Awareness, But Less Precise
Transformers are like master strategists with a bird’s-eye view, they quickly spot patterns, but aren’t great at filling in the fine details.
Capturing Global Context
Thanks to their self-attention mechanism, Transformers can directly link any two features in an image, no matter how far apart they are.
Dynamic Attention Weighting
Unlike CNNs’ fixed kernels, Transformers dynamically allocate focus based on context.
Example: When identifying a Poodle, the model may prioritize fur texture; when it sees a Bulldog, it might focus more on facial structure.
But Transformers also have two major drawbacks:
High Computational Cost:
Self-attention has a time complexity of O(n²). As image resolution increases, so does the cost—making training more intensive.
Weak at Capturing Fine Details:
Transformers lack CNNs’ “built-in intuition” that nearby pixels are usually related.
Example: On their own, Transformers might miss subtle differences in fur texture or eye shape, details that are crucial for distinguishing visually similar breeds.
1.3 Why a Hybrid Architecture Is Necessary
Let’s take a real world case:
How do you distinguish a Golden Retriever from a Labrador Retriever?
They’re both beloved family dogs with similar size and temperament. But experts can easily tell them apart by observing:
- Golden Retrievers have long, dense coats ranging from golden to dark gold, more elongated heads, and distinct feathering around ears, legs, and tails.
- Labradors, on the other hand, have short, double-layered coats, more compact bodies, rounder heads, and thick otter-like tails. Their coats come in yellow, chocolate, or black.
Interestingly, for humans, this distinction is relatively easy, “long hair vs. short hair” might be all you need.
But for AI, relying solely on coat length (a texture-based feature) is often unreliable. Lighting, image quality, or even a trimmed Golden Retriever can confuse the model.
When analyzing this challenge, we can see…
The problem with using only CNNs:
- While CNNs can detect individual features like “coat length” or “tail shape,” they struggle with combinations like “head shape + fur type + body structure.” This issue worsens when the dog is in a different pose.
The problem with using only Transformers:
- Transformers can associate features across the image, but they’re not great at picking up fine-grained cues like slight variations in fur texture or subtle head contours. They also require large datasets to achieve expert-level performance.
- Plus, their computational cost increases sharply with image resolution, slowing down training.
These limitations highlight a core truth:
Fine-grained visual recognition requires both local detail extraction and global relationship modeling.
A truly expert system like a veterinarian or show judge must inspect features up close while understanding the overall structure. That’s exactly where hybrid architectures shine.
1.4 The Advantages of a Hybrid Architecture
This is why we need hybrid systems architectures that combine CNNs’ precision in local features with Transformers’ ability to model global relationships:
- CNNs: Extract local, fine-grained features like fur texture and ear shape, crucial for spotting subtle differences.
- Transformers: Capture long-range dependencies (e.g., head shape + body size + eye color), allowing the model to reason holistically.
- Morphological Feature Extractors: Mimic human expert judgment by emphasizing diagnostic features, bridging the gap left by data-driven models.
Such an architecture not only boosts evaluation metrics like the F1 Score, but more importantly, it enables the AI to genuinely understand the subtle distinctions between breeds, getting closer to the way human experts think. The model learns to weigh multiple features together, instead of over-relying on one or two unstable cues.
In the next section, I’ll dive into how I actually built this hybrid architecture, especially how I selected and integrated the right components.
2. Why I Chose ConvNextV2: Key Innovations Behind the Backbone
Among the many visual recognition architectures available, why did I choose ConvNextV2 as the backbone of my project?
Because its design effectively combines the best of both worlds: the CNN’s ability to extract precise local features, and the Transformer’s strength in capturing long-range dependencies.
Let’s break down three core innovations that made it the right fit.
2.1 FCMAE Self-Supervised Learning: Adaptive Learning Inspired by the Human Brain
Imagine learning to navigate with your eyes covered, your brain becomes laser-focused on memorizing the details you can perceive.
ConvNextV2 uses a self-supervised pretraining strategy similar to that of Vision Transformers.
During training, up to 60% of input pixels are intentionally masked, and the model must learn to reconstruct the missing regions.
This “make learning harder on purpose” approach actually leads to three major benefits:
- Comprehensive Feature Learning
The model learns the underlying structure and patterns of an image—not just the most obvious visual cues.
In the context of breed classification, this means it pays attention to fur texture, skeletal structure, and body proportions, instead of relying solely on color or shape. - Reduced Dependence on Labeled Data
By pretraining on unlabeled dog images, the model develops strong visual representations.
Later, with just a small amount of labeled data, it can fine-tune effectively—saving significant annotation effort. - Improved Recognition of Rare Patterns
The reconstruction task pushes the model to learn generalized visual rules, enhancing its ability to identify rare or underrepresented breeds.
2.2 GRN Global Calibration: Mimicking an Expert’s Attention
Like a seasoned photographer who adjusts the exposure of each element to highlight what truly matters.
GRN (Global Response Normalization) is arguably the most impactful innovation in ConvNextV2, giving CNNs a degree of global awareness that was previously lacking:
- Dynamic Feature Recalibration
GRN globally normalizes the feature map, amplifying the most discriminative signals while suppressing irrelevant ones.
For instance, when identifying a German Shepherd, it emphasizes upright ears and the sloped back while minimizing background noise. - Enhanced Sensitivity to Subtle Differences
This normalization sharpens feature contrast, making it easier to spot fine-grained differences—critical for telling apart breeds like the Siberian Husky and Alaskan Malamute. - Focus on Diagnostic Features
GRN helps the model prioritize features that truly matter for classification, rather than relying on statistically correlated but causally irrelevant cues.
2.3 Sparse and Efficient Convolutions: More with Less
Like a streamlined team where each member plays to their strengths, reducing redundancy while boosting performance.
ConvNextV2 incorporates architectural optimizations such as depthwise separable convolutions and sparse connections, resulting in three major gains:
- Improved Computational Efficiency
By breaking down convolutions into smaller, more efficient steps, the model reduces its computational load.
This allows it to process high-resolution dog images and detect fine visual differences without requiring excessive resources. - Expanded Effective Receptive Field
The layout of convolutions is designed to extend the model’s field of view, helping it analyze both overall body structure and local details simultaneously. - Parameter Efficiency
The architecture ensures that each parameter carries more learning capacity, extracting richer, more nuanced information using the same amount of compute.
2.4 Why ConvNextV2 Was the Right Fit for a Hybrid Architecture
ConvNextV2 turned out to be the perfect backbone for this hybrid system, not just because of its performance, but because it embodies the very philosophy of fusion.
It retains the local precision of CNNs while adopting key design concepts from Transformers to expand its global awareness. This duality makes it a natural bridge between CNNs and Transformers apable of preserving fine-grained details while understanding the broader context.
It also lays the groundwork for additional modules like multi-head attention and morphological feature extractors, ensuring the model starts with a complete, balanced feature set.
In short, ConvNextV2 doesn’t just “see the parts”, it starts to understand how the parts come together. And in a task like dog breed classification, where both minute differences and overall structure matter, this kind of foundation is what transforms an ordinary model into one that can reason like an expert.
3. Technical Implementation of the MultiHeadAttention Mechanism
In neural networks, the core concept of the attention mechanism is to enable models to “focus” on key parts of the input, similar to how human experts consciously focus on specific features (such as ear shape, muzzle length, tail posture) when identifying dog breeds.
The Multi-Head Attention (MHA) mechanism further enhances this ability:
“Rather than having one expert evaluate all features, it’s better to form a panel of experts, letting each focus on different details, and then synthesize a final judgment!”
Mathematically, MHA uses multiple linear projections to allow the model to simultaneously learn different feature associations, further enhancing performance.
3.1 Understanding MultiHeadAttention from a Mathematical Perspective
The core idea of MultiHeadAttention is to use multiple different projections to allow the model to simultaneously attend to patterns in different subspaces. Mathematically, it first projects input features into three roles: Query, Key, and Value, then calculates the similarity between Query (Q) and Key (K), and uses this similarity to perform weighted averaging of Values.
The basic formula can be expressed as:
[text{Attention}(Q, K, V) = text{softmax}left(frac{QK^T}{sqrt{d_k}}right)V]
3.2 Application of Einstein Summation Convention in Attention Calculation
In the implementation, I used the torch.einsum
function based on the Einstein summation convention to efficiently calculate attention scores:
energy = torch.einsum("nqd,nkd->nqk", [q, k])
This means:q
has shape (batch_size, num_heads, query_dim)k
has shape (batch_size, num_heads, key_dim)
The dot product is performed on dimension d
, resulting in (batch_size, num_heads, query_len, key_len)
This is essentially “calculating similarity between each Query and all Keys,” generating an attention weight matrix
3.3 Implementation Code Analysis
Key implementation code for MultiHeadAttention:
def forward(self, x):
N = x.shape[0] # batch size
# 1. Project input, prepare for multi-head attention calculation
x = self.fc_in(x) # (N, input_dim) → (N, scaled_dim)
# 2. Calculate Query, Key, Value, and reshape into multi-head form
q = self.query(x).view(N, self.num_heads, self.head_dim) # query
k = self.key(x).view(N, self.num_heads, self.head_dim) # key
v = self.value(x).view(N, self.num_heads, self.head_dim) # value
# 3. Calculate attention scores (similarity matrix)
energy = torch.einsum("nqd,nkd->nqk", [q, k])
# 4. Apply softmax (normalize weights) and perform scaling
attention = F.softmax(energy / (self.head_dim ** 0.5), dim=2)
# 5. Use attention weights to perform weighted sum on Value
out = torch.einsum("nqk,nvd->nqd", [attention, v])
# 6. Rearrange output and pass through final linear layer
out = out.reshape(N, self.scaled_dim)
out = self.fc_out(out)
return out
3.3.1. Steps 1-2: Projection and Multi-Head Splitting
First, input features are projected through a linear layer, and then separately projected into query, key, and value spaces. Importantly, these projections not only change the feature representation but also split them into multiple “heads,” each attending to different feature subspaces.
3.3.2. Steps 3-4: Attention Calculation

3.3.3. Steps 5-6: Weighted Aggregation and Output Projection
Using the calculated attention weights, weighted summation is performed on the value vectors to obtain the attended feature representation. Finally, outputs from all heads are concatenated and passed through an output projection layer to get the final result.
This implementation has the following simplifications and adjustments compared to standard Transformer MultiHeadAttention: Query, key, and value come from the same input (self-attention), suitable for processing features obtained from CNN backbone networks.
It uses einsum operations to simplify matrix calculations.
The design of projection layers ensures dimensional consistency, facilitating integration with other modules.
3.4 How Attention Mechanisms Enhance Understanding of Morphological Feature Relationships
The multi-head attention mechanism brings three core advantages to dog breed recognition:
3.4.1. Feature Relationship Modeling
Just as a professional veterinarian not only sees that ears are upright but also notices how this combines with tail curl degree and skull shape to form a dog breed’s “feature combination.”
It can establish associations between different morphological features, capturing their synergistic relationships, not just seeing “what features exist” but observing “how these features combine.”
Application: The model can learn that a combination of “pointed ears + curled tail + medium build” points to specific Northern dog breeds.
3.4.2. Dynamic Feature Importance Assessment
Just as experts know to focus particularly on fur texture when identifying Poodles, while focusing mainly on the distinctive nose and head structure when identifying Bulldogs.
It dynamically adjusts focus on different features based on the specific content of the input.
Key features vary across different breeds, and the attention mechanism can adaptively focus.
Application: When seeing a Border Collie, the model might focus more on fur color distribution; when seeing a Dachshund, it might focus more on body proportions
3.4.3. Complementary Information Integration
Like a team of experts with different specializations, one focusing on skeletal structure, another on fur features, another analyzing behavioral posture, making a more comprehensive judgment together.
Through multiple attention heads, each simultaneously captures different types of feature relationships. Each head can focus on a specific type of feature or relationship pattern.
Application: One head might primarily focus on color patterns, another on body proportions, and yet another on facial features, ultimately synthesizing these perspectives to make a judgment.
By combining these three capabilities, the MultiHeadAttention mechanism goes beyond identifying individual features, it learns to model the complex relationships between them, capturing subtle patterns that emerge from their combinations and enabling more accurate recognition.
4. Implementation Details of the Hybrid Architecture
4.1 The Overall Architectural Flow
When designing this hybrid architecture, my goal was simple yet ambitious:
Let each component do what it does best, and build a complementary system where they enhance one another.
Much like a well-orchestrated symphony, each instrument (or module) plays its role, only together can they create harmony.
In this setup:
- The CNN focuses on capturing local details.
- The morphological feature extractor enhances key structural features.
- The multi-head attention module learns how these features interact.
As shown in the diagram above, the overall model operates through five key stages:
4.1.1. Feature Extraction
Once an image enters the model, ConvNextV2 takes charge of extracting foundational features, such as fur color, contours, and texture. This is where the AI begins to “see” the basic shape and appearance of the dog.
4.1.2. Morphological Feature Enhancement
These initial features are then refined by the morphological feature extractor. This module functions like an expert’s eye—highlighting structural characteristics such as ear shape and body proportions. Here, the AI learns to focus on what actually matters.
4.1.3. Feature Fusion
Next comes the feature fusion layer, which merges the local features with the enhanced morphological cues. But this isn’t just a simple concatenation, the layer also models how these features interact, ensuring the AI doesn’t treat them in isolation, but rather understands how they combine to convey meaning.
4.1.4. Feature Relationship Modeling
The fused features are passed into the multi-head attention module, which builds contextual relationships between different attributes. The model begins to understand combinations like “ear shape + fur texture + facial proportions” rather than looking at each trait independently.
4.1.5. Final Classification
After all these layers of processing, the model moves to its final classifier, where it makes a prediction about the dog’s breed, based on the rich, integrated understanding it has developed.
4.2 Integrating ConvNextV2 and Parameter Setup
For implementation, I chose the pretrained ConvNextV2-base model as the backbone:
self.backbone = timm.create_model(
'convnextv2_base',
pretrained=True,
num_classes=0) # Use only the feature extractor; remove original classification head
Depending on the input image size or backbone architecture, the feature output dimensions may vary. To build a robust and flexible system, I designed a dynamic feature dimension detection mechanism:
with torch.no_grad():
dummy_input = torch.randn(1, 3, 224, 224)
features = self.backbone(dummy_input)
if len(features.shape) > 2:
features = features.mean([-2, -1]) # Global average pooling to produce a 1D feature vector
self.feature_dim = features.shape[1]
This ensures the system automatically adapts to any feature shape changes, keeping all downstream components functioning properly.
4.3 Intelligent Configuration of the Multi-Head Attention Layer
As mentioned earlier, I experimented with several head counts. Too many heads increased computation and risked overfitting. I ultimately settled on eight, but allowed the number of heads to adjust automatically based on feature dimensions:
self.num_heads = max(1, min(8, self.feature_dim // 64))
self.attention = MultiHeadAttention(self.feature_dim, num_heads=self.num_heads)
4.4 Making CNN, Transformers, and Morphological Features Work Together
The morphological feature extractor works hand-in-hand with the attention mechanism.
While the former provides structured representations of key traits, the latter models relationships among these features:
# Feature fusion
combined_features = torch.cat([
features, # Base features
morphological_features, # Morphological features
features * morphological_features # Interaction between features
], dim=1)
fused_features = self.feature_fusion(combined_features)
# Apply attention
attended_features = self.attention(fused_features)
# Final classification
logits = self.classifier(attended_features)
return logits, attended_features
A special note about the third component features * morphological_features
— this isn’t just a mathematical multiplication. It creates a form of dialogue between the two feature sets, allowing them to influence each other and generate richer representations.
For example, suppose the model picks up “pointy ears” from the base features, while the morphological module detects a “small head-to-body ratio.”
Individually, these may not be conclusive, but their interaction may strongly suggest a specific breed, like a Corgi or Finnish Spitz. It’s no longer just about recognizing ears or head size, the model learns to interpret how features work together, much like an expert would.
This full pipeline from feature extraction, through morphological enhancement and attention-driven modeling, to prediction is my vision of what an ideal architecture should look like.
The design has several key advantages:
- The morphological extractor brings structured, expert-inspired understanding.
- The multi-head attention uncovers contextual relationships between traits.
- The feature fusion layer captures nonlinear interactions through element-wise multiplication.
4.5 Technical Challenges and How I Solved Them
Building a hybrid architecture like this was far from smooth sailing.
Here are several challenges I faced and how solving them helped me improve the overall design:
4.5.1. Mismatched Feature Dimensions
- Challenge: Output sizes varied across modules, especially when switching backbone networks.
- Solution: In addition to the dynamic dimension detection mentioned earlier, I implemented adaptive projection layers to unify the feature dimensions.
4.5.2. Balancing Performance and Efficiency
- Challenge: More complexity meant more computation.
- Solution: I dynamically adjusted the number of attention heads, and used efficient
einsum
operations to optimize performance.
4.5.3. Overfitting Risk
- Challenge: Hybrid models are more prone to overfitting, especially with smaller training sets.
- Solution: I applied LayerNorm, Dropout, and weight decay for regularization.
4.5.4. Gradient Flow Issues
- Challenge: Deep architectures often suffer from vanishing or exploding gradients.
- Solution: I introduced residual connections to ensure gradients flow smoothly during both forward and backward passes.
If you’re interested in exploring the full implementation, feel free to check out the GitHub project here.
5. Performance Evaluation and Heatmap Analysis
The value of a hybrid architecture lies not only in its quantitative performance but also in how it qualitatively “thinks.”
In this section, we’ll use confidence score statistics and heatmap analysis to demonstrate how the model evolved from CNN → CNN+Transformer → CNN+Transformer+MFE, and how each stage brought its visual reasoning closer to that of a human expert.
To ensure that the performance differences came purely from architecture design, I retrained each model using the exact same dataset, augmentation methods, loss function, and training parameters. The only variation was the presence or absence of the Transformer and morphological modules.
In terms of F1 score, the CNN-only model reached 87.83%, the CNN+Transformer variant performed slightly better at 89.48%, and the final hybrid model scored 88.70%. While the transformer-only version showed the highest score on paper, it didn’t always translate into more reliable predictions. In fact, the hybrid model was more consistent in practice and handled similar-looking or blurry cases more reliably.
5.1 Confidence Scores and Statistical Insights
I tested 17 images of Border Collies, including standard photos, artistic illustrations, and various camera angles, to thoroughly assess the three architectures.
While other breeds were also included in the broader evaluation, I chose Border Collie as a representative case due to its distinctive features and frequent confusion with similar breeds.
Figure 1: Model Confidence Score ComparisonAs shown above, there are clear performance differences across the three models.
A notable example is Sample #3, where the CNN-only model misclassified the Border Collie as a Collie, with a low confidence score of 0.2492.
While the CNN+Transformer corrected this error, it introduced a new one in Sample #5, misidentifying it as a Shiba Inu with 0.2305 confidence.
The final CNN+Transformer+MFE model correctly identified all samples without error. What’s interesting here is that both misclassifications occurred at low confidence levels (below 0.25).
This suggests that even when the model makes a mistake, it retains a sense of uncertainty—a desirable trait in real world applications. We want models to be cautious when unsure, rather than confidently wrong.
Figure 2: Confidence Score DistributionLooking at the distribution of confidence scores, the improvement becomes even more evident.
The CNN-only model mostly predicted in the 0.4–0.5 range, with few samples reaching beyond 0.6.
CNN+Transformer showed better concentration around 0.5–0.6, but still had only one sample in the 0.7–0.8 high-confidence range.
The CNN+Transformer+MFE model stood out with 6 samples reaching the 0.7–0.8 confidence level.
This rightward shift in distribution reveals more than just accuracy, it reflects certainty.
The model is evolving from “barely correct” to “confidently correct,” which significantly enhances its reliability in real-world deployment.
Figure 3: Statistical Summary of Model PerformanceA deeper statistical breakdown highlights consistent improvements:
Mean confidence score rose from 0.4639 (CNN) to 0.5245 (CNN+Transformer), and finally 0.6122 with the full hybrid setup—a 31.9% increase overall.
Median score jumped from 0.4665 to 0.6827, confirming the overall shift toward higher confidence.
The proportion of high-confidence predictions (≥ 0.5) also showed striking gains:
- CNN: 41.18%
- CNN+Transformer: 64.71%
- CNN+Transformer+MFE: 82.35%
This means that with the final architecture, most predictions are not only correct but confidently correct.
You might notice a slight increase in standard deviation (from 0.1237 to 0.1616), which might seem like a negative at first. But in reality, this reflects a more nuanced response to input complexity:
The model is highly confident on easier samples, and appropriately cautious on harder ones. The improvement in maximum confidence value (from 0.6343 to 0.7746) further shows how this hybrid architecture can make more decisive and assured judgments when presented with straightforward samples.
5.2 Heatmap Analysis: Tracing the Evolution of Model Reasoning
While statistical metrics are helpful, they don’t tell the full story.
To truly understand how the model makes decisions, we need to see what it sees and heatmaps make this possible.
In these heatmaps, red indicates areas of high attention, highlighting the regions the model relies on most during prediction. By analyzing these attention maps, we can observe how each model interprets visual information, revealing fundamental differences in their reasoning styles.
Let’s walk through one representative case.
5.2.1 Frontal View of a Border Collie: From Local Eye Focus to Structured Morphological UnderstandingWhen presented with a frontal image of a Border Collie, the three models reveal distinct attention patterns, reflecting how their architectural designs shape visual understanding.
The CNN-only model produces a heatmap with two sharp attention peaks, both centered on the dog’s eyes. This indicates a strong reliance on local features while overlooking other morphological traits like the ears or facial outline. While eyes are indeed important, focusing solely on them makes the model more vulnerable to variations in pose or lighting. The resulting confidence score of 0.5581 reflects this limitation.
With the CNN+Transformer model, the attention becomes more distributed. The heatmap forms a loose M-shaped pattern, extending beyond the eyes to include the forehead and the space between the eyes. This shift suggests that the model begins to understand spatial relationships between features, not just the features themselves. This added contextual awareness leads to a stronger confidence score of 0.6559.
The CNN+Transformer+MFE model shows the most structured and comprehensive attention map. The heat is symmetrically distributed across the eyes, ears, and the broader facial region. This indicates that the model has moved beyond feature detection and is now capturing how features are arranged as part of a meaningful whole. The Morphological Feature Extractor plays a key role here, helping the model grasp the structural signature of the breed. This deeper understanding boosts the confidence to 0.6972.
Together, these three heatmaps represent a clear progression in visual reasoning, from isolated feature detection, to inter-feature context, and finally to structural interpretation. Even though ConvNeXtV2 is already a powerful backbone, adding Transformer and MFE modules enables the model to not just see features but to understand them as part of a coherent morphological pattern. This shift is subtle but crucial, especially for fine-grained tasks like breed classification.
5.2.2 Error Case Analysis: From Misclassification to True Understanding



This is a case where the CNN-only model misclassified a Border Collie.
Looking at the heatmap, we can see why. The model focuses almost entirely on a single eye, ignoring most of the face. This kind of over-reliance on one local feature makes it easy to confuse breeds that share similar traits in this case, a Collie, which also has similar eye shape and color contrast.
What the model misses are the broader facial proportions and structural details that define a Border Collie. Its low confidence score of 0.2492 reflects that uncertainty.
With the CNN+Transformer model, attention shifts in a more promising direction. It now covers both eyes and parts of the forehead, creating a more balanced attention pattern. This suggests the model is beginning to connect multiple features, rather than depending on just one.
Thanks to self-attention, it can better interpret relationships between facial components, leading to the correct prediction — Border Collie. The confidence score rises to 0.5484, more than double the previous model’s.
The CNN+Transformer+MFE model takes this further by improving morphological awareness. The heatmap now extends to the nose and muzzle, capturing nuanced traits like facial length and mouth shape. These are subtle but important cues that help distinguish herding breeds from one another.
The MFE module seems to guide the model toward structural combinations, not just isolated features. As a result, confidence increases again to 0.5693, showing a more stable, breed-specific understanding.
This progression from a narrow focus on a single eye, to integrating facial traits, and finally to interpreting structural morphology, highlights how hybrid models support more accurate and generalizable visual reasoning.
In this example, the CNN-only model focuses almost entirely on one side of the dog’s face. The rest of the image is nearly ignored. This kind of narrow attention suggests the model didn’t have enough visual context to make a strong decision. It guessed correctly this time, but with a low confidence score of 0.2238, it’s clear that the prediction wasn’t based on solid reasoning.
The CNN+Transformer model shows a broader attention span, but it introduces a different issue, the heatmap becomes scattered. You can even spot a strong attention spike on the far right, completely unrelated to the dog. This kind of misplaced focus likely led to a misclassification as a Shiba Inu, and the confidence score was still low at 0.2305.
This highlights an important point:
Adding a Transformer doesn’t guarantee better judgment unless the model learns where to look. Without guidance, self-attention can amplify the wrong signals and create confusion rather than clarity.
With the CNN+Transformer+MFE model, the attention becomes more focused and structured. The model now looks at key regions like the eyes, nose, and chest, building a more meaningful understanding of the image. But even here, the confidence remains low at 0.1835, despite the correct prediction. This image clearly presented a real challenge for all three models.
That’s what makes this case so interesting.
It reminds us that a correct prediction doesn’t always mean the model was confident. In harder scenarios unusual poses, subtle features, cluttered backgrounds even the most advanced models can hesitate.
And that’s where confidence scores become invaluable.
They help flag uncertain cases, making it easier to design review pipelines where human experts can step in and verify tricky predictions.
5.2.3 Recognizing Artistic Renderings: Testing the Limits of Generalization



Artistic images pose a unique challenge for visual recognition systems. Unlike standard photos with crisp textures and clear lighting, painted artworks are often abstract and distorted. This forces models to rely less on superficial cues and more on deeper, structural understanding. In that sense, they serve as a perfect stress test for generalization.
Let’s see how the three models handle this scenario.
Starting with the CNN-only model, the attention map is scattered, with focus diffused across both sides of the image. There’s no clear structure — just a vague attempt to “see everything,” which usually means the model is unsure what to focus on. That uncertainty is reflected in its confidence score of 0.5394, sitting in the lower-mid range. The model makes the correct guess, but it’s far from confident.
Next, the CNN+Transformer model shows a clear improvement. Its attention sharpens and clusters around more meaningful regions, particularly near the eyes and ears. Even with the stylized brushstrokes, the model seems to infer, “this could be an ear” or “that looks like the facial outline.” It’s starting to map anatomical cues, not just visual textures. The confidence score rises to 0.6977, suggesting a more structured understanding is taking shape.
Finally, we look at the CNN+Transformer+MFE hybrid model. This one locks in with precision. The heatmap centers tightly on the intersection of the eyes and nose — arguably the most distinctive and stable region for identifying a Border Collie, even in abstract form. It’s no longer guessing based on appearance. It’s reading the dog’s underlying structure.
This leap is largely thanks to the MFE, which helps the model focus on features that persist, even when style or detail varies. The result? A confident score of 0.7457, the highest among all three.
This experiment makes something clear:
Hybrid models don’t just get better at recognition, they get better at reasoning.
They learn to look past visual noise and focus on what matters most: structure, proportion, and pattern. And that’s what makes them reliable, especially in the unpredictable, messy real world of images.
Conclusion
As deep learning evolves, we’ve moved from CNNs to Transformers—and now toward hybrid architectures that combine the best of both. This shift reflects a broader change in AI design philosophy: from seeking purity to embracing fusion.
Think of it like cooking. Great chefs don’t insist on one technique. They mix sautéing, boiling, and frying depending on the ingredient. Similarly, hybrid models combine different architectural “flavors” to suit the task at hand.
This fusion design offers several key benefits:
- Complementary strengths: Like combining a microscope and a telescope, hybrid models capture both fine details and global context.
- Structured understanding: Morphological feature extractors bring expert-level domain insights, allowing models not just to see, but to truly understand.
- Dynamic adaptability: Future models might adjust internal attention patterns based on the image, emphasizing texture for spotted breeds, or structure for solid-colored ones.
- Wider applicability: From medical imaging to biodiversity and art authentication, any task involving fine-grained visual distinctions can benefit from this approach.
This visual system—blending ConvNeXtV2, attention mechanisms, and morphological reasoning proves that accuracy and intelligence don’t come from any single architecture, but from the right combination of ideas.
Perhaps the future of AI won’t rely on one perfect design, but on learning to combine cognitive strategies just as the human brain does.
References & Data Source
Research References
- Vaswani, A., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems.
- Dosovitskiy, A., et al. (2021). An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. ICLR 2021.
- Liu, Z., et al. (2022). ConvNeXt: A ConvNet for the 2020s. CVPR 2022
- Liu, Z., et al. (2023). ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders. CVPR 2023.
- Rockt (2018). Einstein Summation Notation Explained Visually. rockt.github.io
- Pytorch Org. torch.einsum
Dataset Sources
- Stanford Dogs Dataset – Kaggle Dataset
Originally sourced from Stanford Vision Lab – ImageNet Dogs License: Non-commercial research and educational use only Citation: Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Li Fei-Fei. Novel dataset for Fine-Grained Image Categorization. FGVC Workshop, CVPR, 2011 - Unsplash Images – Additional images of four breeds (Bichon Frise, Dachshund, Shiba Inu, Havanese) were sourced from Unsplash for dataset augmentation.
Thank you for reading. Through developing PawMatchAI, I’ve learned many valuable lessons about AI vision systems and feature recognition. If you have any perspectives or topics you’d like to discuss, I welcome the opportunity to exchange ideas. Email
GitHub
Disclaimer
The methods and approaches described in this article are based on my personal research and experimental findings. While the Hybrid Architecture has demonstrated improvements in specific scenarios, its performance may vary depending on datasets, implementation details, and training conditions.
This article is intended for educational and informational purposes only. Readers should conduct independent evaluations and adapt the approach based on their specific use cases. No guarantees are made regarding its effectiveness across all applications.
The post The Art of Hybrid Architectures appeared first on Towards Data Science.
Combining CNNs and Transformers to Elevate Fine-Grained Visual Classification
The post The Art of Hybrid Architectures appeared first on Towards Data Science. Artificial Intelligence, Computer Vision, Data Science, Deep Dives, Deep Learning, Machine Learning Towards Data ScienceRead More


0 Comments