Article contents
Comparing Vision Transformers and Convolutional Neural Networks: A Systematic Analysis
Abstract
Vision Transformers have emerged as powerful alternatives to Convolutional Neural Networks for image classification tasks. Systematic comparisons under controlled settings remain limited despite growing adoption of transformer-based vision models. The present article conducts comprehensive evaluation of ViTs and CNNs across identical datasets, training conditions, and computational budgets. Multiple architectures including ResNet, EfficientNet, ViT-Base, and DeiT undergo training on benchmark datasets such as CIFAR-10, CIFAR-100, and customized real-world datasets. Performance evaluation encompasses accuracy, F1-score, training stability, adversarial robustness, and inference latency metrics. Results demonstrate that ViTs outperform CNNs on larger datasets while exhibiting superior robustness to noise and perturbations. CNNs maintain advantages for small datasets due to strong inductive biases embedded within convolutional architectures. The effective receptive field in deep convolutional networks exhibits Gaussian distribution patterns centered on each spatial location. Vision transformers learn spatial relationships entirely from data through global self-attention mechanisms. Dataset scale fundamentally determines relative performance characteristics between architectural families. Transformer architectures require substantial training data to discover optimal attention patterns. Convolutional networks converge efficiently on smaller datasets through built-in spatial priors. The article identifies specific conditions under which each architecture demonstrates clear advantages. Findings contribute to understanding of transformer-based vision models while offering practical guidance for architecture selection in applied machine learning systems.
Article information
Journal
Journal of Computer Science and Technology Studies
Volume (Issue)
8 (2)
Pages
19-26
Published
Copyright
Copyright (c) 2026 https://creativecommons.org/licenses/by/4.0/
Open access

This work is licensed under a Creative Commons Attribution 4.0 International License.

Aims & scope
Call for Papers
Article Processing Charges
Publications Ethics
Google Scholar Citations
Recruitment