Research Article

Comparing Vision Transformers and Convolutional Neural Networks: A Systematic Analysis

Authors

  • Chandrasekar Adhithya Harsha Pasumarthi Independent Researcher, USA

Abstract

Vision Transformers have emerged as powerful alternatives to Convolutional Neural Networks for image classification tasks. Systematic comparisons under controlled settings remain limited despite growing adoption of transformer-based vision models. The present article conducts comprehensive evaluation of ViTs and CNNs across identical datasets, training conditions, and computational budgets. Multiple architectures including ResNet, EfficientNet, ViT-Base, and DeiT undergo training on benchmark datasets such as CIFAR-10, CIFAR-100, and customized real-world datasets. Performance evaluation encompasses accuracy, F1-score, training stability, adversarial robustness, and inference latency metrics. Results demonstrate that ViTs outperform CNNs on larger datasets while exhibiting superior robustness to noise and perturbations. CNNs maintain advantages for small datasets due to strong inductive biases embedded within convolutional architectures. The effective receptive field in deep convolutional networks exhibits Gaussian distribution patterns centered on each spatial location. Vision transformers learn spatial relationships entirely from data through global self-attention mechanisms. Dataset scale fundamentally determines relative performance characteristics between architectural families. Transformer architectures require substantial training data to discover optimal attention patterns. Convolutional networks converge efficiently on smaller datasets through built-in spatial priors. The article identifies specific conditions under which each architecture demonstrates clear advantages. Findings contribute to understanding of transformer-based vision models while offering practical guidance for architecture selection in applied machine learning systems.

Article information

Journal

Journal of Computer Science and Technology Studies

Volume (Issue)

8 (2)

Pages

19-26

Published

2026-01-28

How to Cite

Chandrasekar Adhithya Harsha Pasumarthi. (2026). Comparing Vision Transformers and Convolutional Neural Networks: A Systematic Analysis. Journal of Computer Science and Technology Studies, 8(2), 19-26. https://doi.org/10.32996/jcsts.2026.8.2.3

Downloads

Views

34

Downloads

7

Keywords:

Vision Transformers, Convolutional Neural Networks, Image Classification, Deep Learning Architectures, Adversarial Robustness, Transfer Learning