In view of the problem that the traditional ViT(Vision Transformer) model is difficult to complete multi-level image classification, this study proposes a HICViT(Hierarchical Feature Fusion Vision Transformer) for image classification based on ViT. The input data is processed through the ViT extraction module to generate multiple feature maps at different levels, and each feature map contains abstract feature representations at different levels. According to the hierarchical labels, the features extracted by ViT are mapped into features at different levels, and a HIC method is used to fuse the features at different levels, thereby improving the classification performance of the model. The proposed model is compared and analyzed with a variety of advanced deep learning models on three datasets, namely CIFRA-10, CIFRA-100, and CUB-200-2011. On the CIFRA-10 dataset, the classification accuracies of the proposed method at the first level, the second level, and the third level are 99.70%, 98.80%, and 97.80%, respectively. On the CIFRA-100 dataset, the classification accuracies of the proposed method at the first level, the second level, and the third level are 95.23%, 93.54%, and 90.12%, respectively. On the CUB-200-2011 dataset, the classification accuracies of the proposed method at the first level and the second level are 98.09% and 93.66%, respectively. The results indicate that the classification accuracy of the proposed model outperforms that of other comparative models.