QIANets: Quantum-Integrated Adaptive Networks for Reduced Latency and Improved Inference Times in CNN Models

1Saint Andrews Turi Molo, 2American School in London, 3Orizont Theoretical Lyceum, 4Munich International School

Abstract

Convolutional neural networks (CNNs) have made significant advances in computer vision tasks, yet their high inference times and latency often limit real-world applicability. While model compression techniques have gained popularity as solutions, they often overlook the critical balance between low latency and uncompromised accuracy. By harnessing quantum-inspired pruning, tensor decomposition and annealing-based matrix factorization – three quantum-inspired concepts – we introduce QIANets: a novel approach of redesigning the traditional GoogLeNet, DenseNet, and ResNet-18 model architectures to process more parameters and computations whilst maintaining low inference times. Despite experimental limitations, the method was tested and evaluated, demonstrating reductions in inference times, along with effective accuracy preservations.

Introduction

The field of computer vision (CV) has recently experienced a substantial rise in interest (Su & Crandall, 2021). This surge has created transformative advancements, driving the development of14 deep learning models, particularly those based on the convolutional architecture, such as DenseNet, GoogLeNet, and ResNet-18. These methods have significantly optimized neural networks for image processing tasks, achieving state-of-the-art performance across multiple benchmarks. (Anumol, 2023; He et al., 2015; Szegedy et al. 2015) However, the increasing computational complexity, memory consumption, and model size – often comprising millions to billions of parameters – pose substantial challenges for deployment, especially in time-sensitive and computationally-limited scenarios. The demand for low-latency processing in real-time applications, such as image processing and automated CV systems, is critical; compact models are needed for faster responses (Honegger et al., 2014).

To address these issues, researchers have explored various optimization techniques to reduce inference times and latency while maintaining high accuracy. Model compression techniques such as pruning, quantization, and knowledge distillation have shown promise in enhancing model efficiency (Li et al., 2023). Yet, these methods often come with trade-offs that can impact model performance, necessitating a careful balance between energy efficiency and accuracy.

In recent years, the principles of quantum computing have emerged as an avenue for accelerating inference in machine learning (Divya & Dinesh Peter, 2021). Quantum-inspired methods, which leverage phenomena such as quantum optimization algorithms, strive to maintain model performance by reducing computational requirements, thereby offering significant speedups for certain tasks (Pandey et al., 2023). Meanwhile, traditional model compression techniques reduce the size of neural networks by removing less important weights, sacrificing accuracy for lower latency (Francy & Singh, 2024). By integrating concepts from quantum mechanics into convolutional neural network (CNN) models, our approach seeks to address these limitations. We explore the potential of designing CNNs to balance improved inference times with minimal accuracy loss, creating a novel solution.

Within this context, we employ three key quantum-inspired principles: 1. quantum-inspired pruning: reducing model size by removing unnecessary parameters, guided by quantum approximation algorithms; 2. tensor decomposition: breaking down high-dimensional tensors into smaller component to reduce computational complexity; and 3. annealing-based matrix factorization: optimizing matrix factorization by using annealing techniques to find efficient representations of the data.

Our work addresses the following research question: How can principles from quantum computing be used to design and optimize CNNs to reduce latency and improve inference times, while still maintaining stable accuracies across various models?

In this paper, we propose a Quantum-Integrated Adaptive Networks (QIANets) – a comprehensive framework that integrates these quantum computing techniques into the DenseNet, GoogLeNet, and ResNet-18 architectures. To the best of our knowledge, this is the first attempt made to: 1. apply quantum computing-inspired algorithms into the models’ architectures to reduce computational requirements and achieve efficient performance improvements, and 2) specifically target these models.

The contributions of this work include:

• QIANets: a comprehensive framework that integrates QAOA-inspired pruning, tensor decomposition and quantum annealing-inspired matrix factorization into three CNNs.

• An exploration of the trade-offs between latency, inference time, and accuracy, highlighting the effects of applying quantum principles to CNN models for real-time optimization.

Related Works

Our proposed method builds upon the ideas of model compression & quantum-inspired techniques to improve the inference times of CNNs.

2.1 Model Compression Techniques:

Pruning is one of the most effective ways to accelerate CNNs. Cheng et al. (2018) provided a comprehensive review of model compression techniques for deep neural networks (DNNs), specifically focusing on parameter pruning methods: the removal of individual weights based on importance. This significantly reduces model size, while generally preserving the model’s performance.

Despite the advancements in parameter pruning, overall conventional pruning techniques have limitations: 1) when applied during the training phase, as such a method may be costly, and 2) the risk of prematurely removing important data. Hou et al. (2022) proposed a novel methodology called CHEX for training-based channel pruning and regrowing of channels throughout the training process. By employing a column subset selection (CSS) formulation, CHEX allocates and reassigns channels across layers, allowing for significant model compression without requiring a fully pre-trained model

2.2 Quantum-Inspired Techniques for CNNs:

Quantum computing is currently recognized as a potential game-changer for various fields, including NLP, due to its ability to process complex data more efficiently than classical computers. Shi et al. (2021) proposed a quantum-inspired architecture for convolutional neural networks (QICNNs), using complex-valued weights to enhance the representational capacity of traditional CNNs. The authors display that their QICNNs achieve higher classification accuracy and faster convergence on datasets compared to standard CNNs. In contrast, our methodology prioritizes structural optimization for greater computational efficiency. We focus on reducing latency and improving inference times by employing various quantum techniques to decrease computational overhead in the tested CNNs.

Hu et al. (2022) set a high standard in the field by addressing the unique challenges of compressing quantum neural networks (QNNs). Their CompVQC framework leverages an alternating direction method of multipliers (ADMM) approach, achieving a remarkable reduction in circuit depth by over 2.5 times with less than 1% accuracy degradation. While their results in QNN compression are impressive, our research introduces a novel first-attempt technique that applies QAOA-inspired pruning, tensor decomposition and quantum annealing-inspired matrix factorization to classical CNNs. Our work can complement their approach, underscoring the potential of integrating quantum concepts to classical networks, and may lead to further improvements in model efficiency.

Methodology

3.1 Quantum-inspired pruning:

Pruning is a widely applied technique for reducing CNN complexities; early studies have demonstrated the effectiveness of pruning in optimizing neural networks (LeCun et al., 1989; Hanson and Pratt, 1989; Hassibi et al., 1993). Our method implements a new approach to optimization through the Quantum Approximate Optimization Algorithm (QAOA). QAOA is designed to address problems in combinatorial optimization, which involves finding the best solution from a finite set of solutions.

Similarly, we adapt these principles by framing pruning as a probabilistic optimization problem. The goal is to identify the most important weights to retain while allowing others to approach zero. For a neural network layer represented by weights as a tensor: W ∈ RCout×Cin×H×W, we define the importance of each weight using its absolute value: Ii,j = |Wi,j|

To facilitate decision-making regarding weight retention, we normalize these importance scores with the softmax function to derive probabilities:

Pi,j = ∑k,l eIk,l eIi,j


These probabilities are then utilized in a quantum-inspired decision-making process, where weights are selected for pruning based on a threshold , influenced by a hyperparameter known as layer sparsity.

Ri,j = 1    if Pi,j ≥ λ
0    otherwise


Here, Ri,j serves as a binary retain mask indicating whether a weight is pruned (set to 0) or retained. The threshold λ is calibrated to ensure that approximately 100α% of the weights are pruned.

When implemented, we adopt an iterative approach for pruning the model across multiple stages. Each iteration recalculates the retain mask based on updated probabilities derived from the current weights. To enhance this process, we introduce a neighboring entanglement mechanism: when a weight is pruned, its adjacent weights in the tensor may also be pruned with a specified probability Pentangle. This mechanism simulates quantum entanglement, reflecting the idea that nearby weights may exhibit correlated behavior and can be pruned collectively. For convolutional layers specifically, this sequential pruning strategy is executed over several iterations, progressively reducing the number of parameters in the model while maintaining performance integrity.

3.2 Tensor Decomposition

Tensor decomposition further reduces the dimensionality of the weight tensor while preserving essential information for accurate predictions. This technique is inspired by Quantum Circuit Learning (QCL), where high-dimensional tensors are decomposed into lower-dimensional forms for efficient training of quantum circuits.

For a weight tensor W ∈ RCout×Cin×H×W, we use Singular Value Decomposition (SVD) on its flattened matrix representation Wf ∈ RCout×(Cin·H·W), decomposing it as: Wf = UΣVT

Here, U ∈ RCoutxr and V ∈ R(Cin·H·W)xr are orthogonal matrices, and Σ ∈ Rr×r is a diagonal matrix of singular values. The rank r, chosen as a hyperparameter, controls the compression level by retaining only the top r singular values, leading to a reduced rank approximation: Wf ≈ Ur Σr VrT. This reduces the number of parameters and computational costs during inference.

After tensor decomposition, the original weight tensor is reconstructed using the truncated matrices, reshaping the compressed weights back into their original form. This process significantly decreases the number of parameters without greatly affecting model performance. We apply this method to each convolutional layer, creating lower-rank approximations to control the model’s capacity.

Figure 1: An illustrative diagram showcasing the framework used for Quantum-Inspired Pruning. Tensor Decomposition, and Quantum Annealing-Inspired Matrix Factorization.

3.3 Quantum annealing inspired matrix factorisation

Quantum annealing solves optimization problems by evolving a system toward its lowest energy state. We apply this concept to factorize weight tensors, treating the factorization as an optimization problem aimed at minimizing the difference between the original weights and their factorized representation.

Given a weight matrix W ∈ Rm×n, we seek to factor it into two lower-dimensional matrices, W1 ∈ Rm×r and W2 ∈ Rr×n, where r is a hyperparameter that controls the rank: W ≈ W1W2.

The objective is to minimize the reconstruction error: L(W1, W2) = ∥W − W1W22F

Here, ∥ · ∥F denotes the Frobenius norm, measuring the difference between the original and factorized matrices. We use an iterative optimization procedure inspired by quantum annealing to minimize the loss.

The factorization employs gradient-based optimization, initializing W1 and W2 randomly. We iteratively minimize the loss function using an optimizer like LBFGS, suitable for small parameter sets and non-convex landscapes. This optimization simulates quantum annealing by gradually reducing the step size, ensuring convergence to a local minimum. Once complete, the compressed weight matrix is defined as: Wc = W1W2. This compressed matrix replaces the original matrix, reducing model complexity while maintaining performance.

Experiments and Results

We applied our method to compress three CNNs: DenseNet, GoogLeNet, and ResNet-18, all on the CIFAR-10 dataset. These networks were selected for their different design structures and computational demands, such as parameter count, depth, and layer types, providing a comprehensive assessment of our method’s effectiveness across different models. We evaluated the models for image classification performance, focusing on metrics such as inference time, speedup ratio, and accuracy.

Each experiment involved 1) applying the QIANets framework to the respective model architecture and 2) evaluating the models and comparing them to their baseline counterparts. The results, including the networks’ changes before and after compression, are shown in Table 1.

Table 1: Model Performance Comparison Before and After Compression with rounded figures

Model Base Accuracy Base Latency (T/I) New Accuracy New Latency (T/I) Compression Ratio
GoogLeNet 94% 0.00096 86% 0.00083 1:0.52
ResNet-18 93% 0.000011 87% 0.00007 1:0.61
DenseNet 94% 0.000050 88% 0.00042 1:0.56

4.1 Experimential setup

The experiments were conducted within the cloud-based PyTorch framework, utilizing the CIFAR-10 dataset. The CIFAR-10 dataset, which consists of 60,000 32x32 RGB color images in 10 classes, with 6,000 images per class, was split into training and validation sets with an 80/20 ratio. To meet the input size requirements of the models, images were resized to 224x224 pixels. Data preprocessing included normalizing pixel values to the range [-1, 1] using the mean and standard deviation of the dataset. Moreover, data augmentation strategies (random horizontal flipping and random cropping) were applied to generate data variability and improve the models’ performance on unseen data.

All computations were accelerated using CUDA on an NVIDIA A40 GPU via Runpod. Each model underwent training for 50 epochs utilizing the Adam optimizer, with an initial learning rate of 0.001 and weight decay of 1e-4. To run our experiments, the compute used was approximately 1,615,680 TFLOPS-seconds. To ensure consistency across models, batch sizes of 128 were used for training, while batch sizes of 256 were employed for both evaluation and testing within the dataset.

4.2 Hyperparameter tuning

Hyperparameter tuning is performed using Optuna, a framework that implements a multitude of techniques to optimize certain parameters. Optuna experiments with various combinations of hyperparameters, including batch size, learning rate and ECA Kernel Size, and dynamically adjusts them based on each trial. Following each trial, validation accuracy is calculated to test the effectiveness of current parameters. This information dynamically refines the subsequent parameters, ultimately approaching optimized parameter configurations for model performance.

4.3 Model specific analysis

To effectively accommodate the unique architectures of each model, we made targeted adjustments to the QIANets method. These modifications were carefully designed to be minimal, ensuring that all models were trained and evaluated under consistent and fair conditions throughout the experiments.

4.3.1 GoogleNet

GoogLeNet is a convolutional network with nine multi-scale processing Inception modules. Within our experiments, the QIANets framework targets these modules, reducing the weight in their convolutions: layer sparsity of 0.1417 (only 14.17% of weights were pruned), while employing a rank of 41 to efficiently decompose and factorize these weights. ECA and Multi-Scale Fusion are applied to the outputs of the modules, integrating multi-scale attention into parallel branches.

Performance Progression Across Epochs: The training process involved 10 trials of 10 epochs each, followed by a final trial of 50 epochs. During Trial 0 of hyperparameter optimization, the model began with a validation accuracy of 21.08% at Epoch 1 and quickly progressed to 70.18% by Epoch 10, indicating rapid learning during the first stages of training. The highest validation accuracy result was 80.19% throughout Trial 5. Ultimately, the final quantum-inspired GoogLeNet’s test accuracy was 86.65%, representing a notable improvement from the earlier 80.19%, and closely approaching the baseline accuracy of 94.29% after fine-tuning.

Loss Reduction: Over the 50 epochs, the model’s loss steadily decreased, showing consistent improvement. It began at 2.5205 in Epoch 1 trial 0 and reduced to 0.8256 by Epoch 50, effectively minimizing error throughout training. A key outcome of this experiment was the final 13.65% reduction in inference time, minimizing to 0.000835 seconds per image, which underscores the efficiency of our approach compared to the baseline GoogLeNet’s 0.000967 seconds.

4.3.2 DenseNet

We experimented on DenseNet – a CNN structured with 12 dense blocks, with layer-by-layer connections. This intricate connectivity requires careful application of QAOA pruning to ensure that weight removal does not disrupt the model’s residual stream and overall flow of information within the network. To enhance channel-wise interactions, we apply ECA and Multi-Scale Fusion after the dense blocks, allowing the model to leverage both local and global feature representations effectively.

Performance Progression Across Epochs: DenseNet was trained over 10 trials of 10 epochs each, followed by a final trial of 50 epochs to fine-tune performance. During Trial 0, the model began with a validation accuracy of 9.66% at Epoch 1 and exhibited minimal improvement by Epoch 10, reaching 10.34%. In contrast, Trial 1 demonstrated significant learning progression, starting at 27.53% and achieving a remarkable 81.33% by Epoch 10. The highest validation accuracy across all trials peaked at 86.65% in Trial 1, where the model achieved a layer sparsity of approximately 0.3779 (nearly 62% of the weights were pruned while maintaining performance). After extensive fine-tuning, the final quantum-inspired DenseNet achieved a test accuracy of 88.52%, a significant improvement from the earlier 86.65%, and approaching the baseline accuracy of 94.05%.

Loss Reduction: The model demonstrated steady loss reduction throughout the training process, beginning at 2.3028 during Epoch 1 in Trial 0 and decreasing to 0.5606 by Epoch 10 in Trial 1, indicating effective error minimization. This consistent decline reflects the model’s ability to optimize its parameters and improve performance across trials. One of the standout results of this experiment was the final reduction in inference time by 15.20%, dropping to 0.001043 seconds/image, marking a considerable improvement compared to the baseline DenseNet’s 0.00123 seconds.

4.3.3 ResNet-18

Lastly, ResNet-18 is a CNN characterized by its unique residual learning framework and shortcut connections that facilitate training in deep networks. Within our experiments, the QIANets framework targets the residual blocks in the model, reducing less significant weights and channels, detected by ECA’s straightforward 1D convolution. Feature maps are then combined by Multi-Scale Fusion and refined highlighting essential features across different scales.

Performance Progression: Throughout the trials, there were notable fluctuations in performance. The highest validation accuracy across all trials peaked at 91.42% in Trial 4, where the model achieved a layer sparsity of approximately 0.3779 (meaning nearly 62% of the weights were pruned while maintaining performance). After extensive fine-tuning, the final quantum-inspired ResNet-18 reached a test accuracy of 87.11%, a significant improvement from the earlier 84.56% (the highest accuracy before the final fine-tuning) and approaching the baseline accuracy of 93.11%.

Loss Reduction: The loss reduction across trials also followed a clear downward trend. In Trial 3, with a layer sparsity of 0.4805 and a rank of 10, the validation loss dropped from 1.9847 in the first epoch to 0.6321 by the tenth epoch, indicating better model convergence. One of the standout results of this experiment was the final reduction in inference time by 36.4%, dropping to 0.00007 seconds/image, marking a improvement compared to the baseline ResNet’s 0.00011 seconds/image.

Conclusion

In this paper, we introduced the QIANets framework and applied it to three prominent convolutional neural networks—DenseNet, GoogLeNet, and ResNet—with the objective of reducing latency andimproving inference times while maintaining minimal accuracy loss. Our experimental results showed varying degrees of success when applying quantum-inspired techniques, revealing important insights into the trade-offs between latency and accuracy across different architectures.

Limitations

While our results demonstrate the potential of QIANets and quantum-inspired principles in model compression, they also reflect on several factors that influence the performance of our approach:

  1. Data Constraints: The evaluation was restricted to the relatively simple CIFAR-10 dataset, which may not represent the full diversity, complexity, or scalability challenges encountered in real-world scenarios that larger datasets may exhibit. Additionally, due to the computational expense of each run, the approach was only tested on a limited number of trials.
  2. Model Adaptation: The lack of adaptation across different model architectures may hinder the QIANets framework’s ability to optimize the balance between latency and accuracy. Performance in certain cases does not guarantee similar results across architectures without mandatory model-specific adjustments, which can complicate future adaptations.
  3. Hardware Limitations: This study does not consider hardware-specific limitations. Our techniques have not yet been optimized for specialized hardware, such as custom FPGAs or GPUs, which could potentially reduce latency and improve data throughput.
  4. Scope of Focus: The narrow focus on CNNs excludes newer, more advanced architectures, restricting the scope of our framework’s potential. Integrating our concept with transformers and other attention-based models could significantly expand QIANets’ applicability.

Future works should address these limitations by conducting more in-depth experiments that assess both the scalability and practical relevance of the quantum-inspired techniques.

Referances

  1. Su, N. M., & Crandall, D. J. (2021). The affective growth of computer vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9291–9300. doi:10.1109/CVPR46437.2021.00929259
  2. Anumol, C. S. (2023, November). Advancements in CNN Architectures for Computer Vision: A Comprehensive Review. In 2023 Annual International Conference on Emerging Research Areas: International Conference on Intelligent Systems (AICERA/ICIS), pp. 1–7. IEEE. doi:10.1109/AICERA.2023.9580123263
  3. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. doi:10.1109/CVPR.2016.90
  4. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. doi:10.1109/CVPR.2015.7298594
  5. Honegger, D., Oleynikova, H., & Pollefeys, M. (2014, September). Real-time and low latency embedded computer vision hardware based on a combination of FPGA and mobile CPU. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4930–4935. IEEE. doi:10.1109/IROS.2014.6943192
  6. Li, Z., Li, H., & Meng, L. (2023). Model compression for deep neural networks: A survey. Computers, 12(3), 60. doi:10.3390/computers12030060
  7. Divya, R., & Peter, J. D. (2021, November). Quantum machine learning: A comprehensive review on optimization of machine learning algorithms. In 2021 Fourth International Conference on Microelectronics, Signals & Systems (ICMSS), pp. 1–6. IEEE. doi:10.1109/ICMSS53240.2021.9532427
  8. Pandey, S., Basisth, N. J., Sachan, T., Kumari, N., & Pakray, P. (2023). Quantum machine learning for natural language processing application. Physica A: Statistical Mechanics and its Applications, 627, 129123. doi:10.1016/j.physa.2022.129123
  9. Francy, S., & Singh, R. (2024). Edge AI: Evaluation of Model Compression Techniques for Convolutional Neural Networks. arXiv preprint arXiv:2409.02134. arXiv:2409.02134
  10. Cheng, Y., Wang, D., Zhou, P., & Zhang, T. (2017). A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282. In Proceedings of the IEEE Signal Processing Magazine, 35(1), 126–136. doi:10.1109/MSP.2017.2765695
  11. Hou, Z., Qin, M., Sun, F., Ma, X., Yuan, K., Xu, Y., ... & Kung, S. Y. (2022). Chex: Channel exploration for CNN model compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12287–12298. doi:10.1109/CVPR52688.2022.01204
  12. Han, S., Pool, J., Tran, J., & Dally, W. (2015). Learning both weights and connections for efficient neural network. Advances in Neural Information Processing Systems, 28. NIPS Link
  13. Shi, S., Wang, Z., Cui, G., Wang, S., Shang, R., Li, W., ... & Gu, Y. (2022). Quantum-inspired complex convolutional neural networks. Applied Intelligence, 52(15), 17912–17921. doi:10.1007/s10489-022-03525-3
  14. Hu, Z., Dong, P., Wang, Z., Lin, Y., Wang, Y., & Jiang, W. (2022, October). Quantum neural network compression. In Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design, pp. 1–9. doi:10.1145/3508352.3549415
  15. Tomut, A., Jahromi, S. S., Singh, S., Ishtiaq, F., Muñoz, C., Bajaj, P. S., ... & Orus, R. (2024). CompactifAI: Extreme Compression of Large Language Models using Quantum-Inspired Tensor Networks. arXiv preprint arXiv:2401.14109. arXiv:2401.14109
  16. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541–551. doi:10.1162/neco.1989.1.4.541
  17. Hanson, S., & Pratt, L. (1988). Comparing biases for minimal network construction with back-propagation. Advances in Neural Information Processing Systems, 1. NIPS Link
  18. Hassibi, B., Stork, D. G., & Wolff, G. J. (1993, March). Optimal brain surgeon and general network pruning. In IEEE International Conference on Neural Networks, pp. 293–299. IEEE. doi:10.1109/ICNN.1993.298572

© 2024 QIANets. All Rights Reserved