Edited by: Po Yang, The University of Sheffield, United Kingdom
Reviewed by: Jun Liu, Weifang University of Science and Technology, China
Guoxiong Zhou, Central South University Forestry and Technology, China
Jakub Nalepa, Silesian University of Technology, Poland
*Correspondence: Qiudong Yu,
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Grapes are prone to various diseases throughout their growth cycle, and the failure to promptly control these diseases can result in reduced production and even complete crop failure. Therefore, effective disease control is essential for maximizing grape yield. Accurate disease identification plays a crucial role in this process. In this paper, we proposed a real-time and lightweight detection model called Fusion Transformer YOLO for 4 grape diseases detection. The primary source of the dataset comprises RGB images acquired from plantations situated in North China.
Firstly, we introduce a lightweight high-performance VoVNet, which utilizes ghost convolutions and learnable downsampling layer. This backbone is further improved by integrating effective squeeze and excitation blocks and residual connections to the OSA module. These enhancements contribute to improved detection accuracy while maintaining a lightweight network. Secondly, an improved dual-flow PAN+FPN structure with Real-time Transformer is adopted in the neck component, by incorporating 2D position embedding and a single-scale Transformer Encoder into the last feature map. This modification enables real-time performance and improved accuracy in detecting small targets. Finally, we adopt the Decoupled Head based on the improved Task Aligned Predictor in the head component, which balances accuracy and speed.
Experimental results demonstrate that FTR-YOLO achieves the high performance across various evaluation metrics, with a mean Average Precision (mAP) of 90.67%, a Frames Per Second (FPS) of 44, and a parameter size of 24.5M.
The FTR-YOLO presented in this paper provides a real-time and lightweight solution for the detection of grape diseases. This model effectively assists farmers in detecting grape diseases.
香京julia种子在线播放
China’s extensive agricultural heritage, spanning over 2000 years, encompasses grape cultivation. Not only it is a significant grape-producing nation but it also stands as the largest exporter of grapes worldwide. Grapes are not only consumed directly but are also processed into various products such as grape juice, raisins, wine, and other valuable commodities, thus holding substantial commercial value (
The development of computer vision and machine learning technology provides a new solution for real-time automatic detection of crop diseases (
Deep learning can automatically learn the hierarchical features of different disease regions without manual design of feature extraction and classifier, with excellent generalization ability and robustness. The detection of crop diseases through CNN has become a new hotspot in intelligent agriculture research.
The application of machine learning and deep learning in crop disease detection in recent years is summarized. Deep learning, especially CNN, has also made some contributions to grape disease detection.
Comparison of the advantages and disadvantages of different methods.
Method | Advantage | Disadvantage |
---|---|---|
Machine learning | - Less data and computing resources. |
- Difficult to handle complex problems. |
Deep learning |
- The model can automatically learn image feature representations. |
- More data and computing resources. |
Deep learning |
- Higher accuracy and generalization hold significant practical value. |
- Additional data, annotations, and computing resources are necessary. |
There are also several challenges in grape disease detection: (1) grape fruits and inflorescence are small and dense, making it difficult to detect the incidence area, which can be very small. (2) Photos taken in natural scenes are susceptible to external interference. (3) The model needs to balance detection accuracy with lightweight requirements for deployment and real-time performance. To address these challenges, this paper proposes a real-time detection model based on Fusion Transformer YOLO (FTR-YOLO) for grape diseases. The main contributions of this paper are summarized as follows:
Regarding the issue of limited detection of disease types in other models and the detection under non-natural environments, we have collected four grape diseases (anthracnose, grapevine white rot, gray mold, and powdery mildew) datasets in natural environments, covered different parts such as leaves, fruits, and flower inflorescence. The primary source of the dataset comprises RGB images acquired from plantations situated in North China.
In backbone, we integrate learnable downsampling layer (LDS), effective squeeze and excitation (eSE) blocks, and residual connections based on VoVnet, effectively improving the ability of network to extract feature information. In neck component, an improved real-time Transformer with two-dimensional (2D) position embedding and single-scale Transformer encoder (SSTE) are incorporated to the last feature map to accurate detection of small targets. In head component, the Decoupled Head based on the improved Task-Aligned Predictor (ITAP) is adopted to optimize detection accuracy.
To address the challenges with deploying application using models that have a large capacity and slow inference speed, we replace the convolution with ghost module in the model, abandon Transformer decoder, and adopt more efficient SSTE with VoVnet-39 of fewer layers to ensure the lightweight and detection speed.
The rest of the article is organized as follows:
In the process of building grape diseases detection dataset, smartphone is used to collect photos in the local orchard. The photos are taken in different time periods, weather conditions, and scenes. The labeling tool is used to mark the images, the region of interest by manually marking the rectangle, and then generated the configuration file automatically.
Data augmentation is employed to expand the number of images within the training dataset. The methods include random flipping, Gaussian blur, affine transformation, image interception, filling, and so forth. The network model is designed to enhance randomly selected images by one or several operations.
The number of samples for each category is shown in
The number of samples for each disease type.
Disease | Sample size | Number of labeled samples (bounding box) | Percent of bounding box samples |
---|---|---|---|
Anthracnose | 1200 | 4587 | 20.53% |
White rot | 1200 | 6025 | 26.97% |
Gray moid | 1200 | 5160 | 23.09% |
Powdery mildew | 1200 | 6571 | 29.41% |
Total | 4800 | 22343 | 100% |
The overall structure of FTR-YOLO is shown in
The architecture of FTR-YOLO.
In backbone component, a lightweight high-performance VoVnet (LH-VoVNet) (
One of the challenges with DenseNet (
The architecture of DenseNet and VoVNet.
At present, in common networks, the steps of downsampling feature maps are usually completed at the first Conv. of each stage.
Two different methods of downsampling.
To solve this problem, the LDS layer is adopted. The downsampling is moved to the following 3 × 3 Conv. in Path A, and the identity part (Path B) downsampling is done by the added avg-pool, so as to avoid the loss of information caused by the simultaneous appearance of 1 × 1 Conv. and stride. Details are shown in
The pivotal element of the VoVnet lies in the OSA module as described in
The core idea of SE Block is to learn the feature weight according to loss through the network (
Where
In SE block, to avoid the computational burden of such a large model, reduction ratio
Therefore, we adopt eSE that uses only one FC layer with
where
It can be seen from
To solve this problem, this paper adopts a structure—Ghost Module, which can generate a large number of feature graphs with cheap operations. This method can reduce the amount of computation and parameter volume on the basis of ensuring the performance ability of the algorithm.
In the feature map extracted by the mainstream deep neural networks, the rich and even redundant information usually ensures a comprehensive understanding of the input data. These redundancies are called ghost maps.
The ghost module consists of two parts. One part is the feature map generated by the ordinary Conv. The other part is the ghost maps generated by simple linear operation Φ. It is assumed that the input feature map of size
where the numerator is the complexity of ordinary convolution. The denominator is the complexity of ghost module.
Finally, GC-RE-OSA module replaced 3 × 3 Conv. in RE-OSA module (
The structure of GC-RE-OSA module.
The specific structure of LH-VoVnet can be found in
The specific structure of LH-VoVnet.
Type | Output stride | Stage | Output channel |
---|---|---|---|
Stem | 2 |
3×3 Ghost-conv., 64, Stride = 2 |
64 |
Stage 1 | 4 | LDS Layer ×1, GC-RE-OSA×1 | 128 |
Stage 2 | 8 | LDS Layer ×1, GC-RE-OSA×1 | 256 |
Stage 3 | 12 | LDS Layer ×1, GC-RE-OSA×2 | 512 |
Stage 4 | 32 | LDS Layer ×1, GC-RE-OSA×2 | 1024 |
Indeed, the Transformer model relies on a global attention mechanism that requires substantial computational resources for optimal performance (
Within the neck component, we utilize the current optimal dual-flow PAN + FPN structure and enhance it through integration with the GC-RE-OSA module introduced in this paper.
To enhance the detection accuracy, an enhanced global attention mechanism based on the Vision Transformer (ViT) is introduced. This modification takes into consideration that some grape diseases may share similarities, while others have limited occurrence areas. By incorporating this improved global attention mechanism, the detection accuracy can be further improved in detecting different grape diseases.
The current common detection transformer (DETR) algorithms extract the last three layers of feature maps (C3, C4, and C5) from the backbone network as the input. However, this approach usually has two problems:
Previous DETRs, such as deformable DETR (
Compared to the shallower C3 and C4 features, the deepest layer C5 feature has deeper, higher level, and richer semantic features. These semantic features are more useful for distinguishing different objects and are more desirable for Transformer. Shallow features do not play much of a role due to the lack of better semantic features.
To address these issues, we only select the C5 feature map output by the backbone network as the input for the Transformer. To retain key feature information as much as possible, we replaced the simple flattening of feature maps into a vector with a 2D encoding in the Position Embedding module (
The Multi-Head Self-Attention (MHSA) aggregation in Transformer combines input elements without differentiating their positions; thus, Transformer possess permutation invariance. To alleviate this issue, we need to embed spatial information into the feature map, which requires adding 2D position encoding to the final layer feature map. Specifically, the original sine and cosine positional encodings in Position Embedding are respectively extended to column and row positional encodings, and concatenated with them finally.
After the feature map is processed by 2D position embedding, we use a single-scale Transformer Encoder, which only contains one Encoder layer (MHSA + Feed Forward network) to process the output of Q, K, and V at three scales. Note that the three scales share one SSTE and, through this shared operation, the information of the three scales can interact to some extent. Finally, the processing results are concatenated together to form a vector, which is then adjusted back to a 2D feature map, denoted as F5. In the neck part, C3, C4, and F5 are sent to dual-flow PAN + FPN for multi-scale feature fusion. See
In order to achieve better information fusion of the three-layer feature maps (C3, C4, and F5), our enhanced neck implements a dual-stream PAN + FPN architecture, which is featured in the latest YOLO series. In addition to this, we have introduced GC-RE-OSA module to ensure faster detection speed while preserving accuracy. A comparison between YOLOv5 (
Two different neck structures.
For the Head component, we have employed Decoupled Head to perform separate classification and regression tasks via two distinct convolutional channels. Furthermore, our architecture includes the ITAP within each branch, which enhances the interaction between the two tasks.
Object detection commonly faces a task conflict between classification and localization. While decoupled head is successfully applied to SOTA YOLO model in YOLOX (
To address this issue, we drew inspiration from the TAP in TOOD (
ITAP decoupled head structures.
The loss calculation in our study employed the label assignment strategy. SimOTA is employed in YOLOX, v6 and v7 to enhance their performance. Task alignment learning (TAL) proposed in TOOD is used in YOLOv8. This strategy entails selecting positive samples based on the weighted scores of the classification and regression branches within the loss function. For the classification branch, we utilize the varifocal loss (VFL) (
VFL utilizes the target score to assign weight to the loss of positive samples. This implementation significantly amplifies the impact of positive samples with high IoU on the loss function. Consequently, the model prioritizes high-quality samples during the training phase while de-emphasizing the low-quality ones. Similarly, both approaches utilize IoU-aware classification score (IACS) as the target for prediction. This enables effective learning of a combined representation that includes both classification score and localization quality estimation. By employing DFL to tackle the uncertainty associated with bounding boxes, the network gains the ability to swiftly concentrate on the distribution of neighboring regions surrounding the target location. See
where
The experimental hardware environment is configured with INTEL I7-13700 CPU, 32GB RAM, and GEFORCE RTX3090 graphics. The operating system is Windows10 professional edition, the programming language is Python 3.8, and the acceleration environment is CUDA 11.1 and CUDNN 8.2.0. The training parameters of the training process used in the experiment are shown in
The implementation details of training parameters.
Parameter | Value | Parameter | Value |
---|---|---|---|
Optimizer | AdamW | Weight decay | 0.0005 |
Learning rate | 0.001 | Momentum | 0.937 |
Batch size | 8 | warmup steps | 300 |
Image size | 640*640 | Epochs | 200 |
NMS threshold | 0.7 | EMA decay | 0.9998 |
The improved network is composed of backbone, neck, and head, so the influence of the improvement of each part on the model performance should be verified respectively.
In this paper, the LH-VoVNet is verified through experiment. The improvements include (1) the LDS layer is used for downsampling. (2) By adding eSE block and RE-OSA module. (3) The Conv. is replaced with Ghost Module to further lightweight the network. The results of the ablation study are shown in
The results of the ablation study of backbone components.
Methods | mAP@0.5 | Params(M) | FPS |
---|---|---|---|
VoVnet | 84.62 | 49.0 | 38 |
+LDS layer | 85.68 | 49.4 | 37 |
+RE-OSA module | 86.04 | 53.5 | 24 |
+Ghost-conv. | 84.93 |
|
|
LH-VoVNet |
|
24.7 | 56 |
Bold values represents the optimal values.
On the basis of VoVnet, compared by adding LDS layer/RE-OSA module improves accuracy by 1.06%/1.42% mAP. By replacing Ghost-conv., the number of parameters in the network is greatly reduced (−62.7%), the FPS is significantly improved (+78.9%), and the detection performance is also slightly improved (+0.31%). Finally, the integration of these three components shows that mAP 86.79% (+2.17%) is optimal, Params 24.7MB (−50.1%) and FPS 56 (+47.4%), achieve lightweight and real-time in backbone.
To verify the effectiveness of the proposed neck, we evaluate the indicators of the set of variants designed in
The experimental results are shown in
The results of the ablation study of neck components.
Methods | mAP@0.5 | Params(M) | Latency(ms) | FPS |
---|---|---|---|---|
YOLOv5 neck | 86.79 | 24.7 | 52.5 | 56 |
+Real time Transformer | 88.20 | 25.8 | 77.3 | 46 |
+GC-RE-OSA module | 87.22 |
|
|
|
Ours neck |
|
22.5 | 56.3 | 49 |
Bold values represents the optimal values.
To verify the effectiveness of the proposed head, we evaluate the indicators of the set of variants designed in
The results of the ablation study of head & loss components.
Methods | mAP@0.5 | Params(M) | Latency(ms) | FPS |
---|---|---|---|---|
YOLOv5 head | 88.85 |
|
|
|
+ITAP decoupled head | 89.46 | 23.9 | 60.0 | 45 |
+SimOTA | 89.12 | 23.0 | 57.1 | 48 |
+TAL | 89.91 | 23.1 | 57.3 | 48 |
Ours head |
|
24.5 | 61.5 | 44 |
Bold values represents the optimal values.
On the basis of YOLOv5 head, by adding ITAP Decoupled Head delivers 0.61% AP improvement, while increasing the number of parameters by 6.2%, the latency by 6.6%, decreasing the FPS by 8.2%. This indicates that the improved head has minimal impact on parameter and computational speed, while simultaneously enhancing detection accuracy. By adding SimOTA delivers 0.27% AP improvement, the number of parameters/Latency/FPS experience a slight fluctuation by +2.2%/+1.4%/−2.0%. By adding TAL delivers 1.06% AP improvement, the number of parameters/Latency/FPS experience a slight fluctuation by +2.7%/+1.8%/−2.0%. After comparing the label assignments of SimOTA and TAL, it was found that TAL exhibited superior performance, thus making it the preferred choice for our paper. Finally, we adopted a hybrid methodology comprising ITAP Decoupled Head+TAL, resulting in an optimized mAP of 90.67% (+1.82%). Additionally, there was an augmentation in the model’s parameters and Latency to 24.5MB (+8.9%) and 61.5 (+9.2%), respectively, while the FPS decreased to 44 (−10.2%).
The comparison results of different methods.
Method | Size | Params(M) | AP for each category* | mAP@0.5 | FPS |
|
|||
---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | ||||||
Yolo V5 | 640*640 | 46.3 | 87.42 | 76.03 | 85.29 | 88.18 | 84.23 | 40 | < 0.01 |
Yolo V6 | 640*640 | 59.0 | 90.70 | 88.59 | 80.37 | 90.14 | 87.54 | 37 | < 0.01 |
Yolo V7 | 640*640 | 36.6 | 89.51 | 90.44 | 84.23 | 91.26 | 88.86 | 41 | < 0.01 |
Yolo V8 | 640*640 | 43.3 | 89.83 | 88.47 | 85.30 | 92.08 | 88.92 |
|
< 0.01 |
PP-YOLOE | 640*640 | 52.2 | 88.15 | 78.59 | 84.42 | 91.84 | 85.75 | 41 | < 0.01 |
DINO-DETR | 800*1333 | 47.4 |
|
90.58 |
|
|
|
2 | —— |
FTR-YOLO | 640*640 |
|
90.73 |
|
88.54 | 92.74 | 90.67 |
|
< 0.01 |
*In
Bold values represents the optimal values.
Compared to real-time detectors YOLOv5/YOLOv6/YOLOv7/YOLOv8/PP-YOLOE, FTR-YOLO significantly improves accuracy by 6.44%/3.13%/1.81%/1.75%/4.92% mAP, increases FPS by 10.0%/18.9%/7.3%/0.0%/7.3%, and reduces the number of parameters by 47.1%/58.5%/33.1%/43.4%/53.1%. Even among the AP metrics for the four categories, the FTR-YOLO algorithm consistently demonstrates the best performance. Additionally, the differences in AP values among the four disease categories are relatively small, indicating that the FTR-YOLO algorithm exhibits good robustness. This demonstrates the superior performance of FTR-YOLO compared to the state-of-the-art YOLO detectors in terms of accuracy, speed, and lightweight.
In order to determine the statistical significance of the differences between various algorithms, we performed four independent repeated experiments for each algorithm. A
Compared to DINO-DETR, the number of parameters/mAP/FPS experience a fluctuation by −48.3%/−0.45%/+2100.0%. This observation highlights that, while DINO achieves a slightly higher mAP of 0.45% compared to FTR-YOLO, it fails to meet real-time requirements due to its significantly lower FPS (2). Furthermore, there is no discernible advantage in terms of model lightweight.
Different disease types, periods, and locations result in different characteristics and sizes. The improved network proposed in this paper effectively enhances the detection accuracy in small object scenario. In order to verify the detection effect of the small object detection performance, the test dataset is divided into five groups based on the size of the disease area. The, 0%–10%, 10%–20%, 20%–40%, 40%–60%, and 60%–90%, five groups are named with different labels: XS, S, M, L, and XL, which represent the size of different objects. The comparison of the detection accuracy of six common algorithms with FTR-YOLO for five different sizes.
As shown in
Object size sensitivity analysis.
The Batch Random Resize is applied to a batch of images, which helps increase the diversity and randomness of the data. By introducing such variations during training, the model becomes more robust and better able to generalize to unseen examples. This technique can contribute to improving the overall performance and generalization ability of the model in tasks such as object detection or image classification. In our experiment, the data were randomly resized into the following 10 different sizes: [320, 384, 448, 480, 512, 544, 576, 640, 672, 704, 736, and 768].
To further validate the detection performance on images of varying sizes, we categorized the dataset into three groups based on different sizes: (1) small size, less than or equal to 480; (2) medium size, ranging from 480 to 768; (3) large size, greater than 768.
Image size sensitivity analysis.
The detection accuracy among samples of different sizes does not show significant variation, as illustrated in
Based on the comparative evaluation in
Due to VoVnet-39 having fewer layers and the utilization of lightweight ghost modules instead of convolutions, in addition to a real-time transformer that consists of 2D position embedding and a single-scale Transformer encoder, but does not include decoder, FTR-YOLO achieves comparable FPS performance to YOLOv8 while delivering optimal results (
On the other hand, DINO-DETR, with its multi-scale Transformer encoder and decoder, possesses more input feature maps and layers, resulting in better performance for object detection. It outperforms FTR-YOLO in specific metrics such as mAP in
The precision–recall curves of each disease are provided in
The p–r curve of FTR-YOLO.
The detection results of four diseases of grape are shown in
The detection results of FTR-YOLO.
The detection results of different methods.
The experimental results in
Based on the information provided, the FTR-YOLO model is proposed in this paper to achieve accurate, real-time, and lightweight intelligent detection of four common grape diseases in natural environments. The model incorporates several improvements in its components. In backbone, the LH-VoVNet is introduced, which includes LDS layer and Ghost-conv. Additionally, eSE blocks and residual connections are added to the OSA module (GC-RE-OSA module). Experimental results presented in
In this paper, we propose a real-time and lightweight detection model, called Fusion Transformer YOLO, for grape disease detection. In backbone, we integrate GC-RE-OSA module based on VoVnet, effectively improving the ability of network to extract feature information while keeping the network lightweight. In neck component, an improved Real-Time Transformer with 2D position embedding and SSTE are incorporated to the last feature map to accurate detection of small targets in natural environments. In head component, the Decoupled Head based on the ITAP is adopted to optimize detection strategy. Our proposed FTR-YOLO achieved 24.5MB Params, 90.67% mAP@0.5 with 44 FPS, which outperformed YOLOv5-v8 and PP-YOLOE. Although FTR-YOLO uses a real-time Transformer to improve model performance, it still falls behind DETR in terms of performance due to DETR’s multi-scale and multi-layer global transformer architecture.
Future studies plan to explore the fusion of CNN and transformer models, as well as the integration of multimodal features, to further enhance the model’s performance. Additionally, this paper focuses on disease detection in grapes, theoretically, the FTR-YOLO algorithm has the potential to achieve good performance when retrained on other datasets. It can be applied to tasks such as the detection of plant traits and pest diseases in other plants.
The original contributions presented in the study are included in the article/supplementary material. Further inquiries can be directed to the corresponding author.
YL: Conceptualization, Methodology, Software, Writing – original draft, Writing – review & editing. QY: Data curation, Funding acquisition, Writing – review & editing. SG: Supervision, Validation, Visualization, Writing – review & editing.
The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This research is supported by the National Natural Science Foundation of China under Grant No. ZZG0011806; Scientific research Project of Tianjin Science and Technology Commission under Grant No. 2022KJ108 and 2022KJ110; Tianjin University of Technology and Education Key Talent Project under Grant No. KYQD202104 and KYQD202106.
We are grateful for the reviewers’ hard work and constructive comments, which allowed us to improve the quality of this manuscript.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.