Edited by: Michele Pisante, University of Teramo, Italy
Reviewed by: Alexandr Muterko, Russian Academy of Sciences, Russia; Pouria Sadeghi-Tehran, Rothamsted Research, United Kingdom
This article was submitted to Technical Advances in Plant Science, a section of the journal Frontiers in Plant Science
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Wheat blast is a threat to global wheat production, and limited blast-resistant cultivars are available. The current estimations of wheat spike blast severity rely on human assessments, but this technique could have limitations. Reliable visual disease estimations paired with Red Green Blue (RGB) images of wheat spike blast can be used to train deep convolutional neural networks (CNN) for disease severity (DS) classification. Inter-rater agreement analysis was used to measure the reliability of who collected and classified data obtained under controlled conditions. We then trained CNN models to classify wheat spike blast severity. Inter-rater agreement analysis showed high accuracy and low bias before model training. Results showed that the CNN models trained provide a promising approach to classify images in the three wheat blast severity categories. However, the models trained on non-matured and matured spikes images showing the highest precision, recall, and F1 score when classifying the images. The high classification accuracy could serve as a basis to facilitate wheat spike blast phenotyping in the future.
香京julia种子在线播放
Wheat blast is an emergent disease caused by the Ascomycetous fungus
MoT can infect leaves, stems, and seeds, although the most remarkable and studied symptoms are associated with the spike (Igarashi et al.,
Warm temperatures, excessive rain, long and frequent spike wetness, and limited fungicide efficacy exacerbate the intensity of wheat blast epidemics, especially in susceptible cultivars (Goulart et al.,
Since 1985, when wheat spike blast was first detected, intense efforts have been undertaken to identify resistance (Igarashi et al.,
Plant disease estimations, or phytopathometry, refer to the measurement and quantification of plant disease severity (DS)or incidence that is essential when studying and analyzing diseases at organ, plant, or population levels (Large,
A bottleneck in the identification of novel sources of resistance is measuring disease intensity (i.e., plant disease phenotyping), which is considered a limiting factor in the assessment of genotype performance in plant breeding programs (Mahlein,
Computer vision, machine learning, and deep learning methods have recently been adapted to agriculture due to increased knowledge of algorithms and model capabilities that can learn and make predictions from images Red Green Blue (RGB), multispectral, or hyperspectral (Barbedo,
A variety of CNN classification models are available for plant diseases. These include models for bacterial pustule (
Evaluate the agreement in data acquisition of the human rater who collected and classified datasets.
Develop an accurate deep CNN model to detect and classify wheat spike blast symptoms in three severity categories.
A written informed consent was obtained from the individual for the publication of any potentially identifiable images or data included in this article.
Two experiments were conducted under controlled conditions in a growth room at the Asociación de Productores de Oleaginosas y Trigo (ANAPO) research facility in Santa Cruz de la Sierra, Bolivia. Wheat cultivars were planted in pots of 15 cm diameter, filled with vermicast:silt (3:1 [v/v]), and grown at 18−25°C, 14 h light/10 h dark photoperiod, and 50–60% relative humidity. Plants were fertilized, and insecticides were sprayed when needed. Plants were arranged in a randomized complete block design with wheat cultivars having various levels of resistance to MoT, two inoculation levels (inoculated and non-inoculated), and four replicates. Wheat cultivars with a range of sensitivity to the wheat blast were used for the experiments. Experiment one included Bobwhite and South American spring cultivars Atlax, BR-18, Motacú, Urubó, AN-120, Sossego, and San Pablo and for experiment two the cultivars included BR-18, San Pablo, Bobwhite, and Atlax (Baldelomar et al.,
Plants were inoculated at the growth-stage Feekes 10.5, when the spike had completely emerged, with MoT isolate 008-C (
Wheat blast image collection flow process:
Following phytopathometry terminology, we used the term “estimate” for visual disease estimations made by humans and the term “measurement” for estimations made by image analysis (Bock et al.,
Visual estimations of wheat spike blast symptoms were taken seven times after inoculation in each experiment. In experiment one, visual estimations and images were collected 4, 6, 9, 12, 14, 16, and 19 days after inoculation (DAI) and in experiment two, 0, 5, 7, 10, 12, 14, and 19 DAI. Each spike side (four sides total) was visually estimated for DS by Rater 1 (a plant pathologist with experience on wheat blast, rice blast, and other diseases). Simultaneously, an image from each spike side was captured perpendicular to the spike with a distance of 50 cm approximately with a DSLR EOS 6D Canon camera (Canon Inc., Tokyo, Japan) (
The total spike disease estimations of Rater 1 paired with the corresponding image were converted to a three-category scale according to the amount of severity that served to fed training and testing dataset of CNN model. The category selection was based on wheat blast results from published work conducted over the last decade (Baldelomar et al.,
Examples of images per category:
Rater 1 played a critical role in estimating DS and classifying into categories of all the images belonging to Dataset 1 and Dataset 2 (Datasets are described in the section, generation of data sets according to wheat spike physiological changes). Therefore, an inter-rater analysis was needed to determine the reliability of visual estimations of Rater 1. Inter-rater agreement assesses the degree of agreement between two and more raters who obtain independent ratings about the characteristics of a set of subjects. Subjects of interest include people, things, or events that are rated (Madden et al.,
To determine the agreement of disease estimations of Rater 1, we performed an inter-rater analysis including a second-rater, and ImageJ was used as an image analysis software baseline. Rater 2, is a plant pathologist and expert in the wheat blast. ImageJ is an image analysis software used to measure plant diseases from images.
We used the power analysis Wilcoxon signed-rank test to determine the sample size for the inter-rater agreement studies of the two training datasets. The test consisted of DS estimations or measurements of 31 and 29 images from the CNN training Dataset 1 and training Dataset 2, respectively. From now on, the 31 images selected from Dataset 1 will be called sample Dataset 1 and the 29 images from Dataset 2 will be referred to as sample Dataset 2. Rater 2, who is an experienced researcher with more than 4 years of working with the wheat blast disease, visually estimated DS from the sample Dataset 1 and Dataset 2. Additionally, disease measurements were obtained from the sample Dataset 1 and the sample Dataset 2 using ImageJ software as indicated above. Ultimately, the DS results of visual disease estimations of human raters and ImageJ measurements were compared. The estimated and measured DS values from both samples were analyzed for inter-rater agreement in two scenarios, one with a scale of 0–100% DS (continuous data), and the other with the images divided into three categories of DS (ordinal data). We, therefore, computed Lin's Concordance Coefficient, Fleiss kappa, and weighted kappa statistics.
The Lin's concordance coefficient (ρc or CCC) is used to estimate the accuracy1 between two raters using continuous data. From the analysis, we obtained the estimation of accuracy1, precision1, and bias of the disease estimations and disease measurements between the two raters (Lin,
To determine the degree of association between the estimation of categorical information provided by the two raters (inter-rater agreement), the weighted kappa statistics were computed (Chmura,
Wheat was inoculated at the growth-stage Feekes 10.5 (spike completely emerged) of the host plant. Approximately every 2 days after the inoculation, the spike images were collected to capture the changes developed. Indirectly, progressive physiological changes in spikes were recorded, as maturing begins at wheat growth-stage Feekes10.5.4 (kernels watery ripe) and continues through the growth-stage Feekes 11.4 (mature kernels) (Large,
Two datasets were generated considering the (color) physiological changes that can lead to confusion when training the CNN model. Dataset 1, included maturing and non-matured wheat spikes; and Dataset 2 included only non-matured spikes (data available at:
Training and testing data distribution and the number of images used in Dataset 1 and Dataset 2.
Training | 1,595 | 640 | 402 |
Augmented training | 1,595 | 1,920 | 1,608 |
Testing | 381 | 178 | 110 |
Training | 1,430 | 386 | 307 |
Augmented training | 1,430 | 1,544 | 1,535 |
Testing | 327 | 120 | 90 |
Data augmentation is a common technique providing a viable solution to data shortage issues by adding copies of original images with modification or noise (Boulent et al.,
In recent years, the feasibility of using artificial intelligence, in particular deep learning, has been expanded into a variety of applications (Atha and Jahanshahi,
In this study, wheat spike blast symptoms were automatically detected and classified into three severity categories using a pre-trained CNN model. This model may be more efficient than classifying images visually. To obtain a general and reliable CNN model, the network needed to be trained using a large labeled training dataset. The performance of the CNN model is highly dependent on the number and quality of the training data. However, it was hard to collect a wheat blast dataset having a million images in a short time. The performance CNN model can easily lead to under- or over-fitting due to the lack of a large dataset for training. To address this issue, transfer learning was used as a practical solution where a network was trained using a typically different larger dataset such as ImageNet. A major advantage of using transfer learning is that it can adapt the parameters trained from an abundant number of images. Transfer learning starts with a pre-trained model, e.g., VGG16 model, and replaces the fully-connected (FC) layers of the model with new FC layers. A network trained on the ImageNet dataset was used to initialize the network parameters, and the whole network was fine-tuned since the nature of our dataset was very different from the ImageNet dataset. In this study, an FC layer that consisted of three nodes, representing three categories, were appended to the end of the network. A residual neural network architecture (ResNet101), a CNN model with 101 layers with recurrent connection trained on ImageNet data (He et al.,
The network was trained for 15 epochs using a stochastic gradient descent optimizer (Bottou,
Two datasets trained the CNN model with four cases of the study through different weights in loss functions for each category.
Case 1 | [1, 1, 1] | [1, 1, 1] |
Case 2 | [1, 10, 1] | [1, 10, 1] |
Case 3 | [2, 5, 1] | [2, 5, 1] |
Case 4 | [2, 1, 1] | [2, 1, 1] |
The performance of the CNN model was evaluated
Accuracy2 was defined as the total number of TP among three categories divided by the total number of the predictions. Precision2 was defined as the total number of the TP instances divided by the total number of predicted positive examples, which was the summation of TP and FP instances in the binary classification task (Equation 3). Similarly, the precision2 of the multi-classes task illustrates the number of instances that were correctly predicted given all the predicted labels for a given category. Recall was defined as the TP instance divided by all the positive samples (TP and FN) (Equation 4). F1 score is a single metric that encompasses both precision2 and recall (Equation 5). Accuracy2, precision2, recall, and F1 score metrics ranged from 0 to 1, where higher values indicate the high predictive ability of the model.
The final wheat spike blast severity was at day 19 after inoculation when cultivar Atlax reached 100% average DS, followed by Bobwhite (99.7%), San Pablo (32.9%), BR-18 (8.7%), Motacú (3.7%), AN-120 (3.31%), Urubó (1.9%), and Sossego (0.83%). Wheat spike blast symptoms developed on all tested cultivars, with reactions to MoT infection consistent with previous reports, except for cultivar San Pablo that showed moderate susceptibility (Baldelomar et al.,
The Lin's concordance correlation analysis showed a high accuracy1 (ρc = 0.89–0.91), high precision1 (r = 0.91–0.94), and less bias (Cb = 0.95–0.99) in the sample Dataset 2 than in the sample Dataset 1 (ρc = 0.77–0.85, precision1 r = 0.80–0.87, and bias Cb = 0.93–0.98) (
Regression analysis of wheat spike blast DS estimations made by Rater 1 (responsible to estimate the severity of total image dataset)
The weighted kappa statistics (κ), used to quantify inter-rater agreement, were higher in the sample Dataset 1 than in the sample Dataset 2, with κ = 0.72–0.88 (
Values of weighted Kappa (κ) analysis for inter-rater agreement between raters and ImageJ in Dataset 1 (maturing and non-matured spikes) and Dataset 2 (non-matured spikes) of wheat spike blast under controlled environment.
Rater 1 |
0.882 |
4.93 | 0.822 |
4.45 |
Rater 2 |
0.727 |
4.13 | 0.776 |
4.32 |
Rater 1 |
0.747 |
4.32 | 0.849 |
4.65 |
The Fleiss kappa coefficient (Fκ), which compared the association of ordinal categorical information of two or more raters, showed an Fκ = 0.771 (
To train the proposed CNN model, two different datasets were used. As mentioned above in the section
The testing accuracy2 of the model trained with Dataset 1 was 90.1% in Case 1, 90.4% in Case 2, 90.0% in Case 3, and 87.7% in Case 4. The testing accuracy2 of Dataset 2 was 98.4% in Case 1, 93.9% in Case 2, 95.0% in Case 3, and 94.2% in Case 4. Dataset 2 presented higher accuracy2 values compared to Dataset 1, suggesting that the model was accurate. However, it was not sufficient to claim that the model was reliable based on accuracy2 alone since the dataset in this study was unbalanced. In addition to accuracy2, other metrics can help evaluate the performance of the CNN model, such as precision2, recall, and F1 score.
Precision2 indicates the ability to correctly classify an instance in all predicted positive instances. The focus was on the performance of the CNN model in Category 2 as this was the category that breeders and pathologists will concentrate on for breeding purposes. Dataset 1 Case 2 showed the lowest precision2 (75.4%) among all cases values (
Classification performance of the CNN model when classifying the testing set of Dataset 1 (maturing and non-matured spikes) and Dataset 2 (non-matured spikes) in the cases of the study presented different weights in the loss function [weight in Category 1, weight in Category 2, weight in Category 3].
Case 1 |
Precision | 0.891 | 0.852 | 0.955 | 0.923 | 0.918 | 0.967 |
Recall | 0.945 | 0.742 | 0.955 | 0.985 | 0.750 | 0.967 | |
F-1 score | 0.917 | 0.793 | 0.955 | 0.953 | 0.826 | 0.967 | |
Case 2 |
Precision | 0.926 | 0.754 | 0.950 | 0.952 | 0.902 | 0.936 |
Recall | 0.890 | 0.860 | 0.864 | 0.963 | 0.842 | 0.978 | |
F-1 score | 0.908 | 0.803 | 0.905 | 0.957 | 0.871 | 0.957 | |
Case 3 |
Precision | 0.915 | 0.841 | 0.938 | 0.953 | 0.927 | 0.967 |
Recall | 0.929 | 0.803 | 0.955 | 0.985 | 0.842 | 0.967 | |
F-1 score | 0.922 | 0.822 | 0.946 | 0.968 | 0.882 | 0.967 | |
Case 4 |
Precision | 0.915 | 0.850 | 0.946 | 0.942 | 0.941 | 0.946 |
Recall | 0.937 | 0.798 | 0.964 | 0.991 | 0.792 | 0.967 | |
F-1 score | 0.926 | 0.823 | 0.955 | 0.966 | 0.860 | 0.956 |
Confusion matrix of the images of Dataset 1 (non-matured spikes only) showing “true” categories by Rater 1 (y-axis) and predicted categories by the CNN model (x-axis). Category 1: contained images with 0% severity, Category 2: 0.1–20% severity, Category 3: 20.1–100% severity. The cases of study presented different weights in the loss function [weight in Category 1, weight in Category 2, weight in Category 3].
The recall metric for evaluating the CNN model that indicates the ability to correctly recognize a category was also used. In datasets 1 and 2, the recall of Category 2 was the lowest, illustrating the challenge of the model to classify images of Category 2 (early disease stages and low levels of disease symptoms) (
F1 score is a common indicator of the overall performance of the CNN model. In datasets 1 and 2, the F1 score of Category 2 was the lowest, reaffirming the difficulty of classifying images of Category 2 by the model (
A comparison of outcomes revealed that Category 2 was the most difficult category to classify correctly (
Confusion matrix of the images of Dataset 2 (non-matured spikes only) showing “true” categories by Rater 1 (y-axis) and the predicted categories by the CNN model (x-axis). Category 1: contained images with 0% severity, Category 2: 0.1–20% severity, Category 3: 20.1–100% severity. The cases of study presented different weights in the loss function [weight in Category 1, weight in Category 2, and weight in Category 3].
Wheat blast is spreading worldwide, the identification of durable and broad-spectrum resistance is urgently needed (Valent et al.,
This study results demonstrated that the agreement between disease estimations and disease measurements was more significant than what could have been expected to occur by chance. Rater 1 (a pathologist with expertise in multiple diseases besides blast) consistently obtained the higher kappa coefficient (substantial agreement), higher accuracy, and lower bias in all the performed analyses than disease estimations of an expert (Rater 2) in the wheat blast and the disease measurements of ImageJ software. These results are relevant because Rater 1 estimated the DS and classified the entire image dataset into three categories. Therefore, the agreement analysis supports an accurate classification of the images before they were used to train and test the CNN model. The inter-rater agreement analysis also showed that accuracy, precision, and bias are highly dependent on the nature of the dataset. Dataset 1 included images showing disease symptoms and natural plant physiological changes. However, although Dataset 2 was preferred due to higher concordance, results showed that DS assessments among raters were never perfect.
In the present study, the applicability of CNNs for wheat spike blast severity classification from spring wheat images was investigated. Currently, the CNN approach can classify three severity levels (0%, 0.1–20%, and 20.1–100% severity) and was trained using a reliable wheat spike blast dataset. The advantage of this three categories CNN model is that it detects the infected wheat spike and provides further information on the corresponding blast severity level. It is useful to have such a model to classify different infection levels and identify the resistant cultivars from the susceptible ones. Despite the wheat blast dataset comprising of imbalanced data that could have led to a biased CNN model, two techniques, including data augmentation and weighted loss function, were applied to the training process. The loss function is a function map of the difference between the ground truth and predicted output of the model. The importance of a category with a larger error can be enhanced by assigning a weighted variable in the loss function. The results indicate that the performance of the model has a significant improvement when the weighted loss function is applied. In particular, the model has gained the ability to detect Category 2 using a weighted loss function. These encouraging results demonstrate that the proposed CNN model can distinguish Category 1 and Category 2 even though there is a relatively little difference between both the categories. More significant, the CNN could classify the images of Category 3 with low error, which contained infected spikes with severities higher than 20%.
The results showed that the CNN models trained in both datasets (Fernandez-Campos et al.,
Different software based on image analysis are currently available to measure DS (Lamari,
Researchers could benefit from the proposed approach promising for wheat spike blast severity measurements under controlled environmental conditions. Results are supported by a substantial agreement with “true” data obtained from Rater 1, compared against disease estimations of Rater 2, and disease measurements of ImageJ. In collaboration with data scientists, breeders could pre-select wheat cultivars under controlled environments by automatically analyzing and classifying images using the wheat spike blast CNN model preferably trained with Dataset 2. Next, the breeders can focus on the cultivars that fall into categories 1 and 2, which in general terms, are considered resistant or moderately resistant. This may reduce the high number of cultivars tested under field conditions, accelerating the cultivar screening process. A limitation of the study is that the CNN was trained to classify only images of wheat spike blast (spring wheat) under controlled conditions. Further research is required to improve the generalizability of the CNN model using a greater wheat spike blast dataset consisting of controlled and field images. In addition, the results in this study show an opportunity that could be applied similar to other pathogens.
The next step in this research is to validate the model with other images with a similar background and deploy it in a Web application. This future option might allow breeders and pathologists to submit their images and have the model classify them by categories automatically. As more images of various cultivars infected with different isolates can be added to the dataset, increasing symptom variability, a more refined and robust model can be developed. To our knowledge, this is the first study presenting a deep CNN model trained to detect and classify wheat spike blast symptoms. The model might help in the pre-screening of wheat cultivars against the blast fungus under controlled conditions in the future.
The raw data supporting the conclusions of this article and corresponding models are available at:
MF-C, CDC, MJ, Y-TH, TW, and JJ contributed to the study's conception and design. MF-C and CDC conducted the experiment. MF-C collected data and wrote the first draft of the manuscript. MF-C and CG-C performed the statistical analysis. Y-TH and TW wrote the code for the model. Y-TH wrote sections of the manuscript. CDC, MJ, Y-TH, DT, and CG-C edited the manuscript. All authors approved the submitted version.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
We thank the Asociación de Productores de Oleaginosas y Trigo (ANAPO) and the Centro de Investigación Agrícola Tropical (CIAT) for the support provided with the experiments conducted in Bolivia and M. G. Rivadeneira from CIAT for the help with inoculum preparation. We acknowledge the Iyer-Pascuzzi Lab and A. P. Cruz for their guidance in plant phenotyping and D. F. Baldelomar, L. Calderón, J. Cuellar, F. Cortéz, and D. Coimbra from ANAPO for their help with research activities. We also thank Dr. Barbara Valent (Kansas State University) and Gary Peterson for their contribution and commitment to the wheat blast work in South America. Borlaug Fellows Drs. Mr. Kabir and S. Das participated and were trained while conducting experiments associated with this work.
The Supplementary Material for this article can be found online at: