Edited by: Hehuang Xie, Virginia Tech, United States
Reviewed by: Yukiori Goto, Kyoto University, Japan; Diana Tan, University of Western Australia, Australia
†Co-first authors
This article was submitted to Epigenomics and Epigenetics, a section of the journal Frontiers in Genetics
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
香京julia种子在线播放
Children with autism spectrum disorder (ASD) are characterized by social/communication deficits and restricted/repetitive behaviors, but display marked variation at the genetic, phenotypic, and functional levels (
Efforts to identify ASD biomarkers have yielded much information about the biologic basis of ASD. For instance, children with ASD are typified by hyperserotonemia (
Why has this biologic information failed to yield an accurate biomarker? First, ASD heterogeneity makes it difficult to generalize a single measure to all children. Second, the evolving nature of brain function over the lifespan necessitates that biomarker discovery is performed within a narrow neurodevelopmental window. Third, ever-changing diagnostic criteria create a challenging landscape for patient characterization. Fourth, overlap between ASD and other cognitive/behavioral phenotypes necessitates comparisons with “control” participants exhibiting non-ASD developmental delay (DD). Finally, nearly all studies rely on single molecule types. To date, most ASD biomarker studies fail to overcome these challenges. Few employ sufficient sample sizes, focus on multiple molecule types, or include separate training and test sets, leading to poor generalizability and validity.
Experts have proposed that methods analyzing entire networks of biomarkers may increase the specificity of ASD testing (
Here, we interrogate levels of human and microbial saliva RNAs to train and then test a biomarker classification tool in 456 children, age 19–83 months. This study tests the hypothesis that oral transcriptome measurement provides a broad network perspective that can accurately identify ASD status in children.
This study was carried out in accordance with the recommendations of the Institutional Review Boards (IRBs) at the State University of New York (SUNY) Upstate Medical University, Penn State College of Medicine, and University of California, Irvine (UCI) with written informed consent from all subjects. All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the SUNY Upstate, Penn State, and UCI IRBs.
The study included 456 children, 19–83 months of age. To our knowledge this is the largest study of RNA expression in children with ASD. Note that the sample is smaller than some DNA-based studies of ASD because these studies rely on measurements of rare-occurring CNVs or SNPs, while the current study focuses on RNA transcripts present in the majority of children (at varying concentrations). Participants were recruited from Penn State (
General guidelines for interpretation of binomial classification analysis results using receiver operating characteristic (ROC) curves have established that values of 0.5–0.6 reflect nearly worthless classifiers, 0.6–0.7 reflect poor classifiers, 0.7–0.8 reflect fair classifiers, and 0.8–0.9 represent good classifiers. Following these criteria, the goals of our study were to distinguish good classifiers from poor classifiers. Using Power Analysis and Sample Size Software (v15; NCSS, LLC; Kaysville, UT, United States), we thus set the null area under the curve (AUC) upper limit to 0.7, and determined that the sample sizes used in our training set provided 85% power to detect an AUC of the ROC curve = 0.77 (based on a one-sided z-test, with an alpha = 0.05), 99% power to detect an AUC > 0.8, and 100% power to detect an AUC > 0.84. Similarly, the validation cohort (
Medical and demographic characterization was performed as follows: age (months), sex, race, gestational age at birth (weeks), and family history of ASD (first- or second-degree relatives) were collected through parent report. Sleep disorder (defined as difficulty initiating sleep, difficulty maintaining sleep, or use of melatonin), gastrointestinal diagnosis (defined as reflux, constipation, chronic diarrhea, or chronic abdominal pain), asthma, and attention deficit hyperactivity disorder (ADHD) were screened through parent report and verified through chart review. Body mass index (BMI; kg/m2) was measured at the time of sample collection, or obtained through chart review. Adaptive behavior was assessed with Vineland Adaptive Behavior Scales, 2nd edition (VABS-2) on most (77%) children (
At the time of enrollment, saliva was obtained in a non-fasting state with an Oracollect RNA swab (DNA Genotek; Ottawa, Canada) following water rinse. Pooled saliva was collected by applying the highly absorbent swab at two sites: (1) the base of the tongue (near the sublingual ducts); and (2) bilaterally between the gums and buccal mucosa (proximal to the parotid ducts). Saliva collection was completed in 5–10 s. Scraping of the buccal mucosa and teeth was generally avoided. RNA was extracted from whole saliva with a standard Trizol method (
Samples were randomly divided into a training set (
For each sample, five subtypes of RNA were quantified: (1) mature/precursor microRNA (miRNA); (2) piwi-interacting RNA (piRNA); (3) non-coding RNA, including small nucleolar RNA (snoRNA) and long intergenic non-coding RNA (lincRNA); (4) ribosomal RNA (rRNA); and (5) microbial RNA. Three samples did not meet quality criteria for inclusion (
RNA abundance levels utilized in this study were subjected to a systematic series of data transformations to improve sensitivity for classifier detection and reduce the influence of batch effects. Transformation steps and parameters were determined in the training dataset, and later applied in identical fashion to the hold-out test set (Supplementary Figure
In order to develop robust multivariate classifier models that could utilize RNAs in an unbiased manner across the full expression range, we first employed an inverse hyperbolic sine transformation of the read count data within each RNA category, according to the formula
Next, we employed global normalization in which the vector of RNA abundance (within each category) was divided by the norm of the vector (
To account for subject variability and demographic influences on the classifiers, continuous variables (age, birth age, and BMI) were also subjected to spatial sign transformation to ensure they were commensurate with other variables. Co-morbid medical conditions, history, and race were set to binary factors of 1 (positive/present) or 0 (negative/absent) and reduced to principal components that accounted for 80% of variance.
To select and rank the most predictive RNAs within each category, generalized stochastic gradient-boosted logistic models were fit to the training set data. In this method, multivariate logistic models were first trained in an iterative process on subsets of RNAs from subsets of training samples, and input features were given relative ranking based on their prevalence in the logistic models (
Second, to create a joint ranking of all features, the top ranked RNAs from each category were merged with the transformed demographic data and fit to a joint stochastic gradient-boosted model in the training set, as above. This combined model similarly ranked the input features in order of importance across all categories (RNA, biological, demographic, etc.).
Third, to build a predictive model based on these ranked features, radial kernel support vector machines (SVMs) (
As an additional step to help prevent overfitting the training data, 10-fold cross-validation was performed 10 times in each step. Additionally, model parameters (including gradient step size, minimum number of samples per iteration, maximum number of features per logistic model, size of the radial basis function, and cost budget) were carefully selected from reasonable ranges. Confidence intervals for ROC curve performance were determined with the
Transformation parameters and loadings calculated on the training set (
Data transformation, machine learning implementation, and statistical analyses were performed in R
Genomic loci for the predictive RNAs were determined using the University of California Santa Cruz Genome Browser
The analysis included 456 children in the training and test datasets (mean age 51 ± 16 months; 77% male; 66% Caucasian; 52% with ASD; Table
Participant characteristics.
Clinical characteristics | All ( |
Train set ( |
Test set ( |
||
---|---|---|---|---|---|
ASD (188) | Non-ASD (184) | ASD (50) | Non-ASD (34) | ||
Male sex, # (%) | 337 (76) | 156 (83)∗ | 122 (66) | 45 (90) | 26 (76) |
Mean age, mos (SD) | 51 (16) | 54 (15)∗ | 49 (16) | 53 (15) | 46 (16) |
White race, # (%) | 296 (67) | 122 (65)∗ | 126 (69) | 29 (58) | 23 (68) |
BMI, kg/m2 (SD) | 16.7 (2.5) | 16.6 (2.9) | 16.8 (2.3) | 16.6 (2.1) | 16.7 (2.4) |
Sleep disorder, # (%) | 141 (32) | 85 (45)∗ | 31 (17) | 21 (42) | 8 (24) |
ADHD, # (%) | 63 (14) | 41 (22)∗ | 18 (10) | 4 (8)∗ | 0 (0) |
GI diagnosis, # (%) | 57 (13) | 36 (19)∗ | 16 (9) | 7 (14) | 1 (3) |
Asthma, # (%) | 43 (10) | 16 (9) | 18 (10) | 6 (12) | 2 (6) |
Gestation, wks (SD) | 38.6 (2.6) | 39 (3) | 39 (3) | 38 (2) | 39 (2) |
fam hx, # (%) | 172 (39) | 93 (50)∗ | 51 (28) | 29 (58) | 10 (29) |
VABS Comm (SD) | 82.7 (22.8) | 72.2 (20.1)∗ | 93.5 (20.7) | 73.5 (20.8)∗ | 93.4 (18.2) |
VABS Social (SD) | 84.7 (22.7) | 72.2 (16.4)∗ | 97.4 (22.6) | 73.5 (18.0)∗ | 96.3 (15.0) |
VABS Adaptive (SD) | 84.8 (20.0) | 74.9 (15.2)∗ | 95.0 (20.0) | 73.6 (18.9)∗ | 96.6 (11.8) |
ADOS, mean (SD) | 6.1 (2.6) | 6.7 (2.4)∗ | 4.5 (2.9) | 6.7 (1.6)∗ | 3.2 (1.1) |
In training and test sets the ASD group displayed lower mean VABS-2 standard scores (
The feature selection algorithm resulted in a panel comprised of 32 diagnostic RNA features, including 12 microbial taxa, 7 mature miRNAs, 4 precursor miRNAs, 8 piRNAs, and 1 snoRNA (Figure
Abundances of 32 salivary RNAs selected for the diagnostic panel. Box and whisker plots show distributions (in the training set;
Algorithm Performance. In a training set (
To ensure the classifier algorithm performance was not systematically biased based on patient characteristics, distributions of misclassification errors across patient features were explored in the test set (Figure
Correctly and incorrectly classified children display similar medical and demographic features. Histograms display distributions of
Of the 20 human RNAs in the panel, 19 (95%) originated from loci with at least one ASD-associated CNV (Supplementary Table
Evaluation of putative miRNA targets revealed significant enrichment (FDR < 0.05) in 12 KEGG pathways (Supplementary Table
Hierarchical clustering analysis of the classifier features revealed two distinct RNA clusters (Supplementary Figure
This investigation identified 32 salivary RNA features that accurately distinguished ASD status in a training set of 372 children, and displayed 85% accuracy in a separate test set of 84 additional children. The RNA panel included human RNAs and microbial RNAs with putative functions converging on ASD-associated neurobiological pathways. It provides an accurate, objective, systems-based method for identifying ASD status.
The ability to clinically discern children with autistic from peers with non-ASD DD is more challenging than discriminating ASD from TD children. Yet, children with DD have typically not been included in ASD biomarker studies. In this study, the RNA algorithm identified 41/50 ASD children while differentiating 30/34 non-ASD children in a naïve hold-out sample set. Notably, test performance was similar for TD (18/21) and DD (12/13) children. The potential to accurately discriminate between ASD and DD lies at the crux of ASD diagnoses, and represents a significant potential contribution for this objective, biologic assay.
This study employed ASD and non-ASD groups of equal size, and the total non-ASD cohort included 39% DD participants (84/218). Thus, this algorithm is not designed as a screening tool (where ∼1/50 children might have ASD). Instead, our results are best viewed as an adjunct to positive MCHAT-R screening, or as an aid in ASD diagnosis. In these settings (e.g., after a positive MCHAT-R), nearly 50% of children would be expected to have ASD, and a significant proportion of the others would likely have DD.
Although our test differentiates ASD and non-ASD participants from multiple geographic regions, whether the algorithm performs accurately in populations with increasing geographic diversity remains to be determined. This is particularly important to consider given that our algorithm includes microbial RNAs, which could be influenced by dietary and environmental factors. The present algorithm was developed from saliva of children residing in New York and Pennsylvania. When applied to a test set that contained children from New York, Pennsylvania, and California, the test maintained diagnostic accuracy. Future investigations will need to validate these findings across broader geographic cohorts. Refinement of the model may improve performance further as we continue to sample increasingly diverse populations at high volumes and incorporate data into the training steps.
The potential to employ this test in younger toddlers and infants has yet to be assessed. While it is possible that the modest difference in age between the ASD and non-ASD groups may have confounded the analyses, the children incorrectly classified displayed similar age, sex, and race as those who were correctly classified. Thus, the algorithm showed no bias toward demographic or medical factors within the study cohort. It also suggests the test may be broadly applicable without exclusion of medical or demographic subgroups. Conditions that might impact salivary RNA (e.g., asthma, gastrointestinal disturbance, BMI) and conditions that are more common in children with ASD (sleep difficulties, ADHD) also did not appear to bias ASD prediction.
This study recruited a cohort generally representative of children receiving developmental referral, favoring robust statistical power over extensive phenotypic analyses. Group assignments were based on clinical assessments. Participant characterization was driven by a combination of parent report, chart review, and standardized VABS-2 and ADOS-2 assessments. Future studies employing extensive behavioral assessment alongside longitudinal RNA sampling and therapeutic interventions may identify nuances in salivary RNA profiles that correlate with ASD endophenotypes, respond to intervention, or prove useful for guiding personalized therapies.
This tool employs poly-omic RNA measures that link both physiologic and environmental factors implicated in ASD (Figure
The poly-omic diagnostic panel integrates genetic, neurobiologic, phenotypic, and environmental factors implicated in ASD. This concept diagram displays the putative function of human RNAs at the intersection between genetic/environmental risk factors and the neuro-behavioral traits associated with ASD. Factors, such as microRNA (miRNA) and piwi-interacting RNA (piRNA) interact with genes involved in chromatin organization, transcriptional regulation, synaptic function, and other critical neuronal pathways. Disruption of these pathways in children with ASD may lead to alterations in the levels of peripheral miRNAs and piRNAs. In addition, neuro-behavioral characteristics of ASD, which can lead to restricted diets and gastrointestinal (GI) disturbance, may be related to the microbial disruptions upon which the current algorithm is based.
Three RNAs in the current panel (miR-378a, miR-3916, piR-12423) arise from loci associated with ASD copy number variants (CNVs) (Figure
The miRNAs in this panel target transcripts that code for critical elements of neurotransmitter (miR-10a/BDNF; miR-92a/GABA), neurohormonal (miR-106a/serotonin), immune (miR-106a/TGF-beta; miR-106a/TNF-alpha) and xenobiotic (miR-361/GSTO2; miR-125a/GSTM2) pathways (Figure
Reliance on microbial measures for ASD identification will require accounting for features influenced by diet and geography. The current study enrolled children from multiple sites and relied on several microbes found in humans throughout the world (e.g., Lactobacillus). However, validation of RNA from less common bacteria (e.g.,
Numerous medical and demographic factors may influence RNA expression in the oropharynx. We have attempted to control for these factors (e.g., BMI, asthma) through matched recruitment and a statistical modeling approach controlling for medical/demographic factors. Inherent differences in rates of gastrointestinal disturbance between ASD and non-ASD groups (
We have developed an objective, quantitative algorithm based on salivary RNA abundance that accurately discriminates children with ASD from peers with DD or TD. This non-invasive test could augment the accuracy of current ASD assessment, as an adjunctive tool for children with positive MCHAT screening, or an objective aid in ASD diagnosis.
The datasets used and/or analyzed during the current study are not available to the public.
SH and FM conceived the study. SH, FM, and RC contributed to study design. KW and SB coordinated the participant enrollment and data collection. SH, AR, and FM carried out data analysis. SH and AR contributed equally to the initial draft of the manuscript which was critically reviewed by all authors prior to submission.
SH, FM, and AR are co-inventors of patent applications for RNA biomarkers in autism spectrum disorder that is assigned to The Research Foundation for the State University of New York, The Penn State Research Foundation and Quadrant Biosciences Inc., and licensed to Quadrant Biosciences Inc. FM and SH serve on the scientific and medical advisory boards of Quadrant Biosciences Inc., and SH is a paid consultant for Quadrant Biosciences Inc. These conflicts of interest are actively managed by the Penn State College of Medicine. AR, RC, KW, and SB are employees of Quadrant Biosciences Inc.
The authors thank Jessica Bieler, MPH (Penn State), Richard Uhlig and Cynthia Dowd Greene (Quadrant Biosciences Inc.) for assistance with study design. Thanks to Jeanette Ramer, MD and Cheryl Tierney, MD (Penn State), and Carroll Grant, Ph.D., Cynthia Brightman, MD, and Diane Montgomery, MD (SUNY Upstate) for assistance with participant identification. We acknowledge Eric Chin MD, Alexandra Confair, Andy Tarasiuk, Molly Carney, Falisha Gillman, MD, Julie Vallati, Nicole Verdiglione, Maria Chroneos, Rachel Pauley (Penn State), Angela Savage and Parisa Afshari, MD, Ph.D., (SUNY Upstate) and Jean Gehricke, Ph.D., and Sharina Alejo (UC Irvine) for assistance with participant recruitment and sample collection. We thank Dongliang Wang, Ph.D., (SUNY Upstate) and Jeremy Williams (Quadrant Biosciences) for guidance with data processing and statistical analysis.
The Supplementary Material for this article can be found online at:
Algorithm training and testing. The methodological pipeline used for RNA feature selection and model development in the training set is shown, along with direct application of the diagnostic algorithm to the naïve hold out test samples.
Associations between human and microbial salivary RNA features. Hierarchical clustering of the 32 RNA classifiers was performed using Pearson Correlation Analysis with complete linkages. There were two distinct RNA clusters: one involving solely human RNA, and a second containing 8 microbes alongside 2 microRNAs and 2 piRNAs. This second cluster may denote human and microbe elements with functional interactions.
Rates of medical conditions in correctly and incorrectly classified participants. The proportion of correctly and incorrectly classified children in the naïve validation set (
Human RNA loci with autism-associated copy number variants. The genomic location for each of the human RNA classifiers is shown, along with the number of autism-associated CNVs for this region, autism-related case reports, autism cases in the Simons Population, and total reported human cases in the Simons Foundation database.
ASD Candidate Genes Targeted by the 11 microRNAs. All autism-associated genes (Simons Foundation database) targeted with high confidence (based on microT-CDS score) by the 11 microRNA classifiers are listed. The strength of the microRNA-gene interaction is listed (Target Score), and if the interaction has been experimentally validated it is noted. Simons Foundation characteristics for each gene are noted, including gene score (strength of autism-association), implication in syndromic forms of autism, and total number of autistic individuals with a known variant in the gene.
KEGG Pathways over-represented by microRNA targets. The KEGG pathways over-represented (FDR < 0.05) by high-confidence gene targets of the microRNA classifiers are listed. The number of genes targeted within each pathway are shown along with the number of microRNAs targeting them. Note that several brain-related pathways have enriched numbers of targets, including axon guidance, neurotrophin signaling, and glioma.
Pathway interactions among putative microRNA targets. DIANA miRPATH software was used to identify KEGG pathways whose gene components had putative interacting relationships with the 11 microRNA classifiers. Notably, several pathways involved in metabolism and immune pathways previously implicated in autism were on the list.
attention deficit hyperactivity disorder
Autism Diagnostic Observation Schedule
analysis of variance
autism spectrum disorder
area under the curve
brain-derived neurotrophic factor
body mass index
copy number variant
developmental delay
diagnostics and statistics manual
gamma-amino butyric acid
long intergenic non-coding RNA
modified checklist for autism in toddlers – revised
microRNA
piw-ineracting RNA
receiver operating characteristics
ribosomal RNA
simons foundation autism research initiative
small nucleolar RNA
single nucleotide polymorphism
State University of New York
support vector machine
typical development
University of California Irvine
Vineland Adaptive Behavior Scales.