Edited by: Christian Haselgrove, UMass Chan Medical School, United States
Reviewed by: David Haynor, University of Washington, United States; Bo-yong Park, Inha University, Republic of Korea
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Collaborative neuroimaging research is often hindered by technological, policy, administrative, and methodological barriers, despite the abundance of available data. COINSTAC (The Collaborative Informatics and Neuroimaging Suite Toolkit for Anonymous Computation) is a platform that successfully tackles these challenges through federated analysis, allowing researchers to analyze datasets without publicly sharing their data. This paper presents a significant enhancement to the COINSTAC platform: COINSTAC Vaults (CVs). CVs are designed to further reduce barriers by hosting standardized, persistent, and highly-available datasets, while seamlessly integrating with COINSTAC's federated analysis capabilities. CVs offer a user-friendly interface for self-service analysis, streamlining collaboration, and eliminating the need for manual coordination with data owners. Importantly, CVs can also be used in conjunction with open data as well, by simply creating a CV hosting the open data one would like to include in the analysis, thus filling an important gap in the data sharing ecosystem. We demonstrate the impact of CVs through several functional and structural neuroimaging studies utilizing federated analysis showcasing their potential to improve the reproducibility of research and increase sample sizes in neuroimaging studies.
香京julia种子在线播放
In recent years, neuroimaging has seen a growing emphasis on data sharing and collaborative research, as evidenced by the development of new standards [e.g., Brain Imaging Data Structure (BIDS), Gorgolewski et al.,
In this section, we discuss in detail some of the challenges associated with collaborative analysis, particularly in centralized approaches, where the data need to be pooled in one location to perform an analysis. We also discuss COINSTAC, a tool built on the principles of federated analysis to enable analysis without the need to centralize data.
Technological constraints, such as storage space, download speed, and processing power, play a significant role in the feasibility of performing collaborative analyses on large datasets (Homer et al.,
Due to the potentially sensitive nature of neuroimaging datasets, their use in collaborative analysis is often restricted by policies intended to preserve privacy. Collaboration methods include aggregating data in a centralized repository or using Data Usage Agreements (DUAs) (Thompson et al.,
Administrative challenges can arise when collaborating on an analysis, as various steps demand researchers' time and attention. These steps may include communicating between agencies, formulating and signing data-sharing agreements, agreeing on data preparation and analysis processes, procuring technical resources, monitoring and auditing processes, performing data transfer, initiating computations, disseminating results of analyses, and so on.
The efficiency of collaborative analysis is influenced by how quickly these manual steps are executed. Synchronized availability of researchers can present a barrier to the collaboration process. When researchers work asynchronously, each step in a serial process requiring manual interaction introduces potential delays. This can be particularly challenging when researchers are distributed across multiple time zones or have limited time to perform manual tasks. Furthermore, researchers' availability may be constrained by the need for expertise and authority, such as having the authority to sign a data-sharing agreement or the technical expertise to run the appropriate Python script against a dataset. Often, these manual steps must be executed for each new analysis, which can slow down and even impede collaborative analysis. By addressing these administrative barriers, research teams can more effectively collaborate and streamline their analysis processes, ultimately contributing to the advancement of neuroimaging research.
Variability in methodological approaches to data processing and analysis can make reproducing studies challenging (Vogt,
To overcome these barriers, we introduce COINSTAC,
Federated analysis (also federated learning, or decentralized analysis) (Plis et al.,
The Collaborative Informatics and Neuroimaging Suite Toolkit for Anonymous Computation (COINSTAC) (see text footnote 4) (Plis et al.,
The COINSTAC desktop application provides an easy-to-use graphical user interface (GUI) for coordinating and executing federated analysis pipelines among multiple collaborators. Image preprocessing and a variety of univariate and multivariate approaches (e.g., VBM regression, group ICA) can be completed within the app.
For a comprehensive understanding of COINSTAC, its functionalities, and usage, readers are encouraged to refer to the following papers (Plis et al.,
One limitation of the original implementation of COINSTAC is that it requires synchronized coordination (Jwa and Poldrack,
In this paper, we address this limitation by showcasing a method for hosting both private and public datasets where the datasets are persistently accessible for analysis using COINSTAC without the need for synchronized effort from data owners. Analysis of public datasets is made more accessible by removing the need to find, download, preprocess, and prepare datasets for analysis. We provide curated data vaults for various openly available neuroimaging data which COINSTAC users can simply include in their analyses. Access to private datasets can be restricted to a list of computations approved by the vault owner. Standardizing access to data vaults in the COINSTAC system simplifies analysis, optimizes computational performance, and promotes the reusability of neuroimaging datasets.
In this section, we discuss COINSTAC and the extension of the COINSTAC framework with the addition of vaults, their architecture, and various use-cases they enable. All code for COINSTAC and COINSTAC Vaults can be found in the COINSTAC Github repository.
To understand how Vaults improve the workflow of federated analysis in COINSTAC, we will describe the COINSTAC system and how it is used.
The main components of the COINSTAC system are: the desktop application, the central server, and computation containers. The desktop application provides a graphical user interface (GUI) and manages local computation containers used to participate in federated analyses. The central server manages the central database and runs the containers that act as the inner node in federated analyses.
In the COINSTAC desktop application, users join collections of users called “consortia” to collaborate on an analysis pipeline. A consortium is a group formed by individual COINSTAC users, each with their machine that is capable of being a node in a federated analysis pipeline. Each member within a consortium will act as a node in the federated analysis group by running local computations inside of a container on their system.
The following is how a researcher would use the COINSTAC user interface to create a consortium and run a federated analysis pipeline:
Log in as a user
Join (as a member) or create (as an owner) a consortium
Configure a set of computations (a pipeline) to be performed by a consortium
Map their local data to the pipeline
Initiate the pipeline (a run)
View the results of the pipeline run.
The Vaults system is an extension of the COINSTAC platform that allows datasets to be persistently available for participation in federated analyses without requiring manual action from data owners apart from the initial setup. COINSTAC consortium owners can independently add Vaults members to their consortia, allowing vault datasets to participate in federated analyses without the need for coordination between consortia owners and Vault data owners. The Vault client allows datasets to be made available to the larger COINSTAC ecosystem, giving the ability for others to run pipelines using the Vault's data without it ever leaving its respective system.
Vault clients can be added to a consortium by a consortium owner without any action required from the owner of the Vault data, as shown in
Adding vault data to an analysis pipeline.
Making datasets available for federated analysis through COINSTAC is simple using Vaults. Vaults can be hosted in a variety of compute environments such as: on personal machines, on-premises servers, on a cluster of compute nodes, or in a virtual cloud. Both publicly available datasets and private datasets can be made available to the COINSTAC platform via Vaults. COINSTAC consortia can include any combination of diverse types of data: public and private datasets, data hosted on local machines, Vaults hosted by the Tri-institutional Center for Translational Research in Neuroimaging and Data Science (TReNDS), and third-party Vaults connected to COINSTAC as shown in
Different types of participants interacting with COINSTAC.
In addition to TReNDS-hosted vaults, data owners are able to host their own (public or private) data as Vaults (
Process of creating a vault in COINSTAC.
The process for hosting a dataset in a Vault is described below:
Install the Vault client: The user installs the Vault client on their host machine.
Request Vault integration: The user submits a request to the COINSTAC team for integrating the Vault into the COINSTAC ecosystem.
Receive API keys: The COINSTAC team provides the user with the necessary API keys for the user's Vault client.
Configure dataset directory: The user specifies the local directory containing the dataset in the Vault client configuration.
Select approved computations: The user chooses a list of computations, granting permission for these computations to be executed on their vault data.
After this process, the Vault becomes available for use in the COINSTAC system. Consortium owners can select to include the Vault in their consortium and perform federated analysis using Vault data. Whether the data was downloaded from a public repository or collected privately, the process is the same for both types of data since the source data stays on the user's local machine.
The Vault client software package is built upon the same core code as the COINSTAC desktop application to manage containers and execute computation pipelines. However, it omits the user interface (UI) component and includes additional code that enables the client to be persistently online and available. The desktop application has been modified to allow consortium owners to add Vault clients to their consortium via the GUI.
The Vault client is a NodeJS server running on the local machine, responsible for maintaining a persistent connection with the COINSTAC system using the coinstac-vault-client package. The server communicates with the COINSTAC central server using websockets and HTTP protocols. It manages the life-cycle of containers (Docker, Singularity) through the coinstac-container-manager package, which is responsible for isolating and executing the computations within the federated analyses. The Vault client also utilizes other core COINSTAC libraries such as coinstac-client-core, coinstac-client-server, coinstac-pipeline, and coinstac-common, all of which are npm packages, to ensure seamless integration with the COINSTAC ecosystem. An overview is shown in
Architecture of vaults in COINSTAC.
Message passing, which is an integral part of federated analyses, is handled by the Vault client using MQTT (MQ Telemetry Transport) and HTTP protocols. MQTT is a lightweight messaging protocol optimized for high-latency or unreliable networks.
For pipeline runs in consortia that only use Vaults, the result data is uploaded to a secure Amazon S3 bucket, which can then be downloaded by consortium members using the desktop application. This ensures that the results are securely stored and easily accessible by authorized users.
In summary, the Vault architecture in COINSTAC improves the overall efficiency and user experience of performing federated analyses. By maintaining a persistent connection, the Vault client ensures that datasets are readily available for analysis without the need for manual intervention by data owners. Additionally, the integration of the Vault client within the COINSTAC ecosystem allows for seamless interaction between the desktop application and the Vaults, making it simple for consortium owners to include Vault data in their federated analyses.
In this section, we present various use-cases that highlight the benefits and versatility of Vaults in COINSTAC.
TReNDS actively curates and hosts public datasets, making them readily available for the COINSTAC community through the creation of Vaults. These curated Vaults ensure that the public datasets are vetted, of high quality, and easily accessible. Users can contribute to this initiative by hosting Vaults for other public datasets, further expanding the range of data resources available within COINSTAC.
A researcher with a local dataset can benefit from incorporating Vault datasets containing relevant variables into their analysis. Integrating multiple datasets is especially advantageous when the researcher's local data is inadequate for conducting a comprehensive analysis. Collaborating with other COINSTAC consortium members and leveraging data from Vaults enables researchers to enhance the sample size and statistical power of their study efficiently while preserving privacy and streamlining the process by eliminating manual collaboration steps.
For investigators who do not have their own data but want to analyze existing datasets, Vaults provide a valuable solution. The investigator can create a consortium, add selected Vaults using the COINSTAC UI, and initiate the analysis. This approach enables the investigator to obtain meaningful insights from existing datasets without needing to coordinate with the Vault data owners.
Vaults are also advantageous for researchers with limited storage or computing resources. For example, a researcher with a low-powered laptop and minimal storage capacity can still analyze large datasets by creating a consortium and running an analysis using only Vault clients. The data processing occurs on the respective Vault servers, and the results are sent back to the investigator, eliminating the need for high-capacity local hardware.
By addressing these diverse use-cases, COINSTAC Vaults offer a flexible and efficient solution for researchers to access, collaborate, and analyze datasets in a federated environment.
In this section, we conduct a series of analyses using multiple Vaults hosted by TRENDS, emphasizing the practical application and utility of the Vaults feature. We specifically focus on the TReNDS VBM COBRE, TReNDS FreeSurfer COBRE, Child Mind Institute (CMI) VBM, and TReNDS NeuroMark Group-ICA COBRE datasets. These datasets were chosen to be hosted in Vaults based on their relevance to the neuroimaging research community, and their potential to demonstrate the diverse capabilities of COINSTAC Vaults. The hosting decisions were made in coordination with the respective data owners.
Our analyses highlight how the inclusion of Vault data can significantly increase sample size, thereby enhancing the statistical power of results. The diversity of datasets also underscores the flexibility and adaptability of COINSTAC Vaults, demonstrating how they can accommodate a wide range of research contexts and data types.
The TReNDS VBM COBRE Vault contains structural MRI images from 152 participants, approximately half healthy volunteers and half individuals diagnosed with schizophrenia, collected as part of the Mind Research Network COBRE study (Aine et al.,
The following section describes this use-case with 55 participant's structural MRI scans collected under MCIC project (Gollub,
Using the MCIC dataset, we similarly see widespread reduction in brain volume for age, visual and gray/white boundary reductions in volume in females, and insular-temporal and medial frontal (as well as more wide spread) reductions in schizophrenia patients.
The TReNDS VBM COBRE Vault was combined with the MCIC dataset, allowing for an increased sample size, in the same regression analysis to examine diagnostic effects while accounting for age and sex. The combined dataset was largely consistent with the individual site analysis, with the exception of the male/female effect which shows a more complex pattern of increases and decreases, though still largely conforming to reductions in white/gray matter boundary and primary visual area volumes (Gupta et al.,
This Vault contains data from 152 subjects, approximately half controls and half individuals with chronic schizophrenia, collected as part of the Mind Research Network COBRE study.
We ran Ridge regression on the above Vault data on Freesurfer volumetric and surface based measurements on about 500 regions of interest. We noticed the following differences between controls and patients.
Controls have higher values in temporal lobe, as shown in the thickness measurements of tables (
Global freesurfer stats for
Coefficient | 2.5552 | -0.0052 | 0.0323 | 0.1071 |
t stat | 44.8963 | -5.014 | 1.0642 | 4.1085 |
0 | 0 | 0.289 | 1.00E-04 | |
R squared | 0.237444911 | |||
Degrees of freedom | 145 |
Global freesurfer stats for
Coefficient | 2.5331 | −0.0036 | 0.0127 | 0.1158 |
t stat | 38.8572 | −3.0471 | 0.3663 | 3.878 |
0 | 0.0027 | 0.7147 | 2.00E-04 | |
R squared | 0.149884354 | |||
Degrees of freedom | 145 |
Global freesurfer stats for
Coefficient | 3.0038 | −0.0057 | −0.0134 | 0.0829 |
t stat | 60.2257 | −6.3161 | −0.5056 | 3.63 |
0 | 0 | 0.6139 | 4.00E-04 | |
R squared | 0.275552216 | |||
Degrees of freedom | 145 |
Global freesurfer stats for
Coefficient | 2.9341 | −0.0067 | 0.0169 | 0.0682 |
t stat | 52.568 | −6.6199 | 0.5679 | 2.6659 |
0 | 0 | 0.571 | 0.0085 | |
R squared | 0.266849477 | |||
Degrees of freedom | 145 |
Global freesurfer stats for
Coefficient | 428.1455 | 4.8982 | −144.8783 | −164.4922 |
t stat | 5.1293 | 3.2264 | −3.2585 | −4.3021 |
0 | 0.0015 | 0.0014 | 0 | |
R squared | 0.22384763 | |||
Degrees of freedom | 145 |
This Vault contains data from 922 children and adolescents (ages 6–22, 603 Male and 319 female), collected as part of the Healthy Brain Network study (Alexander et al.,
Group ICA (Calhoun et al.,
The Neuromark fMRI domains identified in Du et al. Briefly, these seven identified network templates were divided based on anatomical and functional properties (Du et al.,
The Neuromark fMRI 1.0 template with 53 intrinsic networks (components) from 7 major networks.
In recent decades, data sharing has driven substantial advancements in the field of neuroimaging and expanded opportunities for open science collaboration. Although data sharing has undeniable merits, it also faces inherent limitations, including technological, policy, administrative, and methodological barriers that can hinder progress. COINSTAC Vaults and the federated computing framework within COINSTAC uniquely address these challenges by enabling data analysis while maintaining privacy protection, specifically in the context of neuroimaging research. The “always-on” status of Vaults streamlines collaboration between institutions by eliminating the need for synchronized efforts across users. The accessibility and user-friendly interface of COINSTAC Vaults serve as powerful tools for reproducible research, an area that has faced significant criticism in recent years. By bolstering the collaborative capabilities of federated learning and addressing the limitations of traditional data sharing, COINSTAC Vaults provide a cutting-edge solution for the neuroimaging community, pushing the boundaries of data analysis and open science.
COINSTAC offers a user-friendly GUI for the neuroimaging field, enabling federated learning on neuroimaging data with ease. Its extensive library includes numerous algorithms and pipelines, facilitating efficient processing of large datasets. Currently, over twenty computations are available in open-source repositories, allowing users to create versatile analytic pipelines. The integration of Vaults further enhances the user experience by providing access to diverse datasets, enabling efficient analysis with robust data, and fostering collaboration across institutions asynchronously.
Compared to OpenNeuro,
In addition to being faster to execute by being immediately available with no downloading or manual coordination, curated Vaults that follow documented standards make studies easier to design, execute, and reproduce. For example: Neuroimaging datasets can contain a large number of variables that apply to each subject: demographic information, cognitive measures, etc. The number of these variables can range from tens to hundreds. Using standard naming conventions makes it easier for researchers to understand what each variable tracks so that they can select the relevant variables for their study. Standard and predictable ways for handling missing data in Vaults makes it easier for researchers to design their analyses.
COINSTAC is unique in its commitment to open science, with its open-source platform promoting seamless integration of modular computations and streamlining federated analyses. The addition of COINSTAC Vaults reinforces this commitment by simplifying dataset inclusion in federated analyses, encouraging community contributions, and preserving privacy for private datasets. By offering easy access to public datasets and enabling secure contributions from private dataset owners, COINSTAC Vaults foster collaboration and dedication to open science.
COINSTAC Vaults offer numerous benefits, but there are also limitations and challenges to consider, particularly in the areas of data privacy and security, and resource usage.
One concern is that allowing arbitrary summary queries on a dataset might enable an attacker to reconstruct the data. To mitigate such risks, the system must be privacy-preserving from “end-to-end,” incorporating techniques like secure multiparty computation or differential privacy. Implementing these methods can be difficult due to floating point implementation issues (Mironov,
While differentially private algorithms can provide stronger privacy guarantees, sharing data derivatives without differential privacy might be adequate in some situations, depending on the trust model and privacy concerns of data holders. These issues should be addressed on a case-by-case basis.
Vault owners can currently restrict computations on their data to a pre-approved list. To enhance privacy protection, further improvements are recommended. Potential solutions include allowing Vault owners to:
Approve or deny individual analysis runs.
Specify users and consortia that are allowed to run analyses.
Limit the overall number of computation runs for a vault.
Set expiration dates for specific approval permissions.
Another challenge is handling slowdowns or crashes during resource-intensive analyses due to high compute usage. To address this issue, Vault owners can be given more control over resource usage and compute capacity. They could limit the number of concurrent computations and overall CPU usage. Improving compute capacity could involve strategies like deploying multiple instances behind a load balancer or dynamically scaling resources.
Additional challenges include data distribution, network bandwidth, and communication speed. Federated learning and open-source solutions can help address some of these problems, but further research and development are needed to optimize COINSTAC Vaults' performance in various research settings. Our “Decentralized Sparse Deep Artificial Neural Networks in COINSTAC (CPU and GPU enabled)” algorithm allows users to save network bandwidth when transferring thousands of derived data/machine learning parameters across nodes.
In summary, COINSTAC Vaults mark a significant advancement in federated neuroimaging research, data privacy preservation, and open science promotion. By tackling the existing limitations and challenges, COINSTAC Vaults can further improve collaboration and innovation within the field.
The neuroimaging field is experiencing rapid growth, generating substantial data volumes. However, access to this data is challenged by technological, privacy, administrative, and methodological constraints. In this study, we present COINSTAC Vaults as a solution that streamlines data access and analysis, specifically in the context of neuroimaging research. COINSTAC Vaults ensure continuous availability of high-quality data, promoting the advancement of open science and fostering efficient collaboration between researchers.
We invite researchers to use COINSTAC Vaults in their studies and to host their own datasets using COINSTAC Vaults. By adopting COINSTAC Vaults, the neuroimaging community can overcome the barriers associated with traditional data sharing and analysis methods, paving the way for groundbreaking discoveries.
The long-term vision for COINSTAC and COINSTAC Vaults includes:
Introducing new user interface features, such as the ability to search Vaults and filter by covariates, to improve user experience and efficiency.
Making new datasets available as Vaults, including those from OpenNeuro, the Autism Brain Imaging Data Exchange (ABIDE), the National Institute of Mental Health Data Archive (NDA), the Open Access Series of Imaging Studies (OASIS), and the Image and Data Archive (IDA), to enhance the diversity of Vaults.
Increase BIDS (Brain Imaging Data Structure) support to all major neuroimaging modalities and Vault datasets, to ensure interoperability and ease of use.
Increase compliance to programs such as the FAIR (Findability, Accessibility, Interoperability, and Reuse) Guiding Principles for scientific data management and stewardship, to enhance the overall data sharing ecosystem.
Exploring the integration of differential privacy techniques to further safeguard data privacy, while preserving the utility of data analysis.
Publicly available datasets were analyzed in this study. This data can be found here:
DM, RK, and VC: conceptualization. DM, SB, SPa, and PP: methodology. DM, SB, SPa, KR-M, PP, BB, and JR: writing—original draft preparation. SB and SPa: data analysis. SPl and VC: supervision. All authors: writing—review and editing, read, and agreed to the published version of the manuscript.
This work was funded by the National Institutes of Health (Grants: R01DA040487, R01DA049238, and R01MH121246).
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
1
2
3
4
5
6
7
8
9