The Microbial Clinical Atlas (MCA) is a curated knowledge base of Taxon Passports — structured records that summarise the clinically relevant biology, ecology, and evidence-linked associations of human-associated microorganisms. MCA is designed to make microbiome findings reproducible and comparable across studies by enforcing controlled vocabularies, stable identifiers, and explicit evidence grading.
Each Taxon Passport records whether the organism is considered a pathobiont — a commensal capable of causing disease under conditions such as immunosuppression, antibiotic disruption, or barrier dysfunction.
Every clinical association is graded by study design. Each grade reflects the strongest design reported for that finding.
Every passport field is anchored to a standard ontology or external database, making MCA entries interoperable with other bioinformatics resources.
Each Taxon Passport is assembled by a two-skill AI-assisted curation pipeline. An expert curator reviews every staging file before any changes are committed to the database.
A PDF (filename = PMID) is submitted to the Paper Curator skill. An analyst agent reads the full paper, extracts metadata (title, authors, journal, year, study design, population, sample size), and identifies all microbial taxa mentioned.
Per taxon, two agents run in parallel: a DB Fetch agent queries NCBI Taxonomy and BacDive (by TaxID) for biology and ecology fields; an Entity Extractor agent reads the paper for the clinical layer — pathobiont status, bloom triggers, AMR highlights, metabolites, and individual clinical associations with evidence type.
A Routing agent checks whether the taxon already exists in the XML database (CREATE vs UPDATE). A Grading agent assigns an evidence grade (E1 / E2 / E3) for the paper based purely on study design. Three enrichment agents run in parallel: MeSH (NLM E-utilities), KEGG (local flat-file mirror), and ARO (CARD ontology).
A structured JSON staging file is written per taxon (staging/PMID_YYYY-MM-DD_taxon-name.json). An expert curator reviews every field — proposed additions, evidence rationale, and ontology IDs — before approving.
Approved staging files are applied to the versioned XML database by the XML Update skill. A content hash is computed per association for deduplication. Applied files are archived; a full SQL dump is generated automatically from the updated XML.
Each Taxon Passport is the central record, linked to satellite tables via a surrogate integer primary key. The canonical export format is versioned XML; the MySQL schema is derived from it via xml2sql.py.
| Layer | Fields | Source |
|---|---|---|
| Identity | passport_id, preferred_name, taxon_rank, domain, lineage, ncbi_taxid, synonyms | NCBI Taxonomy |
| Biology | gram_status, oxygen_tolerance, morphology, key_traits, bacdive_url | BacDive |
| Ecology | primary_niches (+ MeSH anatomy ID), reservoirs, transmission_routes | BacDive / literature |
| Clinical Profile | is_pathobiont, clinical_roles, typical_specimens, bloom_triggers, risk_contexts, amr_highlights (+ ARO ID) | Curated literature |
| Metabolites | metabolite_name, relationship (produces / consumes / modifies), KEGG Compound ID, ChEBI ID | Curated literature; KEGG LIGAND |
| Clinical Associations | association_text, evidence_level (E1–E3), evidence_type, content_hash, assoc_refs (MeSH + KEGG Disease ID), PMIDs | Curated literature; NLM MeSH; KEGG MEDICUS |
Planned extensions to MCA under active development or under consideration.
| Feature | Description | Status |
|---|---|---|
| Pathway-level search | Search passports by KEGG Pathway name (e.g. "butanoate metabolism"). Pathway annotations are derived computationally from existing KEGG Compound links — not manually curated — and are used for search indexing only. | In development |
| Gene-level annotation (KEGG Orthology) | Link taxa to KEGG Ortholog (KO) numbers representing key functional genes (e.g. bile salt hydrolase bsh, butyrate kinase buk). Will enable searches such as "which taxa carry the bile salt hydrolase gene?" and connect MCA directly to nucleotide-sequence-based functional profiling (16S rRNA, shotgun metagenomics, metatranscriptomics). Requires integration of strain-level genomic data and KO assignment pipelines. | Planned |
MCA integrates data from the following publicly available resources. We gratefully acknowledge the teams that build and maintain them.
| Resource | Used For | Reference |
|---|---|---|
| NCBI Taxonomy | Taxon lineage, rank, preferred name, synonyms, TaxID | National Center for Biotechnology Information, U.S. National Library of Medicine |
| NLM MeSH | MeSH term annotations on clinical associations; anatomy IDs on body-site fields | Medical Subject Headings, U.S. National Library of Medicine |
| BacDive | Gram status, oxygen tolerance, morphology, key traits, primary niches | BacDive — the Bacterial Diversity Metadatabase, DSMZ (Leibniz Institute) |
| KEGG | KEGG Disease IDs on clinical associations; KEGG Drug IDs on bloom triggers; KEGG Compound IDs on metabolites | Kyoto Encyclopedia of Genes and Genomes, Kanehisa Laboratories |
| CARD / ARO | Antibiotic Resistance Ontology identifiers on AMR highlights | Comprehensive Antibiotic Resistance Database, McMaster University |
For curation speed and reproducibility, the pipeline maintains local snapshots of all reference databases used during enrichment. These mirrors are updated periodically and are not served publicly.
| Database | Used by | Source | Snapshot date |
|---|---|---|---|
| NCBI Taxonomy | TaxID lookup, name resolution | ncbi.nlm.nih.gov/taxonomy | 2026-04-02 |
| BacDive | Gram status, oxygen tolerance, morphology, isolation sources | bacdive.dsmz.de | 2026-04-02 |
| KEGG | Disease, drug, and compound ID enrichment | kegg.jp | 2025-10-26 |
| CARD / ARO | AMR resistance ontology IDs | card.mcmaster.ca | 2026-04-02 |
| ChEBI | Metabolite ID enrichment | ebi.ac.uk/chebi | 2026-04-02 |
| VFDB | Virulence factor annotations | mgc.ac.cn/VFs | 2026-03-27 |
| DO | Disease ontology cross-referencing | disease-ontology.org | 2026-04-02 |
The complete database is published as a versioned XML file and updated with each curation cycle.
Full list of all 24 papers curated into MCA, with authors, journal, year, and PubMed links.
Versioned XML and SQL snapshots for each official release. Use these for reproducible imports and downstream pipelines.
Source code, schema definitions, curation scripts, and issue tracking for the MCA project.
We would love to hear from you. If you would like to suggest specific papers for inclusion in MCA, or if you have spotted an error in any of the records, please reach out to us at bioinformatics@ucalgary.ca — we appreciate every contribution and will do our best to respond promptly.