MCA Logo

About MCA

The Microbial Clinical Atlas (MCA) is a curated knowledge base of Taxon Passports — structured records that summarise the clinically relevant biology, ecology, and evidence-linked associations of human-associated microorganisms. MCA is designed to make microbiome findings reproducible and comparable across studies by enforcing controlled vocabularies, stable identifiers, and explicit evidence grading.

24
Papers Curated
64
Taxon Passports
139
Clinical Associations
380
Ontology References

MCA Features

Pathobiont Status

Each Taxon Passport records whether the organism is considered a pathobiont — a commensal capable of causing disease under conditions such as immunosuppression, antibiotic disruption, or barrier dysfunction.

Yes Recognised pathobiont
CD Context dependent — pathobiont status depends on host factors or clinical setting
No Not considered a pathobiont
UK Unknown — insufficient evidence to classify

Evidence Grading

Every clinical association is graded by study design. Each grade reflects the strongest design reported for that finding.

E3 Strong — systematic review, meta-analysis, or multiple independent human cohorts
E2 Moderate — single human cohort, RCT, case-control, or cross-sectional
E1 Limited — animal model, in vitro, case report, or mechanistic work only

Cross-Database Linkage

Every passport field is anchored to a standard ontology or external database, making MCA entries interoperable with other bioinformatics resources.

NCBI Taxon lineage, rank, preferred name, and stable TaxID
MeSH Disease terms and anatomy on associations and body-site fields
KEGG Disease, Drug, and Compound IDs on clinical conditions, bloom triggers, and metabolites
ARO Antimicrobial resistance gene ontology on AMR highlights
VFDB Virulence factor database on virulence annotations
ChEBI Chemical entity identifiers on metabolites
BacDive Morphology, physiology, and ecological niche reference data

Curation Process

Each Taxon Passport is assembled by a two-skill AI-assisted curation pipeline. An expert curator reviews every staging file before any changes are committed to the database.

1
Paper Analysis

A PDF (filename = PMID) is submitted to the Paper Curator skill. An analyst agent reads the full paper, extracts metadata (title, authors, journal, year, study design, population, sample size), and identifies all microbial taxa mentioned.

2
Database Fetch & Entity Extraction

Per taxon, two agents run in parallel: a DB Fetch agent queries NCBI Taxonomy and BacDive (by TaxID) for biology and ecology fields; an Entity Extractor agent reads the paper for the clinical layer — pathobiont status, bloom triggers, AMR highlights, metabolites, and individual clinical associations with evidence type.

NCBI Taxonomy BacDive clinical_roles bloom_triggers amr_highlights metabolites
3
Routing, Grading & Ontology Enrichment

A Routing agent checks whether the taxon already exists in the XML database (CREATE vs UPDATE). A Grading agent assigns an evidence grade (E1 / E2 / E3) for the paper based purely on study design. Three enrichment agents run in parallel: MeSH (NLM E-utilities), KEGG (local flat-file mirror), and ARO (CARD ontology).

E1 / E2 / E3 MeSH IDs KEGG Disease KEGG Drug KEGG Compound ARO (AMR)
4
Staging File & Expert Review

A structured JSON staging file is written per taxon (staging/PMID_YYYY-MM-DD_taxon-name.json). An expert curator reviews every field — proposed additions, evidence rationale, and ontology IDs — before approving.

5
XML Update & SQL Export

Approved staging files are applied to the versioned XML database by the XML Update skill. A content hash is computed per association for deduplication. Applied files are archived; a full SQL dump is generated automatically from the updated XML.

CREATE / UPDATE content_hash versioned XML SQL dump

Data Model

Each Taxon Passport is the central record, linked to satellite tables via a surrogate integer primary key. The canonical export format is versioned XML; the MySQL schema is derived from it via xml2sql.py.

LayerFieldsSource
Identity passport_id, preferred_name, taxon_rank, domain, lineage, ncbi_taxid, synonyms NCBI Taxonomy
Biology gram_status, oxygen_tolerance, morphology, key_traits, bacdive_url BacDive
Ecology primary_niches (+ MeSH anatomy ID), reservoirs, transmission_routes BacDive / literature
Clinical Profile is_pathobiont, clinical_roles, typical_specimens, bloom_triggers, risk_contexts, amr_highlights (+ ARO ID) Curated literature
Metabolites metabolite_name, relationship (produces / consumes / modifies), KEGG Compound ID, ChEBI ID Curated literature; KEGG LIGAND
Clinical Associations association_text, evidence_level (E1–E3), evidence_type, content_hash, assoc_refs (MeSH + KEGG Disease ID), PMIDs Curated literature; NLM MeSH; KEGG MEDICUS

Roadmap

Planned extensions to MCA under active development or under consideration.

FeatureDescriptionStatus
Pathway-level search Search passports by KEGG Pathway name (e.g. "butanoate metabolism"). Pathway annotations are derived computationally from existing KEGG Compound links — not manually curated — and are used for search indexing only. In development
Gene-level annotation (KEGG Orthology) Link taxa to KEGG Ortholog (KO) numbers representing key functional genes (e.g. bile salt hydrolase bsh, butyrate kinase buk). Will enable searches such as "which taxa carry the bile salt hydrolase gene?" and connect MCA directly to nucleotide-sequence-based functional profiling (16S rRNA, shotgun metagenomics, metatranscriptomics). Requires integration of strain-level genomic data and KO assignment pipelines. Planned

Acknowledgements

MCA integrates data from the following publicly available resources. We gratefully acknowledge the teams that build and maintain them.

ResourceUsed ForReference
NCBI Taxonomy Taxon lineage, rank, preferred name, synonyms, TaxID National Center for Biotechnology Information, U.S. National Library of Medicine
NLM MeSH MeSH term annotations on clinical associations; anatomy IDs on body-site fields Medical Subject Headings, U.S. National Library of Medicine
BacDive Gram status, oxygen tolerance, morphology, key traits, primary niches BacDive — the Bacterial Diversity Metadatabase, DSMZ (Leibniz Institute)
KEGG KEGG Disease IDs on clinical associations; KEGG Drug IDs on bloom triggers; KEGG Compound IDs on metabolites Kyoto Encyclopedia of Genes and Genomes, Kanehisa Laboratories
CARD / ARO Antibiotic Resistance Ontology identifiers on AMR highlights Comprehensive Antibiotic Resistance Database, McMaster University

Curation Data Mirrors

For curation speed and reproducibility, the pipeline maintains local snapshots of all reference databases used during enrichment. These mirrors are updated periodically and are not served publicly.

DatabaseUsed bySourceSnapshot date
NCBI Taxonomy TaxID lookup, name resolution ncbi.nlm.nih.gov/taxonomy 2026-04-02
BacDive Gram status, oxygen tolerance, morphology, isolation sources bacdive.dsmz.de 2026-04-02
KEGG Disease, drug, and compound ID enrichment kegg.jp 2025-10-26
CARD / ARO AMR resistance ontology IDs card.mcmaster.ca 2026-04-02
ChEBI Metabolite ID enrichment ebi.ac.uk/chebi 2026-04-02
VFDB Virulence factor annotations mgc.ac.cn/VFs 2026-03-27
DO Disease ontology cross-referencing disease-ontology.org 2026-04-02

Download & Source

The complete database is published as a versioned XML file and updated with each curation cycle.

Curated Literature

Full list of all 24 papers curated into MCA, with authors, journal, year, and PubMed links.

View Papers
Releases

Versioned XML and SQL snapshots for each official release. Use these for reproducible imports and downstream pipelines.

View Releases
GitHub

Source code, schema definitions, curation scripts, and issue tracking for the MCA project.

View on GitHub

Contact

We would love to hear from you. If you would like to suggest specific papers for inclusion in MCA, or if you have spotted an error in any of the records, please reach out to us at bioinformatics@ucalgary.ca — we appreciate every contribution and will do our best to respond promptly.