About MCA

The Microbial Clinical Atlas (MCA) is a curated knowledge base of Taxon Passports — structured records that summarise the clinically relevant biology, ecology, and evidence-linked associations of human-associated microorganisms. MCA is designed to make microbiome findings reproducible and comparable across studies by enforcing controlled vocabularies, stable identifiers, and explicit evidence grading.

Papers Curated

Taxon Passports

139

Clinical Associations

380

Ontology References

MCA Features

Pathobiont Status

Each Taxon Passport records whether the organism is considered a pathobiont — a commensal capable of causing disease under conditions such as immunosuppression, antibiotic disruption, or barrier dysfunction.

Yes Recognised pathobiont

CD Context dependent — pathobiont status depends on host factors or clinical setting

No Not considered a pathobiont

UK Unknown — insufficient evidence to classify

Evidence Grading

Every clinical association is graded by study design. Each grade reflects the strongest design reported for that finding.

E3 Strong — clinical guidelines, systematic review, pooled meta-analysis, or discovery + validation cohorts in one paper

E2 Moderate — single cohort (or multi-center observational without pooling), RCT, case-control, or cross-sectional

E1 Limited — animal model, in vitro, case report, or mechanistic work only

Cross-Database Linkage

Every passport field is anchored to a standard ontology or external database, making MCA entries interoperable with other bioinformatics resources.

NCBI Taxon lineage, rank, preferred name, and stable TaxID

MeSH Disease terms and anatomy on associations and body-site fields

KEGG Disease, Drug, and Compound IDs on clinical conditions, bloom triggers, and metabolites

ARO Antimicrobial resistance gene ontology on AMR highlights

VFDB Virulence factor database on virulence annotations

ChEBI Chemical entity identifiers on metabolites

BacDive Morphology, physiology, and ecological niche reference data

Curation Process

Each Taxon Passport is assembled by a two-skill AI-assisted curation pipeline. An expert curator reviews every staging file before any changes are committed to the database.

Paper Analysis

A PDF (filename = PMID) is submitted to the Paper Curator skill. An analyst agent reads the full paper, extracts metadata (title, authors, journal, year, study design, population, sample size), and identifies all microbial taxa mentioned.

Database Fetch & Entity Extraction

Per taxon, two agents run in parallel: a DB Fetch agent queries NCBI Taxonomy and BacDive (by TaxID) for biology and ecology fields; an Entity Extractor agent reads the paper for the clinical layer — pathobiont status, bloom triggers, AMR highlights, metabolites, and individual clinical associations with evidence type.

NCBI Taxonomy BacDive clinical_roles bloom_triggers amr_highlights metabolites

Routing, Grading & Ontology Enrichment

A Routing agent checks whether the taxon already exists in the XML database (CREATE vs UPDATE). A Grading agent assigns an evidence grade (E1 / E2 / E3) for the paper based purely on study design. Three enrichment agents run in parallel: MeSH (NLM E-utilities), KEGG (local flat-file mirror), and ARO (CARD ontology).

E1 / E2 / E3 MeSH IDs KEGG Disease KEGG Drug KEGG Compound ARO (AMR)

Staging File & Expert Review

A structured JSON staging file is written per taxon (staging/PMID_YYYY-MM-DD_taxon-name.json). An expert curator reviews every field — proposed additions, evidence rationale, and ontology IDs — before approving.

XML Update & SQL Export

Approved staging files are applied to the versioned XML database by the XML Update skill. A content hash is computed per association for deduplication. Applied files are archived; a full SQL dump is generated automatically from the updated XML.

CREATE / UPDATE content_hash versioned XML SQL dump

Data Model

Each Taxon Passport is the central record, linked to satellite tables via a surrogate integer primary key. The canonical export format is versioned XML; the MySQL schema is derived from it via xml2sql.py.

Layer	Fields	Source
Identity	passport_id, preferred_name, taxon_rank, domain, lineage, ncbi_taxid, synonyms	NCBI Taxonomy
Biology	gram_status, oxygen_tolerance, morphology, key_traits, bacdive_url	BacDive
Ecology	primary_niches (+ MeSH anatomy ID), reservoirs, transmission_routes	BacDive / literature
Clinical Profile	is_pathobiont, clinical_roles, typical_specimens, bloom_triggers, risk_contexts, amr_highlights (+ ARO ID)	Curated literature
Metabolites	metabolite_name, relationship (produces / consumes / modifies), KEGG Compound ID, ChEBI ID	Curated literature; KEGG LIGAND
Clinical Associations	association_text, evidence_level (E1–E3), evidence_type, content_hash, assoc_refs (MeSH + KEGG Disease ID), PMIDs	Curated literature; NLM MeSH; KEGG MEDICUS

Roadmap

Planned extensions to MCA under active development or under consideration.

Feature	Description	Status
Pathway-level search	Search passports by KEGG Pathway name (e.g. "butanoate metabolism"). Pathway annotations are derived computationally from existing KEGG Compound links — not manually curated — and are used for search indexing only.	In development
Gene-level annotation (KEGG Orthology)	Link taxa to KEGG Ortholog (KO) numbers representing key functional genes (e.g. bile salt hydrolase bsh, butyrate kinase buk). Will enable searches such as "which taxa carry the bile salt hydrolase gene?" and connect MCA directly to nucleotide-sequence-based functional profiling (16S rRNA, shotgun metagenomics, metatranscriptomics). Requires integration of strain-level genomic data and KO assignment pipelines.	Planned

Acknowledgements

MCA integrates data from the following publicly available resources. We gratefully acknowledge the teams that build and maintain them.

Resource	Used For	Reference
NCBI Taxonomy	Taxon lineage, rank, preferred name, synonyms, TaxID	National Center for Biotechnology Information, U.S. National Library of Medicine
NLM MeSH	MeSH term annotations on clinical associations; anatomy IDs on body-site fields	Medical Subject Headings, U.S. National Library of Medicine
BacDive	Gram status, oxygen tolerance, morphology, key traits, primary niches	BacDive — the Bacterial Diversity Metadatabase, DSMZ (Leibniz Institute)
KEGG	KEGG Disease IDs on clinical associations; KEGG Drug IDs on bloom triggers; KEGG Compound IDs on metabolites	Kyoto Encyclopedia of Genes and Genomes, Kanehisa Laboratories
CARD / ARO	Antibiotic Resistance Ontology identifiers on AMR highlights	Comprehensive Antibiotic Resistance Database, McMaster University

Curation Data Mirrors

For curation speed and reproducibility, the pipeline maintains local snapshots of all reference databases used during enrichment. These mirrors are updated periodically and are not served publicly.

Database	Used by	Source	Snapshot date
NCBI Taxonomy	TaxID lookup, name resolution	ncbi.nlm.nih.gov/taxonomy	2026-04-02
BacDive	Gram status, oxygen tolerance, morphology, isolation sources	bacdive.dsmz.de	2026-04-02
KEGG	Disease, drug, and compound ID enrichment	kegg.jp	2025-10-26
CARD / ARO	AMR resistance ontology IDs	card.mcmaster.ca	2026-04-02
ChEBI	Metabolite ID enrichment	ebi.ac.uk/chebi	2026-04-02
VFDB	Virulence factor annotations	mgc.ac.cn/VFs	2026-03-27
DO	Disease ontology cross-referencing	disease-ontology.org	2026-04-02

Download & Source

The complete database is published as a versioned XML file and updated with each curation cycle.

Curated Literature

Full list of all 25 papers curated into MCA, with authors, journal, year, and PubMed links.

View Papers

Releases

Versioned XML and SQL snapshots for each official release. Use these for reproducible imports and downstream pipelines.

View Releases

GitHub

Source code, schema definitions, curation scripts, and issue tracking for the MCA project.

View on GitHub

Contact

We would love to hear from you. If you would like to suggest specific papers for inclusion in MCA, or if you have spotted an error in any of the records, please reach out to us at bioinformatics@ucalgary.ca — we appreciate every contribution and will do our best to respond promptly.