Genome AI: Cattle Breed Classification from SNP Data
This project provides a reproducible pipeline for cattle breed classification from SNP genotype data using two neural baselines:
- 1D CNN baseline for local SNP-pattern extraction- Transformer encoder baseline for long-range SNP dependencies
The workflow is notebook-first with script-backed commands so you can run experiments from CLI and analyze outputs in Jupyter.
What Is Implemented
- Data acquisition commands:- Synthetic SNP dataset generation for quick sanity checks- GEO series downloader for real dataset bootstrapping- Schema validation for required fields (breed labels, SNP feature columns)- Preprocessing pipeline:- Genotype normalization to 0/1/2 tokens- Missing-value imputation- Stratified train/val/test split generation- Class-weight estimation for imbalanced breeds- Model training pipeline:- CNN and Transformer model implementations- Shared trainer with early stopping and best-checkpoint saving- Benchmark report generation:- Accuracy, macro-F1, weighted-F1- Side-by-side CNN vs Transformer comparison artifact
Dataset Options
You asked for cattle SNP datasets suitable for breed classification. The implementation supports either direct GEO download or your own pre-collected SNP matrix.
Suggested starting points:
- NCBI GEO cattle SNP arrays (smaller and easier to iterate)2. 1000 Bull Genomes subsets (larger and more realistic)
Notes:
- GEO accessions vary by study format; some provide directly usable SNP matrices, others require extraction from supplementary archives.- For PLINK BED/BIM/FAM inputs, convert to .raw text first, then feed to preprocessing.
Environment Setup (uv)
bashuvsync
Optional notebook support:
bashuvsync--extradev
End-to-End Quick Start
- Generate a synthetic SNP dataset (fast smoke test):
bashuvrunpythonmain.pydownload--sourcesynthetic--output-dirdata/raw--n-samples1200--n-snps4096
- Validate schema:
bashuvrunpythonmain.pyvalidate--input-csvdata/raw/synthetic/synthetic_cattle_snp.csv
- Preprocess and split:
bashuvrunpythonmain.pypreprocess--input-csvdata/raw/synthetic/synthetic_cattle_snp.csv--output-npzdata/processed/snp_dataset.npz--max-snps4096
- Train one model:
bashuvrunpythonmain.pytrain--data-npzdata/processed/snp_dataset.npz--modelcnn--output-dirresults/cnn
- Run CNN vs Transformer comparison:
bashuvrunpythonmain.pycompare--data-npzdata/processed/snp_dataset.npz--output-dirresults/compare--epochs20
Comparison summary is written to:
- results/compare/comparison.json- results/compare/comparison.md
Comparison visuals (generated by the notebook section "CNN vs Transformer Visual Comparison and Selection Rationale"):
- results/compare/transformer_vs_cnn_metrics.png- results/compare/transformer_vs_cnn_tradeoff.png- results/compare/transformer_vs_cnn_context_coverage.png
Real GEO Data Download
To attempt download from a GEO series accession:
bashuvrunpythonmain.pydownload--sourcegeo--geo-accessionGSE50367--output-dirdata/raw
The downloader fetches common GEO artifacts (series matrix, SOFT, and common RAW archives) and stores a manifest file with provenance.
If your selected GEO series does not include a directly usable SNP matrix, unpack supplementary files and convert to numeric SNP matrix with columns like:
- sample_id- breed- snp_00001 ... snp_N
PLINK Conversion Helper
If you already have PLINK files:
- input prefix: mydata (mydata.bed, mydata.bim, mydata.fam)- convert using PLINK:
bashplink--bfilemydata--recodeA--outdata/raw/mydata
Then transform the produced .raw file into the CSV format expected by preprocessing (sample_id + breed + SNP columns).
Notebook Workflow
Use the provided notebook in notebooks/ for:
- Data sanity checks- Class balance plots- CNN vs Transformer metric visualization- Exported model-comparison figures for documentation
After running the comparison cells, the README figures below will point to the generated images.
Comparison Visuals

Why Transformer for SNP?
CNNs are strong at local haplotype-like pattern detection, while Transformers can model longer-range SNP dependencies through attention. This repository keeps preprocessing and splits identical so the model comparison remains fair.
Why we generally choose Transformer over CNN for this task:
- SNPs are sequential but not purely local; biologically relevant interactions can be far apart in genomic coordinates.- Self-attention captures long-range dependencies that fixed convolution kernels may miss.- Macro-F1 and weighted-F1 are emphasized to avoid selecting a model that only performs well on dominant breeds.- The context-coverage plot in the notebook shows why this matters at scale: CNN context grows slowly with depth, while Transformer can access near-global context (within configured token/patch limits).
Important interpretation note:
- On smaller or easier datasets, CNN can still win on both speed and accuracy.- We keep both models and compare every run; Transformer is preferred when long-range dependency modeling provides measurable quality gains.
Selection rule used in the notebook:
- Choose Transformer when macro-F1 or weighted-F1 improves by more than 0.01 over CNN.- Choose CNN only when quality is similar and runtime/latency constraints dominate.