completed⭐ Featured

Genome AI: Cattle Breed Classification from SNP Data

A reproducible machine learning pipeline for cattle breed classification from SNP genotype data, featuring CNN and Transformer baselines. Includes data validation, preprocessing, training, and benchmarking workflows to compare local vs long-range genomic pattern modeling.

Tech Stack

Python Jupyter NotebookPLINKNumPyPandasScikit-learnSeabornTorch

Genome AI: Cattle Breed Classification from SNP Data

This project provides a reproducible pipeline for cattle breed classification from SNP genotype data using two neural baselines:

1D CNN baseline for local SNP-pattern extraction- Transformer encoder baseline for long-range SNP dependencies The workflow is notebook-first with script-backed commands so you can run experiments from CLI and analyze outputs in Jupyter.

What Is Implemented

Data acquisition commands:- Synthetic SNP dataset generation for quick sanity checks- GEO series downloader for real dataset bootstrapping- Schema validation for required fields (breed labels, SNP feature columns)- Preprocessing pipeline:- Genotype normalization to 0/1/2 tokens- Missing-value imputation- Stratified train/val/test split generation- Class-weight estimation for imbalanced breeds- Model training pipeline:- CNN and Transformer model implementations- Shared trainer with early stopping and best-checkpoint saving- Benchmark report generation:- Accuracy, macro-F1, weighted-F1- Side-by-side CNN vs Transformer comparison artifact

Dataset Options

You asked for cattle SNP datasets suitable for breed classification. The implementation supports either direct GEO download or your own pre-collected SNP matrix. Suggested starting points:

NCBI GEO cattle SNP arrays (smaller and easier to iterate)2. 1000 Bull Genomes subsets (larger and more realistic) Notes:

GEO accessions vary by study format; some provide directly usable SNP matrices, others require extraction from supplementary archives.- For PLINK BED/BIM/FAM inputs, convert to .raw text first, then feed to preprocessing.

Environment Setup (uv)

bashuvsync Optional notebook support: bashuvsync--extradev

End-to-End Quick Start

Generate a synthetic SNP dataset (fast smoke test): bashuvrunpythonmain.pydownload--sourcesynthetic--output-dirdata/raw--n-samples1200--n-snps4096
Validate schema: bashuvrunpythonmain.pyvalidate--input-csvdata/raw/synthetic/synthetic_cattle_snp.csv
Preprocess and split: bashuvrunpythonmain.pypreprocess--input-csvdata/raw/synthetic/synthetic_cattle_snp.csv--output-npzdata/processed/snp_dataset.npz--max-snps4096
Train one model: bashuvrunpythonmain.pytrain--data-npzdata/processed/snp_dataset.npz--modelcnn--output-dirresults/cnn
Run CNN vs Transformer comparison: bashuvrunpythonmain.pycompare--data-npzdata/processed/snp_dataset.npz--output-dirresults/compare--epochs20 Comparison summary is written to:

results/compare/comparison.json- results/compare/comparison.md Comparison visuals (generated by the notebook section "CNN vs Transformer Visual Comparison and Selection Rationale"):
results/compare/transformer_vs_cnn_metrics.png- results/compare/transformer_vs_cnn_tradeoff.png- results/compare/transformer_vs_cnn_context_coverage.png

Real GEO Data Download

To attempt download from a GEO series accession: bashuvrunpythonmain.pydownload--sourcegeo--geo-accessionGSE50367--output-dirdata/raw The downloader fetches common GEO artifacts (series matrix, SOFT, and common RAW archives) and stores a manifest file with provenance. If your selected GEO series does not include a directly usable SNP matrix, unpack supplementary files and convert to numeric SNP matrix with columns like:

sample_id- breed- snp_00001 ... snp_N

PLINK Conversion Helper

If you already have PLINK files:

input prefix: mydata (mydata.bed, mydata.bim, mydata.fam)- convert using PLINK: bashplink--bfilemydata--recodeA--outdata/raw/mydata Then transform the produced .raw file into the CSV format expected by preprocessing (sample_id + breed + SNP columns).

Notebook Workflow

Use the provided notebook in notebooks/ for:

Data sanity checks- Class balance plots- CNN vs Transformer metric visualization- Exported model-comparison figures for documentation After running the comparison cells, the README figures below will point to the generated images.

Comparison Visuals

CNN vs Transformer Metrics CNN vs Transformer Tradeoff Transformer vs CNN Context Coverage

Why Transformer for SNP?

CNNs are strong at local haplotype-like pattern detection, while Transformers can model longer-range SNP dependencies through attention. This repository keeps preprocessing and splits identical so the model comparison remains fair. Why we generally choose Transformer over CNN for this task:

SNPs are sequential but not purely local; biologically relevant interactions can be far apart in genomic coordinates.- Self-attention captures long-range dependencies that fixed convolution kernels may miss.- Macro-F1 and weighted-F1 are emphasized to avoid selecting a model that only performs well on dominant breeds.- The context-coverage plot in the notebook shows why this matters at scale: CNN context grows slowly with depth, while Transformer can access near-global context (within configured token/patch limits). Important interpretation note:
On smaller or easier datasets, CNN can still win on both speed and accuracy.- We keep both models and compare every run; Transformer is preferred when long-range dependency modeling provides measurable quality gains. Selection rule used in the notebook:
Choose Transformer when macro-F1 or weighted-F1 improves by more than 0.01 over CNN.- Choose CNN only when quality is similar and runtime/latency constraints dominate.

Gallery

January 2026

Genome AI: Cattle Breed Classification from SNP Data

This project provides a reproducible pipeline for cattle breed classification from SNP genotype data using two neural baselines:

1D CNN baseline for local SNP-pattern extraction- Transformer encoder baseline for long-range SNP dependencies The workflow is notebook-first with script-backed commands so you can run experiments from CLI and analyze outputs in Jupyter.

What Is Implemented

Data acquisition commands:- Synthetic SNP dataset generation for quick sanity checks- GEO series downloader for real dataset bootstrapping- Schema validation for required fields (breed labels, SNP feature columns)- Preprocessing pipeline:- Genotype normalization to 0/1/2 tokens- Missing-value imputation- Stratified train/val/test split generation- Class-weight estimation for imbalanced breeds- Model training pipeline:- CNN and Transformer model implementations- Shared trainer with early stopping and best-checkpoint saving- Benchmark report generation:- Accuracy, macro-F1, weighted-F1- Side-by-side CNN vs Transformer comparison artifact

Dataset Options

You asked for cattle SNP datasets suitable for breed classification. The implementation supports either direct GEO download or your own pre-collected SNP matrix. Suggested starting points:

NCBI GEO cattle SNP arrays (smaller and easier to iterate)2. 1000 Bull Genomes subsets (larger and more realistic) Notes:

GEO accessions vary by study format; some provide directly usable SNP matrices, others require extraction from supplementary archives.- For PLINK BED/BIM/FAM inputs, convert to .raw text first, then feed to preprocessing.

End-to-End Quick Start

Generate a synthetic SNP dataset (fast smoke test): bashuvrunpythonmain.pydownload--sourcesynthetic--output-dirdata/raw--n-samples1200--n-snps4096

Validate schema: bashuvrunpythonmain.pyvalidate--input-csvdata/raw/synthetic/synthetic_cattle_snp.csv

Preprocess and split:

bashuvrunpythonmain.pypreprocess--input-csvdata/raw/synthetic/synthetic_cattle_snp.csv--output-npzdata/processed/snp_dataset.npz--max-snps4096

Train one model: bashuvrunpythonmain.pytrain--data-npzdata/processed/snp_dataset.npz--modelcnn--output-dirresults/cnn

Run CNN vs Transformer comparison: bashuvrunpythonmain.pycompare--data-npzdata/processed/snp_dataset.npz--output-dirresults/compare--epochs20 Comparison summary is written to:

results/compare/comparison.json- results/compare/comparison.md Comparison visuals (generated by the notebook section "CNN vs Transformer Visual Comparison and Selection Rationale"):

results/compare/transformer_vs_cnn_metrics.png- results/compare/transformer_vs_cnn_tradeoff.png- results/compare/transformer_vs_cnn_context_coverage.png

Real GEO Data Download

sample_id- breed- snp_00001 ... snp_N

Why Transformer for SNP?

SNPs are sequential but not purely local; biologically relevant interactions can be far apart in genomic coordinates.- Self-attention captures long-range dependencies that fixed convolution kernels may miss.- Macro-F1 and weighted-F1 are emphasized to avoid selecting a model that only performs well on dominant breeds.- The context-coverage plot in the notebook shows why this matters at scale: CNN context grows slowly with depth, while Transformer can access near-global context (within configured token/patch limits). Important interpretation note:

On smaller or easier datasets, CNN can still win on both speed and accuracy.- We keep both models and compare every run; Transformer is preferred when long-range dependency modeling provides measurable quality gains. Selection rule used in the notebook:

Choose Transformer when macro-F1 or weighted-F1 improves by more than 0.01 over CNN.- Choose CNN only when quality is similar and runtime/latency constraints dominate.