SNP genotypes of Eastern oysters (Crassostrea virginica) from wild and hatchery populations along U.S. Atlantic and Gulf coasts (collected 2022), used as broodstock in a common garden field experiment (2023-2025)

Website: https://www.bco-dmo.org/dataset/998738

Data Type: experimental

Version: 1

Version Date: 2026-05-18

Project

» CAREER: Evaluation of machine learning algorithms for understanding and predicting adaptation to multivariate environments with a Model Validation Program (MVP) (Model Validation Program)

Contributors	Affiliation	Role
Lotterhos, Katie	Northeastern University	Principal Investigator
Small, Jessica	Virginia Institute of Marine Science (VIMS)	Co-Principal Investigator
Carnegie, Ryan	Virginia Institute of Marine Science (VIMS)	Scientist
Bajaj, Kiran	Northeastern University	Student
Eppley, Madeline	Northeastern University	Student
Katsuki, Shelley	Virginia Institute of Marine Science (VIMS)	Student
Mongillo, Nicole	Northeastern University	Student
Rumberger, Camille	Northeastern University	Student
Segnitz, Zea	Northeastern University	Student
York, Amber D.	Woods Hole Oceanographic Institution (WHOI BCO-DMO)	BCO-DMO Data Manager

Abstract

This dataset contains SNP genotype and metadata for broodstock (parent) Eastern oysters (Crassostrea virginica) used in a multi-year common garden field experiment (2023–2025) as part of the project "CAREER: Evaluation of machine learning algorithms for understanding and predicting adaptation to multivariate environments with a Model Validation Program (MVP)" (NSF award #2043905 to Dr. Katie Lotterhos). The goal of the broader project is to assess the ability of machine learning algorithms (MLAs) to predict adaptation of organisms from their DNA sequences. Broodstock were sourced from six wild populations along the U.S. Atlantic and Gulf coasts and two hatchery selection lines. These parents were used in spawning treatments that included monocultures of each population or line, as well as two polyculture (mixed) treatments. The resulting juvenile oysters were deployed at two common garden sites in the Chesapeake Bay (Lewisetta, VA and York River, VA) and monitored over two years. Mortality and phenotype data were collected at six-month intervals to assess fitness across spawn treatments and garden sites, providing a field-based validation of MLA predictions. Broodstock were genotyped on a 200K ThermoFisher Affymetrix Axiom SNP array derived from a 600K array (Gómez-Chiarri et al., 2015; Guo et al., 2023; Modak et al., 2021; Puritz et al., 2024) and aligned to the haplotig-masked C. virginica reference genome (C_virginica-3.0, GCA_002022765.4). The data are archived as two files: (1) a SNP genotype matrix with processed, imputed genotype calls per individual in 0/1/2 integer format (homozygous reference, heterozygous, homozygous alternate), and (2) a SNP metadata file with per-locus information including SNP identifiers, genomic positions, gene annotations, Gene Ontology terms, and population genetic statistics from OutFLANK outlier detection (FST, expected heterozygosity, p-values, q-values, and outlier flags).

Coverage
Dataset Description
Related Publications
Related Datasets
Parameters
Project Information
Funding

Coverage

Location: Atlantic and Gulf coasts of the United States

Spatial Extent: N:43.986 E:-69.55 S:28.084 W:-97.201

Temporal Extent: 2023-05-01 - 2025-05-01

Dataset Description

Acronyms:

SNP = Single-nucleotide Polymorphism
MLA = Machine Learning Algorithm
LD = Linkage Disequilibrium
NCBI = National Center for Biotechnology Information
DNA = Deoxyribonucleic acid
MVP = Model Validation Program
ABC = Aquaculture Genetics and Breeding Technology Center
VIMS = Virginia Institute of Marine Science
GO = Gene Ontology

Crassostrea virginica, LSID (urn:lsid:marinespecies.org:taxname:140657)

Methods & Sampling

Source populations for monoculture treatments included six wild populations and two proprietary broodstock lines (DEBY and LOLA) from the Aquaculture Genetics and Breeding Technology Center (ABC), Virginia Institute of Marine Science (VIMS). Wild populations spanned the species’ native range, which is structured into distinct genetic clusters separated by the Florida peninsula (Reeb & Avise, 1990, Puritz et al., 2022). Populations included oysters from: a variable salinity site in Texas (W1-TX), a low salinity site in Louisiana (W2-LA), a high salinity site on the east coast of Florida (W3-FL), a moderate salinity site in the James River, Virginia (W4-VA), a variable salinity site in New Hampshire (W5-NH), and a high salinity site in Maine (W6-ME). W4-VA was considered local to both common garden sites due to its geographic proximity and intermediate salinity.

Gill tissue was collected from each parent (n = 160) after spawning and stored in 95% ethanol at -80°C. DNA was extracted using the Qiagen DNeasy Blood and Tissue Kit and shipped to Neogen Genomics (Lincoln, NE) for genotyping on a 200K ThermoFisher Affymetrix Axiom SNP array derived from a 600K array (Gómez-Chiarri et al., 2015; Guo et al., 2023; Modak et al., 2021; Puritz et al., 2024) aligned to the haplotig-masked Crassostrea virginica reference genome (C_virginica-3.0, GCA_002022765.4) under NCBI BioProject PRJNA376014.

Data Processing Description

Raw genotypes were processed and filtered in R v4.4.2 (R Core Team 2025). Missing SNPs were imputed using LEA v3.6.0 (Frichot & François, 2015) with K = 2 ancestral groups to produce a full SNP set. We thinned SNPs for linkage disequilibrium (LD) for population structure analysis and neutral parameterization of genome scans (Lotterhos, 2019) using bigsnpr v1.11.3 (Privé et al., 2018). The archived genotype matrix contains the processed, imputed genotype calls per individual encoded in 0/1/2 integer format (homozygous reference, heterozygous, and homozygous alternate, respectively). The accompanying SNP metadata file provides per-locus descriptions including genomic position, gene identifiers, functional descriptions, Gene Ontology terms, and population genetic statistics derived from OutFLANK (FST, expected heterozygosity, p-values, q-values, and outlier flags), as well as a flag indicating inclusion in the LD-thinned dataset, enabling straightforward filtering for only LD-thinned SNPs.

The genotype data are provided as Exp_parents_full_SNP_matrix.rds, an RDS (R Data Serialization) binary file readable in R via readRDS(). The genotype data are also provided as a csv file Exp_parents_full_SNP_matrix.csv. These files contain a processed, imputed SNP genotype matrix with individuals as rows and loci as columns, encoded in 0/1/2 integer format. Genotyping, quality filtering, imputation, and formatting were performed using the code in the associated GitHub repository (https://github.com/DrK-Lo/MVP-H2F-HatcheryField). Code used to generate these files can be found in the /src/parental_genetics directory. The genotype data was generated on June 6, 2025.

BCO-DMO Processing Description

- Loaded data from "Exp_parents_SNP_metadata.csv" into table "998738_v1_oyster-genotype-annotations" using CSV format with row 1 as header; treated empty strings and "nd" as missing values
- Renamed column "AX-ID" to "AX_ID"
- Set data types for all 29 columns: AX_ID, Affx_ID, Group, SNPType, Sequence, cust_id, gene_description, gene_id, go_ids, go_terms, mutID, on_oystercv, organism, scaffold, OutlierFlag, thinned_dataset as string; Chromosome, Position, Rank, Replicates, Tile_Std, Tile_max, Tile_v3 as integer; FST, He, chrom_position, pvalues, pvaluesRightTail, qvalues as number
- Output final table as "998738_v1_oyster-genotype-annotations.csv"

- Provided matrix file Exp_parents_full_SNP_matrix.rds attached as a data file directly. [not imported as a table]

Problem Description

[ table of contents | back to top ]

Related Publications

Bajaj, K. E., Mongillo, N., Eppley, M. G., Rumberger, C. A., Segnitz, Z., Katsuki, S., Carnegie, R., Small, J., & Lotterhos, K. E. (2026). Contrasting effects of geographic distance, environmental distance, and intraspecific diversity on the performance of a marine invertebrate in common gardens. https://doi.org/10.64898/2026.04.02.716183

Frichot, E., & François, O. (2015). LEA: An R package for landscape and ecological association studies. Methods in Ecology and Evolution, 6(8), 925–929. Portico. https://doi.org/10.1111/2041-210x.12382 https://doi.org/https://doi.org/10.1111/2041-210X.12382

Guo, X., Puritz, J. B., Wang, Z., Proestou, D., Allen, S., Small, J., Verbyla, K., Zhao, H., Haggard, J., Chriss, N., Zeng, D., Lundgren, K., Allam, B., Bushek, D., Gomez-Chiarri, M., Hare, M., Hollenbeck, C., La Peyre, J., Liu, M., et al. (2023). Development and Evaluation of High-Density SNP Arrays for the Eastern Oyster Crassostrea virginica. Marine Biotechnology, 25(1), 174–191. https://doi.org/10.1007/s10126-022-10191-3

Gómez-Chiarri, M., Warren, W. C., Guo, X., & Proestou, D. (2015). Developing tools for the study of molluscan immunity: The sequencing of the genome of the eastern oyster, Crassostrea virginica. Fish &Amp; Shellfish Immunology, 46(1), 2–4. https://doi.org/10.1016/j.fsi.2015.05.004

Lotterhos, K. E. (2019). The Effect of Neutral Recombination Variation on Genome Scans for Selection. G3 Genes|Genomes|Genetics, 9(6), 1851–1867. https://doi.org/10.1534/g3.119.400088

Modak, T. H., Literman, R., Puritz, J. B., Johnson, K. M., Roberts, E. M., Proestou, D., Guo, X., Gomez-Chiarri, M., & Schwartz, R. S. (2021). Extensive genome-wide duplications in the eastern oyster (Crassostrea virginica). Philosophical Transactions of the Royal Society B, 376(1825). https://doi.org/10.1098/rstb.2020.0164

Privé, F., Aschard, H., Ziyatdinov, A., & Blum, M. G. B. (2018). Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr. Bioinformatics, 34(16), 2781–2787. https://doi.org/10.1093/bioinformatics/bty185

Puritz, J. B., Guo, X., Hare, M., He, Y., Hillier, L. W., Jin, S., Liu, M., Lotterhos, K. E., Minx, P., Modak, T., Proestou, D., Rice, E. S., Tomlinson, C., Warren, W. C., Witkop, E., Zhao, H., & Gomez‐Chiarri, M. (2023). A second unveiling: Haplotig masking of the eastern oyster genome improves population‐level inference. Molecular Ecology Resources, 24(1). Portico. https://doi.org/10.1111/1755-0998.13801

Puritz, J. B., Zhao, H., Guo, X., Hare, M. P., He, Y., LaPeyre, J., Lotterhos, K. E., Lundgren, K. M., Modak, T., Proestou, D., Rawson, P., Fernandez Robledo, J. A., Weedop, K. B., Witkop, E., & Gomez-Chiarri, M. (2022). Nucleotide and structural polymorphisms of the eastern oyster genome paint a mosaic of divergence, selection, and human impacts. https://doi.org/10.1101/2022.08.29.505629

Reeb, C. A., & Avise, J. C. (1990). A genetic discontinuity in a continuously distributed species: mitochondrial DNA in the American oyster, Crassostrea virginica. Genetics, 124(2), 397–406. https://doi.org/10.1093/genetics/124.2.397

Whitlock, M. C., & Lotterhos, K. E. (2015). Reliable Detection of Loci Responsible for Local Adaptation: Inference of a Null Model through Trimming the Distribution of F ST. The American Naturalist, 186(S1), S24–S36. https://doi.org/10.1086/682949

[ table of contents | back to top ]

Related Datasets

Software

(n.d.). MVP-H2F-HatcheryField: MVP experimental data hatchery to field [Software repository]. GitHub. https://github.com/DrK-Lo/MVP-H2F-HatcheryField

IsRelatedTo

McDonnell Genome Institute - Washington University School of Medicine (2017). Crassostrea virginica Genome sequencing and assembly. 2017/03. NCBI:BioProject: PRJNA376014 [Internet]. Bethesda, MD: National Library of Medicine (US), National Center for Biotechnology Information; Available from: http://www.ncbi.nlm.nih.gov/bioproject/PRJNA376014

[ table of contents | back to top ]

Parameters

Parameter	Description	Units
Affx_ID	Thermofisher SNP identifiers that refer to the specific probe or probe set as it was designed for the array	unitless
AX_ID	Thermofisher analysis-ready SNP identifiers that are assigned after the array is finalized	unitless
Chromosome	Chromosome where SNP occurs on assembly GCF_002022765.2_C_virginica-3.0_genomic	unitless
Position	Chromosomal position where SNP occurs on assembly GCF_002022765.2_C_virginica-3.0_genomic	unitless
FST	Fixation index for the locus calculated by OutFLANK (Whitlock and Lotterhos 2015)	unitless
He	Expected heterozygosity for the locus calculated by OutFLANK (Whitlock and Lotterhos 2015)	unitless
pvalues	Two-tailed p-value for the test of neutrality for the locus calculated by OutFLANK (Whitlock and Lotterhos 2015)	unitless
pvaluesRightTail	Right-tailed p-value for the test of neutrality for the locus calculated by OutFLANK (Whitlock and Lotterhos 2015)	unitless
qvalues	q-value for the test of neutrality for the locus based on the right-tailed p-value calculated by OutFLANK (Whitlock and Lotterhos 2015)	unitless
OutlierFlag	True (T) or False (F) if the SNP was detected as an outlier by OutFLANK (Whitlock and Lotterhos 2015)	unitless
gene_id	Generic locus IDs (NCBI) for gene(s) where SNP occurs	unitless
gene_description	Functional description(s) of gene(s) where SNP occurs	unitless
go_ids	Unique seven-digit identifier for a gene ontology term	unitless
go_terms	Description for a gene ontology term	unitless
SNPType	K - G/T M - A/C R - A/G (most common) S - C/G (rarest) W - A/T Y - C/T (most common)	unitless
Tile_Std	Affymetrix parameter for SNP chip design	unitless
Tile_max	Affymetrix parameter for SNP chip design	unitless
Tile_v3	SNPs tiled on the 200K chip	unitless
Replicates	Affymetrix parameter for SNP chip design	unitless
Rank	Affymetrix parameter for SNP chip design	unitless
Group	pathogen detection CN probesets, content of Axiom_OysterCV, additional well-performing markers from screen	unitless
organism	Genus species	unitless
cust_id	Annotation	unitless
on_oystercv	on Axiom_OysterCV developed be Breeding Consortium	unitless
mutID	Mutation ID Affymetrix parameter for SNP chip design	unitless
chrom_position	Chromosome location and position of SNP on the 200K array; separated by a '.'	unitless
scaffold	Scaffold on assembly GCF_002022765.2_C_virginica-3.0_genomic	unitless
Sequence	71mer sequence	unitless
thinned_dataset	True or False if the SNP occurs in the linkage-disequilibrium thinned SNP dataset	unitless

[ table of contents | back to top ]

Project Information

CAREER: Evaluation of machine learning algorithms for understanding and predicting adaptation to multivariate environments with a Model Validation Program (MVP) (Model Validation Program)

Coverage: East coast of North America

NSF Award Abstract:
Environmental change can be rapid and involve multiple aspects of the environment changing at the same time, such as warming and increased disease pressure. Rapid environmental change threatens the productivity of aquaculture and crops on which humans depend. Predicting organisms' vulnerabilities to rapid and multifactor environmental change, however, is a major scientific challenge. A hurdle to addressing this challenge arises from the complex and non-intuitive ways that organisms adapt, through changes at the level of the DNA sequence, to many environmental stresses at the same time. Thus, there is a need for new approaches to understand and predict adaptation in multivariate environments. To address this need, this project integrates research and education with a Model Validation Program (MVP). The research is developing and evaluating Machine Learning Algorithms (MLAs) for understanding and predicting adaptation of organisms to multivariate environments from their DNA sequences. To evaluate MLAs, this research combines both data simulation and an empirical test in the field with the Eastern Oyster, which provide important ecosystem services and support a multi-million dollar industry. For oysters, this research is studying how temperature, disease pressure, and salinity interact with evolutionary history to determine fitness in the field. This research advances efforts toward addressing the major scientific challenge of predicting adaptation in complex environments by integrating concepts across the frontiers of marine, evolutionary, and statistical sciences in a new way. Machine learning and model validation are not traditionally taught in the marine and environmental sciences, but are becoming increasingly relevant to these fields. As part of a broader education program, this research is developing MVP Learning Modules for high school students and undergraduates, which help students build the foundational knowledge they need to critically evaluate and apply models. Modules are being disseminated to hundreds of students in the greater Boston area and are being made available online for widespread use. The MVP mentoring program is training graduate students, undergraduates, and high school students in marine evolutionary ecology, statistical genomics, and machine learning. This research addresses a pressing societal need to more informatively match genotypes to environments for restoration, farming, and assisted gene flow efforts. Results are being disseminated to stakeholders in the oyster industry.

The goal of this research is to evaluate if MLAs, which can model non-linearities, can be used to understand and predict adaptation to multivariate environments under a wide range of scenarios. In Objective 1, the Principal Investigator (PI) is creating simulated datasets with different aspects of realism, and using them to evaluate and refine the MLAs. This novel set of simulations is studying genome evolution under high gene flow in complex, multivariate environments. In Objective 2, the PI is building on their expertise with the Eastern oyster to evaluate the MLAs in a field setting. The PI is first developing a comprehensive seascape genomic dataset and using it to train MLAs to predict an individual's multivariate environment based on a single nucleotide polymorphism genotype. Then, the PI is testing if the MLA prediction can predict the fitness of different genotypes from across the species range when raised in common garden field conditions. In Objective 3, the PI is integrating research and education by using the data obtained from Objs. 1 and 2 to develop a series of original "MVP Learning Modules" with interactive web apps for persons at different levels of understanding, using the relatable example of an oyster restoration project. This research lays the foundation for future studies by producing datasets that could become classical examples for developing and benchmarking innovative modeling approaches.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

[ table of contents | back to top ]

Funding

Funding Source	Award
NSF Division of Ocean Sciences (NSF OCE)	OCE-2043905

[ table of contents | back to top ]