| Contributors | Affiliation | Role |
|---|---|---|
| White, Crow | California Polytechnic State University (Cal Poly) | Principal Investigator |
| Christie, Mark | Purdue University | Co-Principal Investigator |
| Toonen, Robert J. | University of Hawaiʻi at Mānoa (HIMB) | Co-Principal Investigator |
| Davidson, Jean | California Polytechnic State University (Cal Poly) | Scientist |
| Daniels, Benjamin | Oregon State University (OSU) | Student |
| Lee, Andy | Purdue University | Student |
| López, Cataixa | Hawaii Pacific University (HPU) | Student |
| York, Amber D. | Woods Hole Oceanographic Institution (WHOI BCO-DMO) | BCO-DMO Data Manager |
"KW" or "kw" in this dataset's files, sample IDs, or metadata indicate the organism of interest:
Kellet’s whelk, gastropod, Kelletia kelletii, LSID (urn:lsid:marinespecies.org:taxname:491054)
Field Collections
Using SCUBA, we collected adult and recruit Kellet's whelks by hand from sub-tidal (approximately 15 m depth) locations across Kellet’s whelk’s entire biogeographic range, from Isla San Roque in Baja California, Mexico to Monterey Bay in California, USA. Collections occurred across three years, from 2015 to 2017. We used Kellet’s whelk’s growth function to classify the ages of recruit whelks based on length (White et al., 2025).
DNA extraction
DNA was extracted from whelk tissue using a Salting-out protocol (described in detail by Daniels, et al., 2023), cleaned using the ZR-96 DNA Clean-Up Kit (Zymo Research, USA), and sequenced using the generated GT-Seq panel.
Sequencing
We developed a novel GT-Seq panel (Campbell et al., 2015) using SNPs found on differentially expressed genes between Kellet’s whelks’ expanded and historical range sites (MON and NAP respectively) (Lee et al., 2024). Individuals were genotyped by GTSeek (Twin Falls, ID).
Briefly, RNA reads from the expanded and historical ranges were aligned to a de novo reference transcriptome (Daniels, et al., 2023) using bowtie2/2.4.2 (Langmead & Salzberg, 2012), sorted using samtools/1.9 (Li et al., 2009), and merged using stringtie2/2.1.1 (Kovaka et al., 2019). The count matrix was created using the featureCounts tool of subread/2.0.1 (Liao et al., 2014). Differentially expressed genes (DEGs) were identified using the R package DESeq2/1.34 (Love et al., 2014) using a minimum significance threshold of 0.05 after false discovery rate correction via the Benjamini–Hochberg method (Benjamini & Hochberg, 1995). SNPs were identified on DEG contigs using the GATK pipeline (Van der Auwera & O’Connor, 2020), following best practices for RNA-Seq. We then selected 1,000 SNPs with the highest pairwise FST values for GT-Seq multi-plex primer design by GTSeek (Twin Falls, ID).
- Loaded two CSV files: "kw_gtseq_allinds_samplemeta v3.csv" as resource "995189_v1_kw_gtseq_allinds_samplemeta" and "kw_gtseq_genotypes_combined v3.csv" as resource "kw_gtseq_genotypes"; both with missing values defined as empty strings and "NA"
- Applied extensive field-level metadata (descriptions, standard name IDs, supplied units) to all fields in the sample metadata resource
- Renamed four columns in the sample metadata resource: "OriginalCalculatedConcentration_ng_ul_" → "OriginalCalculatedConcentration_ng_ul", "ExtractionCap_" → "ExtractionCap", "Elution_TEBuffer_" → "Elution_TEBuffer", "DNAYield_ug_" → "DNAYield_ug"
- Time was not provided so removed trailing " 0:00" time padding from six date fields (DateOfExtraction, SurveyDate, tfert_date, thatch_date, tmiddisp_date, tsettle_date).
- Converted all six date fields from "%m/%d/%Y" format to ISO 8601 "%Y-%m-%d" format
- Set data types for all columns in the sample metadata resource: numeric fields (Age_year, Lat, Lon, etc.), integer fields (CLNo, Cohort_year, GTseq_On_Target_Reads, etc.), string fields (CapLabel, GTseq_Sample, Region, etc.), and date fields with "%Y-%m-%d" format
- Fixed longitude values in the sample metadata resource by prepending a negative sign, as these decimal degree locations should be negative (West)
- Output two CSV files: "995189_v1_kw_gtseq_allinds_samplemeta.csv" and supplemental table "kw_gtseq_genotypes.csv"
| Parameter | Description | Units |
| CLNo | DNA extraction cap label number | unitless |
| CapLabel | DNA extraction cap label | unitless |
| SiteDescription | Name of Kelletia kelletii tissue sample collection site | unitless |
| SiteCode | Code name of Kelletia kelletii tissue sample collection site | unitless |
| Lat | Latitude of Kelletia kelletii tissue sample collection site, North is positive | decimal degrees |
| Lon | Longitude of Kelletia kelletii tissue sample collection site, West is negative | decimal degrees |
| SiteCodeLetter_cap | DNA extraction cap label letter | unitless |
| Whelk_ID | Unique sample ID | unitless |
| TissueType | Sample tissue type (adult or recruit) | unitless |
| Year | Year of Kelletia kelletii tissue sample collection | unitless |
| ExtractionCap | DNA extraction cap label number | unitless |
| DateOfExtraction | Date DNA extraction was performed | unitless |
| PerformedBy | Researcher number who performed DNA extraction | unitless |
| OriginalCalculatedConcentration | DNA extraction concentration (ng/ul) if measured | ug/ul or ng/ul (conflicting units description) |
| Elution_TEBuffer | TE elution buffer used in DNA extraction protocol | microliters (uL) |
| DNAYield_ug | DNA extraction yield (ug), if measured | micrograms (ug) |
| Measurement_mm | Kelletia kelletii maximum shell length (mm), if measured (recruits only) | millimeters (mm) |
| GTseq_Sample | Unique sample ID used by GTSeek for sequencing (same meaning in GTseq_Sample in supplemental kw_gtseq_genotypes.csv) | unitless |
| GTseq_Raw_Reads | Raw number of reads sequenced | unitless |
| GTseq_On_Target_Reads | Number of reads containing in-silico probe sequences | unitless |
| GTseq_Percent_On_Target | Percentage of raw reads containing in-silico probe sequences. | percent (%) |
| GTseq_Percent_GT | Percentage of loci that was genotyped | percent (%) |
| GTseq_IFI | Individual fuzziness index of each sample. This is a measure of DNA cross-contamination and is calculated using read counts from the background signal at homozygous and No-Call loci. Low scores are better than high scores | unitless |
| GTseq_Sample_reprep | Unique sample ID for resequenced individuals | unitless |
| GTseq_Raw_Reads_reprep | Raw number of reads sequenced for resequenced individuals | count |
| GTseq_On_Target_Reads_reprep | Number of reads containing in-silico probe sequences for resequenced individuals | count |
| GTseq_Percent_On_Target_reprep | Percentage of raw reads containing in-silico probe sequences for resequenced individuals | percent (%) |
| GTseq_Percent_GT_reprep | Percentage of loci that was genotyped for resequenced individuals | percent (%) |
| GTseq_IFI_reprep | Individual fuzziness index of each sample for resequenced individuals. This is a measure of DNA cross-contamination and is calculated using read counts from the background signal at homozygous and No-Call loci. Low scores are better than high scores | unitless |
| GTseq_Run_reprep | Whether or not a sample was re-run, 0 = no, 2 = yes. | unitless |
| PoolRADseq | Flag if sample used in Pooled RADseq (1=yes, 0=no) | unitless |
| Recruit_label | CLNo for just recruits | unitless |
| SL_mm | Recruit shell length | millimeters (mm) |
| Age_year | Recruit age calculated using growth function from White et al. (2025) | years |
| SurveyDate | Date sample was collected in the wild | unitless |
| tfert_date | Date recruit was estimated to have been fertilized | unitless |
| thatch_date | Date recruit was estimated to have hatched from its egg capsule | unitless |
| tmiddisp_date | Date recruit was estimated to be midpoint in its dispersal period | unitless |
| tsettle_date | Date recruit was estimated to have settled post-dispersal | unitless |
| Cohort_year | Year recruit was estimated to be midpoint in its dispersal period | unitless |
| Region | Region recruit was collected (species' historical or expanded range region) | unitless |
| Recruit_type | Categorical age group of recruits | unitless |
| Lee_etal | Flag if sample used in GTseq (included=yes, excluded=no) | unitless |
| Dataset-specific Instrument Name | |
| Generic Instrument Name | Self-Contained Underwater Breathing Apparatus |
| Dataset-specific Description | Tissue samples collected via SCUBA and small vessel |
| Generic Instrument Description | The self-contained underwater breathing apparatus or scuba diving system is the result of technological developments and innovations that began almost 300 years ago. Scuba diving is the most extensively used system for breathing underwater by recreational divers throughout the world and in various forms is also widely used to perform underwater work for military, scientific, and commercial purposes.
Reference: https://oceanexplorer.noaa.gov/technology/technical/technical.html |
NSF Award Abstract:
Where do young marine fish and shellfish come from? This project aims to improve our understanding of how coastal marine populations are connected in space and time. Coastal populations are replenished through the arrival of minuscule larvae that have been dispersed for weeks to months in the open ocean after spawning at remote sites. The combination of the long dispersal period of marine fish and shellfish larvae and the varying ocean currents results in complex patterns of "connectivity" among populations near and far. Identifying these patterns of connectivity is fundamental to marine science and critical for effective fisheries management and conservation, yet it remains an unresolved component of marine ecology. The study species is currently expanding its biogeographic range up the U.S. west coast. By genetically analyzing individuals from across the species' range, including offspring spawned in the laboratory by experimentally-crossed individuals collected in the field from throughout the species historical and expanded range, certain genes can serve to differentiate populations along the coast. The team leverages the statistical power of these geographically-informative genes to assign thousands of young collected in the field to the source populations that spawned them (across the species' range and over multiple years). The team then quantifies patterns of connectivity over multiple years, and tests fundamental hypotheses on the spatial scale, temporal variability, biogeographic patterns, and biophysical drivers of population connectivity. The project trains approximately two dozen U.S. university students in molecular ecology and marine science, as well as creating intellectual linkages among Ph.D.-granting and non-Ph.D.-granting universities. The project also supports further development of a K-12 education program that uses SCUBA diving and videography to teach elementary school students Next Generation Science Standards and train them for careers in science, technology, engineering and mathematics.
Using a kelp forest gastropod and fisheries species (Kellet's whelk, Kelletia kelletii), this project combines genome-wide Restriction site Associated DNA (RAD) loci with transcriptomic loci identified from common-garden laboratory crosses of individuals from the species' historical and expanded range to identify geographically-informative loci that maximize power for individual assignment testing. Leveraging the combined power of these loci, genetic assignment of approximately three thousand recruit samples to 20 putative source populations allows the team to construct three independent years of connectivity matrices and test some of the most fundamental questions in marine ecology, including: 1) Are marine populations open or closed and at what scales? 2) To what degree is the evolutionary pattern of gene flow represented by single versus multiple generations of connectivity events? And, 3) How spatially heterogeneous and temporally variable is population connectivity? Can one year of connectivity data predict anything about the next? Additionally, by focusing on a range-expanding species with common life history traits, the team addresses a number of questions with broad applicability and significant ecological and societal implications: 4) How much is population connectivity influenced by post-recruitment demographic and evolutionary processes? 5) How well-connected are historic- and expanded-range populations? And, of particular relevance to climate change, 6) Are El Nino oceanographic conditions, which are predicted to increase in frequency and intensity this century, driving the poleward range expansion of this coastal marine species? By coupling common-garden experimental crosses to identify maximally-informative transcriptomic loci with genomic RAD analysis of field samples, this project aims to accurately and precisely quantify marine population connectivity in high gene flow species with large population sizes.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
| Funding Source | Award |
|---|---|
| NSF Division of Ocean Sciences (NSF OCE) | |
| NSF Division of Ocean Sciences (NSF OCE) | |
| NSF Division of Ocean Sciences (NSF OCE) |