GTseq DNA sequencing of Kelletia kelletii collected in California, USA and Baja, Mexico from 2015 to 2017

Website: https://www.bco-dmo.org/dataset/995189

Data Type: Other Field Results

Version: 1

Version Date: 2026-03-19

Project

» Collaborative Research: RUI: Combined spatial and temporal analyses of population connectivity during a northern range expansion (KW connectivity)

Contributors	Affiliation	Role
White, Crow	California Polytechnic State University (Cal Poly)	Principal Investigator
Christie, Mark	Purdue University	Co-Principal Investigator
Toonen, Robert J.	University of Hawaiʻi at Mānoa (HIMB)	Co-Principal Investigator
Davidson, Jean	California Polytechnic State University (Cal Poly)	Scientist
Daniels, Benjamin	Oregon State University (OSU)	Student
Lee, Andy	Purdue University	Student
López, Cataixa	Hawaii Pacific University (HPU)	Student
York, Amber D.	Woods Hole Oceanographic Institution (WHOI BCO-DMO)	BCO-DMO Data Manager

Abstract

Climate-driven warming and changes in major ocean currents enable poleward transport and range expansions of many marine species. Here, we report the population genetic assignment of post-dispersal recruits to pre-dispersal natal source locations for the gastropod Kellet’s whelk (Kelletia kelletii), a commercial fisheries species and subtidal predator with top-down food web effects, whose populations have recently undergone climate-driven northward range expansion. This dataset includes sample id, collection site and year, tissue type, length, and extraction information for samples sequenced. Samples were collected between 2015-2017. We genotyped 2,874 individuals from 24 locations across the species’ entire biogeographic range using 305 genotyping-in-thousands by sequencing (GT-Seq) loci. Analysis shows a large contribution of 1-year old recruits from the historical range to the expanded range, variable post-settlement survival of those recruits in the expanded range in relation to their natal origin, and that El Niño Southern Oscillation may play a role in long distance dispersal.

Coverage
Dataset Description
- Methods & Sampling
- BCO-DMO Processing Description
Data Files
Supplemental Files
Related Publications
Related Datasets
Parameters
Instruments
Project Information
Funding

Coverage

Location: Southern and Central California, USA subtidal coastal waters

Spatial Extent: N:36.618167 E:-114.36197 S:27.15326 W:-121.939167

Temporal Extent: 2015-06-19 - 2017-08-19

Dataset Description

"KW" or "kw" in this dataset's files, sample IDs, or metadata indicate the organism of interest:
Kellet’s whelk, gastropod, Kelletia kelletii, LSID (urn:lsid:marinespecies.org:taxname:491054)

Methods & Sampling

Field Collections

Using SCUBA, we collected adult and recruit Kellet's whelks by hand from sub-tidal (approximately 15 m depth) locations across Kellet’s whelk’s entire biogeographic range, from Isla San Roque in Baja California, Mexico to Monterey Bay in California, USA. Collections occurred across three years, from 2015 to 2017. We used Kellet’s whelk’s growth function to classify the ages of recruit whelks based on length (White et al., 2025).

DNA extraction

DNA was extracted from whelk tissue using a Salting-out protocol (described in detail by Daniels, et al., 2023), cleaned using the ZR-96 DNA Clean-Up Kit (Zymo Research, USA), and sequenced using the generated GT-Seq panel.

Sequencing

We developed a novel GT-Seq panel (Campbell et al., 2014) using SNPs found on differentially expressed genes between Kellet’s whelks’ expanded and historical range sites (MON and NAP, respectively) (Lee et al., 2024). Individuals were genotyped by GTSeek (Twin Falls, ID).

Briefly, RNA reads from the expanded and historical ranges were aligned to a de novo reference transcriptome (Daniels, et al., 2023) using bowtie2/2.4.2 (Langmead & Salzberg, 2012), sorted using samtools/1.9 (Li et al., 2009), and merged using stringtie2/2.1.1 (Kovaka et al., 2019). The count matrix was created using the featureCounts tool of subread/2.0.1 (Liao et al., 2014). Differentially expressed genes (DEGs) were identified using the R package DESeq2/1.34 (Love et al., 2014) using a minimum significance threshold of 0.05 after false discovery rate correction via the Benjamini–Hochberg method (Benjamini & Hochberg, 1995). SNPs were identified on DEG contigs using the GATK pipeline (Van der Auwera & O’Connor, 2020), following best practices for RNA-Seq. We then selected 1,000 SNPs with the highest pairwise FST values for GT-Seq multi-plex primer design by GTSeek (Twin Falls, ID).

GT-seq Primer Design and Panel Optimization

GT-seq primers were designed using GTseek’s proprietary primer design pipeline, which screens candidate primers for thermodynamic stability, genomic specificity, and multiplex compatibility. Primer candidates were required to have melting temperatures of 58–63°C, lengths of 17–25 bp, and 20–80% GC content. Secondary structure filters were applied to remove primers exhibiting predicted hairpin Tm values >50°C. To minimize undesired cross-reactivity within the multiplex, each candidate primer was evaluated for 3′ heterodimer formation against all previously accepted primers, using the 3′-most 10 bases of both sequences. Because accepted primers already include their Illumina overhang sequences, heterodimer screening accounts for interactions involving both the target-specific region and the attached Illumina tag. Any candidate exhibiting a predicted 3′ heterodimer melting temperature of ≥12°C with any previously accepted primer was excluded. Primers passing all criteria were appended with standard Illumina tags and assembled into draft multiplex pools.

Draft multiplexes were synthesized and tested using the standard two-PCR GT-seq workflow modified by using the Nate’s Plates GT-seq tagging and normalization kits for PCR2 (GTseek, Twin Falls, ID). All test libraries were sequenced on an Illumina platform. Amplicon performance was evaluated based on target-capture efficiency, read-depth uniformity, allelic balance, and absence of secondary or off-target products. Loci exhibiting low efficiency, excessive background, or unstable allele profiles were omitted, and revised panels were iteratively re-tested. Optimization continued until the multiplex achieved ≥50% target-capture efficiency with consistent amplification across samples, at which point a finalized primer pool was prepared for production genotyping.

BCO-DMO Processing Description

- Loaded two CSV files: "kw_gtseq_allinds_samplemeta v3.csv" as resource "995189_v1_kw_gtseq_allinds_samplemeta" and "kw_gtseq_genotypes_combined v3.csv" as resource "kw_gtseq_genotypes"; both with missing values defined as empty strings and "NA"
- Applied extensive field-level metadata (descriptions, standard name IDs, supplied units) to all fields in the sample metadata resource
- Renamed four columns in the sample metadata resource: "OriginalCalculatedConcentration_ng_ul_" → "OriginalCalculatedConcentration" (confirmed correct unit with the data submitter is ng/uL not ug_ul so that was entered in metadata for column), "ExtractionCap_" → "ExtractionCap", "Elution_TEBuffer_" → "Elution_TEBuffer", "DNAYield_ug_" → "DNAYield_ug"
- Time was not provided so removed trailing " 0:00" time padding from six date fields (DateOfExtraction, SurveyDate, tfert_date, thatch_date, tmiddisp_date, tsettle_date).
- Converted all six date fields from "%m/%d/%Y" format to ISO 8601 "%Y-%m-%d" format
- Set data types for all columns in the sample metadata resource: numeric fields (Age_year, Lat, Lon, etc.), integer fields (CLNo, Cohort_year, GTseq_On_Target_Reads, etc.), string fields (CapLabel, GTseq_Sample, Region, etc.), and date fields with "%Y-%m-%d" format
- Fixed longitude values in the sample metadata resource by prepending a negative sign, as these decimal degree locations should be negative (West)
- Output two CSV files: "995189_v1_kw_gtseq_allinds_samplemeta.csv" and supplemental table "kw_gtseq_genotypes.csv"

[ table of contents | back to top ]

Data Files

File
995189_v1_kw_gtseq_allinds_samplemeta.csv (Comma Separated Values (.csv), 726.74 KB) MD5:629ffe64d8d67440af81a93eaf2aa494 Primary data file for dataset ID 995189, version 1. Contains sample metadata and collection information.

[ table of contents | back to top ]

Supplemental Files

File
kw_gtseq_genotypes.csv (Comma Separated Values (.csv), 2.84 MB) MD5:9a167ce06190d18435807620886493c1 Table of sample genotypes. The column "GTseq_Sample" is a unique sample ID used by GTSeek for sequencing (same meaning in GTseq_Sample in the main data table of this dataset "995189_v1_kw_gtseq_allinds_samplemeta.csv"). The subsequent columns are named based on the following conventions: NODE_AAAAAA_length_BBBB_cov_DDD.DDDDDD_gEEE_iFFF_GGG AAAAAA= the number of the transcript BBBB= sequence length in nucleotides DDD.DDDDDD= k-mer coverage EEE= gene number (g) FFF= isoform number(i) GGG= starting position of each actual gt-seq loci
Whelk_GTseq345_GenomeCap_Amplicons.fa (FASTA, 47.04 KB) MD5:cbb1bb334e4506da43f5fb90d0ec1120 Fasta file of each of the amplicon sequences that was used for GTseq (See Methods & Sampling and Data Processing for context).

File

kw_gtseq_genotypes.csv

(Comma Separated Values (.csv), 2.84 MB)
MD5:9a167ce06190d18435807620886493c1

Table of sample genotypes.  The column "GTseq_Sample" is a unique sample ID used by GTSeek for sequencing (same meaning in GTseq_Sample in the main data table of this dataset "995189_v1_kw_gtseq_allinds_samplemeta.csv").  The subsequent columns are named based on the following conventions:

NODE_AAAAAA_length_BBBB_cov_DDD.DDDDDD_gEEE_iFFF_GGG	

AAAAAA= the number of the transcript  
BBBB= sequence length in nucleotides
DDD.DDDDDD= k-mer coverage
EEE= gene number (g)
FFF= isoform number(i) 
GGG= starting position of each actual gt-seq loci

Whelk_GTseq345_GenomeCap_Amplicons.fa

(FASTA, 47.04 KB)
MD5:cbb1bb334e4506da43f5fb90d0ec1120

Fasta file of each of the amplicon sequences that was used for GTseq (See Methods & Sampling and Data Processing for context).

[ table of contents | back to top ]

Related Publications

Benjamini, Y., & Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289–300. doi:10.1111/j.2517-6161.1995.tb02031.x

Campbell, N. R., Harmon, S. A., & Narum, S. R. (2014). Genotyping‐in‐Thousands by sequencing (GT‐seq): A cost effective SNP genotyping method based on custom amplicon sequencing. Molecular Ecology Resources, 15(4), 855–867. Portico. https://doi.org/10.1111/1755-0998.12357

Daniels, B. N., Nurge, J., Sleeper, O., Lee, A., López, C., Christie, M. R., Toonen, R. J., White, C., & Davidson, J. M. (2023). Genomic DNA extraction optimization and validation for genome sequencing using the marine gastropod Kellet’s whelk. PeerJ, 11, e16510. Portico. https://doi.org/10.7717/peerj.16510

Kovaka, S., Zimin, A. V., Pertea, G. M., Razaghi, R., Salzberg, S. L., & Pertea, M. (2019). Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biology, 20(1). https://doi.org/10.1186/s13059-019-1910-1

Lee, A., Daniels, B. N., Hemstrom, W., López, C., Kagaya, Y., Kihara, D., Davidson, J. M., Toonen, R. J., White, C., & Christie, M. R. (2024). Genetic adaptation despite high gene flow in a range‐expanding population. Molecular Ecology. Portico. https://doi.org/10.1111/mec.17511

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., … Homer, N. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16), 2078–2079. doi:10.1093/bioinformatics/btp352

Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12). doi:10.1186/s13059-014-0550-8

Van der Auwera, G.A & O'Connor, B.D. (2020). Genomics in the Cloud: Using Docker, GATK, and WDL in Terra (1st Edition). O'Reilly Media. ISBN 9781491975190

White, C., Tett, P., Kushner, D. J., Beas, R., Zacherl, D., Lonhart, S. I., Lorda, J., Roy, S., Toonen, R. J., Christie, M., Daniels, B. N., Lee, A., & Lopez, C. (2025). Cohort tracking using size‐frequency population survey data to estimate individual growth. Ecosphere, 16(10). Portico. https://doi.org/10.1002/ecs2.70436

[ table of contents | back to top ]

Related Datasets

IsRelatedTo

White, C. (2025) Kelletia kelletii size-frequency population survey data collected by scientific SCUBA divers at 36 kelp forest habitat sites across the species’ biogeographic range in 2015, 2016 and 2017. Biological and Chemical Oceanography Data Management Office (BCO-DMO). (Version 1) Version Date 2025-03-12 doi:10.26008/1912/bco-dmo.955710.1 [view at BCO-DMO]
Relationship Description: Related datasets from the same study

White, C., Christie, M., Toonen, R. J. (2022) Wild adult and recruit Kelletia kelletii samples from 2015 to 2017 (KW connectivity project). Biological and Chemical Oceanography Data Management Office (BCO-DMO). (Version 1) Version Date 2022-05-17 doi:10.26008/1912/bco-dmo.874458.1 [view at BCO-DMO]
Relationship Description: Related datasets from the same study

White, C., Christie, M., Toonen, R. J., López, C., Lee, A., Davidson, J., Daniels, B., Evan, F. (2025) Restriction site-associated DNA sequence metadata of Kelletia kelletii collected in California, USA and Baja, Mexico in 2015 to 2017. Biological and Chemical Oceanography Data Management Office (BCO-DMO). (Version 1) Version Date 2025-04-09 doi:10.26008/1912/bco-dmo.958359.1 [view at BCO-DMO]
Relationship Description: Related datasets from the same study

White, C., Toonen, R. J., Christie, M., Davidson, J., Anderson, P., Daniels, B., Lee, A., López, C. (2024) Full genome and transcriptome sequence assembly of the non-model organism Kellet’s whelk, Kelletia kelletii. Biological and Chemical Oceanography Data Management Office (BCO-DMO). (Version 1) Version Date 2024-12-04 doi:10.26008/1912/bco-dmo.945292.1 [view at BCO-DMO]
Relationship Description: Related datasets from the same study

[ table of contents | back to top ]

Parameters

Parameter	Description	Units
CLNo	DNA extraction cap label number	unitless
CapLabel	DNA extraction cap label	unitless
SiteDescription	Name of Kelletia kelletii tissue sample collection site	unitless
SiteCode	Code name of Kelletia kelletii tissue sample collection site	unitless
Lat	Latitude of Kelletia kelletii tissue sample collection site, North is positive	decimal degrees
Lon	Longitude of Kelletia kelletii tissue sample collection site, West is negative	decimal degrees
SiteCodeLetter_cap	DNA extraction cap label letter	unitless
Whelk_ID	Unique sample ID	unitless
TissueType	Sample tissue type (adult or recruit)	unitless
Year	Year of Kelletia kelletii tissue sample collection	unitless
ExtractionCap	DNA extraction cap label number	unitless
DateOfExtraction	Date DNA extraction was performed	unitless
PerformedBy	Researcher number who performed DNA extraction	unitless
OriginalCalculatedConcentration	DNA extraction concentration if measured	ng/ul
Elution_TEBuffer	TE elution buffer used in DNA extraction protocol	microliters (uL)
DNAYield_ug	DNA extraction yield (ug), if measured	micrograms (ug)
Measurement_mm	Kelletia kelletii maximum shell length (mm), if measured (recruits only)	millimeters (mm)
GTseq_Sample	Unique sample ID used by GTSeek for sequencing (same meaning in GTseq_Sample in supplemental kw_gtseq_genotypes.csv)	unitless
GTseq_Raw_Reads	Raw number of reads sequenced	unitless
GTseq_On_Target_Reads	Number of reads containing in-silico probe sequences	unitless
GTseq_Percent_On_Target	Percentage of raw reads containing in-silico probe sequences.	percent (%)
GTseq_Percent_GT	Percentage of loci that was genotyped	percent (%)
GTseq_IFI	Individual fuzziness index of each sample. This is a measure of DNA cross-contamination and is calculated using read counts from the background signal at homozygous and No-Call loci. Low scores are better than high scores	unitless
GTseq_Sample_reprep	Unique sample ID for resequenced individuals	unitless
GTseq_Raw_Reads_reprep	Raw number of reads sequenced for resequenced individuals	count
GTseq_On_Target_Reads_reprep	Number of reads containing in-silico probe sequences for resequenced individuals	count
GTseq_Percent_On_Target_reprep	Percentage of raw reads containing in-silico probe sequences for resequenced individuals	percent (%)
GTseq_Percent_GT_reprep	Percentage of loci that was genotyped for resequenced individuals	percent (%)
GTseq_IFI_reprep	Individual fuzziness index of each sample for resequenced individuals. This is a measure of DNA cross-contamination and is calculated using read counts from the background signal at homozygous and No-Call loci. Low scores are better than high scores	unitless
GTseq_Run_reprep	Whether or not a sample was re-run, 0 = no, 2 = yes.	unitless
PoolRADseq	Flag if sample used in Pooled RADseq (1=yes, 0=no)	unitless
Recruit_label	CLNo for just recruits	unitless
SL_mm	Recruit shell length	millimeters (mm)
Age_year	Recruit age calculated using growth function from White et al. (2025)	years
SurveyDate	Date sample was collected in the wild	unitless
tfert_date	Date recruit was estimated to have been fertilized	unitless
thatch_date	Date recruit was estimated to have hatched from its egg capsule	unitless
tmiddisp_date	Date recruit was estimated to be midpoint in its dispersal period	unitless
tsettle_date	Date recruit was estimated to have settled post-dispersal	unitless
Cohort_year	Year recruit was estimated to be midpoint in its dispersal period	unitless
Region	Region recruit was collected (species' historical or expanded range region)	unitless
Recruit_type	Categorical age group of recruits	unitless
Lee_etal	Flag if sample used in GTseq (included=yes, excluded=no)	unitless

[ table of contents | back to top ]

Instruments

Dataset-specific Instrument Name
Generic Instrument Name	Self-Contained Underwater Breathing Apparatus
Dataset-specific Description	Tissue samples collected via SCUBA and small vessel
Generic Instrument Description	The self-contained underwater breathing apparatus or scuba diving system is the result of technological developments and innovations that began almost 300 years ago. Scuba diving is the most extensively used system for breathing underwater by recreational divers throughout the world and in various forms is also widely used to perform underwater work for military, scientific, and commercial purposes. Reference: https://oceanexplorer.noaa.gov/technology/technical/technical.html

[ table of contents | back to top ]

Project Information

Collaborative Research: RUI: Combined spatial and temporal analyses of population connectivity during a northern range expansion (KW connectivity)

Coverage: California, USA and Baja, Mexico coast

NSF Award Abstract:
Where do young marine fish and shellfish come from? This project aims to improve our understanding of how coastal marine populations are connected in space and time. Coastal populations are replenished through the arrival of minuscule larvae that have been dispersed for weeks to months in the open ocean after spawning at remote sites. The combination of the long dispersal period of marine fish and shellfish larvae and the varying ocean currents results in complex patterns of "connectivity" among populations near and far. Identifying these patterns of connectivity is fundamental to marine science and critical for effective fisheries management and conservation, yet it remains an unresolved component of marine ecology. The study species is currently expanding its biogeographic range up the U.S. west coast. By genetically analyzing individuals from across the species' range, including offspring spawned in the laboratory by experimentally-crossed individuals collected in the field from throughout the species historical and expanded range, certain genes can serve to differentiate populations along the coast. The team leverages the statistical power of these geographically-informative genes to assign thousands of young collected in the field to the source populations that spawned them (across the species' range and over multiple years). The team then quantifies patterns of connectivity over multiple years, and tests fundamental hypotheses on the spatial scale, temporal variability, biogeographic patterns, and biophysical drivers of population connectivity. The project trains approximately two dozen U.S. university students in molecular ecology and marine science, as well as creating intellectual linkages among Ph.D.-granting and non-Ph.D.-granting universities. The project also supports further development of a K-12 education program that uses SCUBA diving and videography to teach elementary school students Next Generation Science Standards and train them for careers in science, technology, engineering and mathematics.

Using a kelp forest gastropod and fisheries species (Kellet's whelk, Kelletia kelletii), this project combines genome-wide Restriction site Associated DNA (RAD) loci with transcriptomic loci identified from common-garden laboratory crosses of individuals from the species' historical and expanded range to identify geographically-informative loci that maximize power for individual assignment testing. Leveraging the combined power of these loci, genetic assignment of approximately three thousand recruit samples to 20 putative source populations allows the team to construct three independent years of connectivity matrices and test some of the most fundamental questions in marine ecology, including: 1) Are marine populations open or closed and at what scales? 2) To what degree is the evolutionary pattern of gene flow represented by single versus multiple generations of connectivity events? And, 3) How spatially heterogeneous and temporally variable is population connectivity? Can one year of connectivity data predict anything about the next? Additionally, by focusing on a range-expanding species with common life history traits, the team addresses a number of questions with broad applicability and significant ecological and societal implications: 4) How much is population connectivity influenced by post-recruitment demographic and evolutionary processes? 5) How well-connected are historic- and expanded-range populations? And, of particular relevance to climate change, 6) Are El Nino oceanographic conditions, which are predicted to increase in frequency and intensity this century, driving the poleward range expansion of this coastal marine species? By coupling common-garden experimental crosses to identify maximally-informative transcriptomic loci with genomic RAD analysis of field samples, this project aims to accurately and precisely quantify marine population connectivity in high gene flow species with large population sizes.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

[ table of contents | back to top ]

Funding

Funding Source	Award
NSF Division of Ocean Sciences (NSF OCE)	OCE-1924537
NSF Division of Ocean Sciences (NSF OCE)	OCE-1924505
NSF Division of Ocean Sciences (NSF OCE)	OCE-1924604

[ table of contents | back to top ]