GTseq DNA sequencing of Kelletia kelletii collected in California, USA and Baja, Mexico from 2015 to 2017

Website: https://www.bco-dmo.org/dataset/995189
Data Type: Other Field Results
Version: 1
Version Date: 2026-03-19

Project
» Collaborative Research: RUI: Combined spatial and temporal analyses of population connectivity during a northern range expansion (KW connectivity)
ContributorsAffiliationRole
White, CrowCalifornia Polytechnic State University (Cal Poly)Principal Investigator
Christie, MarkPurdue UniversityCo-Principal Investigator
Toonen, Robert J.University of Hawaiʻi at Mānoa (HIMB)Co-Principal Investigator
Davidson, JeanCalifornia Polytechnic State University (Cal Poly)Scientist
Daniels, BenjaminOregon State University (OSU)Student
Lee, AndyPurdue UniversityStudent
López, CataixaHawaii Pacific University (HPU)Student
York, Amber D.Woods Hole Oceanographic Institution (WHOI BCO-DMO)BCO-DMO Data Manager

Abstract
Climate-driven warming and changes in major ocean currents enable poleward transport and range expansions of many marine species. Here, we report the population genetic assignment of post-dispersal recruits to pre-dispersal natal source locations for the gastropod Kellet’s whelk (Kelletia kelletii), a commercial fisheries species and subtidal predator with top-down food web effects, whose populations have recently undergone climate-driven northward range expansion. This dataset includes sample id, collection site and year, tissue type, length, and extraction information for samples sequenced. Samples were collected between 2015-2017. We genotyped 2,874  individuals from 24 locations across the species’ entire biogeographic range using 305 genotyping-in-thousands by sequencing (GT-Seq) loci. Analysis shows a large contribution of 1-year old recruits from the historical range to the expanded range, variable post-settlement survival of those recruits in the expanded range in relation to their natal origin, and that El Niño Southern Oscillation may play a role in long distance dispersal.


Coverage

Location: Southern and Central California, USA subtidal coastal waters 
Spatial Extent: N:36.618167 E:-114.36197 S:27.15326 W:-121.939167
Temporal Extent: 2015-06-19 - 2017-08-19

Dataset Description

"KW" or "kw" in this dataset's files, sample IDs, or metadata indicate the organism of interest:
Kellet’s whelk, gastropod, Kelletia kelletii, LSID (urn:lsid:marinespecies.org:taxname:491054)


Methods & Sampling

Field Collections 

Using SCUBA, we collected adult and recruit Kellet's whelks by hand from sub-tidal (approximately 15 m depth) locations across Kellet’s whelk’s entire biogeographic range, from Isla San Roque in Baja California, Mexico to Monterey Bay in California, USA. Collections occurred across three years, from 2015 to 2017. We used Kellet’s whelk’s growth function to classify the ages of recruit whelks based on length (White et al., 2025).

DNA extraction

DNA was extracted from whelk tissue using a Salting-out protocol (described in detail by Daniels, et al., 2023), cleaned using the ZR-96 DNA Clean-Up Kit (Zymo Research, USA), and sequenced using the generated GT-Seq panel. 

Sequencing

We developed a novel GT-Seq panel (Campbell et al., 2014) using SNPs found on differentially expressed genes between Kellet’s whelks’ expanded and historical range sites (MON and NAP, respectively) (Lee et al., 2024). Individuals were genotyped by GTSeek (Twin Falls, ID). 

Briefly, RNA reads from the expanded and historical ranges were aligned to a de novo reference transcriptome (Daniels, et al., 2023) using bowtie2/2.4.2 (Langmead & Salzberg, 2012), sorted using samtools/1.9 (Li et al., 2009), and merged using stringtie2/2.1.1 (Kovaka et al., 2019). The count matrix was created using the featureCounts tool of subread/2.0.1 (Liao et al., 2014). Differentially expressed genes (DEGs) were identified using the R package DESeq2/1.34 (Love et al., 2014) using a minimum significance threshold of 0.05 after false discovery rate correction via the Benjamini–Hochberg method (Benjamini & Hochberg, 1995). SNPs were identified on DEG contigs using the GATK pipeline (Van der Auwera & O’Connor, 2020), following best practices for RNA-Seq. We then selected 1,000 SNPs with the highest pairwise FST values for GT-Seq multi-plex primer design by GTSeek (Twin Falls, ID). 

GT-seq Primer Design and Panel Optimization

GT-seq primers were designed using GTseek’s proprietary primer design pipeline, which screens candidate primers for thermodynamic stability, genomic specificity, and multiplex compatibility. Primer candidates were required to have melting temperatures of 58–63°C, lengths of 17–25 bp, and 20–80% GC content. Secondary structure filters were applied to remove primers exhibiting predicted hairpin Tm values >50°C. To minimize undesired cross-reactivity within the multiplex, each candidate primer was evaluated for 3′ heterodimer formation against all previously accepted primers, using the 3′-most 10 bases of both sequences. Because accepted primers already include their Illumina overhang sequences, heterodimer screening accounts for interactions involving both the target-specific region and the attached Illumina tag. Any candidate exhibiting a predicted 3′ heterodimer melting temperature of ≥12°C with any previously accepted primer was excluded. Primers passing all criteria were appended with standard Illumina tags and assembled into draft multiplex pools.

Draft multiplexes were synthesized and tested using the standard two-PCR GT-seq workflow modified by using the Nate’s Plates GT-seq tagging and normalization kits for PCR2 (GTseek, Twin Falls, ID). All test libraries were sequenced on an Illumina platform. Amplicon performance was evaluated based on target-capture efficiency, read-depth uniformity, allelic balance, and absence of secondary or off-target products. Loci exhibiting low efficiency, excessive background, or unstable allele profiles were omitted, and revised panels were iteratively re-tested. Optimization continued until the multiplex achieved ≥50% target-capture efficiency with consistent amplification across samples, at which point a finalized primer pool was prepared for production genotyping.


BCO-DMO Processing Description

- Loaded two CSV files: "kw_gtseq_allinds_samplemeta v3.csv" as resource "995189_v1_kw_gtseq_allinds_samplemeta" and "kw_gtseq_genotypes_combined v3.csv" as resource "kw_gtseq_genotypes"; both with missing values defined as empty strings and "NA"
- Applied extensive field-level metadata (descriptions, standard name IDs, supplied units) to all fields in the sample metadata resource
- Renamed four columns in the sample metadata resource: "OriginalCalculatedConcentration_ng_ul_" → "OriginalCalculatedConcentration" (confirmed correct unit with the data submitter is ng/uL not ug_ul so that was entered in metadata for column), "ExtractionCap_" → "ExtractionCap", "Elution_TEBuffer_" → "Elution_TEBuffer", "DNAYield_ug_" → "DNAYield_ug"
- Time was not provided so removed trailing " 0:00" time padding from six date fields (DateOfExtraction, SurveyDate, tfert_date, thatch_date, tmiddisp_date, tsettle_date).
- Converted all six date fields from "%m/%d/%Y" format to ISO 8601 "%Y-%m-%d" format
- Set data types for all columns in the sample metadata resource: numeric fields (Age_year, Lat, Lon, etc.), integer fields (CLNo, Cohort_year, GTseq_On_Target_Reads, etc.), string fields (CapLabel, GTseq_Sample, Region, etc.), and date fields with "%Y-%m-%d" format
- Fixed longitude values in the sample metadata resource by prepending a negative sign, as these decimal degree locations should be negative (West)
- Output two CSV files: "995189_v1_kw_gtseq_allinds_samplemeta.csv" and supplemental table "kw_gtseq_genotypes.csv"


[ table of contents | back to top ]

Data Files

File
995189_v1_kw_gtseq_allinds_samplemeta.csv
(Comma Separated Values (.csv), 726.74 KB)
MD5:629ffe64d8d67440af81a93eaf2aa494
Primary data file for dataset ID 995189, version 1. Contains sample metadata and collection information.

[ table of contents | back to top ]

Supplemental Files

File
kw_gtseq_genotypes.csv
(Comma Separated Values (.csv), 2.84 MB)
MD5:9a167ce06190d18435807620886493c1
Table of sample genotypes. The column "GTseq_Sample" is a unique sample ID used by GTSeek for sequencing (same meaning in GTseq_Sample in the main data table of this dataset "995189_v1_kw_gtseq_allinds_samplemeta.csv"). The subsequent columns are named based on the following conventions:

NODE_AAAAAA_length_BBBB_cov_DDD.DDDDDD_gEEE_iFFF_GGG

AAAAAA= the number of the transcript
BBBB= sequence length in nucleotides
DDD.DDDDDD= k-mer coverage
EEE= gene number (g)
FFF= isoform number(i)
GGG= starting position of each actual gt-seq loci
Whelk_GTseq345_GenomeCap_Amplicons.fa
(FASTA, 47.04 KB)
MD5:cbb1bb334e4506da43f5fb90d0ec1120
Fasta file of each of the amplicon sequences that was used for GTseq (See Methods & Sampling and Data Processing for context).

[ table of contents | back to top ]

Related Publications

Benjamini, Y., & Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289–300. doi:10.1111/j.2517-6161.1995.tb02031.x
Methods
Campbell, N. R., Harmon, S. A., & Narum, S. R. (2014). Genotyping‐in‐Thousands by sequencing (GT‐seq): A cost effective SNP genotyping method based on custom amplicon sequencing. Molecular Ecology Resources, 15(4), 855–867. Portico. https://doi.org/10.1111/1755-0998.12357
Methods
Daniels, B. N., Nurge, J., Sleeper, O., Lee, A., López, C., Christie, M. R., Toonen, R. J., White, C., & Davidson, J. M. (2023). Genomic DNA extraction optimization and validation for genome sequencing using the marine gastropod Kellet’s whelk. PeerJ, 11, e16510. Portico. https://doi.org/10.7717/peerj.16510
Methods
Kovaka, S., Zimin, A. V., Pertea, G. M., Razaghi, R., Salzberg, S. L., & Pertea, M. (2019). Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biology, 20(1). https://doi.org/10.1186/s13059-019-1910-1
Methods
Lee, A., Daniels, B. N., Hemstrom, W., López, C., Kagaya, Y., Kihara, D., Davidson, J. M., Toonen, R. J., White, C., & Christie, M. R. (2024). Genetic adaptation despite high gene flow in a range‐expanding population. Molecular Ecology. Portico. https://doi.org/10.1111/mec.17511
Results
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., … Homer, N. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16), 2078–2079. doi:10.1093/bioinformatics/btp352
Methods
Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12). doi:10.1186/s13059-014-0550-8
Methods
Van der Auwera, G.A & O'Connor, B.D. (2020). Genomics in the Cloud: Using Docker, GATK, and WDL in Terra (1st Edition). O'Reilly Media. ISBN 9781491975190
Software
White, C., Tett, P., Kushner, D. J., Beas, R., Zacherl, D., Lonhart, S. I., Lorda, J., Roy, S., Toonen, R. J., Christie, M., Daniels, B. N., Lee, A., & Lopez, C. (2025). Cohort tracking using size‐frequency population survey data to estimate individual growth. Ecosphere, 16(10). Portico. https://doi.org/10.1002/ecs2.70436
Methods

[ table of contents | back to top ]

Related Datasets

IsRelatedTo
White, C. (2025) Kelletia kelletii size-frequency population survey data collected by scientific SCUBA divers at 36 kelp forest habitat sites across the species’ biogeographic range in 2015, 2016 and 2017. Biological and Chemical Oceanography Data Management Office (BCO-DMO). (Version 1) Version Date 2025-03-12 doi:10.26008/1912/bco-dmo.955710.1 [view at BCO-DMO]
Relationship Description: Related datasets from the same study
White, C., Christie, M., Toonen, R. J. (2022) Wild adult and recruit Kelletia kelletii samples from 2015 to 2017 (KW connectivity project). Biological and Chemical Oceanography Data Management Office (BCO-DMO). (Version 1) Version Date 2022-05-17 doi:10.26008/1912/bco-dmo.874458.1 [view at BCO-DMO]
Relationship Description: Related datasets from the same study
White, C., Christie, M., Toonen, R. J., López, C., Lee, A., Davidson, J., Daniels, B., Evan, F. (2025) Restriction site-associated DNA sequence metadata of Kelletia kelletii collected in California, USA and Baja, Mexico in 2015 to 2017. Biological and Chemical Oceanography Data Management Office (BCO-DMO). (Version 1) Version Date 2025-04-09 doi:10.26008/1912/bco-dmo.958359.1 [view at BCO-DMO]
Relationship Description: Related datasets from the same study
White, C., Toonen, R. J., Christie, M., Davidson, J., Anderson, P., Daniels, B., Lee, A., López, C. (2024) Full genome and transcriptome sequence assembly of the non-model organism Kellet’s whelk, Kelletia kelletii. Biological and Chemical Oceanography Data Management Office (BCO-DMO). (Version 1) Version Date 2024-12-04 doi:10.26008/1912/bco-dmo.945292.1 [view at BCO-DMO]
Relationship Description: Related datasets from the same study

[ table of contents | back to top ]

Parameters

ParameterDescriptionUnits
CLNo

DNA extraction cap label number

unitless
CapLabel

DNA extraction cap label

unitless
SiteDescription

Name of Kelletia kelletii tissue sample collection site

unitless
SiteCode

Code name of Kelletia kelletii tissue sample collection site

unitless
Lat

Latitude of Kelletia kelletii tissue sample collection site, North is positive

decimal degrees
Lon

Longitude of Kelletia kelletii tissue sample collection site, West is negative

decimal degrees
SiteCodeLetter_cap

DNA extraction cap label letter

unitless
Whelk_ID

Unique sample ID

unitless
TissueType

Sample tissue type (adult or recruit)

unitless
Year

Year of Kelletia kelletii tissue sample collection

unitless
ExtractionCap

DNA extraction cap label number

unitless
DateOfExtraction

Date DNA extraction was performed

unitless
PerformedBy

Researcher number who performed DNA extraction

unitless
OriginalCalculatedConcentration

DNA extraction concentration if measured

ng/ul
Elution_TEBuffer

TE elution buffer used in DNA extraction protocol

microliters (uL)
DNAYield_ug

DNA extraction yield (ug), if measured

micrograms (ug)
Measurement_mm

Kelletia kelletii maximum shell length (mm), if measured (recruits only)

millimeters (mm)
GTseq_Sample

Unique sample ID used by GTSeek for sequencing (same meaning in GTseq_Sample in supplemental kw_gtseq_genotypes.csv)

unitless
GTseq_Raw_Reads

Raw number of reads sequenced

unitless
GTseq_On_Target_Reads

Number of reads containing in-silico probe sequences

unitless
GTseq_Percent_On_Target

Percentage of raw reads containing in-silico probe sequences.

percent (%)
GTseq_Percent_GT

Percentage of loci that was genotyped

percent (%)
GTseq_IFI

Individual fuzziness index of each sample. This is a measure of DNA cross-contamination and is calculated using read counts from the background signal at homozygous and No-Call loci. Low scores are better than high scores

unitless
GTseq_Sample_reprep

Unique sample ID for resequenced individuals

unitless
GTseq_Raw_Reads_reprep

Raw number of reads sequenced for resequenced individuals

count
GTseq_On_Target_Reads_reprep

Number of reads containing in-silico probe sequences for resequenced individuals

count
GTseq_Percent_On_Target_reprep

Percentage of raw reads containing in-silico probe sequences for resequenced individuals

percent (%)
GTseq_Percent_GT_reprep

Percentage of loci that was genotyped for resequenced individuals

percent (%)
GTseq_IFI_reprep

Individual fuzziness index of each sample for resequenced individuals. This is a measure of DNA cross-contamination and is calculated using read counts from the background signal at homozygous and No-Call loci. Low scores are better than high scores

unitless
GTseq_Run_reprep

Whether or not a sample was re-run, 0 = no, 2 = yes.

unitless
PoolRADseq

Flag if sample used in Pooled RADseq (1=yes, 0=no)

unitless
Recruit_label

CLNo for just recruits

unitless
SL_mm

Recruit shell length

millimeters (mm)
Age_year

Recruit age calculated using growth function from White et al. (2025)

years
SurveyDate

Date sample was collected in the wild

unitless
tfert_date

Date recruit was estimated to have been fertilized

unitless
thatch_date

Date recruit was estimated to have hatched from its egg capsule

unitless
tmiddisp_date

Date recruit was estimated to be midpoint in its dispersal period

unitless
tsettle_date

Date recruit was estimated to have settled post-dispersal

unitless
Cohort_year

Year recruit was estimated to be midpoint in its dispersal period

unitless
Region

Region recruit was collected (species' historical or expanded range region)

unitless
Recruit_type

Categorical age group of recruits

unitless
Lee_etal

Flag if sample used in GTseq (included=yes, excluded=no)

unitless


[ table of contents | back to top ]

Instruments

Dataset-specific Instrument Name
Generic Instrument Name
Self-Contained Underwater Breathing Apparatus
Dataset-specific Description
Tissue samples collected via SCUBA and small vessel
Generic Instrument Description
The self-contained underwater breathing apparatus or scuba diving system is the result of technological developments and innovations that began almost 300 years ago. Scuba diving is the most extensively used system for breathing underwater by recreational divers throughout the world and in various forms is also widely used to perform underwater work for military, scientific, and commercial purposes. Reference: https://oceanexplorer.noaa.gov/technology/technical/technical.html


[ table of contents | back to top ]

Project Information

Collaborative Research: RUI: Combined spatial and temporal analyses of population connectivity during a northern range expansion (KW connectivity)

Coverage: California, USA and Baja, Mexico coast


NSF Award Abstract:
Where do young marine fish and shellfish come from? This project aims to improve our understanding of how coastal marine populations are connected in space and time. Coastal populations are replenished through the arrival of minuscule larvae that have been dispersed for weeks to months in the open ocean after spawning at remote sites. The combination of the long dispersal period of marine fish and shellfish larvae and the varying ocean currents results in complex patterns of "connectivity" among populations near and far. Identifying these patterns of connectivity is fundamental to marine science and critical for effective fisheries management and conservation, yet it remains an unresolved component of marine ecology. The study species is currently expanding its biogeographic range up the U.S. west coast. By genetically analyzing individuals from across the species' range, including offspring spawned in the laboratory by experimentally-crossed individuals collected in the field from throughout the species historical and expanded range, certain genes can serve to differentiate populations along the coast. The team leverages the statistical power of these geographically-informative genes to assign thousands of young collected in the field to the source populations that spawned them (across the species' range and over multiple years). The team then quantifies patterns of connectivity over multiple years, and tests fundamental hypotheses on the spatial scale, temporal variability, biogeographic patterns, and biophysical drivers of population connectivity. The project trains approximately two dozen U.S. university students in molecular ecology and marine science, as well as creating intellectual linkages among Ph.D.-granting and non-Ph.D.-granting universities. The project also supports further development of a K-12 education program that uses SCUBA diving and videography to teach elementary school students Next Generation Science Standards and train them for careers in science, technology, engineering and mathematics.

Using a kelp forest gastropod and fisheries species (Kellet's whelk, Kelletia kelletii), this project combines genome-wide Restriction site Associated DNA (RAD) loci with transcriptomic loci identified from common-garden laboratory crosses of individuals from the species' historical and expanded range to identify geographically-informative loci that maximize power for individual assignment testing. Leveraging the combined power of these loci, genetic assignment of approximately three thousand recruit samples to 20 putative source populations allows the team to construct three independent years of connectivity matrices and test some of the most fundamental questions in marine ecology, including: 1) Are marine populations open or closed and at what scales? 2) To what degree is the evolutionary pattern of gene flow represented by single versus multiple generations of connectivity events? And, 3) How spatially heterogeneous and temporally variable is population connectivity? Can one year of connectivity data predict anything about the next? Additionally, by focusing on a range-expanding species with common life history traits, the team addresses a number of questions with broad applicability and significant ecological and societal implications: 4) How much is population connectivity influenced by post-recruitment demographic and evolutionary processes? 5) How well-connected are historic- and expanded-range populations? And, of particular relevance to climate change, 6) Are El Nino oceanographic conditions, which are predicted to increase in frequency and intensity this century, driving the poleward range expansion of this coastal marine species? By coupling common-garden experimental crosses to identify maximally-informative transcriptomic loci with genomic RAD analysis of field samples, this project aims to accurately and precisely quantify marine population connectivity in high gene flow species with large population sizes.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.



[ table of contents | back to top ]

Funding

Funding SourceAward
NSF Division of Ocean Sciences (NSF OCE)
NSF Division of Ocean Sciences (NSF OCE)
NSF Division of Ocean Sciences (NSF OCE)

[ table of contents | back to top ]