Q1:   How do I use TRExplorer?
Q2:   What datasets are included in TRExplorer?
Q3:   Why do many TR loci have population allele frequencies from some datasets but not from others?
Q4:   How do I interpret differences in allele distributions across datasets?
Q5:   How do I interpret the (O/E) constraint percentile score?
Q6:   How do I use the VC, mappability, and nearby TR annotations?
Q7:   How do I cite TRExplorer?
Q8:   Can I contribute data to TRExplorer?

Q1: How do I use TRExplorer?

TRExplorer is similar to the gnomAD browser, but for tandem repeats (TRs). It allows users to search for TRs by their genomic intervals, gene names, gene regions, motifs, and other properties. Then, it displays TR allele size distributions, constraint metrics, and many other annotations based on different short read and long read datasets (see Q2 for details).

Example use cases:
Example #1 View a table of all STR and VNTR loci in the CACNA1C gene, sorted by motif size
Example #2 View a table of all coding CAG loci throughout the genome, sorted by their degree of polymorphism in 100 long read genomes from the HPRC,
then export them to an ExpansionHunter catalog by selecting Export To: ExpansionHunter
Example #3 See other TR loci and sequence context around the NOTCH2NLC disease-associated locus in IGV.js,
along with the segmental duplications and mappability tracks in IGV.js

Q2: What datasets are included in TRExplorer?

Population allele frequencies are currently available from the following short read, long read, and assembly-based datasets:

Assembly-based datasets
Long-read datasets
Short-read datasets

Q3: Why do many TR loci have population allele frequencies from some datasets but not from others?

Tandem repeat (TR) genotyping depends on the catalog used - that is, most short-read and long-read TR genotyping tools require the user to specify the list (aka. catalog) of loci to genotype - including their exact genomic coordinates and repeat motifs. The different datasets in TRExplorer were generated at different times by different groups, and were genotyped using different subsets of loci. That being said, we made sure to only include datasets in TRExplorer if they were genotyped using the same, or compatible locus definitions (ie. start and end coordinates and motifs).
Q4: How do I interpret differences in allele distributions across datasets?

Differences in tandem repeat (TR) allele distributions between datasets can arise from a variety of technical and biological factors, including but not limited to:

Q5: How do I interpret the (O/E) constraint percentile score?

The (O/E) constraint percentile reflects how much observed variation in a tandem repeat (TR) differs from the expectation under neutral selection. [Danzi et al. 2025] defines observed-to-expected (O/E) constraint as the ratio of observed variability in longest pure segment (LPS) length to the expected variability, based on sequence context. These expectations are predicted using a deep learning model trained on TRs more than 10kb away from the nearest gene, which are presumably mostly under neutral selection. Unlike genic constraint models, which focus primarily on identifying loci under purifying selection, TR constraint captures both ends of the evolutionary spectrum:

Low O/E values (constrained): TRs with less variability than expected may be functionally important and under negative selection. For example, repeats in coding regions where expansion or contraction could disrupt protein structure or dosage.

High O/E values (excess variation): TRs with greater variability than expected may reflect positive selection, recent evolutionary origin, or genomic instability. Notably, many known pathogenic TRs (those causing repeat expansion disorders) fall into this high-variance category.

The percentile score contextualizes this measure across the genome to help prioritize loci for follow-up.
Q6: How do I use the VC, mappability, and nearby TR annotations?

These annotations provide important context about the genomic context of a tandem repeat (TR) locus and can inform interpretation of genotyping accuracy and biological relevance.

Variation cluster (VC): As described in [Weisburd, Dolzhenko et al., 2024], variation clusters represent broader polymorphic regions composed of adjacent or overlapping TRs and possibly other variation. If a TR is flagged as part of a VC, we recommend reviewing neighboring TRs in the same cluster, as the region may harbor complex or composite repeat structures.

Nearby TRs: This flag indicates whether a given locus is located within a dense region of adjacent TRs. Loci in TR-dense regions may be prone to ambiguous alignments or compound variation patterns that may impact analysis accuracy and interpretation.

Low mappability: TRs with low sequence mappability may be difficult to genotype accurately using short-read sequencing due to read alignment uncertainty. In combination, these annotations can help flag loci that may be challenging to resolve with short reads, and highlight regions where long-read data or broader contextual analysis may be warranted.

Q7: How do I cite TRExplorer?

If you use TRExplorer in your work, please cite the following preprint:

Weisburd B, Dolzhenko E, Bennett MF, Danzi MC, English A, Hiatt L, Tanudisastro H, Kurtas NE, Jam HZ, Brand H, Sedlazeck FJ, Gymrek M, Dashnow H, Eberle MA, Rehm HL. Defining a tandem repeat catalog and variation clusters for genome-wide analyses and population databases. bioRxiv 2024.10.04.615514; https://doi.org/10.1101/2024.10.04.615514

Q8: Can I contribute data to TRExplorer?

If you are interested in contributing tandem repeat genotypes or annotations to the TRExplorer resource, please email weisburd@broadinstitute.org