UCSC RepeatMasker annotations are available as Bioconductor AnnotationHub resources. The UCSCRepeatMasker annotation package stores the metadata for these resources and provides this vignette to illustrate how to use them.
UCSCRepeatMasker 3.22.0
The UCSCRepeatMasker package provides metadata for
AnnotationHub resources associated with UCSC RepeatMasker
annotations. The original data can be found through UCSC download URLs
https://hgdownload.soe.ucsc.edu/goldenPath/XXXX/database/rmsk.txt.gz,
where XXXX is the corresponding code to a UCSC genome version.
Details about how those original data were processed into
AnnotationHub resources can be found in the source
file:
UCSCRepeatMasker/scripts/make-data_UCSCRepeatMasker.R
while details on how the metadata for those resources has been generated can be found in the source file:
UCSCRepeatMasker/scripts/make-metadata_UCSCRepeatMasker.R
UCSC RepeatMasker annotations can be retrieved using the AnnotationHub, which is a web resource that provides a central location where genomic files (e.g., VCF, bed, wig) and other resources from standard (e.g., UCSC, Ensembl) and distributed sites, can be found. A Bioconductor AnnotationHub web resource creates and manages a local cache of files retrieved by the user, helping with quick and reproducible access.
For example, to list the available UCSC RepeatMasker annotations for the human genome, we should first load the AnnotationHub package:
library(AnnotationHub)
and then query the annotation hub as follows:
ah <- AnnotationHub()
query(ah, c("UCSC", "RepeatMasker", "Homo sapiens"))
## AnnotationHub with 3 records
## # snapshotDate(): 2025-10-28
## # $dataprovider: UCSC
## # $species: Homo sapiens
## # $rdataclass: GRanges
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## # rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH99002"]]'
##
## title
## AH99002 | UCSC RepeatMasker annotations (Mar2020) for Human (hg19)
## AH99003 | UCSC RepeatMasker annotations (Sep2021) for Human (hg38)
## AH111333 | UCSC RepeatMasker annotations (Oct2022) for Human (hg38)
We can retrieve the desired resource, e.g., UCSC RepeatMasker annotations for hg38, using the following syntax:
rmskhg38 <- ah[["AH99003"]]
rmskhg38
## GRanges object with 5633664 ranges and 11 metadata columns:
## seqnames ranges strand | swScore milliDiv
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## [1] chr1 10001-10468 + | 463 13
## [2] chr1 15798-15849 + | 18 232
## [3] chr1 16713-16744 + | 18 137
## [4] chr1 18907-19048 + | 239 338
## [5] chr1 19972-20405 + | 994 312
## ... ... ... ... . ... ...
## [5633660] chrX_KV766199v1_alt 179150-179234 - | 255 248
## [5633661] chrX_KV766199v1_alt 184474-184785 - | 2039 173
## [5633662] chrX_KV766199v1_alt 186964-187271 - | 386 283
## [5633663] chrX_KV766199v1_alt 187486-187569 - | 270 321
## [5633664] chrX_KV766199v1_alt 187597-187822 - | 1301 102
## milliDel milliIns genoLeft repName repClass
## <numeric> <numeric> <integer> <character> <character>
## [1] 6 17 -248945954 (TAACCC)n Simple_repeat
## [2] 0 19 -248940573 (TGCTCC)n Simple_repeat
## [3] 0 0 -248939678 (TGG)n Simple_repeat
## [4] 129 0 -248937374 L2a LINE
## [5] 60 25 -248936017 L3 LINE
## ... ... ... ... ... ...
## [5633660] 69 28 -8770 MIR1_Amn SINE
## [5633661] 0 0 -3219 AluJb SINE
## [5633662] 44 78 -733 MLT1G3 LTR
## [5633663] 23 0 -435 MLT1G3 LTR
## [5633664] 47 4 -182 L1MA8 LINE
## repFamily repStart repEnd repLeft
## <character> <integer> <integer> <integer>
## [1] Simple_repeat 1 471 0
## [2] Simple_repeat 1 52 0
## [3] Simple_repeat 1 32 0
## [4] L2 2942 3104 -322
## [5] CR1 2680 3129 -970
## ... ... ... ... ...
## [5633660] MIR -80 150 63
## [5633661] Alu 0 312 1
## [5633662] ERVL-MaLR -65 526 232
## [5633663] ERVL-MaLR -414 141 56
## [5633664] L1 -5 6286 6051
## -------
## seqinfo: 711 sequences (1 circular) from hg38 genome
Note that the data is returned using a GRanges object, please consult the
vignettes from the GenomicRanges package for details on how to
manipulate this type of object. The contents of the 11 metadata columns are
described at the UCSC Genome Browser web page for the
RepeatMasker database schema.
Please consult the credits and references sections on that page for information
on how to cite these data.
The GRanges object contains further metadata accessible with the metadata()
method as follows:
metadata(rmskhg38)
## $srcurl
## [1] "https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/rmsk.txt.gz"
##
## $srcVersion
## [1] "Sep2021"
##
## $citation
## A. Smit, R. Hubley, P. Green (1996-2010). _RepeatMasker Open-3.0_.
## <https://www.repeatmasker.org>.
##
## $gdesc
## | organism: Homo sapiens (Human)
## | provider: UCSC
## | genome: hg38
## | release date: Dec. 2013
## | ---
## | seqlengths:
## | chr1 chr2 ... chrX_MU273397v1_alt
## | 248956422 242193529 ... 330493
sessionInfo()
## R version 4.5.1 Patched (2025-08-23 r88802)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] GenomicRanges_1.61.8 Seqinfo_0.99.4 IRanges_2.43.8
## [4] S4Vectors_0.47.6 AnnotationHub_3.99.6 BiocFileCache_2.99.6
## [7] dbplyr_2.5.1 BiocGenerics_0.55.4 generics_0.1.4
## [10] BiocStyle_2.37.1
##
## loaded via a namespace (and not attached):
## [1] rappdirs_0.3.3 sass_0.4.10 BiocVersion_3.22.0
## [4] RSQLite_2.4.3 digest_0.6.37 magrittr_2.0.4
## [7] evaluate_1.0.5 bookdown_0.45 fastmap_1.2.0
## [10] blob_1.2.4 jsonlite_2.0.0 AnnotationDbi_1.71.2
## [13] GenomeInfoDb_1.45.13 DBI_1.2.3 BiocManager_1.30.26
## [16] httr_1.4.7 purrr_1.1.0 UCSC.utils_1.5.1
## [19] Biostrings_2.77.2 httr2_1.2.1 jquerylib_0.1.4
## [22] cli_3.6.5 crayon_1.5.3 rlang_1.1.6
## [25] XVector_0.49.3 Biobase_2.69.1 bit64_4.6.0-1
## [28] withr_3.0.2 cachem_1.1.0 yaml_2.3.10
## [31] tools_4.5.1 memoise_2.0.1 dplyr_1.1.4
## [34] filelock_1.0.3 curl_7.0.0 vctrs_0.6.5
## [37] R6_2.6.1 png_0.1-8 lifecycle_1.0.4
## [40] KEGGREST_1.49.2 bit_4.6.0 pkgconfig_2.0.3
## [43] pillar_1.11.1 bslib_0.9.0 glue_1.8.0
## [46] xfun_0.53 tibble_3.3.0 tidyselect_1.2.1
## [49] knitr_1.50 htmltools_0.5.8.1 rmarkdown_2.30
## [52] compiler_4.5.1