Bioc.gff 0.99.17
The Bioc.gff package provides support for the General Feature Format (GFF) file format, which is widely used for representing genomic features and annotations. This package allows users to read, write, and manipulate GFF versions 1, 2 (GTF), and 3 in R, making it easier to work with genomic data.
While the rtracklayer package offers robust GFF support, it is a
large package with many features beyond file import. Bioc.gff
fills a specific
niche in the Bioconductor ecosystem by providing a lightweight, focused solution
with minimal dependencies. This modularity is beneficial for developers of other
packages who need to parse GFF files without inheriting the extensive dependency
footprint of rtracklayer
. For the end user, it offers a simple and direct
interface for GFF manipulation.
Note that much of the code in the package is ported and adapted from the
rtracklayer Bioconductor package. The intention is that the GFF
functionality residing in rtracklayer will be removed from
that package in favor of using Bioc.gff
, thereby reducing its size and
complexity.
You can install the Bioc.gff
package from Bioconductor using the following:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("Bioc.gff")
library(Bioc.gff)
library(GenomicRanges)
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: generics
#>
#> Attaching package: 'generics'
#> The following objects are masked from 'package:base':
#>
#> as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
#> setequal, union
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
#> mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#> rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
#> unsplit, which.max, which.min
#> Loading required package: S4Vectors
#>
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#>
#> findMatches
#> The following objects are masked from 'package:base':
#>
#> I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: Seqinfo
The Bioc.gff
package provides functions to read and write GFF files, as well
as to manipulate GFF data structures. The following sections provide some
examples of how to use the package.
You can read GFF files using the readGFF
function. This function supports
GFF versions 1, 2, and 3. The following example demonstrates how to read a GFF
file:
gff_file <- system.file("extdata", "genes.gff3", package = "Bioc.gff")
import
functionThe import
function deduces that the input string is a GFF file type and calls
the appropriate method for reading the GFF file.
import(gff_file)
#> GRanges object with 31 ranges and 10 metadata columns:
#> seqnames ranges strand | source type score phase
#> <Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer>
#> [1] chr10 92828-95504 - | rtracklayer gene 5 <NA>
#> [2] chr10 92828-95178 - | rtracklayer mRNA NA <NA>
#> [3] chr10 92828-95504 - | rtracklayer mRNA NA <NA>
#> [4] chr10 92828-94054 - | rtracklayer exon NA <NA>
#> [5] chr10 92997-94054 - | rtracklayer CDS NA <NA>
#> ... ... ... ... . ... ... ... ...
#> [27] chr12 89675-89827 + | rtracklayer CDS NA <NA>
#> [28] chr12 90587-90655 + | rtracklayer exon NA <NA>
#> [29] chr12 90587-90655 + | rtracklayer CDS NA <NA>
#> [30] chr12 90796-91263 + | rtracklayer exon NA <NA>
#> [31] chr12 90796-91263 * | rtracklayer CDS NA <NA>
#> ID Name geneName Alias genome
#> <character> <character> <character> <list> <character>
#> [1] GeneID:347688 TUBB8 tubulin, beta 8 FLJ40100,TUBB8 hg19
#> [2] 873 TUBB8 <NA> <NA>
#> [3] 872 TUBB8 <NA> <NA>
#> [4] <NA> <NA> <NA> <NA>
#> [5] <NA> <NA> <NA> <NA>
#> ... ... ... ... ... ...
#> [27] <NA> <NA> <NA> <NA>
#> [28] <NA> <NA> <NA> <NA>
#> [29] <NA> <NA> <NA> <NA>
#> [30] <NA> <NA> <NA> <NA>
#> [31] <NA> <NA> <NA> <NA>
#> Parent
#> <list>
#> [1]
#> [2] GeneID:347688
#> [3] GeneID:347688
#> [4] 872,873
#> [5] 872,873
#> ... ...
#> [27] 4644
#> [28] 4644
#> [29] 4644
#> [30] 4644
#> [31] 4644
#> -------
#> seqinfo: 2 sequences from an unspecified genome; no seqlengths
Here the output object is a GRanges
class, which contains genomic coordinates
and associated metadata. The GRanges
object indicates the ranges with the
seqnames
, ranges
, and strand
, columns. The associated metadata is
stored in the mcols
slot, which (in this example) includes attributes like
source
, type
, phase
, ID
, Name
, geneName
, Alias
and genome
.
You can also selectively import ranges from the GFF file using the which
argument. This allows you to filter the data based on specific criteria, such as
chromosome or strand. For example, to import only the ranges on chromosome 1:
which <- GRanges("chr10:90000-93000")
import(gff_file, which = which, genome = "hg19")
#> GRanges object with 5 ranges and 10 metadata columns:
#> seqnames ranges strand | source type score phase
#> <Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer>
#> [1] chr10 92828-95504 - | rtracklayer gene 5 <NA>
#> [2] chr10 92828-95178 - | rtracklayer mRNA NA <NA>
#> [3] chr10 92828-95504 - | rtracklayer mRNA NA <NA>
#> [4] chr10 92828-94054 - | rtracklayer exon NA <NA>
#> [5] chr10 92997-94054 - | rtracklayer CDS NA <NA>
#> ID Name geneName Alias genome
#> <character> <character> <character> <list> <character>
#> [1] GeneID:347688 TUBB8 tubulin, beta 8 FLJ40100,TUBB8 hg19
#> [2] 873 TUBB8 <NA> <NA>
#> [3] 872 TUBB8 <NA> <NA>
#> [4] <NA> <NA> <NA> <NA>
#> [5] <NA> <NA> <NA> <NA>
#> Parent
#> <list>
#> [1]
#> [2] GeneID:347688
#> [3] GeneID:347688
#> [4] 872,873
#> [5] 872,873
#> -------
#> seqinfo: 298 sequences (2 circular) from hg19 genome
Note that you can indicate the genome build using the genome
argument, which
is useful for ensuring that the imported data is compatible with other genomic
data you may be working with.
readGFF
functionAs an alternative, you can use the readGFF
function directly, which is
more explicit about the GFF version:
readGFF(gff_file, version = "3")
#> DataFrame with 31 rows and 14 columns
#> seqid source type start end score strand
#> <factor> <factor> <factor> <integer> <integer> <numeric> <character>
#> 1 chr10 rtracklayer gene 92828 95504 5 -
#> 2 chr10 rtracklayer mRNA 92828 95178 NA -
#> 3 chr10 rtracklayer mRNA 92828 95504 NA -
#> 4 chr10 rtracklayer exon 92828 94054 NA -
#> 5 chr10 rtracklayer CDS 92997 94054 NA -
#> ... ... ... ... ... ... ... ...
#> 27 chr12 rtracklayer CDS 89675 89827 NA +
#> 28 chr12 rtracklayer exon 90587 90655 NA +
#> 29 chr12 rtracklayer CDS 90587 90655 NA +
#> 30 chr12 rtracklayer exon 90796 91263 NA +
#> 31 chr12 rtracklayer CDS 90796 91263 NA *
#> phase ID Name geneName Alias
#> <integer> <character> <character> <character> <list>
#> 1 NA GeneID:347688 TUBB8 tubulin, beta 8 FLJ40100,TUBB8
#> 2 NA 873 TUBB8 NA
#> 3 NA 872 TUBB8 NA
#> 4 NA NA NA NA
#> 5 NA NA NA NA
#> ... ... ... ... ... ...
#> 27 NA NA NA NA
#> 28 NA NA NA NA
#> 29 NA NA NA NA
#> 30 NA NA NA NA
#> 31 NA NA NA NA
#> genome Parent
#> <character> <list>
#> 1 hg19
#> 2 NA GeneID:347688
#> 3 NA GeneID:347688
#> 4 NA 872,873
#> 5 NA 872,873
#> ... ... ...
#> 27 NA 4644
#> 28 NA 4644
#> 29 NA 4644
#> 30 NA 4644
#> 31 NA 4644
This function returns a DataFrame
object, which contains the same information
as the GRanges
object but in a tabular format. Note that one can use the
makeGRangesFromDataFrame
function to convert the DataFrame
returned by
readGFF
into a GRanges
object, which is often more convenient for downstream
analysis:
readGFF(gff_file, version = "3") |>
makeGRangesFromDataFrame(
keep.extra.columns = TRUE
)
#> GRanges object with 31 ranges and 10 metadata columns:
#> seqnames ranges strand | source type score phase
#> <Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer>
#> [1] chr10 92828-95504 - | rtracklayer gene 5 <NA>
#> [2] chr10 92828-95178 - | rtracklayer mRNA NA <NA>
#> [3] chr10 92828-95504 - | rtracklayer mRNA NA <NA>
#> [4] chr10 92828-94054 - | rtracklayer exon NA <NA>
#> [5] chr10 92997-94054 - | rtracklayer CDS NA <NA>
#> ... ... ... ... . ... ... ... ...
#> [27] chr12 89675-89827 + | rtracklayer CDS NA <NA>
#> [28] chr12 90587-90655 + | rtracklayer exon NA <NA>
#> [29] chr12 90587-90655 + | rtracklayer CDS NA <NA>
#> [30] chr12 90796-91263 + | rtracklayer exon NA <NA>
#> [31] chr12 90796-91263 * | rtracklayer CDS NA <NA>
#> ID Name geneName Alias genome
#> <character> <character> <character> <list> <character>
#> [1] GeneID:347688 TUBB8 tubulin, beta 8 FLJ40100,TUBB8 hg19
#> [2] 873 TUBB8 <NA> <NA>
#> [3] 872 TUBB8 <NA> <NA>
#> [4] <NA> <NA> <NA> <NA>
#> [5] <NA> <NA> <NA> <NA>
#> ... ... ... ... ... ...
#> [27] <NA> <NA> <NA> <NA>
#> [28] <NA> <NA> <NA> <NA>
#> [29] <NA> <NA> <NA> <NA>
#> [30] <NA> <NA> <NA> <NA>
#> [31] <NA> <NA> <NA> <NA>
#> Parent
#> <list>
#> [1]
#> [2] GeneID:347688
#> [3] GeneID:347688
#> [4] 872,873
#> [5] 872,873
#> ... ...
#> [27] 4644
#> [28] 4644
#> [29] 4644
#> [30] 4644
#> [31] 4644
#> -------
#> seqinfo: 2 sequences from an unspecified genome; no seqlengths
You can also read GFF files from remote URLs. The import
function can handle
URLs directly, allowing you to read GFF files hosted on remote servers.
In this example, we read a GFF3 file from the miRBase database:
remote_gff_url <- "https://www.mirbase.org/download/hsa.gff3"
import(remote_gff_url, version = "3")
To show the example output in the vignette, a more advanced approach is used
below. This is done to avoid repeated downloads of the remote file when the
vignette is re-built on the Bioconductor Build System (BBS). We use the
BiocFileCache
package to cache the file locally.
library(BiocFileCache)
#> Loading required package: dbplyr
bfc <- BiocFileCache::BiocFileCache()
remote_gff_url <- "https://www.mirbase.org/download/hsa.gff3"
bquery <- bfcquery(bfc, remote_gff_url, "rname", exact = TRUE)
if (!nrow(bquery))
bfcadd(x = bfc, rname = remote_gff_url, rtype = "web", download = TRUE)
gff_local <- bfcrpath(
bfc, rnames = remote_gff_url, exact = TRUE, download = FALSE, rtype = "web"
)
Finally, the relevant function remains the same for reading a GFF file i.e.,
via import()
:
import(gff_local, version = "3")
#> GRanges object with 4801 ranges and 8 metadata columns:
#> seqnames ranges strand | source type
#> <Rle> <IRanges> <Rle> | <factor> <factor>
#> [1] chr1 17369-17436 - | NA miRNA_primary_transcript
#> [2] chr1 17409-17431 - | NA miRNA
#> [3] chr1 17369-17391 - | NA miRNA
#> [4] chr1 30366-30503 + | NA miRNA_primary_transcript
#> [5] chr1 30438-30458 + | NA miRNA
#> ... ... ... ... . ... ...
#> [4797] chrY 2609229-2609252 + | NA miRNA
#> [4798] chrY 4606120-4606228 + | NA miRNA_primary_transcript
#> [4799] chrY 4606140-4606158 + | NA miRNA
#> [4800] chrY 13479177-13479266 + | NA miRNA_primary_transcript
#> [4801] chrY 13479189-13479214 + | NA miRNA
#> score phase ID Alias Name
#> <numeric> <integer> <character> <list> <character>
#> [1] NA <NA> MI0022705 MI0022705 hsa-mir-6859-1
#> [2] NA <NA> MIMAT0027618 MIMAT0027618 hsa-miR-6859-5p
#> [3] NA <NA> MIMAT0027619 MIMAT0027619 hsa-miR-6859-3p
#> [4] NA <NA> MI0006363 MI0006363 hsa-mir-1302-2
#> [5] NA <NA> MIMAT0005890 MIMAT0005890 hsa-miR-1302
#> ... ... ... ... ... ...
#> [4797] NA <NA> MIMAT0023714_1 MIMAT0023714 hsa-miR-6089
#> [4798] NA <NA> MI0032313 MI0032313 hsa-mir-9985
#> [4799] NA <NA> MIMAT0039763 MIMAT0039763 hsa-miR-9985
#> [4800] NA <NA> MI0039722 MI0039722 hsa-mir-12120
#> [4801] NA <NA> MIMAT0049014 MIMAT0049014 hsa-miR-12120
#> Derives_from
#> <character>
#> [1] <NA>
#> [2] MI0022705
#> [3] MI0022705
#> [4] <NA>
#> [5] MI0006363
#> ... ...
#> [4797] MI0023563
#> [4798] <NA>
#> [4799] MI0032313
#> [4800] <NA>
#> [4801] MI0039722
#> -------
#> seqinfo: 24 sequences from an unspecified genome; no seqlengths
While TxDb
objects are highly efficient for querying transcript annotations
within R, it is often necessary to export this data to a standard, portable
format for use with external tools or for sharing with collaborators. The GFF3
format is a widely accepted standard for this purpose. Converting a TxDb
or a
derivative object (like a GRangesList
of exons) into a GFF3-compatible
GRanges
object allows for easy export. This is particularly useful for
visualizing annotations in genome browsers like IGV or UCSC, or for input into
downstream analysis pipelines that expect GFF3 files.
library(GenomicFeatures)
#> Loading required package: AnnotationDbi
#> Loading required package: Biobase
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
exonsBy(txdb, by = "tx") |>
asGFF()
#> GRanges object with 825453 ranges and 7 metadata columns:
#> seqnames ranges strand | type ID
#> <Rle> <IRanges> <Rle> | <character> <character>
#> [1] chr1 11874-14409 + | mRNA mRNA1
#> [2] chr1 11874-14409 + | mRNA mRNA2
#> [3] chr1 11874-14409 + | mRNA mRNA3
#> [4] chr1 69091-70008 + | mRNA mRNA4
#> [5] chr1 321084-321115 + | mRNA mRNA5
#> ... ... ... ... . ... ...
#> [825449] chrUn_gl000241 22732-22846 - | exon exon742489
#> [825450] chrUn_gl000241 20433-20481 - | exon exon742490
#> [825451] chrUn_gl000243 11501-11530 + | exon exon742491
#> [825452] chrUn_gl000243 13608-13637 + | exon exon742492
#> [825453] chrUn_gl000247 5787-5816 - | exon exon742493
#> Name exon_id exon_name exon_rank Parent
#> <character> <integer> <character> <integer> <character>
#> [1] 1 <NA> <NA> <NA> <NA>
#> [2] 2 <NA> <NA> <NA> <NA>
#> [3] 3 <NA> <NA> <NA> <NA>
#> [4] 4 <NA> <NA> <NA> <NA>
#> [5] 5 <NA> <NA> <NA> <NA>
#> ... ... ... ... ... ...
#> [825449] <NA> 289961 <NA> 6 mRNA82957
#> [825450] <NA> 289960 <NA> 7 mRNA82957
#> [825451] <NA> 289967 <NA> 1 mRNA82958
#> [825452] <NA> 289968 <NA> 1 mRNA82959
#> [825453] <NA> 289969 <NA> 1 mRNA82960
#> -------
#> seqinfo: 93 sequences (1 circular) from hg19 genome
You can convert a GFF
object to a TxDb
object using the makeTxDbFromGFF
in the txdbmaker package. This is useful for creating a
transcript database from GFF annotations, which can then be used for various
genomic analyses.
library(txdbmaker)
#>
#> Attaching package: 'txdbmaker'
#> The following objects are masked from 'package:GenomicFeatures':
#>
#> UCSCFeatureDbTableSchema, browseUCSCtrack, getChromInfoFromBiomart,
#> makeFDbPackageFromUCSC, makeFeatureDbFromUCSC, makePackageName,
#> makeTxDb, makeTxDbFromBiomart, makeTxDbFromEnsembl,
#> makeTxDbFromGFF, makeTxDbFromGRanges, makeTxDbFromUCSC,
#> makeTxDbPackage, makeTxDbPackageFromBiomart,
#> makeTxDbPackageFromUCSC, supportedMiRBaseBuildValues,
#> supportedUCSCFeatureDbTables, supportedUCSCFeatureDbTracks,
#> supportedUCSCtables
txdb <- makeTxDbFromGFF(
file = gff_local,
format = "gff3",
dataSource = "https://www.mirbase.org/download/hsa.gff3",
organism = "Homo sapiens",
taxonomyId = 9606
)
#> Import genomic features from the file as a GRanges object ...
#> OK
#> Prepare the 'metadata' data frame ... OK
#> Make the TxDb object ...
#> Warning in .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID, : the transcript names ("tx_name" column in the TxDb object) imported from the
#> "Name" attribute are not unique
#> Warning in .makeTxDb_normarg_chrominfo(chrominfo): genome version information
#> is not available for this TxDb object
#> OK
genome <- grepv("genome-build-id", readLines(gff_local)) |>
strsplit("# genome-build-id:\\s+") |>
unlist() |>
tail(1L)
genome(txdb) <- genome
txdb
#> TxDb object:
#> # Db type: TxDb
#> # Supporting package: GenomicFeatures
#> # Data source: https://www.mirbase.org/download/hsa.gff3
#> # Organism: Homo sapiens
#> # Taxonomy ID: 9606
#> # Genome: NA
#> # Nb of transcripts: 4801
#> # Db created by: txdbmaker package from Bioconductor
#> # Creation time: 2025-09-25 16:28:58 -0400 (Thu, 25 Sep 2025)
#> # txdbmaker version at creation time: 1.5.6
#> # RSQLite version at creation time: 2.4.3
#> # DBSCHEMAVERSION: 1.2
sessionInfo()
#> R version 4.5.1 Patched (2025-08-23 r88802)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] txdbmaker_1.5.6
#> [2] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
#> [3] GenomicFeatures_1.61.6
#> [4] AnnotationDbi_1.71.1
#> [5] Biobase_2.69.1
#> [6] BiocFileCache_2.99.6
#> [7] dbplyr_2.5.1
#> [8] GenomicRanges_1.61.5
#> [9] Seqinfo_0.99.2
#> [10] IRanges_2.43.2
#> [11] S4Vectors_0.47.2
#> [12] BiocGenerics_0.55.1
#> [13] generics_0.1.4
#> [14] Bioc.gff_0.99.17
#> [15] BiocStyle_2.37.1
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 dplyr_1.1.4
#> [3] blob_1.2.4 filelock_1.0.3
#> [5] Biostrings_2.77.2 bitops_1.0-9
#> [7] fastmap_1.2.0 RCurl_1.98-1.17
#> [9] GenomicAlignments_1.45.4 XML_3.99-0.19
#> [11] digest_0.6.37 lifecycle_1.0.4
#> [13] KEGGREST_1.49.1 RSQLite_2.4.3
#> [15] magrittr_2.0.4 compiler_4.5.1
#> [17] rlang_1.1.6 sass_0.4.10
#> [19] progress_1.2.3 tools_4.5.1
#> [21] yaml_2.3.10 rtracklayer_1.69.1
#> [23] knitr_1.50 prettyunits_1.2.0
#> [25] S4Arrays_1.9.1 bit_4.6.0
#> [27] curl_7.0.0 DelayedArray_0.35.3
#> [29] abind_1.4-8 BiocParallel_1.43.4
#> [31] withr_3.0.2 purrr_1.1.0
#> [33] grid_4.5.1 biomaRt_2.65.15
#> [35] SummarizedExperiment_1.39.2 cli_3.6.5
#> [37] rmarkdown_2.29 crayon_1.5.3
#> [39] httr_1.4.7 rjson_0.2.23
#> [41] BiocBaseUtils_1.11.2 DBI_1.2.3
#> [43] cachem_1.1.0 stringr_1.5.2
#> [45] parallel_4.5.1 BiocManager_1.30.26
#> [47] XVector_0.49.1 restfulr_0.0.16
#> [49] matrixStats_1.5.0 vctrs_0.6.5
#> [51] Matrix_1.7-4 jsonlite_2.0.0
#> [53] bookdown_0.44 hms_1.1.3
#> [55] bit64_4.6.0-1 jquerylib_0.1.4
#> [57] glue_1.8.0 codetools_0.2-20
#> [59] stringi_1.8.7 GenomeInfoDb_1.45.11
#> [61] BiocIO_1.19.0 UCSC.utils_1.5.0
#> [63] tibble_3.3.0 pillar_1.11.1
#> [65] rappdirs_0.3.3 htmltools_0.5.8.1
#> [67] GenomeInfoDbData_1.2.14 R6_2.6.1
#> [69] httr2_1.2.1 evaluate_1.0.5
#> [71] lattice_0.22-7 png_0.1-8
#> [73] Rsamtools_2.25.3 memoise_2.0.1
#> [75] bslib_0.9.0 SparseArray_1.9.1
#> [77] xfun_0.53 MatrixGenerics_1.21.0
#> [79] pkgconfig_2.0.3