How to Extract Coordinates and Sequence from a FASTA File
2024-06-04
getConsensus.Rmd
Overview
This vignette demonstrates how to extract contiguous consensus
sequence blocks from a FASTA file using rustytools
, and how
to annotate them with genomic features from various sources such as
UCSC, GENCODE, and Ensembl.
This is particularly useful for processing FASTA files derived from
consensus calling pipelines (e.g., samtools consensus
) and
for downstream variant or feature annotation.
Load Required Packages
rm(list = ls())
suppressPackageStartupMessages({
library(rustytools)
library(GenomicRanges)
})
Extracting Consensus Sequences
The get_consensus()
function extracts sequence blocks
containing A/C/T/G from a FASTA file and returns a GRanges
object with embedded sequence metadata.
root <- file.path(.libPaths()[1], "rustytools/extdata")
tfasta <- file.path(root, "test.fa")
results <- get_consensus(tfasta, cores = 1, genome = "toy")
results[[1]]
## GRanges object with 7 ranges and 1 metadata column:
## seqnames ranges strand | mcols.sequence
## <Rle> <IRanges> <Rle> | <character>
## [1] chr1 238-251 * | ACTGACCTTAAGGT
## [2] chr12 102-115 * | ACTGACCTTAAGGT
## [3] chr12 595-608 * | ACTGACCTTAAGGT
## [4] chr3 317-330 * | ACTGACCTTAAGGT
## [5] chr4 168-181 * | ACTGACCTTAAGGT
## [6] chr5 171-184 * | ACTGACCTTAAGGT
## [7] chr6 97-110 * | ACTGACCTTAAGGT
## -------
## seqinfo: 7 sequences from toy genome
The cores
parameter defines how many threads to use
(typically one per chromosome). For toy examples, multi-core use is
unnecessary.
Running on Large FASTA Files
For real genomes, such as those from HLA typing or clinical samples:
fasta <- "~/Desktop/Junk/GENDX_HLA_Mix1_S1_L001_t1000.fa"
To test on only the first few contigs before running a full genome:
results <- get_consensus(fasta, cores = 16, genome = "hg38", test_with_n = 1)
Reducing and Merging Consensus Ranges
You can merge nearby regions using reduce()
:
reduced_results <- reduce(results[[1]], min.gapwidth = 2L)
Annotation Methods
1. UCSC KnownGene (TxDb)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(VariantAnnotation)
txdb_hg38 <- TxDb.Hsapiens.UCSC.hg38.knownGene
si <- seqinfo(txdb_hg38)
# Ensure correct seqinfo
reduced_results@seqinfo <- GenomeInfoDb::Seqinfo(
seqnames = si@seqnames[1:22],
seqlengths = si@seqlengths[1:22],
isCircular = rep(NA, 22),
genome = "hg38"
)
annotated_results <- locateVariants(reduced_results, txdb_hg38, AllVariants())
table(annotated_results$LOCATION)
annotated_results[annotated_results$LOCATION == "coding"]
2. GENCODE GTF Annotation with txdbmaker
library(txdbmaker)
gtf_file <- "/Users/sfurlan/Library/CloudStorage/OneDrive-SharedLibraries-FredHutchinsonCancerCenter/Furlan_Lab - General/resources/refs/gencode.v46.annotation.gtf.gz"
txdb <- makeTxDbFromGFF(gtf_file, format = "gtf", dbxrefTag = "gene_name")
annotated_results2 <- locateVariants(reduced_results, txdb, AllVariants())
head(annotated_results2)
3. Ensembl-based Annotation with ChIPpeakAnno
library(ChIPpeakAnno)
library(ensembldb)
library(EnsDb.Hsapiens.v86)
annoGR <- toGRanges(EnsDb.Hsapiens.v86)
seqlevelsStyle(reduced_results) <- seqlevelsStyle(annoGR)
annotated_results3 <- annoPeaks(reduced_results, annoGR, bindingType = "fullRange")
annotated_results3[grepl("^HLA", annotated_results3$gene_name)]$gene_name
Summary
-
get_consensus()
extracts contiguous sequence blocks from FASTA files. - Results are returned as
GRanges
objects that are easily annotated. - Annotations can be performed using UCSC TxDb, GENCODE GTFs, or Ensembl databases.
This workflow is ideal for consensus-based variant discovery, isoform localization, or follow-up annotation after long-read assemblies.