How to Extract Coordinates and Sequence from a FASTA File

Overview

This vignette demonstrates how to extract contiguous consensus sequence blocks from a FASTA file using rustytools, and how to annotate them with genomic features from various sources such as UCSC, GENCODE, and Ensembl.

This is particularly useful for processing FASTA files derived from consensus calling pipelines (e.g., samtools consensus) and for downstream variant or feature annotation.

Load Required Packages

rm(list = ls())
suppressPackageStartupMessages({
  library(rustytools)
  library(GenomicRanges)
})

Extracting Consensus Sequences

The get_consensus() function extracts sequence blocks containing A/C/T/G from a FASTA file and returns a GRanges object with embedded sequence metadata.

root <- file.path(.libPaths()[1], "rustytools/extdata")
tfasta <- file.path(root, "test.fa")
results <- get_consensus(tfasta, cores = 1, genome = "toy")
results[[1]]

## GRanges object with 7 ranges and 1 metadata column:
##       seqnames    ranges strand | mcols.sequence
##          <Rle> <IRanges>  <Rle> |    <character>
##   [1]     chr1   238-251      * | ACTGACCTTAAGGT
##   [2]    chr12   102-115      * | ACTGACCTTAAGGT
##   [3]    chr12   595-608      * | ACTGACCTTAAGGT
##   [4]     chr3   317-330      * | ACTGACCTTAAGGT
##   [5]     chr4   168-181      * | ACTGACCTTAAGGT
##   [6]     chr5   171-184      * | ACTGACCTTAAGGT
##   [7]     chr6    97-110      * | ACTGACCTTAAGGT
##   -------
##   seqinfo: 7 sequences from toy genome

The cores parameter defines how many threads to use (typically one per chromosome). For toy examples, multi-core use is unnecessary.

Running on Large FASTA Files

For real genomes, such as those from HLA typing or clinical samples:

fasta <- "~/Desktop/Junk/GENDX_HLA_Mix1_S1_L001_t1000.fa"

To test on only the first few contigs before running a full genome:

results <- get_consensus(fasta, cores = 16, genome = "hg38", test_with_n = 1)

Reducing and Merging Consensus Ranges

You can merge nearby regions using reduce():

reduced_results <- reduce(results[[1]], min.gapwidth = 2L)

Annotation Methods

1. UCSC KnownGene (TxDb)

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(VariantAnnotation)

txdb_hg38 <- TxDb.Hsapiens.UCSC.hg38.knownGene
si <- seqinfo(txdb_hg38)

# Ensure correct seqinfo
reduced_results@seqinfo <- GenomeInfoDb::Seqinfo(
  seqnames = si@seqnames[1:22],
  seqlengths = si@seqlengths[1:22],
  isCircular = rep(NA, 22),
  genome = "hg38"
)

annotated_results <- locateVariants(reduced_results, txdb_hg38, AllVariants())
table(annotated_results$LOCATION)
annotated_results[annotated_results$LOCATION == "coding"]

2. GENCODE GTF Annotation with `txdbmaker`

library(txdbmaker)

gtf_file <- "/Users/sfurlan/Library/CloudStorage/OneDrive-SharedLibraries-FredHutchinsonCancerCenter/Furlan_Lab - General/resources/refs/gencode.v46.annotation.gtf.gz"

txdb <- makeTxDbFromGFF(gtf_file, format = "gtf", dbxrefTag = "gene_name")
annotated_results2 <- locateVariants(reduced_results, txdb, AllVariants())
head(annotated_results2)

3. Ensembl-based Annotation with `ChIPpeakAnno`

library(ChIPpeakAnno)
library(ensembldb)
library(EnsDb.Hsapiens.v86)

annoGR <- toGRanges(EnsDb.Hsapiens.v86)
seqlevelsStyle(reduced_results) <- seqlevelsStyle(annoGR)

annotated_results3 <- annoPeaks(reduced_results, annoGR, bindingType = "fullRange")
annotated_results3[grepl("^HLA", annotated_results3$gene_name)]$gene_name

Summary

get_consensus() extracts contiguous sequence blocks from FASTA files.
Results are returned as GRanges objects that are easily annotated.
Annotations can be performed using UCSC TxDb, GENCODE GTFs, or Ensembl databases.

This workflow is ideal for consensus-based variant discovery, isoform localization, or follow-up annotation after long-read assemblies.

2024-06-04