How to extract coordinates and sequence from a fasta file
2024-06-04
getConsensus.Rmd
Installing Rust
First you need to have an updated Rust installation. Go to this site to learn how to install Rust.
Installing rustytools
You will need to have the devtools package installed…
devtools::install_github("furlan-lab/rustytools", force=T)
Running a tool to extract consensus sequence data
Loading libraries
rm(list=ls())
suppressPackageStartupMessages({
library(rustytools)
})
Get consensus from fasta
This function will process a fasta file that has been processed using a tool such as samtools consensus. The function will find all regions (subsequences) of the fasta that have sequence (A, C, T, G) and output a granges object containing the features and their sequences. It is invoked as follows:
root<-file.path(.libPaths()[1], "rustytools/extdata")
tfasta=file.path(root, "test.fa")
results<-get_consensus(tfasta, cores=1, genome="toy")
results[[1]]
## GRanges object with 7 ranges and 1 metadata column:
## seqnames ranges strand | mcols.sequence
## <Rle> <IRanges> <Rle> | <character>
## [1] chr1 238-251 * | ACTGACCTTAAGGT
## [2] chr12 102-115 * | ACTGACCTTAAGGT
## [3] chr12 595-608 * | ACTGACCTTAAGGT
## [4] chr3 317-330 * | ACTGACCTTAAGGT
## [5] chr4 168-181 * | ACTGACCTTAAGGT
## [6] chr5 171-184 * | ACTGACCTTAAGGT
## [7] chr6 97-110 * | ACTGACCTTAAGGT
## -------
## seqinfo: 7 sequences from toy genome
In the above toy example we use the cores argument set to 1. get_consensus works by splitting work for each contig (chromosome) across cores. In the above example, it doesn’t make sense to split the contigs across cores because the contig has about 500-600 bases. When running this on real fasta sequence from a large genome, one would want to leverage parallel processing and process each contig using a core. In this next block we show how to run get_consensus on a big fasta.
fasta="~/Desktop/GENDX_HLA_Mix1_S1_L001_t1000.fa"
results<-get_consensus(fasta, cores=16, genome="hg38")
Before running a full fasta on all contigs, you can check the first n contigs of a full sized fasta using the argument test_with_n. Note that the full fasta will be read into memory to ensure appropriate seqinfo data
results<-get_consensus(fasta, cores=16, genome="hg38", test_with_n = 1)
Merge
library(GenomicRanges)
reduced_results<-reduce(results[[1]], min.gapwidth = 2L)
Annotation 1
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(VariantAnnotation)
txdb_hg38 <- TxDb.Hsapiens.UCSC.hg38.knownGene
si<-seqinfo(txdb_hg38)
reduced_results@seqinfo<-GenomeInfoDb::Seqinfo(seqnames = si@seqnames[1:22], seqlengths = si@seqlengths[1:22], isCircular = rep(NA, 22), genome = "hg38")
annotated_results <- locateVariants(reduced_results, txdb_hg38, AllVariants())
table(annotated_results$LOCATION)
##
## spliceSite intron fiveUTR threeUTR coding intergenic promoter
## 438 102026 181 810 985 8989 4482
annotated_results[annotated_results$LOCATION=="coding"]
## GRanges object with 985 ranges and 9 metadata columns:
## seqnames ranges strand | LOCATION LOCSTART LOCEND
## <Rle> <IRanges> <Rle> | <factor> <integer> <integer>
## [1] chr1 10639562-10639580 - | coding 4642 4660
## [2] chr1 11027568-11027586 - | coding 1162 1180
## [3] chr1 11027568-11027586 - | coding 1339 1357
## [4] chr1 11027568-11027586 - | coding 1336 1354
## [5] chr1 11027568-11027586 - | coding 1360 1378
## ... ... ... ... . ... ... ...
## [981] chr22 29991526-29991544 + | coding 316 334
## [982] chr22 29991526-29991544 + | coding 316 334
## [983] chr22 29991526-29991544 + | coding 316 334
## [984] chr22 37710509-37710527 + | coding 197 215
## [985] chr22 37710509-37710527 + | coding 197 215
## QUERYID TXID CDSID GENEID PRECEDEID
## <integer> <character> <IntegerList> <character> <CharacterList>
## [1] 90 12536 14804 54897
## [2] 96 12554 14845 10747
## [3] 96 12555 14845 10747
## [4] 96 12556 14845 10747
## [5] 96 12560 14845 10747
## ... ... ... ... ... ...
## [981] 22248 239582 263033,263032 8897
## [982] 22248 239584 263033,263032 8897
## [983] 22248 239586 263033,263032 8897
## [984] 22310 240165 263652 11078
## [985] 22310 240167 263652 11078
## FOLLOWID
## <CharacterList>
## [1]
## [2]
## [3]
## [4]
## [5]
## ... ...
## [981]
## [982]
## [983]
## [984]
## [985]
## -------
## seqinfo: 22 sequences from hg38 genome
Annotation 2
# devtools::install_github("Bioconductor/UCSC.utils")
# devtools::install_github("Bioconductor/GenomeInfoDb")
# devtools::install_github("grimbough/biomaRt")
# devtools::install_github("Bioconductor/txdbmaker")
# BiocManager::install("EnsDb.Hsapiens.v86")
library(txdbmaker)
gtf_file<-"/Users/sfurlan/Library/CloudStorage/OneDrive-SharedLibraries-FredHutchinsonCancerCenter/Furlan_Lab - General/resources/refs/gencode.v46.annotation.gtf.gz"
txdb <- makeTxDbFromGFF(gtf_file, format="gtf", dbxrefTag = "gene_name")
annotated_results2 <- locateVariants(reduced_results, txdb, AllVariants())
annotated_results2
## GRanges object with 120299 ranges and 9 metadata columns:
## seqnames ranges strand | LOCATION LOCSTART LOCEND
## <Rle> <IRanges> <Rle> | <factor> <integer> <integer>
## [1] chr1 832502-832521 - | intron 8576 8595
## [2] chr1 832502-832521 - | intron 8023 8042
## [3] chr1 832502-832521 - | intron 7470 7489
## [4] chr1 832502-832521 - | intron 29775 29794
## [5] chr1 832502-832521 - | intron 6950 6969
## ... ... ... ... . ... ... ...
## [120295] chr22 40018446-40018463 + | intron 24286 24303
## [120296] chr22 40018446-40018463 + | intron 45916 45933
## [120297] chr22 40018446-40018463 + | intron 22915 22932
## [120298] chr22 40018446-40018463 + | intron 11525 11542
## [120299] chr22 40018446-40018463 + | intron 8322 8339
## QUERYID TXID CDSID GENEID
## <integer> <character> <IntegerList> <character>
## [1] 1 227999 ENSG00000167384.11
## [2] 1 228000 ENSG00000167384.11
## [3] 1 228001 ENSG00000167384.11
## [4] 1 228001 ENSG00000167384.11
## [5] 1 228002 ENSG00000167384.11
## ... ... ... ... ...
## [120295] 22329 217276 ENSG00000127616.22
## [120296] 22329 217276 ENSG00000127616.22
## [120297] 22329 217277 ENSG00000127616.22
## [120298] 22329 217278 ENSG00000127616.22
## [120299] 22329 217279 ENSG00000127616.22
## PRECEDEID FOLLOWID
## <CharacterList> <CharacterList>
## [1]
## [2]
## [3]
## [4]
## [5]
## ... ... ...
## [120295]
## [120296]
## [120297]
## [120298]
## [120299]
## -------
## seqinfo: 22 sequences from hg38 genome
Annotation 3
library(ChIPpeakAnno)
library(ensembldb)
library(EnsDb.Hsapiens.v86)
annoGR <- toGRanges(EnsDb.Hsapiens.v86)
seqlevelsStyle(reduced_results) <- seqlevelsStyle(annoGR)
annotated_results3<-annoPeaks(reduced_results, annoGR, bindingType = "fullRange")
annotated_results3[grepl("^HLA", annotated_results3$gene_name)]$gene_name
## [1] "HLA-F-AS1" "HLA-G" "HLA-L" "HLA-L" "HLA-DRB9" "HLA-DRB9"
## [7] "HLA-DRB9" "HLA-DRB9" "HLA-DRB9" "HLA-DRB6" "HLA-DRB1" "HLA-DRB6"
## [13] "HLA-DRB1" "HLA-Z" "HLA-Z" "HLA-DOA" "HLA-DPA1" "HLA-DPA3"
Appendix
## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.5.2
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] EnsDb.Hsapiens.v86_2.99.0
## [2] ensembldb_2.26.0
## [3] AnnotationFilter_1.26.0
## [4] ChIPpeakAnno_3.36.1
## [5] txdbmaker_1.1.0
## [6] VariantAnnotation_1.48.1
## [7] Rsamtools_2.18.0
## [8] Biostrings_2.70.2
## [9] XVector_0.42.0
## [10] SummarizedExperiment_1.32.0
## [11] MatrixGenerics_1.14.0
## [12] matrixStats_1.2.0
## [13] TxDb.Hsapiens.UCSC.hg38.knownGene_3.18.0
## [14] GenomicFeatures_1.54.3
## [15] AnnotationDbi_1.64.1
## [16] Biobase_2.62.0
## [17] GenomicRanges_1.54.1
## [18] GenomeInfoDb_1.41.1
## [19] IRanges_2.36.0
## [20] S4Vectors_0.40.2
## [21] BiocGenerics_0.48.1
## [22] rustytools_0.0.1
##
## loaded via a namespace (and not attached):
## [1] rstudioapi_0.15.0 jsonlite_1.8.8 magrittr_2.0.3
## [4] rmarkdown_2.25 fs_1.6.3 BiocIO_1.12.0
## [7] zlibbioc_1.48.0 ragg_1.2.7 vctrs_0.6.5
## [10] multtest_2.58.0 memoise_2.0.1 RCurl_1.98-1.14
## [13] htmltools_0.5.7 S4Arrays_1.2.0 progress_1.2.3
## [16] lambda.r_1.2.4 curl_5.2.0 SparseArray_1.2.4
## [19] sass_0.4.8 bslib_0.6.1 desc_1.4.3
## [22] httr2_1.0.0 futile.options_1.0.1 cachem_1.0.8
## [25] GenomicAlignments_1.38.2 lifecycle_1.0.4 pkgconfig_2.0.3
## [28] Matrix_1.6-5 R6_2.5.1 fastmap_1.1.1
## [31] GenomeInfoDbData_1.2.11 digest_0.6.34 colorspace_2.1-0
## [34] regioneR_1.34.0 textshaping_0.3.7 RSQLite_2.3.5
## [37] filelock_1.0.3 fansi_1.0.6 httr_1.4.7
## [40] abind_1.4-5 compiler_4.3.2 bit64_4.0.5
## [43] BiocParallel_1.36.0 DBI_1.2.2 biomaRt_2.61.0
## [46] MASS_7.3-60 rappdirs_0.3.3 DelayedArray_0.28.0
## [49] rjson_0.2.21 tools_4.3.2 glue_1.7.0
## [52] VennDiagram_1.7.3 restfulr_0.0.15 InteractionSet_1.30.0
## [55] grid_4.3.2 ade4_1.7-22 generics_0.1.3
## [58] seqinr_4.2-36 gtable_0.3.4 BSgenome_1.70.2
## [61] hms_1.1.3 xml2_1.3.6 utf8_1.2.4
## [64] pillar_1.9.0 stringr_1.5.1 splines_4.3.2
## [67] dplyr_1.1.4 BiocFileCache_2.10.1 lattice_0.21-9
## [70] survival_3.5-7 rtracklayer_1.62.0 bit_4.0.5
## [73] tidyselect_1.2.0 RBGL_1.78.0 knitr_1.45
## [76] ProtGenerics_1.34.0 futile.logger_1.4.3 xfun_0.42
## [79] stringi_1.8.3 UCSC.utils_1.1.0 lazyeval_0.2.2
## [82] yaml_2.3.8 evaluate_0.23 codetools_0.2-19
## [85] tibble_3.2.1 graph_1.80.0 cli_3.6.2
## [88] pbmcapply_1.5.1 systemfonts_1.0.5 munsell_0.5.0
## [91] jquerylib_0.1.4 Rcpp_1.0.12 dbplyr_2.4.0
## [94] png_0.1-8 XML_3.99-0.16.1 parallel_4.3.2
## [97] pkgdown_2.0.7 ggplot2_3.5.0 blob_1.2.4
## [100] prettyunits_1.2.0 bitops_1.0-7 scales_1.3.0
## [103] purrr_1.0.2 crayon_1.5.2 rlang_1.1.3
## [106] KEGGREST_1.42.0 formatR_1.14