Skip to contents

Installing Rust

First you need to have an updated Rust installation. Go to this site to learn how to install Rust.

Installing rustytools

You will need to have the devtools package installed…

devtools::install_github("furlan-lab/rustytools", force=T)

Running a tool to extract consensus sequence data

Loading libraries

Get consensus from fasta

This function will process a fasta file that has been processed using a tool such as samtools consensus. The function will find all regions (subsequences) of the fasta that have sequence (A, C, T, G) and output a granges object containing the features and their sequences. It is invoked as follows:

root<-file.path(.libPaths()[1], "rustytools/extdata")
tfasta=file.path(root, "test.fa")
results<-get_consensus(tfasta, cores=1, genome="toy")
results[[1]]
## GRanges object with 7 ranges and 1 metadata column:
##       seqnames    ranges strand | mcols.sequence
##          <Rle> <IRanges>  <Rle> |    <character>
##   [1]     chr1   238-251      * | ACTGACCTTAAGGT
##   [2]    chr12   102-115      * | ACTGACCTTAAGGT
##   [3]    chr12   595-608      * | ACTGACCTTAAGGT
##   [4]     chr3   317-330      * | ACTGACCTTAAGGT
##   [5]     chr4   168-181      * | ACTGACCTTAAGGT
##   [6]     chr5   171-184      * | ACTGACCTTAAGGT
##   [7]     chr6    97-110      * | ACTGACCTTAAGGT
##   -------
##   seqinfo: 7 sequences from toy genome

In the above toy example we use the cores argument set to 1. get_consensus works by splitting work for each contig (chromosome) across cores. In the above example, it doesn’t make sense to split the contigs across cores because the contig has about 500-600 bases. When running this on real fasta sequence from a large genome, one would want to leverage parallel processing and process each contig using a core. In this next block we show how to run get_consensus on a big fasta.

fasta="~/Desktop/GENDX_HLA_Mix1_S1_L001_t1000.fa"
results<-get_consensus(fasta, cores=16, genome="hg38")

Before running a full fasta on all contigs, you can check the first n contigs of a full sized fasta using the argument test_with_n. Note that the full fasta will be read into memory to ensure appropriate seqinfo data

results<-get_consensus(fasta, cores=16, genome="hg38", test_with_n = 1)

Merge

library(GenomicRanges)
reduced_results<-reduce(results[[1]], min.gapwidth = 2L)

Annotation 1

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(VariantAnnotation)
txdb_hg38 <- TxDb.Hsapiens.UCSC.hg38.knownGene
si<-seqinfo(txdb_hg38)
reduced_results@seqinfo<-GenomeInfoDb::Seqinfo(seqnames = si@seqnames[1:22], seqlengths = si@seqlengths[1:22], isCircular = rep(NA, 22), genome = "hg38")
annotated_results <- locateVariants(reduced_results, txdb_hg38, AllVariants())
table(annotated_results$LOCATION)
## 
## spliceSite     intron    fiveUTR   threeUTR     coding intergenic   promoter 
##        438     102026        181        810        985       8989       4482
annotated_results[annotated_results$LOCATION=="coding"]
## GRanges object with 985 ranges and 9 metadata columns:
##         seqnames            ranges strand | LOCATION  LOCSTART    LOCEND
##            <Rle>         <IRanges>  <Rle> | <factor> <integer> <integer>
##     [1]     chr1 10639562-10639580      - |   coding      4642      4660
##     [2]     chr1 11027568-11027586      - |   coding      1162      1180
##     [3]     chr1 11027568-11027586      - |   coding      1339      1357
##     [4]     chr1 11027568-11027586      - |   coding      1336      1354
##     [5]     chr1 11027568-11027586      - |   coding      1360      1378
##     ...      ...               ...    ... .      ...       ...       ...
##   [981]    chr22 29991526-29991544      + |   coding       316       334
##   [982]    chr22 29991526-29991544      + |   coding       316       334
##   [983]    chr22 29991526-29991544      + |   coding       316       334
##   [984]    chr22 37710509-37710527      + |   coding       197       215
##   [985]    chr22 37710509-37710527      + |   coding       197       215
##           QUERYID        TXID         CDSID      GENEID       PRECEDEID
##         <integer> <character> <IntegerList> <character> <CharacterList>
##     [1]        90       12536         14804       54897                
##     [2]        96       12554         14845       10747                
##     [3]        96       12555         14845       10747                
##     [4]        96       12556         14845       10747                
##     [5]        96       12560         14845       10747                
##     ...       ...         ...           ...         ...             ...
##   [981]     22248      239582 263033,263032        8897                
##   [982]     22248      239584 263033,263032        8897                
##   [983]     22248      239586 263033,263032        8897                
##   [984]     22310      240165        263652       11078                
##   [985]     22310      240167        263652       11078                
##                FOLLOWID
##         <CharacterList>
##     [1]                
##     [2]                
##     [3]                
##     [4]                
##     [5]                
##     ...             ...
##   [981]                
##   [982]                
##   [983]                
##   [984]                
##   [985]                
##   -------
##   seqinfo: 22 sequences from hg38 genome

Annotation 2

# devtools::install_github("Bioconductor/UCSC.utils")
# devtools::install_github("Bioconductor/GenomeInfoDb")
# devtools::install_github("grimbough/biomaRt")
# devtools::install_github("Bioconductor/txdbmaker")
# BiocManager::install("EnsDb.Hsapiens.v86")

library(txdbmaker)
gtf_file<-"/Users/sfurlan/Library/CloudStorage/OneDrive-SharedLibraries-FredHutchinsonCancerCenter/Furlan_Lab - General/resources/refs/gencode.v46.annotation.gtf.gz"
txdb <- makeTxDbFromGFF(gtf_file, format="gtf", dbxrefTag = "gene_name")
annotated_results2 <- locateVariants(reduced_results, txdb, AllVariants())
annotated_results2
## GRanges object with 120299 ranges and 9 metadata columns:
##            seqnames            ranges strand | LOCATION  LOCSTART    LOCEND
##               <Rle>         <IRanges>  <Rle> | <factor> <integer> <integer>
##        [1]     chr1     832502-832521      - |   intron      8576      8595
##        [2]     chr1     832502-832521      - |   intron      8023      8042
##        [3]     chr1     832502-832521      - |   intron      7470      7489
##        [4]     chr1     832502-832521      - |   intron     29775     29794
##        [5]     chr1     832502-832521      - |   intron      6950      6969
##        ...      ...               ...    ... .      ...       ...       ...
##   [120295]    chr22 40018446-40018463      + |   intron     24286     24303
##   [120296]    chr22 40018446-40018463      + |   intron     45916     45933
##   [120297]    chr22 40018446-40018463      + |   intron     22915     22932
##   [120298]    chr22 40018446-40018463      + |   intron     11525     11542
##   [120299]    chr22 40018446-40018463      + |   intron      8322      8339
##              QUERYID        TXID         CDSID             GENEID
##            <integer> <character> <IntegerList>        <character>
##        [1]         1      227999               ENSG00000167384.11
##        [2]         1      228000               ENSG00000167384.11
##        [3]         1      228001               ENSG00000167384.11
##        [4]         1      228001               ENSG00000167384.11
##        [5]         1      228002               ENSG00000167384.11
##        ...       ...         ...           ...                ...
##   [120295]     22329      217276               ENSG00000127616.22
##   [120296]     22329      217276               ENSG00000127616.22
##   [120297]     22329      217277               ENSG00000127616.22
##   [120298]     22329      217278               ENSG00000127616.22
##   [120299]     22329      217279               ENSG00000127616.22
##                  PRECEDEID        FOLLOWID
##            <CharacterList> <CharacterList>
##        [1]                                
##        [2]                                
##        [3]                                
##        [4]                                
##        [5]                                
##        ...             ...             ...
##   [120295]                                
##   [120296]                                
##   [120297]                                
##   [120298]                                
##   [120299]                                
##   -------
##   seqinfo: 22 sequences from hg38 genome

Annotation 3

library(ChIPpeakAnno)
library(ensembldb)
library(EnsDb.Hsapiens.v86)
annoGR <- toGRanges(EnsDb.Hsapiens.v86)
seqlevelsStyle(reduced_results) <- seqlevelsStyle(annoGR)
annotated_results3<-annoPeaks(reduced_results, annoGR, bindingType = "fullRange")
annotated_results3[grepl("^HLA", annotated_results3$gene_name)]$gene_name
##  [1] "HLA-F-AS1" "HLA-G"     "HLA-L"     "HLA-L"     "HLA-DRB9"  "HLA-DRB9" 
##  [7] "HLA-DRB9"  "HLA-DRB9"  "HLA-DRB9"  "HLA-DRB6"  "HLA-DRB1"  "HLA-DRB6" 
## [13] "HLA-DRB1"  "HLA-Z"     "HLA-Z"     "HLA-DOA"   "HLA-DPA1"  "HLA-DPA3"

Appendix

## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.5.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] EnsDb.Hsapiens.v86_2.99.0               
##  [2] ensembldb_2.26.0                        
##  [3] AnnotationFilter_1.26.0                 
##  [4] ChIPpeakAnno_3.36.1                     
##  [5] txdbmaker_1.1.0                         
##  [6] VariantAnnotation_1.48.1                
##  [7] Rsamtools_2.18.0                        
##  [8] Biostrings_2.70.2                       
##  [9] XVector_0.42.0                          
## [10] SummarizedExperiment_1.32.0             
## [11] MatrixGenerics_1.14.0                   
## [12] matrixStats_1.2.0                       
## [13] TxDb.Hsapiens.UCSC.hg38.knownGene_3.18.0
## [14] GenomicFeatures_1.54.3                  
## [15] AnnotationDbi_1.64.1                    
## [16] Biobase_2.62.0                          
## [17] GenomicRanges_1.54.1                    
## [18] GenomeInfoDb_1.41.1                     
## [19] IRanges_2.36.0                          
## [20] S4Vectors_0.40.2                        
## [21] BiocGenerics_0.48.1                     
## [22] rustytools_0.0.1                        
## 
## loaded via a namespace (and not attached):
##   [1] rstudioapi_0.15.0        jsonlite_1.8.8           magrittr_2.0.3          
##   [4] rmarkdown_2.25           fs_1.6.3                 BiocIO_1.12.0           
##   [7] zlibbioc_1.48.0          ragg_1.2.7               vctrs_0.6.5             
##  [10] multtest_2.58.0          memoise_2.0.1            RCurl_1.98-1.14         
##  [13] htmltools_0.5.7          S4Arrays_1.2.0           progress_1.2.3          
##  [16] lambda.r_1.2.4           curl_5.2.0               SparseArray_1.2.4       
##  [19] sass_0.4.8               bslib_0.6.1              desc_1.4.3              
##  [22] httr2_1.0.0              futile.options_1.0.1     cachem_1.0.8            
##  [25] GenomicAlignments_1.38.2 lifecycle_1.0.4          pkgconfig_2.0.3         
##  [28] Matrix_1.6-5             R6_2.5.1                 fastmap_1.1.1           
##  [31] GenomeInfoDbData_1.2.11  digest_0.6.34            colorspace_2.1-0        
##  [34] regioneR_1.34.0          textshaping_0.3.7        RSQLite_2.3.5           
##  [37] filelock_1.0.3           fansi_1.0.6              httr_1.4.7              
##  [40] abind_1.4-5              compiler_4.3.2           bit64_4.0.5             
##  [43] BiocParallel_1.36.0      DBI_1.2.2                biomaRt_2.61.0          
##  [46] MASS_7.3-60              rappdirs_0.3.3           DelayedArray_0.28.0     
##  [49] rjson_0.2.21             tools_4.3.2              glue_1.7.0              
##  [52] VennDiagram_1.7.3        restfulr_0.0.15          InteractionSet_1.30.0   
##  [55] grid_4.3.2               ade4_1.7-22              generics_0.1.3          
##  [58] seqinr_4.2-36            gtable_0.3.4             BSgenome_1.70.2         
##  [61] hms_1.1.3                xml2_1.3.6               utf8_1.2.4              
##  [64] pillar_1.9.0             stringr_1.5.1            splines_4.3.2           
##  [67] dplyr_1.1.4              BiocFileCache_2.10.1     lattice_0.21-9          
##  [70] survival_3.5-7           rtracklayer_1.62.0       bit_4.0.5               
##  [73] tidyselect_1.2.0         RBGL_1.78.0              knitr_1.45              
##  [76] ProtGenerics_1.34.0      futile.logger_1.4.3      xfun_0.42               
##  [79] stringi_1.8.3            UCSC.utils_1.1.0         lazyeval_0.2.2          
##  [82] yaml_2.3.8               evaluate_0.23            codetools_0.2-19        
##  [85] tibble_3.2.1             graph_1.80.0             cli_3.6.2               
##  [88] pbmcapply_1.5.1          systemfonts_1.0.5        munsell_0.5.0           
##  [91] jquerylib_0.1.4          Rcpp_1.0.12              dbplyr_2.4.0            
##  [94] png_0.1-8                XML_3.99-0.16.1          parallel_4.3.2          
##  [97] pkgdown_2.0.7            ggplot2_3.5.0            blob_1.2.4              
## [100] prettyunits_1.2.0        bitops_1.0-7             scales_1.3.0            
## [103] purrr_1.0.2              crayon_1.5.2             rlang_1.1.3             
## [106] KEGGREST_1.42.0          formatR_1.14