v1.0 Manual


DISSECT

Transcript-to-genome aligner for detection of structural alteration events in transcripts.

USAGE

Within a properly setup Dissect folder, call:
./dissect configFile

Read below for how to set up the Dissect folder and config file.

Important Note: For each first time run with a new reference genome file, the indexing should be done by turning on STEP_0 flag (described below in config file description). For any subsequent run with the same reference genome file (and with the same indexing window-size parameter), STEP_0 flag can be turned off to gain speed by skipping the indexing stage of mrsFAST.

How To Set-up Dissect Folder
  1. Prepare the configuration file and genome sequence list file as described below.
  2. Make sure that mrsFAST and mrsFAST extension binaries and genomic region inference binary (anchInfer) are within "binary/" folder and dissect binary is in the main folder.
  3. Place the input query file containing long transcript reads or transcriptome assembly contigs in the data/ folder in FASTA format.
  4. Place FASTA formatted genome file in a folder with fast access (preferably same disk) from Dissect main folder. Note that mrsFAST index file will be generated in the same folder, so allow ~7-8x disk space (w.r.t. the reference file) in the same folder.

Genome Sequence List File

This file in each line should contain the name of a reference sequence (e.g. chromosomes) in the reference genome file and its size separated by white-space. Order of the sequences is not important.

A sample sequence list file for mm9 build mouse reference genome is as follows:

chr1 197195432
chr2 181748087
chr3 159599783
...
...
chr16_random 3994
chr17_random 628739
chrM 16299
chrUn_random 5900358
chrX_random 1785075
chrY_random 58682461

Config File
The configuration file needs to be formatted as follows in each line:
[ConfigNo]_[ConfigName]=[ConfigValue]

ConfigName can be changed by the user as desired as long as configNo is correct.

A sample config file with default alignment parameters is given below for mm9 build mouse genome:

01_refFile=./mm9_data/mm9.fa
02_configRun=11111
03_FileList=chrExtended.mus.files.list
04_querySeq=./data/querySeq.fa
05_anchIndexSize=12
06_anchSize=24
07_numAnchors=20
08_anchErrRate=1
09_maxAnchorHits=150
10_cLength=400000
11_anchInferMergeDist=400000
13_anchGracePerc=90
14_maxRegions=3
15_rnaSeedSize=7
16_rnaGapLimit=50
17_fusionPenalty=50
18_scoreGracePerc=50

Description of the configuration parameters:

01_refFile: path for the reference genome file used.

02_configRun: Bit Flags for Dissect Run STEP 0, STEP 1, STEP 2, STEP 3, STEP 4, and STEP 5 respectively.
Note that not all combinations of configRun flags will be valid (e.g. 10101, 11011). For the first time run with any reference genome 11111 will perform the full Dissect run from indexing to chaining. Indexing step can be skipped for any subsequent run with the same reference genome by setting configRun=01111.
Furthermore, if the same run is made with different chaining parameters (e.g. rnaGapLimit, fusionPenalty), all other steps other than chaining can be skipped by setting configRun=00001. Similarly any modification to genomic region inference step would require only the last three steps to be rerun with configRun=00111.

03_FileList: Path for genomic sequence lisr file described above

04_querySeq: Path for query input FASTA file with a list of transcript sequences.

05_anchIndexSize: mrsFAST index size for anchor mapping (used in STEP 0). Larger values speed up mapping in STEP 1 and increase the index size of the genome. Smaller values will reduce index size, but increase mapping in STEP 1.

06_anchSize: Size of the anchors to be mapped (used in STEP 1). Longer anchors can improve genomic region inference accuracy due to increased number of unique mappings, yet can cause the correct region to be missed for transcripts with multiple short exons.

07_numAnchors: Number of anchors to be sampled from the query sequence (used in STEP 1). Larger number of anchors will increase sensitivity of genomic region inference, but will slow down mapping step by linear and region inference step by a quadratic factor.

08_anchErrRate: Number of allowed mismatched when mapping anchor sequences to the genome (used in STEP 1). This value should be <= floor(anchSize/anchIndexSize) - 1. Smaller values of anchErrRate will result in more accurate genomic region inference calls for highly similar transcript sequences, but might cause the correct region to be missed for transcripts with relatively high level of substitution errors.

09_maxAnchorHits: Maximum number of mappings of an anchor to the genome for the anchor to be taken into account in region inference step. (Used in STEP 2). This value should be <255. The more mappings an anchor has, the less amount of information gained as to which genomic region the transcript should be aligned. Therefore, anchors with number of mappings larger than this threshold is eliminated altogether from the genomic region inference step.

11_anchInferMergeDist: Highest distance threshold for merging two regions into one (used in STEP 2). This can either be applied in order to merge two fusion regions into one or merge two single regions in the high scoring list into one.

13_anchGracePerc: The ratio of a region to the highest scoring region, which a region needs to maintain in order to get into or stay in the highest scoring list. (used in STEP 2)

14_maxRegions: The maximum number of disjoint regions reported by genomic region inference step (used in STEP 2).

15_rnaSeedSize: Seed size used for construction homologous fragments between query transcript and inferred genomic region sequences. (used in STEP 3). This value indicates the minimum required length of error-free sequence for a homologous fragment to be detected. Increasing this length will speed up mapping in STEP 3, yet might potentially result in smaller fragment set construction.

16_rnaGapLimit: The maximum length of contiguous transcript gap that can be chained through in the fragment chaining step (STEP 4). This value should be equal to the maximum expected novel insertion length in the transcript and preferably be small. Very large values of rnaGapLimit may result in higher rate of false positive and false negatives in event detection during fragment chaining.

17_fusionPenalty: Constant fusion transition penalty used in fragment chaining (STEP 4).

18_scoreGracePerc: Similar to anchGracePerc, but used for eliminating relatively low scoring alignments at the end of STEP 4.

OUTPUT

Output of Dissect consists of the following files:

Important Note: Remember to save Dissect output with different names (or in a different folder), since each subsequent run with the same query name will overwrite previously existing output files.

Output Format
The main output of Dissect is ([QUERY].blockOutput) indicates alignment blocks of transcripts to multiple inferred regions.

A sample output is as follows:

>uc009iwu.1
S 1383 chr7 rc 1398 3 646(647),71,681(680), 0,646,717, 111800765,111801414,111801494, +,+,+, 1,1,1, 1..2,, G=0
@
>uc009kbr.1
N anch
@
>uc009iin.1
D 872 chr7/chr1 fw 1034 2 227,794, 0,240, 101191200,101854713, +,+, 1,2, , G=13
>uc009fcu.1
S 3018 chr7 fw 3078 6 312,292,813,228,124,1309, 0,312,604,1417,1645,1769, 8954256,8957503,8959536,8961824,8967814,8971048, +,+,+,+,+,+, 1,1,1,1,1,1, ,,,,, G=0
S 3008 chr7 fw 3078 6 312,292,813,228,124,1309, 0,312,604,1417,1645,1769, 7642223,7638950,7636917,7634599,7628523,7625293, -,-,-,-,-,-, 1,1,1,1,1,1, ,,,,, G=0
S 2986 chr7 fw 3078 6 312,292,813,228,124,1309, 0,312,604,1417,1645,1769, 9638195,9641437,9643471,9645767,9651813,9655046, +,+,+,+,+,+, 1,1,1,1,1,1, ,,,,, G=0
@

In this output file each query transcript name is printed separately, which is then followed by a number of tab-delimited alignment lines until '@'.

Alignment line fields are described as follows:

1) Type of alignment:

Remainder of the fields are described for S and D cases:

2) Alignment score: Overall match score - [insertion penalties + genomic transition penalties - splice junction scores].

3) Genomic sequence name(s) for the genomic region(s). For single region alignment, this field will simply be the sequence name (e.g. chr7). For double region alignment, the field will have two sequence names separated by '/ (e.g. chr7/chr1). Denote first genomic region sequence as SEQ1 and the second as SEQ2.

4) Junction directionality (fw/rc): Direction of splice junctions used for aligning the transcript sequence. 'fw' alignments use GT-AG splice junction scoring scheme, 'rc' alignments use CT-AC. If there is no junction present in the alignment or both cases are equivalent 'fw' is printed. (Note that, aligning assembled contigs, this field indicates the likely to indicate the 5'-3' ends of the assembled transcript).

5) Length of the query transcript sequence.

6) Number of blocks in alignment.

7) Length of each block in the transcript (and in the genome if different). This field is separated by ',' characters indicating the length of each alignment block in downstream direction of the transcript. If the block length is equal in the transcript and the genome the sub-field contains a positive integer indicating the length. If the length of the transcript and genome are different due to an indel, the sub-field contains a positive integer indicating the length of the block in the transcript and a separate positive integer in parentheses indicating the length of the block in the genome (e.g. 156(157)). Even though not very common, different sized transcript/genome block pairs can be introduced by the post-processing step of Dissect.

8) Starting position of each block in the transcript (indices are 0-based).

9) Starting position of each block in the genome (indices are 0-based). Unlike PSL output format, these indices always indicate the actual starting position of the block (i.e. if alignment direction is reverse, this index will be larger than all other positions in the block).

10) Direction of each block in the alignment (for alignments containing inversion events some blocks will have different direction values than others).

11) Equal scoring junction position information for each block-block junction. When the sequences on two sides of a junction is similar and there is a lack of a canonical splice junction or multiple canonical splice signals in the neighborhood of the junction, Dissect may not select the correct junction with confidence. In these cases, the offset for alternative splice junction positions are printed in this field. (Not that number of sub-fields is one less than other, since number of junctions is one less than the number of blocks).

12) Metadata used for debugging and various other extensions to the output, this field is subject to change with upcoming updates to the alignment tool.

Temporary Files

During its execution, Dissect creates a number of temporary files. These files are created to ensure the program can be run in multiple steps without nay issues. For this reason these files should not be removed until the last stage of the Dissect run is completed.

Temporary file can also give additional info as to how the alignment is done (or why not being aligned as desired).

These temporary files (in the order of their creation) are described below:
  • anchorReads.fa : Set of anchor sequences sampled from the query transcripts
  • anchorReads.fa.output (.sorted/.unmapped) : Mapping results of anchor sequences to the reference genome
  • anchorInferenceOutput: Inferred genomic regions (or region pairs for fused transcripts) for each query in the transcript according to anchor mappings (There can be more than one region for each query transcripts)
  • batchQueryList (List of queries to be sampled and mapped with mrsFAST_HS )
  • batchTileOutput (.nohit) (Resulting fragments for queries and regions)
Valid XHTML :: Valid CSS: :: Powered by WikkaWiki