Tutorial¶
The easiest way to perform a WOLAND analysis is through a single batch submission using woland-batch.pl
. This will envolve initial 4-step preparation but in the next-time that you will use WOLAND with other samples (and we believe that you do!) you will use only the first step. It is easy no? Each step will prepare the inputs for this script:
$ perl woland-batch.pl -i <input.table file> -c <chromosome.profile file> -g <genomes.folder> -n <genome.version> -r <refseq.file> -w <hotspot.window length> -t <number.of.threads> -o <target.output folder>
First Step - Preparing input-table¶
Filtering¶
In most cases, a raw .vcf file containing SNVs from a resequencing pipeline is not suitable for a point mutation analysis. First, you have to filter polymorphisms and false-positives from each sample using, for example, vcftools
(<http://vcftools.sourceforge.net/>) and/or ANNOVAR
(<http://annovar.openbioinformatics.org/en/>).
Annotating¶
Several tools are available to annotate .vcf files. However, WOLAND accepts only ANNOVAR
(<http://annovar.openbioinformatics.org/en/>) gene annotation. It is easy to use ANNOVAR and you can find information about downloading, installing and using it at its website. Here is an example how to use annovar to annotate a .vcf file using annotate_variation.pl
from ANNOVAR:
$ perl annotate_variation.pl -geneanno -buildver hg19 example/ex1.avinput humandb/
Warning
WOLAND accepts ONLY .variant_function
files from ANNOVAR. It is not possible to use exonic_variant_function
output.
At this time you have a .variant_function
for each sample to be analyzed. You can manually annotate a file (think twice before) or force annotation when gene information are not avaialble (or not necessary). Let’s take a look at a variant-function
file from annovar:
exonic | Lrp1b | chr2 | 3432131 | 3432131 | A | G |
intergenic | Rbpj | chr5 | 25465 | 25465 | T | A |
intronic | Cmklr1 | chr5 | 4234231 | 4234231 | C | T |
intronic | Setd8 | chr5 | 8423415 | 8423415 | G | C |
... | ... | ... | ... | ... | . | . |
Now you have to build a tabular input-table
file to assign samples into a group name - a “Control” or a “Treated” group, for example.
Grouping samples¶
At this step you must create a simple tabular file (input-table
). Each line must corresponds to each file sample name in the first column and its group in the second column. Let’s see an example:
Control | Sample1.txt.variant_function |
Control | Sample2.txt.variant_function |
Treated | Sample3.txt.variant_function |
Treated | Sample4.txt.variant_function |
Treated | Sample5.txt.variant_function |
This file input-table
must be saved as a tabular text file and it will be used as the first argument in woland-batch.pl
script.
Note
You can provide a path for each file in input-table
if it does not rely on WOLAND $install_dir
.
Second Step - Chromosome profile¶
At this step you must check you chromosome profile file. This file contains the length of each chromosome of your genome and it is used to calculate frequency of mutational changes. You can manually create your own chromosome profile or use the woland-bed.pl
script if you have a .BED file from you targeted resequencing experiment (exome, for example). Let’s see an example of a chr_profile
file:
chr1 | 195471971 |
chr2 | 182113224 |
chr3 | 160039680 |
chr4 | 156508116 |
... | ... |
Note
If you have a .BED file from you experiment you can use woland-bed.pl
. For example:
$ perl woland-bed.pl hg19-exome-enrichment.bed
This will create a WOLAND-BED-PROFILE-hg19-exome-enrichment.bed
file which can be used as chr_profile
argument.
Third Step - Genome information¶
WOLAND uses genome sequences in FASTA format to extract context sequences and RefSeq annotation to obtain gene and transcriptional information. So you must provide two files for each genome. It is easy to obtain them and you MUST rename the according to <genome_version> parameter and move them to $install_dir/genomes/
folder.
A lot of genome sequences are avaialble nowadays. We advise you to use UCSC genome database to obtain your genome sequence and your RefSeq annotation file. Please check http://hgdownload.cse.ucsc.edu/downloads.html.
Genome sequence in FASTA format¶
The genome sequence must contain all chromosomes in chr
format. For example:
>chr1
AGCATCGATCGGCATGCATGCTAGCTAGCTACGATGCTAGCAT (...)
>chr2
GCATGCATCGTACGTACGATCGATCGATCGATCGATCGATCGA (...)
(...)
Please rename the FASTA file to genome_<genome_version>.fa
and move it to $install_dir/genomes/
. For example:
$ mv hg19.fa $install_dir/genomes/hg19.fa
RefSeq annotation¶
The RefSeq annotation can be obtained through http://hgdownload.cse.ucsc.edu/downloads.html .
Note
You MUST download the RefGene file - usually provided as refGene.txt
.
Please rename the RefGene file to refseq_<genome_version>.txt` and move it to ``$install_dir/genomes/
. For example:
$ mv RefGene $install_dir/genomes/refseq_hg19.txt
Fourth Step - Choosing hotspot window length and running!¶
Now you can choose a natural number >1 for the hotspot window length <hotspot_window>
, for example: 1000. Now, voilà, you can run woland-batch.pl
!:
$ perl woland-batch.pl -i input.table.tgca.csv -c profiles/chromosome.profile.hg19.bed.exons.txt -w 1000 -g genomes/ -n hg19 -r genomes/refseq_hg19.txt -o .