Tutorial

The easiest way to perform a WOLAND analysis is through a single batch submission using woland-batch.pl. This will envolve initial 4-step preparation but in the next-time that you will use WOLAND with other samples (and we believe that you do!) you will use only the first step. It is easy no? Each step will prepare the inputs for this script:

$ perl woland-batch.pl -i <input.table file> -c <chromosome.profile file> -g <genomes.folder> -n <genome.version> -r <refseq.file> -w <hotspot.window length> -t <number.of.threads> -o <target.output folder>

First Step - Preparing input-table

Filtering

In most cases, a raw .vcf file containing SNVs from a resequencing pipeline is not suitable for a point mutation analysis. First, you have to filter polymorphisms and false-positives from each sample using, for example, vcftools (<http://vcftools.sourceforge.net/>) and/or ANNOVAR (<http://annovar.openbioinformatics.org/en/>).

Annotating

Several tools are available to annotate .vcf files. However, WOLAND accepts only ANNOVAR (<http://annovar.openbioinformatics.org/en/>) gene annotation. It is easy to use ANNOVAR and you can find information about downloading, installing and using it at its website. Here is an example how to use annovar to annotate a .vcf file using annotate_variation.pl from ANNOVAR:

$ perl annotate_variation.pl -geneanno -buildver hg19 example/ex1.avinput humandb/

Warning

WOLAND accepts ONLY .variant_function files from ANNOVAR. It is not possible to use exonic_variant_function output.

At this time you have a .variant_function for each sample to be analyzed. You can manually annotate a file (think twice before) or force annotation when gene information are not avaialble (or not necessary). Let’s take a look at a variant-function file from annovar:

exonic Lrp1b chr2 3432131 3432131 A G
intergenic Rbpj chr5 25465 25465 T A
intronic Cmklr1 chr5 4234231 4234231 C T
intronic Setd8 chr5 8423415 8423415 G C
... ... ... ... ... . .

Now you have to build a tabular input-table file to assign samples into a group name - a “Control” or a “Treated” group, for example.

Grouping samples

At this step you must create a simple tabular file (input-table). Each line must corresponds to each file sample name in the first column and its group in the second column. Let’s see an example:

Control Sample1.txt.variant_function
Control Sample2.txt.variant_function
Treated Sample3.txt.variant_function
Treated Sample4.txt.variant_function
Treated Sample5.txt.variant_function

This file input-table must be saved as a tabular text file and it will be used as the first argument in woland-batch.pl script.

Note

You can provide a path for each file in input-table if it does not rely on WOLAND $install_dir.

Second Step - Chromosome profile

At this step you must check you chromosome profile file. This file contains the length of each chromosome of your genome and it is used to calculate frequency of mutational changes. You can manually create your own chromosome profile or use the woland-bed.pl script if you have a .BED file from you targeted resequencing experiment (exome, for example). Let’s see an example of a chr_profile file:

chr1 195471971
chr2 182113224
chr3 160039680
chr4 156508116
... ...

Note

If you have a .BED file from you experiment you can use woland-bed.pl. For example:

$ perl woland-bed.pl hg19-exome-enrichment.bed

This will create a WOLAND-BED-PROFILE-hg19-exome-enrichment.bed file which can be used as chr_profile argument.

Third Step - Genome information

WOLAND uses genome sequences in FASTA format to extract context sequences and RefSeq annotation to obtain gene and transcriptional information. So you must provide two files for each genome. It is easy to obtain them and you MUST rename the according to <genome_version> parameter and move them to $install_dir/genomes/ folder.

A lot of genome sequences are avaialble nowadays. We advise you to use UCSC genome database to obtain your genome sequence and your RefSeq annotation file. Please check http://hgdownload.cse.ucsc.edu/downloads.html.

Genome sequence in FASTA format

The genome sequence must contain all chromosomes in chr format. For example:

>chr1
AGCATCGATCGGCATGCATGCTAGCTAGCTACGATGCTAGCAT (...)
>chr2
GCATGCATCGTACGTACGATCGATCGATCGATCGATCGATCGA (...)
(...)

Please rename the FASTA file to genome_<genome_version>.fa and move it to $install_dir/genomes/. For example:

$ mv hg19.fa $install_dir/genomes/hg19.fa

RefSeq annotation

The RefSeq annotation can be obtained through http://hgdownload.cse.ucsc.edu/downloads.html .

Note

You MUST download the RefGene file - usually provided as refGene.txt.

Please rename the RefGene file to refseq_<genome_version>.txt` and move it to ``$install_dir/genomes/. For example:

$ mv RefGene $install_dir/genomes/refseq_hg19.txt

Fourth Step - Choosing hotspot window length and running!

Now you can choose a natural number >1 for the hotspot window length <hotspot_window>, for example: 1000. Now, voilà, you can run woland-batch.pl!:

$ perl woland-batch.pl -i input.table.tgca.csv -c profiles/chromosome.profile.hg19.bed.exons.txt -w 1000 -g genomes/ -n hg19 -r genomes/refseq_hg19.txt -o .