Create an empty directory to work in the exercise :
mkdir quality_control cd quality_control
Donwload the raw data to it (you can use the next command or download the file from this link)
wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/quality_control/f010_raw_mirna.fastq
The file f010_raw_mirna.fastq contains reads from a microRNA sequencing experiment. Use the command head to have a view of the first lines of the file:
head f010_raw_mirna.fastq
Use the command wc to count how many reads there are in the file (remember you have to divide by 4)
wc -l f010_raw_mirna.fastq
First create a directory to store the results of the fastqc analysis:
mkdir f020_res_fastqc
Then execute fastqc storing the results in the created directory (option -o):
fastqc -o f020_res_fastqc f010_raw_mirna.fastq
Find the results in the fastqc_report.html file and discus them. There are many Overrepresented sequences. Explore whether some of them correspond to miRNAs using the miRBase (Search → By sequence).
There are 2 known adapters used in this experiment:
CTGGGAAATCACCATAAACGTGAAATGTCTTTGGATTTGGGAATCTTATAAGTTCTGTATGAGACCACTCTAAAAA CTTTTTTTCGTCCTTTCCACAAGATATATAAAGCCAAGAAATCGAAATACTTTCAAGTTACGGTAAGC
Use the command grep to see whether they are still present in your data:
grep "CTGGGAAATCACCATAAACGTGAAATGTCTTTGGATTTGGGAATCTTATAAGTTCTGTATGAGACCACTCTAAAAA" f010_raw_mirna.fastq grep "CTTTTTTTCGTCCTTTCCACAAGATATATAAAGCCAAGAAATCGAAATACTTTCAAGTTACGGTAAGC" f010_raw_mirna.fastq
Do the sequences appear systematically at the beginning or at the end of the reads?
But the adapters could also appear in the reverse, complementary or reverse complementary mode.
Compute the reverse, complementary and the reverse complementary sequences of the two adapters, and find out which of them appear in your data.
To compute those sequences you can use some online resources as the one in:
Use grep form Linux shell to find out which of the versions of the adapter is in your data.
grep CTGGGAAATCACCATAAACGTGAAATGTCTTTGGATTTGGGAATCTTATAAGTTCTGTATGAGACCACTCTAAAAA f010_raw_mirna.fastq | wc -l ## present in the sample (at the beginning of the reads) grep GACCCTTTAGTGGTATTTGCACTTTACAGAAACCTAAACCCTTAGAATATTCAAGACATACTCTGGTGAGATTTTT f010_raw_mirna.fastq | wc -l grep TTTTTAGAGTGGTCTCATACAGAACTTATAAGATTCCCAAATCCAAAGACATTTCACGTTTATGGTGATTTCCCAG f010_raw_mirna.fastq | wc -l ## present in the sample (at the end of the read) ... but not so numerous grep AAAAATCTCACCAGAGTATGTCTTGAATATTCTAAGGGTTTAGGTTTCTGTAAAGTGCAAATACCACTAAAGGGTC f010_raw_mirna.fastq | wc -l
But sometimes the adapter does not appear complete. It may be there just the first part:
grep CTGGGAAATCACCATAAACGTGAAATGTCTTTGGA f010_raw_mirna.fastq | wc -l grep GACCCTTTAGTGGTATTTGCACTTTACAGAAACCT f010_raw_mirna.fastq | wc -l grep TTTTTAGAGTGGTCTCATACAGAACTTATAAGATT f010_raw_mirna.fastq | wc -l grep AAAAATCTCACCAGAGTATGTCTTGAATATTCTAA f010_raw_mirna.fastq | wc -l
or the end part of it:
grep AATCTTATAAGTTCTGTATGAGACCACTCTAAAAA f010_raw_mirna.fastq | wc -l grep TTAGAATATTCAAGACATACTCTGGTGAGATTTTT f010_raw_mirna.fastq | wc -l grep TCCAAAGACATTTCACGTTTATGGTGATTTCCCAG f010_raw_mirna.fastq | wc -l grep AGGTTTCTGTAAAGTGCAAATACCACTAAAGGGTC f010_raw_mirna.fastq | wc -l
NOTE: in the code above I did cut just the 35 first or last nucleotides of the primer in its different arrangements, but this is an arbitrary length. We are just trying to discover which of the arrangements are present in our sample and whether there are allocated in the 5’ or in the 6’ end.
grep CTTTTTTTCGTCCTTTCCACAAGATATATAAAGCCAAGAAATCGAAATACTTTCAAGTTACGGTAAGC f010_raw_mirna.fastq | wc -l ## present in the sample (at the end of the read) ... but not so numerous grep GAAAAAAAGCAGGAAAGGTGTTCTATATATTTCGGTTCTTTAGCTTTATGAAAGTTCAATGCCATTCG f010_raw_mirna.fastq | wc -l grep GCTTACCGTAACTTGAAAGTATTTCGATTTCTTGGCTTTATATATCTTGTGGAAAGGACGAAAAAAAG f010_raw_mirna.fastq | wc -l ## present in the sample (at the beginning of the reads) grep CGAATGGCATTGAACTTTCATAAAGCTAAAGAACCGAAATATATAGAACACCTTTCCTGCTTTTTTTC f010_raw_mirna.fastq | wc -l
As before, sometimes the adapter does not appear complete. It may be there just the first part:
grep CTTTTTTTCGTCCTTTCCACAAGATATATA f010_raw_mirna.fastq | wc -l grep GAAAAAAAGCAGGAAAGGTGTTCTATATAT f010_raw_mirna.fastq | wc -l grep GCTTACCGTAACTTGAAAGTATTTCGATTT f010_raw_mirna.fastq | wc -l grep CGAATGGCATTGAACTTTCATAAAGCTAAA f010_raw_mirna.fastq | wc -l
or the end part of it:
grep AAGCCAAGAAATCGAAATACTTTCAAGTTACGGTAAGC f010_raw_mirna.fastq | wc -l grep TTCGGTTCTTTAGCTTTATGAAAGTTCAATGCCATTCG f010_raw_mirna.fastq | wc -l grep CTTGGCTTTATATATCTTGTGGAAAGGACGAAAAAAAG f010_raw_mirna.fastq | wc -l grep GAACCGAAATATATAGAACACCTTTCCTGCTTTTTTTC f010_raw_mirna.fastq | wc -l
Check the options:
You can find the help of the program typing in the shell:
cutadapt -h
To get read of the the adapters found in our data we run cutadapt several times:
cutadapt -g CTGGGAAATCACCATAAACGTGAAATGTCTTTGGATTTGGGAATCTTATAAGTTCTGTATGAGACCACTCTAAAAA -o f030_mirna_trim1.fastq f010_raw_mirna.fastq cutadapt -a TTTTTAGAGTGGTCTCATACAGAACTTATAAGATTCCCAAATCCAAAGACATTTCACGTTTATGGTGATTTCCCAG -o f030_mirna_trim2.fastq f030_mirna_trim1.fastq cutadapt -g GCTTACCGTAACTTGAAAGTATTTCGATTTCTTGGCTTTATATATCTTGTGGAAAGGACGAAAAAAAG -o f030_mirna_trim3.fastq f030_mirna_trim2.fastq cutadapt -a CTTTTTTTCGTCCTTTCCACAAGATATATAAAGCCAAGAAATCGAAATACTTTCAAGTTACGGTAAGC -o f030_mirna_trim4.fastq f030_mirna_trim3.fastq
Now you can grep again searching for the adapters
grep CTGGGAAATCACCATAAACGTGAAATGTCTTTGGATTTGGGAATCTTATAAGTTCTGTATGAGACCACTCTAAAAA f030_mirna_trim4.fastq | wc -l #grep GACCCTTTAGTGGTATTTGCACTTTACAGAAACCTAAACCCTTAGAATATTCAAGACATACTCTGGTGAGATTTTT f030_mirna_trim4.fastq | wc -l grep TTTTTAGAGTGGTCTCATACAGAACTTATAAGATTCCCAAATCCAAAGACATTTCACGTTTATGGTGATTTCCCAG f030_mirna_trim4.fastq | wc -l #grep AAAAATCTCACCAGAGTATGTCTTGAATATTCTAAGGGTTTAGGTTTCTGTAAAGTGCAAATACCACTAAAGGGTC f030_mirna_trim4.fastq | wc -l
grep CTTTTTTTCGTCCTTTCCACAAGATATATAAAGCCAAGAAATCGAAATACTTTCAAGTTACGGTAAGC f030_mirna_trim4.fastq | wc -l #grep GAAAAAAAGCAGGAAAGGTGTTCTATATATTTCGGTTCTTTAGCTTTATGAAAGTTCAATGCCATTCG f030_mirna_trim4.fastq | wc -l grep GCTTACCGTAACTTGAAAGTATTTCGATTTCTTGGCTTTATATATCTTGTGGAAAGGACGAAAAAAAG f030_mirna_trim4.fastq | wc -l #grep CGAATGGCATTGAACTTTCATAAAGCTAAAGAACCGAAATATATAGAACACCTTTCCTGCTTTTTTTC f030_mirna_trim4.fastq | wc -l
Check the data quality again using fastqc:
mkdir f040_res_fastqc_trimmed fastqc -o f040_res_fastqc_trimmed f030_mirna_trim4.fastq
Some of the reads seems to be too short and some others may not have enough quality.
Check the options:
You can find the help of the program typing in the shell:
cutadapt -h
Run cutadapt for length and quality purge of the reads.
cutadapt -m 17 -M 30 -q 10 -o f040_mirna_cut.fastq f030_mirna_trim4.fastq
Check the data quality again using fastqc:
mkdir f050_res_fastqc_trimmed_purged fastqc -o f050_res_fastqc_trimmed_purged f040_mirna_cut.fastq
Explore again the Overrepresented sequences in miRBase (Search → By sequence).
Count how many reads are left for the analysis (divide by 4)
wc -l f010_raw_mirna.fastq wc -l f040_mirna_cut.fastq
NOTE: Before continuing with the next step you should go back to your working directory:
cd ..
In this hands-on will learn how to align DNA and RNA-seq data with most widely used software today. Building a whole genome index requires a lot of RAM memory and almost one hour in a typical workstation, for this reason in this tutorial we will work with chromosome 21 to speed up the exercises. The same steps would be done for a whole genome alignment. Two different datasets, high and low quality have been simulated for DNA, high quality contains 0.1% of mutations and low quality contains 1%. For RNA-seq a 100bp and 150bp datasets have been simulated.
Create a mapping folder in your working directory and download the reference genome sequence to be used (human chromosome 21) and simulated datasets from HERE.
mkdir mapping cd mapping
Working with NGS data requires a high-end workstations and time for building the reference genome indexes and alignment. During this tutorial we will work only with chromosome 21 to speed up the runtimes. You can download it from the Download link at the top of Ensembl website and then to Download data via FTP. We are going to work with Esembl GRCh37 revision 75 and you can download the chromosome 21 from ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/. The next command will download the chr 21.
wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/f000_chr21_ref_genome_sequence.fa
Or you can use the next commmand to download this file from
NOTE: For working with the whole reference genome the file to be downloaded is Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
For this hands-on we are going to use small DNA datasets simulated from chr 21. Data has been already simulated using dwgsim softeare from SAMtools. You can download all the files with the next commands. The gunzip * command will decompress all files.
mkdir data cd data wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/alignment/dna_chr21_100_hq_read1.fastq.gz wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/alignment/dna_chr21_100_hq_read2.fastq.gz wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/alignment/dna_chr21_100_lq_read1.fastq.gz wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/alignment/dna_chr21_100_lq_read2.fastq.gz gunzip * cd ..
Notice that the name of the folders and files describe the dataset, ie. dna_chr21_100_hq stands for: DNA type of data from chromosome 21 with 100nt read lengths of high quality. Where hq quality means 0.1% mutations and lq quality 1% mutations. Take a few minutes to understand the different files.
For those with access to high-end nodes clusters you can index and simulated whole genome datasets or download real datasets from this sources:
In this exercise we’ll learn how to download, install, build the reference genome index and align in single-end and paired-end mode with BWA. But first, we need to create a folder to store the alignment results and an index folder to store the genome index, you can create both folders by executing:
mkdir results mkdir index
NOTE: Now your working directory must contain 3 folders: data, index and results. Your working directory should be similar to this, execute ‘tree’:
tree
.
├── data
│ ├── dna_chr21_100_hq_read1.fastq
│ ├── dna_chr21_100_hq_read2.fastq
│ ├── dna_chr21_100_lq_read1.fastq
│ └── dna_chr21_100_lq_read2.fastq
├── f000_chr21_ref_genome_sequence.fa
├── index
└── results
BWA is probably the most used aligner for DNA. AS the documentation states it consists of three different algorithms: BWA, BWA-SW and BWA-MEM. The first algorithm, which is the oldest, is designed for Illumina sequence reads up to 100bp, while the rest two for longer sequences. BWA-MEM and BWA-SW share similar features such as long-read support and split alignment, but BWA-MEM, which is the latest, is generally recommended for high-quality queries as it is faster and more accurate. BWA-MEM also has better performance than BWA for 70-100bp Illumina reads.
First we have to move our reference genome to the index folder by executing:
mv f000_chr21_ref_genome_sequence.fa index/
Now we can create the index by executing:
bwa index index/f000_chr21_ref_genome_sequence.fa
Some files will be created in the index folder, those files constitute the index that BWA uses. NOTE: The index must created only once, it will be used for all the different alignments with BWA.
BWA-MEM is the recommended algorithm to use now. You can check the options by executing:
bwa mem
To align SE (Single-end) with BWA-MEM execute:
bwa mem -t 4 -R "@RG\tID:foo\tSM:bar\tPL:Illumina\tPU:unit1\tLB:lib1" index/f000_chr21_ref_genome_sequence.fa data/dna_chr21_100_hq_read1.fastq > results/dna_chr21_100_hq_se.sam
Now you can use SAMtools to create the BAM file. First we move to results folder and then we execute samtools command to export SAM file to BAM file:
cd results samtools view -bS dna_chr21_100_hq_se.sam -o dna_chr21_100_hq_se.bam
You can explore the SAM file with the alignments using the next command (if you press up/down keys you can navigate through the file.
more dna_chr21_100_hq_se.sam
Before the next command we should go back to main directory.
cd ..
To align PE (Paired-end) with BWA-MEM just execute the same command line with the two FASTQ files:
bwa mem -t 4 -R "@RG\tID:foo\tSM:bar\tPL:Illumina\tPU:unit1\tLB:lib1" index/f000_chr21_ref_genome_sequence.fa data/dna_chr21_100_hq_read1.fastq data/dna_chr21_100_hq_read2.fastq > results/dna_chr21_100_hq_pe.sam
Now you can use SAMtools to create the BAM file (as we did before):
cd results samtools view -bS dna_chr21_100_hq_pe.sam -o dna_chr21_100_hq_pe.bam
If you want to explore the new SAM file with paired-end alignments you can execute:
more dna_chr21_100_hq_pe.sam
Now you can do the same for the low quality datasets.
NOTE: Before continuing with the next step you should go back to your working directory:
cd ../..
Create a folder visualization and download de required files:
mkdir visualization cd visualization wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/visualization/example_1/NA12878_child.bam wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/visualization/example_1/NA12891_dad.bam wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/visualization/example_1/NA12892_mom.bam
These datasets contain reads only for the GABBR1 gene.
Use samtools to index the bam files:
samtools index NA12878_child.bam samtools index NA12891_dad.bam samtools index NA12892_mom.bam
Notice that three files have been created (with extension .bai):
ls -1 NA12878_child.bam NA12878_child.bam.bai NA12891_dad.bam NA12891_dad.bam.bai NA12892_mom.bam NA12892_mom.bam.bai
You can run this command from the terminal using igv:
igv
Run just in case you do not have downloaded Human hg10 genome before:
Steps:
NOTE: Before continuing with the next step you should go back to your working directory:
cd ..
Create a directory for this exercise:
mkdir calling cd calling
Create a new directory called genome:
mkdir genome
Now we can copy the genome from the previous step (mapping):
cp ../mapping/index/f000_chr21_ref_genome_sequence.fa genome/
Or we can download the file with the next command:
wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/f000_chr21_ref_genome_sequence.fa -P genome/
NOTE: we used option -P this time with wget, this option allows us to indicate the output folder.
Use samtools to generate the fasta file index:
samtools faidx genome/f000_chr21_ref_genome_sequence.fa
This creates a file called samtools faidx f000_chr21_ref_genome_sequence.fa.fai, with one record per line for each of the contigs in the FASTA reference file.
Generate the sequence dictionary using Picard.
picard CreateSequenceDictionary REFERENCE=genome/f000_chr21_ref_genome_sequence.fa OUTPUT=genome/f000_chr21_ref_genome_sequence.dict
Create a data folder and download the required BAM by executing the next commands:
mkdir data cd data wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/calling-old/example1/000-dna_chr21_100_hq_pe.bam
NOTE: this BAM file is the same file we created in the maping section.
We must sort and index the BAM file before processing it with Picard and GATK. To sort the bam file we use samtools
samtools sort 000-dna_chr21_100_hq_pe.bam 001-dna_chr21_100_hq_pe_sorted
Index the BAM file:
samtools index 001-dna_chr21_100_hq_pe_sorted.bam
Run the following Picard command to mark duplicates:
picard MarkDuplicates INPUT=001-dna_chr21_100_hq_pe_sorted.bam OUTPUT=002-dna_chr21_100_hq_pe_sorted_noDup.bam METRICS_FILE=002-metrics.txt
This creates a sorted BAM file called 002-dna_chr21_100_hq_pe_sorted_noDup.bam with the same content as the input file, except that any duplicate reads are marked as such. It also produces a metrics file called metrics.txt.
Run the following Picard command to index the new BAM file:
picard BuildBamIndex INPUT=002-dna_chr21_100_hq_pe_sorted_noDup.bam
There are 2 steps to the realignment process:
First, create a target list of intervals which need to be realigned
gatk -T RealignerTargetCreator -R ../genome/f000_chr21_ref_genome_sequence.fa -I 002-dna_chr21_100_hq_pe_sorted_noDup.bam -o 003-indelRealigner.intervals
Second, perform realignment of the target intervals
gatk -T IndelRealigner -R ../genome/f000_chr21_ref_genome_sequence.fa -I 002-dna_chr21_100_hq_pe_sorted_noDup.bam -targetIntervals 003-indelRealigner.intervals -o 003-dna_chr21_100_hq_pe_sorted_noDup_realigned.bam
This creates a file called 003-dna_chr21_100_hq_pe_sorted_noDup_realigned.bam containing all the original reads, but with better local alignments in the regions that were realigned.
Two steps:
First, analyse patterns of covariation in the sequence dataset.
It is imperative that you provide the program with a set of known sites, otherwise it will refuse to run. The known sites are used to build the covariation model and estimate empirical base qualities. For details on what to do if there are no known sites available for your organism of study, please see the online GATK documentation.
Download the known sites file:
wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/calling-old/000-dbSNP_chr21.vcf -P ..
Execute tye GATK command:
gatk -T BaseRecalibrator -R ../genome/f000_chr21_ref_genome_sequence.fa -I 003-dna_chr21_100_hq_pe_sorted_noDup_realigned.bam -knownSites ../000-dbSNP_chr21.vcf -o 004-recalibration_data.table
This creates a GATKReport file called 004-recalibration_data.table containing several tables. These tables contain the covariation data that will be used in a later step to recalibrate the base qualities of your sequence data.
NOTE: this step could take several minutes, if you do not want to wait until the comand finish you can download the result file by executing the next command:
wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/calling-old/example1/004-recalibration_data.table
Second, apply the recalibration to your sequence data
gatk -T PrintReads -R ../genome/f000_chr21_ref_genome_sequence.fa -I 003-dna_chr21_100_hq_pe_sorted_noDup_realigned.bam -BQSR 004-recalibration_data.table -o 004-dna_chr21_100_hq_pe_sorted_noDup_realigned_recalibrated.bam
NOTE: this step could take several minutes, if you do not want to wait until the comand finish you can download the result files by executing the next commands:
wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/calling-old/example1/004-dna_chr21_100_hq_pe_sorted_noDup_realigned_recalibrated.bai wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/calling-old/example1/004-dna_chr21_100_hq_pe_sorted_noDup_realigned_recalibrated.bam
This creates a file called 004-dna_chr21_100_hq_pe_sorted_noDup_realigned_recalibrated.bam containing all the original reads, but now with exquisitely accurate base substitution, insertion and deletion quality scores. By default, the original quality scores are discarded in order to keep the file size down. However, you have the option to retain them by adding the flag –emit_original_quals to the PrintReads command, in which case the original qualities will also be written in the file, tagged OQ.
SNPs are called using the next command:
gatk -T UnifiedGenotyper -R ../genome/f000_chr21_ref_genome_sequence.fa -I 004-dna_chr21_100_hq_pe_sorted_noDup_realigned_recalibrated.bam -o 005-dna_chr21_100_he_pe.vcf
NOTE: this step could take several minutes, if you do not want to wait until the comand finish you can download the result files by executing the next commands:
wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/calling-old/example1/005-dna_chr21_100_he_pe.vcf wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/calling-old/example1/005-dna_chr21_100_he_pe.vcf.idx
Example: filter SNPs with low confidence calling (QD < 12.0) and flag them as “LowConf”.
gatk -T VariantFiltration -R ../genome/f000_chr21_ref_genome_sequence.fa -V 005-dna_chr21_100_he_pe.vcf --filterExpression "QD < 12.0" --filterName "LowConf" -o 006-dna_chr21_100_he_pe_filtered.vcf
The command –filterExpression will read the INFO field and check whether variants satisfy the requirement. If a variant does not satisfy your filter expression, the field FILTER will be filled with the indicated –filterName. These commands can be called several times indicating different filtering expression (i.e: –filterName One –filterExpression “X < 1” –filterName Two –filterExpression “X > 2”).
QUESTION: How many “LowConf” variants have we obtained?
grep LowConf 006-dna_chr21_100_he_pe_filtered.vcf | wc -l