CRISPResso2

Analysis of genome editing outcomes from deep sequencing data

What is CRISPResso2?

CRISPResso2 is a software pipeline for the analysis of genome editing experiments. It is designed to enable rapid and intuitive interpretation of results produced by amplicon sequencing.
Briefly, CRISPResso2
  • aligns sequencing reads to a reference sequence
  • quantifies insertions, mutations and deletions to determine whether a read is modified or unmodified by genome editing
  • summarizes editing results in intuitive plots and datasets

What can I do with CRISPResso2?

CRISPResso2 can be used to analyze genome editing outcomes using cleaving nucleases (e.g. Cas9 or Cpf1) or noncleaving nucleases (e.g. base editors). The following operations can be automatically performed:
  • filtering of low-quality reads
  • adapter trimming
  • alignment of reads to one or multiple reference sequences (in the case of multiple alleles)
  • quantification of HDR and NHEJ outcomes (if the HDR sequence is provided)
  • quantification frameshift/inframe mutations and identification affected splice sites (if an exon sequence is provided)
  • visualization of the indel distribution and position (for cleaving nucleases)
  • visualization of distribution and position of substitutions (for base editors)
  • visualization of alleles and their frequencies

How is CRISPResso2 different from CRISPResso?

CRISPResso2 introduces five key innovations for the analysis of genome editing data:
  • Comprehensive analysis of sequencing data from base editors. We have added additional analysis and visualization capabilities especially for experiments using base editors. For example, see CRISPResso2 output from base-editor analysis here.
  • Allele specific quantification of heterozygous references. If the targeted editing region has more than one allele, reads arising from each allele can be deconvoluted. See an example CRISPResso2 output here.
  • A novel biologically-informed alignment algorithm. This algorithm incorporates knowledge about the mutations produced by gene editing tools to create more biologically-likely alignments.
  • Ultra-fast processing time.
  • Batch mode for processing and comparing multiple samples. See an example CRISPResso2 batch output here.

Usage

Required parameters

CRISPResso2 requires four inputs:
  1. Editing tool specification: Users should select 'Cas9', 'Cpf1' or 'Base editors' depending on the editing tool used in the experiment. This sets the quantification window center and window size for measuring genome editing events.
    For Cas9, the quantification window center is set to -3bp from the 5' end of the sgRNA with a window width of 1bp on both sides of the predicted cleavage site.
    For Cpf1 the quantification window center is set to +1bp from the 5' end of the sgRNA with a window width of 1bp on both sides of the predicted cut site.
    For base editors, the editing window center is set to -10bp and the editing window is set to 10bp to quantify edits across a 20bp sgRNA.
    The 'Custom' editing tool can be selected to provide a custom quantification window center and window size.
  2. Input sequences: Sequencing reads should be provided in FASTQ format (fastq.gz files are also accepted) as paired-end reads (two files), single-end reads (single file), or interleaved reads from paired end sequencing (single file)
  3. Amplicon sequence: A reference amplicon sequence to assess and quantify the efficiency of genome editing. If multiple alleles are present at the editing site, multiple sequences can be given, separated by a comma.
  4. sgRNA sequence: The sequence of the sgRNA used in the experiment is used to infer the predicted position of editing activity. The sgRNA sequence must be given without the PAM sequence.

CRISPResso processing

Quality filtering

Input reads are first filtered based on the quality score (phred33) in order to remove potentially false positive indels. The filtering based on the phred33 quality score can be modulated by adjusting the optimal parameters (see additional notes below).

Adapter trimming

Next, adapters are trimmed from the reads. If no adapter are present, select 'No Trimming' under the 'Trimming adapter' heading in the optional parameters. If reads contain adapter sequences that need to be trimmed, select the adapters used for trimming under the ‘Trimming adapter’ heading in the optional parameters. Possible adapters include Nextera PE, TruSeq3 PE, TruSeq3 SE, TruSeq2 PE, and TruSeq2 SE. The adapters are trimmed from the reads using Trimmomatic.

Read merging

If paired-end reads are provided, reads are merged using FLASh . This produces a single read for alignment to the amplicon sequence, and reduces sequencing errors that may be present at the end of sequencing reads.

Alignment

The preprocessed reads are then aligned to the reference sequence with a global sequence alignment algorithm that takes into account our biological knowledge of nuclease function. If multiple alleles are present at the editing site, each allele can be passed to CRISPResso2 and sequenced reads will be assigned to the reference sequence or origin.

Visualization and analysis

Finally, after analyzing the aligned reads, a set of informative graphs are generated, allowing for the quantification and visualization of the position and type of outcomes within the amplicon sequence.

Optional parameters

  • Amplicon name/s: A name for the reference amplicon can be given. If multiple amplicons are given, multiple names can be specified here.
  • Sample name: This name can be used to identify a CRISPResso2 run.
  • Minimum homology for alignment to an amplicon: After reads are aligned to a reference sequence, the homology is calculated as the number of bp they have in common. If the aligned read has a homology less than this parameter, it is discarded. This is useful for filtering erroneous reads that do not align to the target amplicon, for example arising from alternate primer locations.
  • Base editing: If base editor output is selected, plots showing the frequency of substitutions in the quantification window are generated. The target and result bases can also be set to measure the rate of on-target conversion at bases in the quantification window.
  • HDR: The expected HDR amplicon sequence can be provided to quantify the number of reads showing a successful HDR repair. Note that the entire amplicon sequence must be provided, not just the donor template. CRISPResso2 will quantify identified instances of NHEJ, HDR, or mixed editing events.
  • Exon specification: Sequences of exons within the amplicon sequence can be provided to enable frameshift analysis and splice site analysis by CRISPResso2. Users should provide the subsequences of the reference amplicon sequence that correspond to coding sequences (not the whole exon sequence(s)!). If your amplicon sequences contain more than one coding exonic subsequence, please provide them separated by commas.

Troubleshooting

  • Please check that your input file(s) are in FASTQ format (compressed fastq.gz also accepted).
  • If you get an empty report, please double check that your amplicon sequence is correct and in the right orientation. It can be helpful to inspect the first few lines of your FASTQ file - the start of the amplicon sequence should match the start of your sequences. If not, check to see if the files are trimmed (see point below).
  • It is important to check if your reads are trimmed or not. CRISPResso2 assumes that the reads ARE ALREADY TRIMMED! If reads are not already trimmed, select the adapters used for trimming under the ‘Trimming Adapter’ heading under the ‘Optional Parameters’. This is FUNDAMENTAL to CRISPResso analysis. Failure to trim adaptors may result in false positives. This will result in a report where you will observe an unrealistic 100% edited reads and a sharp peak at the edges of the reference amplicon in figure 4.
  • It is possible to use CRISPResso with single end reads. In this case, select ‘Single end reads’ under the ‘Experimental Design’ heading.
  • The quality filter assumes that your reads uses the Phred33 scale, and it should be adjusted for each user’s specific application. A reasonable value for this parameter is 30.
  • If your amplicon sequence is longer than your sequenced read length, the R1 and R2 reads should overlap by at least 10bp. For example, if you sequence using 150bp reads, the maximum amplicon length should be 290 bp.
  • Especially in repetitive regions, multiple alignments may have the best score. If you want to investigate alternate best-scoring alignments, you can view all alignments using this tool: http://rna.informatik.uni-freiburg.de/Teaching/index.jsp?toolName=Gotoh. As input, sequences from the 'Alleles_frequency_table.txt' can be used. Specifically, for a given row, the value in the 'Aligned_Sequence' should be entered into the 'Sequence a' box after removing any dashes, and the value in the 'Reference_Sequence' should be entered into the 'Sequence b' box after removing any dashes. The alternate alignments can be selected in the 'Results' panel in the Output section.

CRISPResso2 output

The output of CRISPResso2 consists of a set of informative graphs that allow for the quantification and visualization of the position and type of outcomes within an amplicon sequence.
Example Output
Data file descriptions
  • CRISPResso2_report.html is a summary report that can be viewed in a web browser containing all of the output plots and summary statistics.
  • Alleles_frequency_table.zip can be unzipped to a tab-separated text file that shows all reads and alignments to references. The first column shows the aligned sequence of the sequenced read. The second column shows the aligned sequence of the reference sequence. Gaps in each of these columns represent insertions and deletions. The next column 'Reference_Name' shows the name of the reference that the read aligned to. The fourth column, 'Read_Status' shows whether the read was modified or unmodified. The fifth through seventh columns ('n_deleted', 'n_inserted', 'n_substituted') show the number of bases deleted, inserted, and substituted as compared to the reference sequence. The eighth column shows the number of reads having that sequence, and the ninth column shows the percentage of all reads having that sequence.
  • CRISPResso_mapping_statistics.txt is a tab-delimited text file showing the number of reads in the input ('READS IN INPUTS') the number of reads after filtering, trimming and merging (READS AFTER PREPROCESSING), the number of reads aligned (READS ALIGNED) and the number of reads for which the alignment had to be computed vs read from cache.
  • CRISPResso_quantification_of_editing_frequency.txt is a tab-delimited text file showing the number of reads aligning to each reference amplicon, as well as the status (modified/unmodified, number of insertions, deletions, and/or substitutions) of those reads.
  • CRISPResso_RUNNING_LOG.txt is a text file and shows a log of the CRISPResso run.
  • CRISPResso2_info.pickle can be read by other CRISPResso tools and contains information about the run and results.
The remainder of the files are produced for each amplicon, and each file is prefixed by the name of the amplicon if more than one amplicon is given.
  • Alleles_frequency_table_around_sgRNA_NNNNN.txt is a tab-separated text file that shows alleles and alignments to the specified reference for a subsequence around the sgRNA (here, shown by 'NNNNN'). This data report is produced for each amplicon when a guide is found in the amplicon sequence. A report is generated for each guide. The number of nucleotides shown in this report can be modified by changing the `--plot_window_size` parameter.
  • Substitution_frequency_table_around_sgRNA_NNNNN.txt is a tab-separated text file that shows the frequency of substitutions in the amplicon sequence around the sgRNA (here, shown by 'NNNNN'). The first row shows the reference sequence. The following rows show the number of substitutions to each base. For example, the first numeric value in the second row (marked ‘A’) shows the number of bases that have a substitution resulting in an A at the first basepair of the amplicon sequence. The number of unmodified bases at each position is now shown in this table (because they aren’t substitutions). Thus, if the first basepair of the amplicon sequence is an A, the first value in the first row will show 0. A report is generated for each guide. The number of nucleotides shown in this report can be modified by changing the `--plot_window_size` parameter.
  • Substitution_frequency_table.txt is a tab-separated text file that shows the frequency of substitutions in the amplicon sequence across the entire amplicon. The first row shows the reference sequence. The following rows show the number of substitutions to each base. For example, the first numeric value in the second row (marked ‘A’) shows the number of bases that have a substitution resulting in an A at the first basepair of the amplicon sequence. The number of unmodified bases at each position is now shown in this table (because they aren’t substitutions). Thus, if the first basepair of the AMPLICON sequence is an A, the first value in the first row will show 0.
  • Insertion_histogram.txt is a tab-separated text file that shows a histogram of the insertion sizes in the amplicon sequence in the quantification window. Insertions outside of the quantification window are not included. The ins_size column shows the insertion length, and the fq column shows the number of reads having that insertion size.
  • Deletion_histogram.txt is a tab-separated text file that shows a histogram of the deletion sizes in the amplicon sequence in the quantification window. Deletions outside of the quantification window are not included. The del_size column shows length of the deletion, and the fq column shows the number of reads having that number of substitutions.
  • Substitution_histogram.txt is a tab-separated text file that shows a histogram of the number of substitutions in the amplicon sequence in the quantification window. Substitutions outside of the quantification window are not included. The sub_count column shows the number of substitutions, and the fq column shows the number of reads having that number of substitutions.
  • Effect_vector_insertion.txt is a tab-separated text file with a one-row header that shows the percentage of reads with an insertion at each base in the reference sequence. The first column shows the 1-based position of the amplicon, and the second column shows the percentage of reads with a insertion at that location.
  • Effect_vector_deletion.txt is a tab-separated text file with a one-row header that shows the percentage of reads with a deletion at each base in the reference sequence. The first column shows the 1-based position of the amplicon, and the second column shows the percentage of reads with a deletion at that location.
  • Effect_vector_substitution.txt is a tab-separated text file with a one-row header that shows the percentage of reads with a substitution at each base in the reference sequence. The first column shows the 1-based position of the amplicon, and the second column shows the percentage of reads with a substitution at that location.
  • Effect_vector_combined.txt is a tab-separated text file with a one-row header that shows the percentage of reads with any modification (insertion, deletion, or substitution) at each base in the reference sequence. The first column shows the 1-based position of the amplicon, and the second column shows the percentage of reads with a modification at that location.
  • Modification_count_vectors.txt is a tab-separated file showing the number of modifications for each position in the amplicon. The first row shows the amplicon sequence, and successive rows show the number of reads with insertions (row 2), insertions_left (row 3), deletions (row 4), substitutions (row 5) and the sum of all modifications (row 6). Additionally, the last row shows the number of reads aligned.
    If an insertion occurs between bases 5 and 6, the insertions vector will be incremented at bases 5 and 6. However, the insertions_left vector will only be incremented at base 5 so the sum of the insertions_left row represents an accurate count of the number of insertions, whereas the sum of the insertions row will yield twice the number of insertions.
  • Quantification_window_modification_count_vectors.txt is a tab-separated file showing the number of modifications for positions in the quantification window of the amplicon. The first row shows the amplicon sequence in the quantification window, and successive rows show the number of reads with insertions (row 2), insertions_left (row 3), deletions (row 4), substitutions (row 5) and the sum of all modifications (row 6). Additionally, the last row shows the number of reads aligned.
  • Nucleotide_frequency_table.txt is a tab-separated file showing the number of each residue at each position in the amplicon. The first row shows the amplicon sequence, and successive rows show the number of reads with an A (row 2), C (row 3), G (row 4), T (row 5), N (row 6), or a deletion (-) (row 7) at each position.
  • Quantification_window_nucleotide_frequency_table.txt is a tab-separated file showing the number of each residue at positions in the quantification window of the amplicon. The first row shows the amplicon sequence in the quantification window, and successive rows show the number of reads with an A (row 2), C (row 3), G (row 4), T (row 5), N (row 6), or a deletion (-) (row 7) at each position.
  • Nucleotide_percentage_table.txt is a tab-separated file showing the percentage of each residue at each position in the amplicon. The first row shows the amplicon sequence, and successive rows show the percentage of reads with an A (row 2), C (row 3), G (row 4), T (row 5), N (row 6), or a deletion (-) (row 7) at each position.
  • Quantification_window_nucleotide_percentage_table.txt is a tab-separated file showing the percentage of each residue at positions in the quantification window of the amplicon. The first row shows the amplicon sequence in the quantification window, and successive rows show the percentage of reads with an A (row 2), C (row 3), G (row 4), T (row 5), N (row 6), or a deletion (-) (row 7) at each position.
The following report files are produced when the base editor mode is enabled:
  • Selected_nucleotide_percentage_table_around_sgRNA_NNNNN.txt is a tab-separated text file that shows the percentage of each base at selected nucleotides in the amplicon sequence around the sgRNA (here, shown by 'NNNNN'). If the base editing experiment targets cytosines (as set by the --base_editor_from parameter), each C in the quantification window will be numbered (e.g. C5 represents the cytosine at the 5th position in the selected nucleotides). The percentage of each base at these selected target cytosines is reported, with the first row showing the numbered cytosines, and the remainder of the rows showing the percentage of each nucleotide present at these locations. This file shows nucleotides within '--plot_window_size' bp of the position specified by the parameter '--quantification_window_center' relative to the 3' end of each guide.
  • Selected_nucleotide_frequency_table_around_sgRNA_NNNNN.txt is a tab-separated text file that shows the frequency of each base at selected nucleotides in the amplicon sequence around the sgRNA (here, shown by 'NNNNN'). If the base editing experiment targets cytosines (as set by the --base_editor_from parameter), each C in the quantification window will be numbered (e.g. C5 represents the cytosine at the 5th position in the selected nucleotides). The frequency of each base at these selected target cytosines is reported, with the first row showing the numbered cytosines, and the remainder of the rows showing the frequency of each nucleotide present at these locations. This file shows nucleotides within '--plot_window_size' bp of the position specified by the parameter '--quantification_window_center' relative to the 3' end of each guide.
The following report files are produced when the amplicon contains a coding sequence:
  • Frameshift_analysis.txt is a text file describing the number of noncoding, in-frame, and frameshift mutations. This report file is produced when the amplicon contains a coding sequence.
  • Splice_sites_analysis.txt is a text file describing the number of splicing sites that are unmodified and modified. This file report is produced when the amplicon contains a coding sequence.
  • Effect_vector_insertion_noncoding.txt is a tab-separated text file with a one-row header that shows the percentage of reads with a noncoding insertion at each base in the reference sequence. The first column shows the 1-based position of the amplicon, and the second column shows the percentage of reads with a noncoding insertion at that location. This report file is produced when amplicon contains a coding sequence.
  • Effect_vector_deletion_noncoding.txt is a tab-separated text file with a one-row header that shows the percentage of reads with a noncoding deletion at each base in the reference sequence. The first column shows the 1-based position of the amplicon, and the second column shows the percentage of reads with a noncoding deletion at that location. This report file is produced when amplicon contains a coding sequence.
  • Effect_vector_substitution_noncoding.txt is a tab-separated text file with a one-row header that shows the percentage of reads with a noncoding substitution at each base in the reference sequence. The first column shows the 1-based position of the amplicon, and the second column shows the percentage of reads with a noncoding substitution at that location. This report file is produced when amplicon contains a coding sequence.