Hodgkin’s Lymphoma, Dissociated Tumor: Targeted, Gene Signature Panel#

Dataset: Hodgkin’s Lymphoma, Dissociated Tumor: Targeted, Gene Signature Panel

The detailed description of this dataset can be found here.


Preparation#

Download fastq files.

$ wget https://cf.10xgenomics.com/samples/cell-exp/4.0.0/Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature/Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature_fastqs.tar

$ tar xvf Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature_fastqs.tar

Download cell barcode info.

These are the cell-associated barcodes determined in the parent Hodgkin’s Lymphoma, Dissociated Tumor: Whole Transcriptome Analysis dataset.

$ wget https://cf.10xgenomics.com/samples/cell-exp/4.0.0/Parent_NGSC3_DI_HodgkinsLymphoma/Parent_NGSC3_DI_HodgkinsLymphoma_filtered_feature_bc_matrix.tar.gz

$ tar zxvf Parent_NGSC3_DI_HodgkinsLymphoma_filtered_feature_bc_matrix.tar.gz

Inspect cell barcodes.

$ gzip -dc filtered_feature_bc_matrix/barcodes.tsv.gz | head

AAACCCACAGGTTCGC-1
AAACCCACATCGATCA-1
AAACCCACATTGAGCT-1
AAACCCAGTATTTCCT-1
AAACGAACACGGTCTG-1
AAACGAAGTGGGTATG-1
AAACGAATCCTTATCA-1
AAACGCTAGTCTTCGA-1
AAACGCTGTTCTTGTT-1
AAAGAACAGATTCGAA-1

Prepare feature sequences.

The map subcommand is designed to process secondary libraries created on top of the (whole) transcriptome assays. These libraries enrich transcripts of interest through hybridization or PCR amplification, allowing for potentially more sensitive detection. The transcripts may be from endogenous genes or ectopic constructs, such as eGFP, as long as they were captured in the parent libraries.

In this tutorial, we use the 10x Genomics’ “Targeted Gene Expression” library as an example. The bait sequences serve as references for read 2 mapping. Ideally, the reference sequences should include all possible captured transcribed regions.

Read 1 contains cell barcodes and UMIs, while read 2 should contain the captured transcribed regions. The feature references should only include transcribed parts that are non-overlapping and have no introns for endogenous genes.

$ wget https://cf.10xgenomics.com/samples/cell-exp/4.0.0/Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature/Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature_target_panel.csv

Inspect feature reference info.

$ head Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature_target_panel.csv

#panel_name=Human Gene Signature Panel
#panel_type=predesigned
#reference_genome=GRCh38
#reference_version=2020-A
#target_panel_file_format=1.0
gene_id,bait_seq,bait_id
ENSG00000000003,AGTTGTGGACGCTCGTAAGTTTTCGGCAGTTTCCGGGGAGACTCGGGGACTCCGCGTCTCGCTCTCTGTGTTCCAATCGCCCGGTGCGGTGGTGCAGGGTCTCGGGCTAGTCATGGCGTC,ENSG00000000003|TSPAN6|1
ENSG00000000003,CCCGTCTCGGAGACTGCAGACTAAACCAGTCATTACTTGTTTCAAGAGCGTTCTGCTAATCTACACTTTTATTTTCTGGATCACTGGCGTTATCCTTCTTGCAGTTGGCATTTGGGGCAA,ENSG00000000003|TSPAN6|2
ENSG00000000003,GGTGAGCCTGGAGAATTACTTTTCTCTTTTAAATGAGAAGGCCACCAATGTCCCCTTCGTGCTCATTGCTACTGGTACCGTCATTATTCTTTTGGGCACCTTTGGTTGTTTTGCTACCTG,ENSG00000000003|TSPAN6|3
ENSG00000000003,CCGAGCTTCTGCATGGATGCTAAAACTGTATGCAATGTTTCTGACTCTCGTTTTTTTGGTCGAACTGGTCGCTGCCATCGTAGGATTTGTTTTCAGACATGAGATTAAGAACAGCTTTAA,ENSG00000000003|TSPAN6|4

Re-format.

$ grep -v '#' Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature_target_panel.csv | wc -l

53720

$ cut -d',' -f1,2 Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature_target_panel.csv | gsed 's/,/\t/g' | grep -v '#' | head -53719 > Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature_target_panel.tsv

$ head Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature_target_panel.tsv

ENSG00000000003 AGTTGTGGACGCTCGTAAGTTTTCGGCAGTTTCCGGGGAGACTCGGGGACTCCGCGTCTCGCTCTCTGTGTTCCAATCGCCCGGTGCGGTGGTGCAGGGTCTCGGGCTAGTCATGGCGTC
ENSG00000000003 CCCGTCTCGGAGACTGCAGACTAAACCAGTCATTACTTGTTTCAAGAGCGTTCTGCTAATCTACACTTTTATTTTCTGGATCACTGGCGTTATCCTTCTTGCAGTTGGCATTTGGGGCAA
ENSG00000000003 GGTGAGCCTGGAGAATTACTTTTCTCTTTTAAATGAGAAGGCCACCAATGTCCCCTTCGTGCTCATTGCTACTGGTACCGTCATTATTCTTTTGGGCACCTTTGGTTGTTTTGCTACCTG
ENSG00000000003 CCGAGCTTCTGCATGGATGCTAAAACTGTATGCAATGTTTCTGACTCTCGTTTTTTTGGTCGAACTGGTCGCTGCCATCGTAGGATTTGTTTTCAGACATGAGATTAAGAACAGCTTTAA
ENSG00000000003 GAATAATTATGAGAAGGCTTTGAAGCAGTATAACTCTACAGGAGATTATAGAAGCCATGCAGTAGACAAGATCCAAAATACGTTGCATTGTTGTGGTGTCACCGATTATAGAGATTGGAC
ENSG00000000003 AGATACTAATTATTACTCAGAAAAAGGATTTCCTAAGAGTTGCTGTAAACTTGAAGATTGTACTCCACAGAGAGATGCAGACAAAGTAAACAATGAAGGTTGTTTTATAAAGGTGATGAC
ENSG00000000003 CATTATAGAGTCAGAAATGGGAGTCGTTGCAGGAATTTCCTTTGGAGTTGCTTGCTTCCAACTGATTGGAATCTTTCTCGCCTACTGCCTCTCTCGTGCCATAACAAATAACCAGTATGA
ENSG00000000003 GATAGTGTAACCCAATGTATCTGTGGGCCTATTCCTCTCTACCTTTAAGGACATTTAGGGTCCCCCCTGTGAATTAGAAAGTTGCTTGGCTGGAGAACTGACAACACTACTTACTGATAG
ENSG00000000003 ACCAAAAAACTACACCAGTAGGTTGATTCAATCAAGATGTATGTAGACCTAAAACTACACCAATAGGCTGATTCAATCAAGATCCGTGCTCGCAGTGGGCTGATTCAATCAAGATGTATG
ENSG00000000003 TTTGCTATGTTCTAAGTCCACCTTCTATCCCATTCATGTTAGATCGTTGAAACCCTGTATCCCTCTGAAACACTGGAAGAGCTAGTAAATTGTAAATGAAGTAATACTGTGTTCCTCTTG

Matrix generation#

To begin with, we search all read 1 sequences against a set of reference cell-associated barcodes. You can adjust the search range with -r1_c (default 0,16), and the mismatching threshold with -cb_m (default 1). Next, we map read 2 sequences with correct cell barcodes (found in read 1) to provided reference sequences using an aligner such as bwa (default, set with -al, Li, H. (2013). arXiv:1303.3997.) or bowtie2 (Langmead, B., and Salzberg, S.L. (2012). Nat. Methods 9, 357–359.). We keep only alignments with a mapping quality above a threshold, which you can set with --mapq (default 10), for downstream feature counting.

For UMI deduplication, we use the UMI-tools package (Smith, T., Heger, A., and Sudbery, I. (2017). Genome Res. 27, 491–499.). You can specify the UMI starting position on read 1 with -us (default 16), and the UMI length with -ul (default 12). Fragments with a UMI length less than this value are discarded. The UMI deduplication method is set with -ud (default directional), and the UMI deduplication mismatch threshold is set with -um (default 1).

Finally, the resulting feature count matrix can be easily imported into well-established single cell analysis packages such as Seurat and Scanpy.

$ fba map \
    -1 Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature_fastqs/Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature_S1_L003_R1_001.fastq.gz \
    -2 Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature_fastqs/Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature_S1_L003_R2_001.fastq.gz \
    -w filtered_feature_bc_matrix/barcodes.tsv.gz \
    -f Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature_target_panel.tsv \
    -o matrix_featurecount.csv.gz \
    -r1_c 0,16 \
    -cb_m 1 \
    -al bwa \
    --mapq 10 \
    -us 16 \
    -ul 12 \
    -um 1 \
    -ud directional \
    --output_directory barcode_mapping

Result summary.

7.67% of total read pairs (2,405,998 of 31,372,024) contribute to the final expression matrix after UMI deduplication. Sequenced quite deep.

2021-02-17 23:33:59,615 - fba.__main__ - INFO - fba version: 0.0.7
2021-02-17 23:33:59,615 - fba.__main__ - INFO - Initiating logging ...
2021-02-17 23:33:59,615 - fba.__main__ - INFO - Python version: 3.7
2021-02-17 23:33:59,615 - fba.__main__ - INFO - Using map subcommand ...
2021-02-17 23:33:59,863 - fba.map - INFO - bwa version: 0.7.17
2021-02-17 23:34:02,116 - fba.map - INFO - samtools version: 1.3
2021-02-17 23:34:02,145 - fba.map - INFO - Number of reference cell barcodes: 3,394
2021-02-17 23:34:02,145 - fba.map - INFO - Read 1 coordinates to search: [0, 16)
2021-02-17 23:34:02,145 - fba.map - INFO - Cell barcode maximum number of mismatches: 1
2021-02-17 23:34:02,145 - fba.map - INFO - Read 1 maximum number of N allowed: 3
2021-02-17 23:34:02,145 - fba.map - INFO - Matching cell barcodes, read 1 ...
2021-02-17 23:47:07,994 - fba.map - INFO - number of read pairs processed: 31,372,024
2021-02-17 23:47:07,995 - fba.map - INFO - Number of read pairs w/ valid cell barcodes: 28,336,049
2021-02-17 23:47:08,024 - fba.map - INFO - Number of reference features: 1,142
2021-02-17 23:47:08,024 - fba.map - INFO - Number of threads: 56
2021-02-17 23:47:08,024 - fba.map - INFO - Aligning read 2 ...
2021-02-17 23:52:34,225 - fba.map - INFO -
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 6222224 sequences (560000160 bp)...
[M::mem_process_seqs] Processed 6222224 reads in 1411.678 CPU sec, 87.647 real sec
[M::process] read 6222224 sequences (560000160 bp)...
[M::mem_process_seqs] Processed 6222224 reads in 484.034 CPU sec, 11.666 real sec
[M::process] read 6222224 sequences (560000160 bp)...
[M::mem_process_seqs] Processed 6222224 reads in 487.450 CPU sec, 11.070 real sec
[M::process] read 6222224 sequences (560000160 bp)...
[M::mem_process_seqs] Processed 6222224 reads in 457.438 CPU sec, 8.857 real sec
[M::process] read 3447153 sequences (310243770 bp)...
[M::mem_process_seqs] Processed 3447153 reads in 273.114 CPU sec, 8.418 real sec
[main] Version: 0.7.17-r1198-dirty
[main] CMD: /home2/s166631/bin/bwa mem -t 56 -C barcode_mapping/feature_ref.fasta barcode_mapping/modified.fq.gz
[main] Real time: 187.399 sec; CPU: 3189.278 sec
2021-02-17 23:52:34,226 - fba.map - INFO - Generating matrix (UMI deduplication) ...
2021-02-17 23:52:34,226 - fba.map - INFO - UMI-tools version: 1.1.1
2021-02-17 23:52:34,226 - fba.map - INFO - Mapping quality threshold: 10
2021-02-17 23:52:34,226 - fba.map - INFO - UMI starting position on read 1: 16
2021-02-17 23:52:34,226 - fba.map - INFO - UMI length: 12
2021-02-17 23:52:34,226 - fba.map - INFO - UMI-tools deduplication threshold: 1
2021-02-17 23:52:34,226 - fba.map - INFO - UMI-tools deduplication method: directional
2021-02-17 23:54:06,700 - fba.map - INFO - Number of cell barcodes detected: 3,379
2021-02-17 23:54:06,700 - fba.map - INFO - Number of features detected: 1,129
2021-02-17 23:54:06,704 - fba.map - INFO - Total UMIs after deduplication: 2,405,998
2021-02-17 23:54:06,713 - fba.map - INFO - Median number of UMIs per cell: 507.0
2021-02-17 23:54:11,085 - fba.__main__ - INFO - Done.