Hodgkin’s Lymphoma, Dissociated Tumor: Targeted, Gene Signature Panel#
Dataset: Hodgkin’s Lymphoma, Dissociated Tumor: Targeted, Gene Signature Panel
The detailed description of this dataset can be found here.
Preparation#
Download fastq files.
$ wget https://cf.10xgenomics.com/samples/cell-exp/4.0.0/Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature/Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature_fastqs.tar
$ tar xvf Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature_fastqs.tar
Download cell barcode info.
These are the cell-associated barcodes determined in the parent Hodgkin’s Lymphoma, Dissociated Tumor: Whole Transcriptome Analysis dataset.
$ wget https://cf.10xgenomics.com/samples/cell-exp/4.0.0/Parent_NGSC3_DI_HodgkinsLymphoma/Parent_NGSC3_DI_HodgkinsLymphoma_filtered_feature_bc_matrix.tar.gz
$ tar zxvf Parent_NGSC3_DI_HodgkinsLymphoma_filtered_feature_bc_matrix.tar.gz
Inspect cell barcodes.
$ gzip -dc filtered_feature_bc_matrix/barcodes.tsv.gz | head
AAACCCACAGGTTCGC-1
AAACCCACATCGATCA-1
AAACCCACATTGAGCT-1
AAACCCAGTATTTCCT-1
AAACGAACACGGTCTG-1
AAACGAAGTGGGTATG-1
AAACGAATCCTTATCA-1
AAACGCTAGTCTTCGA-1
AAACGCTGTTCTTGTT-1
AAAGAACAGATTCGAA-1
Prepare feature sequences.
The map
subcommand is designed to process secondary libraries
created on top of the (whole) transcriptome assays. These libraries
enrich transcripts of interest through hybridization or PCR
amplification, allowing for potentially more sensitive detection. The
transcripts may be from endogenous genes or ectopic constructs, such as
eGFP, as long as they were captured in the parent libraries.
In this tutorial, we use the 10x Genomics’ “Targeted Gene Expression” library as an example. The bait sequences serve as references for read 2 mapping. Ideally, the reference sequences should include all possible captured transcribed regions.
Read 1 contains cell barcodes and UMIs, while read 2 should contain the captured transcribed regions. The feature references should only include transcribed parts that are non-overlapping and have no introns for endogenous genes.
$ wget https://cf.10xgenomics.com/samples/cell-exp/4.0.0/Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature/Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature_target_panel.csv
Inspect feature reference info.
$ head Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature_target_panel.csv
#panel_name=Human Gene Signature Panel
#panel_type=predesigned
#reference_genome=GRCh38
#reference_version=2020-A
#target_panel_file_format=1.0
gene_id,bait_seq,bait_id
ENSG00000000003,AGTTGTGGACGCTCGTAAGTTTTCGGCAGTTTCCGGGGAGACTCGGGGACTCCGCGTCTCGCTCTCTGTGTTCCAATCGCCCGGTGCGGTGGTGCAGGGTCTCGGGCTAGTCATGGCGTC,ENSG00000000003|TSPAN6|1
ENSG00000000003,CCCGTCTCGGAGACTGCAGACTAAACCAGTCATTACTTGTTTCAAGAGCGTTCTGCTAATCTACACTTTTATTTTCTGGATCACTGGCGTTATCCTTCTTGCAGTTGGCATTTGGGGCAA,ENSG00000000003|TSPAN6|2
ENSG00000000003,GGTGAGCCTGGAGAATTACTTTTCTCTTTTAAATGAGAAGGCCACCAATGTCCCCTTCGTGCTCATTGCTACTGGTACCGTCATTATTCTTTTGGGCACCTTTGGTTGTTTTGCTACCTG,ENSG00000000003|TSPAN6|3
ENSG00000000003,CCGAGCTTCTGCATGGATGCTAAAACTGTATGCAATGTTTCTGACTCTCGTTTTTTTGGTCGAACTGGTCGCTGCCATCGTAGGATTTGTTTTCAGACATGAGATTAAGAACAGCTTTAA,ENSG00000000003|TSPAN6|4
Re-format.
$ grep -v '#' Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature_target_panel.csv | wc -l
53720
$ cut -d',' -f1,2 Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature_target_panel.csv | gsed 's/,/\t/g' | grep -v '#' | head -53719 > Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature_target_panel.tsv
$ head Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature_target_panel.tsv
ENSG00000000003 AGTTGTGGACGCTCGTAAGTTTTCGGCAGTTTCCGGGGAGACTCGGGGACTCCGCGTCTCGCTCTCTGTGTTCCAATCGCCCGGTGCGGTGGTGCAGGGTCTCGGGCTAGTCATGGCGTC
ENSG00000000003 CCCGTCTCGGAGACTGCAGACTAAACCAGTCATTACTTGTTTCAAGAGCGTTCTGCTAATCTACACTTTTATTTTCTGGATCACTGGCGTTATCCTTCTTGCAGTTGGCATTTGGGGCAA
ENSG00000000003 GGTGAGCCTGGAGAATTACTTTTCTCTTTTAAATGAGAAGGCCACCAATGTCCCCTTCGTGCTCATTGCTACTGGTACCGTCATTATTCTTTTGGGCACCTTTGGTTGTTTTGCTACCTG
ENSG00000000003 CCGAGCTTCTGCATGGATGCTAAAACTGTATGCAATGTTTCTGACTCTCGTTTTTTTGGTCGAACTGGTCGCTGCCATCGTAGGATTTGTTTTCAGACATGAGATTAAGAACAGCTTTAA
ENSG00000000003 GAATAATTATGAGAAGGCTTTGAAGCAGTATAACTCTACAGGAGATTATAGAAGCCATGCAGTAGACAAGATCCAAAATACGTTGCATTGTTGTGGTGTCACCGATTATAGAGATTGGAC
ENSG00000000003 AGATACTAATTATTACTCAGAAAAAGGATTTCCTAAGAGTTGCTGTAAACTTGAAGATTGTACTCCACAGAGAGATGCAGACAAAGTAAACAATGAAGGTTGTTTTATAAAGGTGATGAC
ENSG00000000003 CATTATAGAGTCAGAAATGGGAGTCGTTGCAGGAATTTCCTTTGGAGTTGCTTGCTTCCAACTGATTGGAATCTTTCTCGCCTACTGCCTCTCTCGTGCCATAACAAATAACCAGTATGA
ENSG00000000003 GATAGTGTAACCCAATGTATCTGTGGGCCTATTCCTCTCTACCTTTAAGGACATTTAGGGTCCCCCCTGTGAATTAGAAAGTTGCTTGGCTGGAGAACTGACAACACTACTTACTGATAG
ENSG00000000003 ACCAAAAAACTACACCAGTAGGTTGATTCAATCAAGATGTATGTAGACCTAAAACTACACCAATAGGCTGATTCAATCAAGATCCGTGCTCGCAGTGGGCTGATTCAATCAAGATGTATG
ENSG00000000003 TTTGCTATGTTCTAAGTCCACCTTCTATCCCATTCATGTTAGATCGTTGAAACCCTGTATCCCTCTGAAACACTGGAAGAGCTAGTAAATTGTAAATGAAGTAATACTGTGTTCCTCTTG
Matrix generation#
To begin with, we search all read 1 sequences against a set of reference
cell-associated barcodes. You can adjust the search range with -r1_c
(default 0,16
), and the mismatching threshold with -cb_m
(default 1
). Next, we map read 2 sequences with correct cell
barcodes (found in read 1) to provided reference sequences using an
aligner such as bwa
(default, set with -al
, Li, H. (2013).
arXiv:1303.3997.) or bowtie2
(Langmead, B., and Salzberg, S.L.
(2012). Nat. Methods 9, 357–359.). We keep only alignments with a
mapping quality above a threshold, which you can set with --mapq
(default 10
), for downstream feature counting.
For UMI deduplication, we use the UMI-tools package (Smith, T., Heger,
A., and Sudbery, I. (2017). Genome Res. 27, 491–499.). You can specify
the UMI starting position on read 1 with -us
(default 16
), and
the UMI length with -ul
(default 12
). Fragments with a UMI
length less than this value are discarded. The UMI deduplication method
is set with -ud
(default directional
), and the UMI deduplication
mismatch threshold is set with -um
(default 1
).
Finally, the resulting feature count matrix can be easily imported into well-established single cell analysis packages such as Seurat and Scanpy.
$ fba map \
-1 Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature_fastqs/Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature_S1_L003_R1_001.fastq.gz \
-2 Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature_fastqs/Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature_S1_L003_R2_001.fastq.gz \
-w filtered_feature_bc_matrix/barcodes.tsv.gz \
-f Targeted_NGSC3_DI_HodgkinsLymphoma_GeneSignature_target_panel.tsv \
-o matrix_featurecount.csv.gz \
-r1_c 0,16 \
-cb_m 1 \
-al bwa \
--mapq 10 \
-us 16 \
-ul 12 \
-um 1 \
-ud directional \
--output_directory barcode_mapping
Result summary.
7.67% of total read pairs (2,405,998 of 31,372,024) contribute to the final expression matrix after UMI deduplication. Sequenced quite deep.
2021-02-17 23:33:59,615 - fba.__main__ - INFO - fba version: 0.0.7
2021-02-17 23:33:59,615 - fba.__main__ - INFO - Initiating logging ...
2021-02-17 23:33:59,615 - fba.__main__ - INFO - Python version: 3.7
2021-02-17 23:33:59,615 - fba.__main__ - INFO - Using map subcommand ...
2021-02-17 23:33:59,863 - fba.map - INFO - bwa version: 0.7.17
2021-02-17 23:34:02,116 - fba.map - INFO - samtools version: 1.3
2021-02-17 23:34:02,145 - fba.map - INFO - Number of reference cell barcodes: 3,394
2021-02-17 23:34:02,145 - fba.map - INFO - Read 1 coordinates to search: [0, 16)
2021-02-17 23:34:02,145 - fba.map - INFO - Cell barcode maximum number of mismatches: 1
2021-02-17 23:34:02,145 - fba.map - INFO - Read 1 maximum number of N allowed: 3
2021-02-17 23:34:02,145 - fba.map - INFO - Matching cell barcodes, read 1 ...
2021-02-17 23:47:07,994 - fba.map - INFO - number of read pairs processed: 31,372,024
2021-02-17 23:47:07,995 - fba.map - INFO - Number of read pairs w/ valid cell barcodes: 28,336,049
2021-02-17 23:47:08,024 - fba.map - INFO - Number of reference features: 1,142
2021-02-17 23:47:08,024 - fba.map - INFO - Number of threads: 56
2021-02-17 23:47:08,024 - fba.map - INFO - Aligning read 2 ...
2021-02-17 23:52:34,225 - fba.map - INFO -
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 6222224 sequences (560000160 bp)...
[M::mem_process_seqs] Processed 6222224 reads in 1411.678 CPU sec, 87.647 real sec
[M::process] read 6222224 sequences (560000160 bp)...
[M::mem_process_seqs] Processed 6222224 reads in 484.034 CPU sec, 11.666 real sec
[M::process] read 6222224 sequences (560000160 bp)...
[M::mem_process_seqs] Processed 6222224 reads in 487.450 CPU sec, 11.070 real sec
[M::process] read 6222224 sequences (560000160 bp)...
[M::mem_process_seqs] Processed 6222224 reads in 457.438 CPU sec, 8.857 real sec
[M::process] read 3447153 sequences (310243770 bp)...
[M::mem_process_seqs] Processed 3447153 reads in 273.114 CPU sec, 8.418 real sec
[main] Version: 0.7.17-r1198-dirty
[main] CMD: /home2/s166631/bin/bwa mem -t 56 -C barcode_mapping/feature_ref.fasta barcode_mapping/modified.fq.gz
[main] Real time: 187.399 sec; CPU: 3189.278 sec
2021-02-17 23:52:34,226 - fba.map - INFO - Generating matrix (UMI deduplication) ...
2021-02-17 23:52:34,226 - fba.map - INFO - UMI-tools version: 1.1.1
2021-02-17 23:52:34,226 - fba.map - INFO - Mapping quality threshold: 10
2021-02-17 23:52:34,226 - fba.map - INFO - UMI starting position on read 1: 16
2021-02-17 23:52:34,226 - fba.map - INFO - UMI length: 12
2021-02-17 23:52:34,226 - fba.map - INFO - UMI-tools deduplication threshold: 1
2021-02-17 23:52:34,226 - fba.map - INFO - UMI-tools deduplication method: directional
2021-02-17 23:54:06,700 - fba.map - INFO - Number of cell barcodes detected: 3,379
2021-02-17 23:54:06,700 - fba.map - INFO - Number of features detected: 1,129
2021-02-17 23:54:06,704 - fba.map - INFO - Total UMIs after deduplication: 2,405,998
2021-02-17 23:54:06,713 - fba.map - INFO - Median number of UMIs per cell: 507.0
2021-02-17 23:54:11,085 - fba.__main__ - INFO - Done.