PHAGE-ATAC; Anti-CD8 Phage Hashing Single-cell ATAC-seq Using CD8 T Cells from Four Human Donors#

Dataset: PHAGE-ATAC: four anti-CD8 phage hashtags and a subsequent hashing experiment using CD8 T cells from four human donors

Fiskin, E., Lareau, C.A., Ludwig, L.S., Eraslan, G., Liu, F., Ring, A.M., Xavier, R.J., and Regev, A. (2021). Single-cell profiling of proteins and chromatin accessibility using PHAGE-ATAC. Nat. Biotechnol. 40, 374–381.


Preparation#

Download fastq files from Gene Expression Omnibus.

Inspect fastq files (This is a phage-derived tag library built on top of the single-cell ATAC-seq library, we will need all 3 end reads).

$ zcat SRR12588752_1.fastq.gz | head

@SRR12588752.1 NB501583:726:HMCKKBGXF:1:11101:16677:1031 length=34
CAGCTNTTCCTGCGCTCGACACCGTTACTTGTGT
+SRR12588752.1 NB501583:726:HMCKKBGXF:1:11101:16677:1031 length=34
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEE
@SRR12588752.2 NB501583:726:HMCKKBGXF:1:11101:14655:1033 length=34
CAGCTNTTCCTGCGCTGCTTACCGTAACTTGTGT
+SRR12588752.2 NB501583:726:HMCKKBGXF:1:11101:14655:1033 length=34
AAAAA#EEEEEEEAEEEEEEEEEEEEEEEEEEEE
@SRR12588752.3 NB501583:726:HMCKKBGXF:1:11101:17414:1034 length=34
CAGCTNTTCCTGCGCTCGACACCGTTACTTGTGT

$ zcat SRR12588752_2.fastq.gz | head

@SRR12588752.1 NB501583:726:HMCKKBGXF:1:11101:16677:1031 length=16
NNNNAGNNNNNATGNN
+SRR12588752.1 NB501583:726:HMCKKBGXF:1:11101:16677:1031 length=16
####AE#####EEE##
@SRR12588752.2 NB501583:726:HMCKKBGXF:1:11101:14655:1033 length=16
NNAACTNNNNNAAAAN
+SRR12588752.2 NB501583:726:HMCKKBGXF:1:11101:14655:1033 length=16
##//A6#####////#
@SRR12588752.3 NB501583:726:HMCKKBGXF:1:11101:17414:1034 length=16
NNTCAGTNNNTCAGGN

$ zcat SRR12588752_3.fastq.gz | head

@SRR12588752.1 NB501583:726:HMCKKBGXF:1:11101:16677:1031 length=34
GATACCGNGGNNNNNNNNNNNNNNNNNNNNNNNN
+SRR12588752.1 NB501583:726:HMCKKBGXF:1:11101:16677:1031 length=34
AAAAAEE#EE########################
@SRR12588752.2 NB501583:726:HMCKKBGXF:1:11101:14655:1033 length=34
GATACCGCGGTGTNTNANNNCNNNNNAGNNNCNN
+SRR12588752.2 NB501583:726:HMCKKBGXF:1:11101:14655:1033 length=34
AAAAAEEEEEEEE#E#E###E#####EE###E##
@SRR12588752.3 NB501583:726:HMCKKBGXF:1:11101:17414:1034 length=34
GATACCGCGGTGTATNANNNGNNNNNAANNNGNN

Prepare cell barcodes (downloaded from the manuscript’s GitHub repository). These are the cell-associated barcodes, they are generated based on the single-cell ATAC-seq library.

$ wget https://raw.githubusercontent.com/evgenijfiskin/phage-atac/master/cd8_hashing/data/barcodes.tsv

Inspect cell barcodes.

$ wc -l barcodes.tsv

8366

$ head barcodes.tsv

AAACGAAAGACCATAA-1
AAACGAAAGGAGGCGA-1
AAACGAAAGGCGATTG-1
AAACGAACAAGAGATT-1
AAACGAACACAGTTAC-1
AAACGAACACTGATTG-1
AAACGAACACTTACAG-1
AAACGAACAGCAGGTA-1
AAACGAACAGCGTACC-1
AAACGAAGTTGCAGAG-1

Prepare feature barcodes. CDR3 barcode sequences (phage-derived tags, PDTs) can be found in Supplementary Table 4 and are truncated to only keep the variable parts.

CDR3 barcode sequences#

CD8Nb PH-A

GATACCGCGGTGTATTATTGCGCAAAGGACGCGG

CD8Nb PH-B

GATACCGCGGTGTATTATTGCGCTAAAGACGCGG

CD8Nb PH-C

CAGCTCTTCCTGCGCTGCTTACCGTAACTTGTGT

CD8Nb PH-D

CAGCTCTTCCTGCGCTGCTTACAGTGACCTGTGT

$ cat feature_barcodes_R3_truncated.tsv

CD8Nb_PH-A      CAAAGGACGCGG
CD8Nb_PH-B      CTAAAGACGCGG

$ cat feature_barcodes_R1_truncated.tsv

CD8Nb_PH-C      CGTAACTTGTGT
CD8Nb_PH-D      AGTGACCTGTGT

First, we screened reads for the constant sequence (GATACCGCGGTGTATTATTGCG) at the beginning of the CDR3 barcode sequences on read 3 using cutadapt (version 3.7).

$ cutadapt \
    --cores 0 \
    --front GATACCGCGGTGTATTATTGCG \
    --minimum-length 12:16 \
    --trimmed-only \
    --output SRR12588752_3_trimmed.fq.gz --paired-output SRR12588752_2_trimmed.fq.gz \
    SRR12588752_3.fastq.gz SRR12588752_2.fastq.gz

Preview the filtering result: 51,140,637 out of 54,274,791 (94.2%) read pairs are kept for phage-derived tag (PDT) identification.

== Read fate breakdown ==
Pairs that were too short:             652,917 (1.2%)
Pairs discarded as untrimmed:        2,481,237 (4.6%)
Pairs written (passing filters):    51,140,637 (94.2%)

Then, for read 1 (CAGCTCTTCCTGCGCTGCTTAC).

$ cutadapt \
    --cores 0 \
    --front CAGCTCTTCCTGCGCTGCTTAC \
    --minimum-length 12:16 \
    --trimmed-only \
    --output SRR12588752_1_trimmed.fq.gz --paired-output SRR12588752_2_trimmed.fq.gz \
    SRR12588752_1.fastq.gz SRR12588752_2.fastq.gz

Preview the filtering result: 25,988,762 out of 54,274,791 (47.9%) read pairs are kept for phage-derived tag (PDT) identification.

== Read fate breakdown ==
Pairs that were too short:              16,644 (0.0%)
Pairs discarded as untrimmed:       28,269,385 (52.1%)
Pairs written (passing filters):    25,988,762 (47.9%)

QC#

The first 10,000 read pairs are sampled (set by -n, default 100,000) for quality control. The -t option can be used to set the number of threads. By default, diagnostic results and plots are generated in the qc directory (set by --output_directory), and the full length of read 1 and read 2 are searched against reference cell and feature barcodes, respectively. The per base content of both read pairs and the distribution of matched barcode positions are summarized. Use -r1_c and/or -r2_c to limit the search range, and -cb_n and/or -fb_n to set the mismatch tolerance for cell and/or feature barcode matching (default 3).

This library was constructed using 10x Genomics’ Chromium Single Cell ATAC Reagent Kits, where the 16-base pair cell barcode is sequenced during the i5 index read. In Cell Ranger ATAC, the raw 16 bp sequences may be transformed into their reverse-complement counterparts as cell barcodes in the outputs. To achieve the same result in fba, use -cb_rc to reverse-complement the cell barcode sequences.

To achieve the same result in fba,

R3#

$ fba qc \
    -1 SRR12588752_2_trimmed.fq.gz \
    -2 SRR12588752_3_trimmed.fq.gz \
    -w barcodes.tsv \
    -f feature_barcodes_R3_truncated.txt \
    -cb_rc \
    -n 10000

This library is constructed using the Chromium Single Cell ATAC Reagent Kits and sequenced on Illumina NextSeq 500. The GC content of cell barcodes (read 2) are quite even.

../../../_images/Pyplot_read1_per_base_seq_content_trimmed_r3.webp ../../../_images/Pyplot_read1_barcodes_starting_ending_trimmed_r3.webp

As for read 3, based on the per base content, it suggests low complexity.

pic1 pic2

The detailed qc results are stored in the feature_barcoding_output.tsv.gz file. The matching_pos columns indicate the matched positions on reads, while the matching_description columns indicate mismatches in the format of substitutions:insertions:deletions.

$ gzip -dc feature_barcoding_output.tsv.gz | head

read1_seq       cell_barcode    cb_num_mismatches       read2_seq       feature_barcode fb_num_mismatches
NTGTTGCTGGTTAGAA        CTGTTGCTGGTTAGAA        1       CAAAGGACGCGG    CD8Nb_PH-A_CAAAGGACGCGG 0
NTCGACCGATTGCGTA        GTCGACCGATTGCGTA        1       CTAAAGACGCGG    CD8Nb_PH-B_CTAAAGACGCGG 0
GCCGAACTGTTAGAAG        GCCGAACTGTTAGAAG        0       CAAAGGACGCGG    CD8Nb_PH-A_CAAAGGACGCGG 0
TGAGCGCACACCTTGA        TGAGCGCACACCTTGA        0       CAAAGGACGCGG    CD8Nb_PH-A_CAAAGGACGCGG 0
AATTCTGCTTGGCTGC        AATTCTGCTTGGCTGC        0       CAAAGGACGCGG    CD8Nb_PH-A_CAAAGGACGCGG 0
GGAATGGTGACCGTGC        GGAATGGTGACCGTGC        0       CAAAGGACGCGG    CD8Nb_PH-A_CAAAGGACGCGG 0
AGGAATTGATTCGCCT        AGGAATTGATTCGCCT        0       CAAAGGACGCGG    CD8Nb_PH-A_CAAAGGACGCGG 0
CCAAGTTGATAATAGG        CCAAGTTGATAATAGG        0       CTAAAGACGCGG    CD8Nb_PH-B_CTAAAGACGCGG 0
CCGCAAGTGAATCCAC        CCGCAAGTGAATCCAC        0       CAAAGGACGCGG    CD8Nb_PH-A_CAAAGGACGCGG 0

R1#

$ fba qc \
    -1 SRR12588752_2_trimmed.fq.gz \
    -2 SRR12588752_1_trimmed.fq.gz \
    -w barcodes.tsv \
    -f feature_barcodes_R1_truncated.txt \
    -cb_rc \
    -n 10000

For read 1, the per base content plot suggests low complexity, with a high proportion of constant bases at the beginning of the reads.

pic3 pic4

The detailed qc results are stored in the feature_barcoding_output.tsv.gz file. The matching_pos columns indicate the matched positions on reads, while the matching_description columns indicate mismatches in the format of substitutions:insertions:deletions.

$ gzip -dc feature_barcoding_output.tsv.gz | head

read1_seq       cell_barcode    cb_num_mismatches       read2_seq       feature_barcode fb_num_mismatches
NCTCGGGACGTCTGGC        ACTCGGGACGTCTGGC        1       AGTGACCTGTGT    CD8Nb_PH-D_AGTGACCTGTGT 0
NCTAAGACTTTATGGC        GCTAAGACTTTATGGC        1       AGTGACCTGTGT    CD8Nb_PH-D_AGTGACCTGTGT 0
NACGGAAGATCGTAAC        CACGGAAGATCGTAAC        1       AGTGACCTGTGT    CD8Nb_PH-D_AGTGACCTGTGT 0
NTGTTGTGAGTCCCGA        GTGTTGTGAGTCCCGA        1       AGTGACCTGTGT    CD8Nb_PH-D_AGTGACCTGTGT 0
CCTCCTGCTATCAGGG        CCTCCTGCTATCAGGG        0       AGTGACCTGTGT    CD8Nb_PH-D_AGTGACCTGTGT 0
GTTGATTCTCGAAGCA        GTTGATTCTCGAAGCA        0       AGTGACCTGTGT    CD8Nb_PH-D_AGTGACCTGTGT 0
TGGTTAGACTCCGTAA        TGGTTAGACTCCGTAA        0       AGTGACCTGTGT    CD8Nb_PH-D_AGTGACCTGTGT 0
GCCTCTTGACTGGGTC        GCCTCTTGACTGGGTC        0       CGTAACTTGTGT    CD8Nb_PH-C_CGTAACTTGTGT 0
AGGTAGCGAGAGTAAT        AGGTAGCGAGAGTAAT        0       AGTGACCTGTGT    CD8Nb_PH-D_AGTGACCTGTGT 0

Barcode extraction#

R3#

Search ranges are set to 0,16 on read 2 and 0,12 on read 3. One mismatch for cell and feature barcodes (-cb_m, -cf_m) are allowed. Use -cb_rc to reverse-complement cell barcode sequences in the output if needed.

$ fba extract \
    -1 SRR12588752_2_trimmed.fq.gz \
    -2 SRR12588752_3_trimmed.fq.gz \
    -w barcodes.tsv \
    -f feature_barcodes_R3_truncated.txt \
    -o feature_barcoding_output_R3.tsv.gz \
    -r1_c 0,16 \
    -r2_c 0,12 \
    -cb_m 1 \
    -fb_m 1 \
    -cb_rc

Preview of result.

$ gzip -dc feature_barcoding_output_R3.tsv.gz | head

read1_seq       cell_barcode    cb_num_mismatches       read2_seq       feature_barcode fb_num_mismatches
NTGTTGCTGGTTAGAA        CTGTTGCTGGTTAGAA        1       CAAAGGACGCGG    CD8Nb_PH-A_CAAAGGACGCGG 0
NTCGACCGATTGCGTA        GTCGACCGATTGCGTA        1       CTAAAGACGCGG    CD8Nb_PH-B_CTAAAGACGCGG 0
GCCGAACTGTTAGAAG        GCCGAACTGTTAGAAG        0       CAAAGGACGCGG    CD8Nb_PH-A_CAAAGGACGCGG 0
TGAGCGCACACCTTGA        TGAGCGCACACCTTGA        0       CAAAGGACGCGG    CD8Nb_PH-A_CAAAGGACGCGG 0
AATTCTGCTTGGCTGC        AATTCTGCTTGGCTGC        0       CAAAGGACGCGG    CD8Nb_PH-A_CAAAGGACGCGG 0
GGAATGGTGACCGTGC        GGAATGGTGACCGTGC        0       CAAAGGACGCGG    CD8Nb_PH-A_CAAAGGACGCGG 0
AGGAATTGATTCGCCT        AGGAATTGATTCGCCT        0       CAAAGGACGCGG    CD8Nb_PH-A_CAAAGGACGCGG 0
CCAAGTTGATAATAGG        CCAAGTTGATAATAGG        0       CTAAAGACGCGG    CD8Nb_PH-B_CTAAAGACGCGG 0
CCGCAAGTGAATCCAC        CCGCAAGTGAATCCAC        0       CAAAGGACGCGG    CD8Nb_PH-A_CAAAGGACGCGG 0

Result summary.

10,543,901 out of 51,140,637 read pairs have valid cell and feature barcodes.

2022-03-13 00:13:02,564 - fba.__main__ - INFO - fba version: 0.0.12
2022-03-13 00:13:02,564 - fba.__main__ - INFO - Initiating logging ...
2022-03-13 00:13:02,564 - fba.__main__ - INFO - Python version: 3.10
2022-03-13 00:13:02,564 - fba.__main__ - INFO - Using extract subcommand ...
2022-03-13 00:13:02,589 - fba.levenshtein - INFO - Number of reference cell barcodes: 8,366
2022-03-13 00:13:02,590 - fba.levenshtein - INFO - Number of reference feature barcodes: 2
2022-03-13 00:13:02,590 - fba.levenshtein - INFO - Read 1 coordinates to search: [0, 16)
2022-03-13 00:13:02,590 - fba.levenshtein - INFO - Read 2 coordinates to search: [0, 12)
2022-03-13 00:13:02,590 - fba.levenshtein - INFO - Cell barcode maximum number of mismatches: 1
2022-03-13 00:13:02,590 - fba.levenshtein - INFO - Feature barcode maximum number of mismatches: 1
2022-03-13 00:13:02,590 - fba.levenshtein - INFO - Read 1 maximum number of N allowed: 3
2022-03-13 00:13:02,590 - fba.levenshtein - INFO - Read 2 maximum number of N allowed: 3
2022-03-13 00:13:02,809 - fba.levenshtein - INFO - Matching ...
2022-03-13 00:16:00,978 - fba.levenshtein - INFO - Read pairs processed: 10,000,000
2022-03-13 00:18:58,488 - fba.levenshtein - INFO - Read pairs processed: 20,000,000
2022-03-13 00:21:55,956 - fba.levenshtein - INFO - Read pairs processed: 30,000,000
2022-03-13 00:24:53,698 - fba.levenshtein - INFO - Read pairs processed: 40,000,000
2022-03-13 00:27:51,819 - fba.levenshtein - INFO - Read pairs processed: 50,000,000
2022-03-13 00:28:12,045 - fba.levenshtein - INFO - Number of read pairs processed: 51,140,637
2022-03-13 00:28:12,045 - fba.levenshtein - INFO - Number of read pairs w/ valid barcodes: 10,543,901
2022-03-13 00:28:12,060 - fba.__main__ - INFO - Done.

R1#

Search ranges are set to 0,16 on read 2 and 0,12 on read 1. One mismatch for cell and feature barcodes (-cb_m, -cf_m) are allowed. Use -cb_rc to reverse-complement cell barcode sequences in the output if needed.

$ fba extract \
    -1 SRR12588752_2_trimmed.fq.gz \
    -2 SRR12588752_1_trimmed.fq.gz \
    -w barcodes.tsv \
    -f feature_barcodes_R1_truncated.txt \
    -o feature_barcoding_output_R1.tsv.gz \
    -r1_c 0,16 \
    -r2_c 0,12 \
    -cb_m 1 \
    -fb_m 1 \
    -cb_rc

Preview of result.

$ gzip -dc feature_barcoding_output_R1.tsv.gz | head

read1_seq       cell_barcode    cb_num_mismatches       read2_seq       feature_barcode fb_num_mismatches
NCTCGGGACGTCTGGC        ACTCGGGACGTCTGGC        1       AGTGACCTGTGT    CD8Nb_PH-D_AGTGACCTGTGT 0
NCTAAGACTTTATGGC        GCTAAGACTTTATGGC        1       AGTGACCTGTGT    CD8Nb_PH-D_AGTGACCTGTGT 0
NACGGAAGATCGTAAC        CACGGAAGATCGTAAC        1       AGTGACCTGTGT    CD8Nb_PH-D_AGTGACCTGTGT 0
NTGTTGTGAGTCCCGA        GTGTTGTGAGTCCCGA        1       AGTGACCTGTGT    CD8Nb_PH-D_AGTGACCTGTGT 0
CCTCCTGCTATCAGGG        CCTCCTGCTATCAGGG        0       AGTGACCTGTGT    CD8Nb_PH-D_AGTGACCTGTGT 0
GTTGATTCTCGAAGCA        GTTGATTCTCGAAGCA        0       AGTGACCTGTGT    CD8Nb_PH-D_AGTGACCTGTGT 0
TGGTTAGACTCCGTAA        TGGTTAGACTCCGTAA        0       AGTGACCTGTGT    CD8Nb_PH-D_AGTGACCTGTGT 0
GCCTCTTGACTGGGTC        GCCTCTTGACTGGGTC        0       CGTAACTTGTGT    CD8Nb_PH-C_CGTAACTTGTGT 0
AGGTAGCGAGAGTAAT        AGGTAGCGAGAGTAAT        0       AGTGACCTGTGT    CD8Nb_PH-D_AGTGACCTGTGT 0

Result summary.

11,128,546 out of 25,988,762 read pairs have valid cell and feature barcodes.

2022-03-12 23:29:33,460 - fba.__main__ - INFO - fba version: 0.0.12
2022-03-12 23:29:33,460 - fba.__main__ - INFO - Initiating logging ...
2022-03-12 23:29:33,460 - fba.__main__ - INFO - Python version: 3.10
2022-03-12 23:29:33,460 - fba.__main__ - INFO - Using extract subcommand ...
2022-03-12 23:29:33,488 - fba.levenshtein - INFO - Number of reference cell barcodes: 8,366
2022-03-12 23:29:33,488 - fba.levenshtein - INFO - Number of reference feature barcodes: 2
2022-03-12 23:29:33,488 - fba.levenshtein - INFO - Read 1 coordinates to search: [0, 16)
2022-03-12 23:29:33,488 - fba.levenshtein - INFO - Read 2 coordinates to search: [0, 12)
2022-03-12 23:29:33,488 - fba.levenshtein - INFO - Cell barcode maximum number of mismatches: 1
2022-03-12 23:29:33,488 - fba.levenshtein - INFO - Feature barcode maximum number of mismatches: 1
2022-03-12 23:29:33,488 - fba.levenshtein - INFO - Read 1 maximum number of N allowed: 3
2022-03-12 23:29:33,488 - fba.levenshtein - INFO - Read 2 maximum number of N allowed: 3
2022-03-12 23:29:33,707 - fba.levenshtein - INFO - Matching ...
2022-03-12 23:33:10,471 - fba.levenshtein - INFO - Read pairs processed: 10,000,000
2022-03-12 23:36:47,019 - fba.levenshtein - INFO - Read pairs processed: 20,000,000
2022-03-12 23:38:56,544 - fba.levenshtein - INFO - Number of read pairs processed: 25,988,762
2022-03-12 23:38:56,544 - fba.levenshtein - INFO - Number of read pairs w/ valid barcodes: 11,128,546
2022-03-12 23:38:56,558 - fba.__main__ - INFO - Done.

Matrix generation#

Only fragments with correctly matched cell and feature barcodes are included. Use -ul to set the UMI length (default 12). Setting to 0 means no UMIs and read counts are summarized instead. Use -cb_rc to reverse-complement cell barcode sequences in the output matrix if needed. The generated feature count matrix can be easily imported into well-established single cell analysis packages such as Seurat and Scanpy.

$ fba count \
    -i feature_barcoding_output_R1.tsv.gz \
    -i feature_barcoding_output_R3.tsv.gz \
    -o matrix_featurecount.csv.gz \
    -ul 0

Result summary.

39.9 % (21,672,447 out of 54,274,791) of total read pairs have valid cell and feature barcodes. The median number of reads per cell for this phage-derived tag library is 2,261.0.

2022-03-13 00:36:01,502 - fba.__main__ - INFO - fba version: 0.0.12
2022-03-13 00:36:01,502 - fba.__main__ - INFO - Initiating logging ...
2022-03-13 00:36:01,502 - fba.__main__ - INFO - Python version: 3.9
2022-03-13 00:36:01,502 - fba.__main__ - INFO - Using count subcommand ...
2022-03-13 00:36:02,348 - fba.count - INFO - UMI-tools version: 1.1.1
2022-03-13 00:36:02,348 - fba.count - INFO - UMI length set to 0, ignoring UMI information. Skipping arguments: "-us/--umi_start".
2022-03-13 00:36:02,348 - fba.count - INFO - Header: read1_seq cell_barcode cb_num_mismatches read2_seq feature_barcode fb_num_mismatches
2022-03-13 00:36:20,914 - fba.count - INFO - Number of read pairs processed: 21,672,447
2022-03-13 00:36:20,917 - fba.count - INFO - Number of cell barcodes detected: 8,366
2022-03-13 00:36:20,917 - fba.count - INFO - Number of features detected: 4
2022-03-13 00:36:20,917 - fba.count - INFO - Counting ...
2022-03-13 00:36:21,009 - fba.count - INFO - Total reads: 21,672,447
2022-03-13 00:36:21,016 - fba.count - INFO - Median number of reads per cell: 2,261.0
2022-03-13 00:36:21,103 - fba.__main__ - INFO - Done.

Demultiplexing#

Negative binomial distribution#

Cells are demultiplexed based on the feature count matrix using demultiplexing method 1 (set by -dm), which is implemented based on the method described by Stoeckius, M., et al. (2018) with some modifications. The output directory for demultiplexing is set by --output_directory (default demultiplexed). A cell identity matrix is generated, where 0 indicates negative and 1 indicates positive. To adjust the quantile threshold for demultiplexing, use -q (default 0.9999). To generate visualization plots, set -v.

$ fba demultiplex \
    -i matrix_featurecount.csv.gz \
    -q 0.99 \
    -v
2022-03-13 00:47:41,569 - fba.__main__ - INFO - fba version: 0.0.12
2022-03-13 00:47:41,569 - fba.__main__ - INFO - Initiating logging ...
2022-03-13 00:47:41,569 - fba.__main__ - INFO - Python version: 3.10
2022-03-13 00:47:41,569 - fba.__main__ - INFO - Using demultiplex subcommand ...
2022-03-13 00:47:49,145 - fba.__main__ - INFO - Skipping arguments: "-p/--prob"
2022-03-13 00:47:49,145 - fba.demultiplex - INFO - Output directory: demultiplexed
2022-03-13 00:47:49,145 - fba.demultiplex - INFO - Demultiplexing method: 1
2022-03-13 00:47:49,146 - fba.demultiplex - INFO - UMI normalization method: clr
2022-03-13 00:47:49,146 - fba.demultiplex - INFO - Visualization: On
2022-03-13 00:47:49,146 - fba.demultiplex - INFO - Visualization method: tsne
2022-03-13 00:47:49,146 - fba.demultiplex - INFO - Loading feature count matrix: matrix_featurecount.csv.gz ...
2022-03-13 00:47:49,324 - fba.demultiplex - INFO - Number of cells: 8,366
2022-03-13 00:47:49,324 - fba.demultiplex - INFO - Number of positive cells for a feature to be included: 200
2022-03-13 00:47:49,327 - fba.demultiplex - INFO - Number of features: 4 / 4 (after filtering / original in the matrix)
2022-03-13 00:47:49,327 - fba.demultiplex - INFO - Features: CD8Nb_PH-A CD8Nb_PH-B CD8Nb_PH-C CD8Nb_PH-D
2022-03-13 00:47:49,327 - fba.demultiplex - INFO - Total UMIs/reads: 21,672,447 / 21,672,447
2022-03-13 00:47:49,328 - fba.demultiplex - INFO - Median number of UMIs/reads per cell: 2,261.0 / 2,261.0
2022-03-13 00:47:49,328 - fba.demultiplex - INFO - Demultiplexing ...
2022-03-13 00:48:53,685 - fba.demultiplex - INFO - Generating heatmap ...
2022-03-13 00:48:59,759 - fba.demultiplex - INFO - Embedding ...
2022-03-13 00:49:20,128 - fba.__main__ - INFO - Done.

Heatmap of the relative abundance of features (phage-derived tags, PDTs) across all cells. Each column represents a single cell. This is a re-creation of Fig. 3b in Fiskin, E., et al. (2021).

Heatmap

Preview the demultiplexing result: the numbers of singlets, multiplets and negatives are 6,373 (76.2%), 639 (7.6%), and 1,354 (16.2%), respectively.

In [1]: import pandas as pd

In [2]: m = pd.read_csv("demultiplexed/matrix_cell_identity.csv.gz", index_col=0)

In [3]: m.loc[:, m.sum(axis=0) == 1].sum(axis=1)
Out[3]:
CD8Nb_PH-A    1640
CD8Nb_PH-B    1601
CD8Nb_PH-C    1564
CD8Nb_PH-D    1568
dtype: int64

In [4]: sum(m.sum(axis=0) == 1)
Out[4]: 6373

In [5]: sum(m.sum(axis=0) > 1)
Out[5]: 639

In [6]: sum(m.sum(axis=0) == 0)
Out[6]: 1354

In [7]: m.shape
Out[7]: (4, 8366)

t-SNE embedding of cells based on the abundance of features (phage-derived tags, no transcriptome information used). Colors indicate the hashtag status for each cell, as called by FBA. This is a re-creation of Fig. 3d in Fiskin, E., et al. (2021).

t-SNE embedding

Gaussian mixture model#

The implementation of demultiplexing method 2 (set by -dm) is inspired by the method described on the 10x Genomics’ website. To set the probability threshold for demultiplexing, use -p (default 0.9). To generate visualization plots, set -v.

$ fba demultiplex \
    -i matrix_featurecount.csv.gz \
    -dm 2 \
    -v
2022-03-13 11:27:47,035 - fba.__main__ - INFO - fba version: 0.0.12
2022-03-13 11:27:47,035 - fba.__main__ - INFO - Initiating logging ...
2022-03-13 11:27:47,035 - fba.__main__ - INFO - Python version: 3.9
2022-03-13 11:27:47,035 - fba.__main__ - INFO - Using demultiplex subcommand ...
2022-03-13 11:27:49,515 - fba.__main__ - INFO - Skipping arguments: "-q/--quantile", "-cm/--clustering_method"
2022-03-13 11:27:49,515 - fba.demultiplex - INFO - Output directory: demultiplexed
2022-03-13 11:27:49,515 - fba.demultiplex - INFO - Demultiplexing method: 2
2022-03-13 11:27:49,515 - fba.demultiplex - INFO - UMI normalization method: clr
2022-03-13 11:27:49,515 - fba.demultiplex - INFO - Visualization: On
2022-03-13 11:27:49,515 - fba.demultiplex - INFO - Visualization method: tsne
2022-03-13 11:27:49,515 - fba.demultiplex - INFO - Loading feature count matrix: matrix_featurecount.csv.gz ...
2022-03-13 11:27:49,595 - fba.demultiplex - INFO - Number of cells: 8,366
2022-03-13 11:27:49,595 - fba.demultiplex - INFO - Number of positive cells for a feature to be included: 200
2022-03-13 11:27:49,607 - fba.demultiplex - INFO - Number of features: 4 / 4 (after filtering / original in the matrix)
2022-03-13 11:27:49,607 - fba.demultiplex - INFO - Features: CD8Nb_PH-A CD8Nb_PH-B CD8Nb_PH-C CD8Nb_PH-D
2022-03-13 11:27:49,607 - fba.demultiplex - INFO - Total UMIs: 21,672,447 / 21,672,447
2022-03-13 11:27:49,614 - fba.demultiplex - INFO - Median number of UMIs/reads per cell: 2,261.0 / 2,261.0
2022-03-13 11:27:49,614 - fba.demultiplex - INFO - Demultiplexing ...
2022-03-13 11:27:51,392 - fba.demultiplex - INFO - Generating heatmap ...
2022-03-13 11:27:53,158 - fba.demultiplex - INFO - Embedding ...
2022-03-13 11:28:08,072 - fba.__main__ - INFO - Done.

Heatmap of the relative abundance of features (phage-derived tags, PDTs) across all cells. Each column represents a single cell. This is a re-creation of Fig. 3b in Fiskin, E., et al. (2021).

Heatmap

Preview the demultiplexing result: the numbers of singlets, multiplets and negatives are 6,510 (77.8%), 709 (8.5%), and 1,147 (13.7%), respectively.

In [1]: import pandas as pd

In [2]: m = pd.read_csv("demultiplexed/matrix_cell_identity.csv.gz", index_col=0)

In [3]: m.loc[:, m.sum(axis=0) == 1].sum(axis=1)
Out[3]:
CD8Nb_PH-A    1680
CD8Nb_PH-B    1637
CD8Nb_PH-C    1646
CD8Nb_PH-D    1547
dtype: int64

In [4]: sum(m.sum(axis=0) == 1)
Out[4]: 6510

In [5]: sum(m.sum(axis=0) > 1)
Out[5]: 709

In [6]: sum(m.sum(axis=0) == 0)
Out[6]: 1147

In [7]: m.shape
Out[7]: (4, 8366)

t-SNE embedding of cells based on the abundance of features (phage-derived tags, no transcriptome information used). Colors indicate the hashtag status for each cell, as called by FBA. This is a re-creation of Fig. 3d in Fiskin, E., et al. (2021).

t-SNE embedding

Read distribution and model fitting threshold:

UMI distribution UMI distribution UMI distribution UMI distribution