15k HEK293 and 40k HMECs Multiplexed by Lipid- and Cholesterol-tagged Indices¶

Dataset: MULTI-seq: sample multiplexing for single-cell RNA sequencing using lipid-tagged indices

McGinnis, C.S., Patterson, D.M., Winkler, J., Conrad, D.N., Hein, M.Y., Srivastava, V., Hu, J.L., Murrow, L.M., Weissman, J.S., Werb, Z., et al. (2019). MULTI-seq: sample multiplexing for single-cell RNA sequencing using lipid-tagged indices. Nat. Methods 16, 619–626.

Preparation¶

Download pre-processed feature count matrix files from Gene Expression Omnibus.

$ wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE129nnn/GSE129578/suppl/GSE129578_processed_data_files.csv.tar.gz

$ tar zxvf GSE129578_processed_data_files.csv.tar.gz

$ ls -1

GSE129578_processed_data_files.csv.tar.gz
HMEC_orig_MULTI_matrix.csv
HMEC_techrep_MULTI_matrix.csv
PDX_MULTI_matrix.csv
POC_MULTI_matrix.csv
POC_nuc_MULTI_matrix.csv

15k human embryonic kidney 293 cells (HEK293)¶

Pre-processing¶

Inspect feature count matrix.

$ head POC_MULTI_matrix.csv

CellID,LaneID,Bar1,Bar2,Bar3,nUMI_Bar
AAACCTGAGAAGGTTT-1,LMO,1,35,0,36
AAACCTGAGATGTTAG-1,LMO,3,1,4,8
AAACCTGAGGAATGGA-1,LMO,257,0,0,257
AAACCTGAGGGATCTG-1,LMO,2,0,1,3
AAACCTGAGGTAAACT-1,LMO,98,0,0,98
AAACCTGAGTGTTAGA-1,LMO,87,0,0,87
AAACCTGCACACAGAG-1,LMO,114,1,2,117
AAACCTGCACAGCCCA-1,LMO,1,1,25,27
AAACCTGCATCCAACA-1,LMO,4,8,0,12

Pre-process feature count matrix.

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: m = pd.read_csv(filepath_or_buffer="POC_MULTI_matrix.csv",
                        index_col=0)

In [4]: m.iloc[1:5, 1:5]
Out[4]:
                    Bar1  Bar2  Bar3  nUMI_Bar
CellID
AAACCTGAGATGTTAG-1     3     1     4         8
AAACCTGAGGAATGGA-1   257     0     0       257
AAACCTGAGGGATCTG-1     2     0     1         3
AAACCTGAGGTAAACT-1    98     0     0        98

In [5]: m = m.T.drop(["LaneID", "nUMI_Bar"])

In [6]: m.to_csv(path_or_buf="matrix_featurecount_POC_MULTI.csv.gz",
                 compression="infer")

In [7]: m.sum(axis=0)
Out[7]:
CellID
AAACCTGAGAAGGTTT-1     36
AAACCTGAGATGTTAG-1      8
AAACCTGAGGAATGGA-1    257
AAACCTGAGGGATCTG-1      3
AAACCTGAGGTAAACT-1     98
                    ...
TTTGTCACAAGCCTAT-3      0
TTTGTCAGTATAGTAG-3      0
TTTGTCAGTCTGATCA-3      0
TTTGTCAGTGCGCTTG-3      0
TTTGTCAGTGGTCCGT-3      0
Length: 15482, dtype: object

In [8]: np.median(m.sum(axis=0))
Out[8]: 20.0

Demultiplexing¶

Cells are demultiplexed based on the abundance of features. Demultiplexing method 4 is implemented based on the method described in McGinnis, C., et al. (2019) with some modifications. A cell identity matrix is generated in the output directory: 0 means negative, 1 means positive. To generate visualization plots, set -v.

$ fba demultiplex -i matrix_featurecount_POC_MULTI.csv.gz -dm 4 -v

2021-12-20 14:54:45,248 - fba.__main__ - INFO - fba version: 0.0.x
2021-12-20 14:54:45,248 - fba.__main__ - INFO - Initiating logging ...
2021-12-20 14:54:45,248 - fba.__main__ - INFO - Python version: 3.9
2021-12-20 14:54:45,249 - fba.__main__ - INFO - Using demultiplex subcommand ...
2021-12-20 14:54:47,474 - fba.__main__ - INFO - Skipping arguments: "-q/--quantile", "-cm/--clustering_method", "-p/--prob"
2021-12-20 14:54:47,474 - fba.demultiplex - INFO - Output directory: demultiplexed
2021-12-20 14:54:47,474 - fba.demultiplex - INFO - Demultiplexing method: 4
2021-12-20 14:54:47,474 - fba.demultiplex - INFO - UMI normalization method: clr
2021-12-20 14:54:47,474 - fba.demultiplex - INFO - Visualization: On
2021-12-20 14:54:47,474 - fba.demultiplex - INFO - Visualization method: tsne
2021-12-20 14:54:47,474 - fba.demultiplex - INFO - Loading feature count matrix: matrix_featurecount_POC_MULTI.csv.gz ...
2021-12-20 14:54:48,677 - fba.demultiplex - INFO - Number of cells: 15,482
2021-12-20 14:54:48,677 - fba.demultiplex - INFO - Number of positive cells for a feature to be included: 200
2021-12-20 14:54:48,701 - fba.demultiplex - INFO - Number of features: 3 / 3 (after filtering / original in the matrix)
2021-12-20 14:54:48,701 - fba.demultiplex - INFO - Features: Bar1 Bar2 Bar3
2021-12-20 14:54:48,701 - fba.demultiplex - INFO - Total UMIs: 705,913 / 705,913
2021-12-20 14:54:48,713 - fba.demultiplex - INFO - Median number of UMIs per cell: 20.0 / 20.0
2021-12-20 14:54:48,713 - fba.demultiplex - INFO - Demultiplexing ...
2021-12-20 14:54:52,347 - fba.demultiplex - INFO - Generating heatmap ...
2021-12-20 14:54:54,168 - fba.demultiplex - INFO - Embedding ...
2021-12-20 14:55:12,277 - fba.__main__ - INFO - Done.

Heatmap of the relative abundance of features across all cells. Each column represents a single cell.

Preview the demultiplexing result: the numbers of singlets, multiplets and negative cells.

In [1]:  import pandas as pd

In [2]: m = pd.read_csv("demultiplexed/matrix_cell_identity.csv.gz", index_col=0)

In [3]: m.loc[:, m.sum(axis=0) == 1].sum(axis=1)
Out[3]:
Bar1    5909
Bar2    2016
Bar3    2083
dtype: int64

In [4]: sum(m.sum(axis=0) > 1)
Out[4]: 875

In [5]: sum(m.sum(axis=0) == 0)
Out[5]: 4599

40k primary human mammary epithelial cells (HMECs)¶