cellranger

Last updated: 2023-08-30

Checks: 6 1

Knit directory: lab-notes/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(1)

The command set.seed(1) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: unavailable

Recording the operating system, R version, and package versions is critical for reproducibility. To record the session information, add sessioninfo: “sessionInfo()” to _workflowr.yml. Alternatively, you could use devtools::session_info() or sessioninfo::session_info(). Lastly, you can manually add a code chunk to this file to run any one of these commands and then disable to automatic insertion by changing the workflowr setting to sessioninfo: ““.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: ba8cc1d

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version ba8cc1d. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    analysis/doc_prefix.html

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/cellranger.Rmd) and HTML (docs/cellranger.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
html	d9c3393	1onic	2023-08-28	Build site.
html	d4afa3b	1onic	2023-07-12	Build site.
html	289324b	1onic	2023-07-12	Build site.
html	be52451	1onic	2023-07-12	Build site.
html	a73ac21	1onic	2023-06-13	Build site.
html	41874f0	1onic	2023-06-13	Build site.
html	75989aa	1onic	2023-06-13	Build site.
html	14b2f73	1onic	2023-06-13	Build site.
html	6f3f6a4	1onic	2023-05-17	Build site.
html	5192404	1onic	2023-05-17	Build site.
html	93a8936	1onic	2023-05-10	Build site.
html	454b232	1onic	2023-05-10	Build site.
html	6a1bef0	1onic	2023-05-10	Build site.
html	b8fbbd1	1onic	2023-05-10	Build site.
html	e5648db	1onic	2023-05-10	Build site.
html	c0c4bf0	1onic	2023-04-26	Build site.
html	9fbd5e6	1onic	2023-04-26	Build site.
html	1478db1	1onic	2023-04-19	Build site.
html	e1b57ff	1onic	2023-03-29	Build site.
html	7288d9c	1onic	2023-03-29	Build site.
html	3aef7a8	1onic	2023-03-29	Build site.
html	3f796b6	1onic	2023-03-22	Build site.
html	cdae3ea	1onic	2023-03-10	Build site.
html	99420b0	1onic	2023-03-10	Build site.
html	0f70ffe	1onic	2023-03-10	Build site.
html	e4922c1	1onic	2023-03-02	Build site.
html	2d991e9	1onic	2023-03-02	Build site.
html	c350c44	1onic	2023-02-06	Build site.
html	f12e92e	1onic	2023-02-06	Build site.
html	0b2cc20	1onic	2023-02-06	Build site.
html	fdd0022	1onic	2023-02-06	Build site.
html	21d42f1	1onic	2023-01-27	Build site.
html	e47cac4	1onic	2023-01-27	Build site.
Rmd	cd608b0	1onic	2023-01-27	update
html	bce6cff	1onic	2023-01-11	Build site.
html	c51b055	1onic	2023-01-11	Build site.
html	baf3f16	1onic	2023-01-11	Build site.
html	f1c0cc6	1onic	2023-01-11	Build site.
Rmd	40c30e4	1onic	2023-01-11	update

Improving CellRanger Feature Assignment During Aggregation

One core issue I am currently trying to solve is to aggregate the Replogle et al 2022 filtered barcode matrices after counting and alignment. This is a necessary step to aggregate all results from different batches. The problem lies in the complexity of the aggregation, the size, and current inefficient cellranger code.

Approach

To understand the problem I first focused on both the existing code, understanding the data structures, and Python limitations. I also timed current aggregation essentially benchmarking it on the data counts I currently have. For this I chose the Xie et al dataset. The reason for this is becuase the Xie et al dataset is of fairly large length which puts stress on the current code sufficiently.

Details of the Problem

The current availible CellRanger code on the official website takes around 10h to aggregate the Xie et al counts data. Of these 10h between 8h 50m - 9h 10m are spent excusively on a single step of the aggregation: the calling of the protospacers. This narrows the problem I am trying to solve since it seems that the entirety of the aggregation can be done in ~1h with exception to the protospacer step which becomes the clear bottleneck in using cellranger for CRISPR data.

for cell in self.features_per_cell:
    calls = self.features_per_cell.get(cell)
    num_features = len(calls)
    features = sep.join(calls)

    # These three lines represent a fast way of populating the UMIS
    cell_bc_index = self.matrix.bc_to_int(cell)
    bc_data = np.ravel(
        self.matrix.m.getcol(cell_bc_index).todense()
    )  # get data for a single bc, densify it, then flatten it
    umis = sep.join(
        str(bc_data[self.matrix.feature_id_to_int(f_id)]).encode() for f_id in calls
    )

    features_per_cell_table.loc[cell] = (num_features, features, umis)

There is also this:

for feature_id in feature_ids_this_type:
    if not isinstance(feature_id, bytes):
        raise ValueError(
            f"Feature ID {feature_id} must be bytes but was {type(feature_id)}"
        )
    umi_counts = self.matrix.get_subselected_counts(
        log_transform=False, library_type=self.feature_type, list_feature_ids=[feature_id]
    )

    in_high_umi_component = GuideAssigner._call_presence(umi_counts, self.method)
    assignments[feature_id] = FeatureAssignments(
        np.flatnonzero(np.array(in_high_umi_component)), sum(umi_counts), False, None
    )

From this we see that there are currently TWO major loops: for feature_id in feature_ids_this_type: and for cell in self.features_per_cell that both make very large interations on the feature or cell axis of our data.

From this I proposed the following [TWO] solutions:

cell_list = list(self.features_per_cell.keys()) # defaultdict[bytes, List[bytes]] where key=cell and val = list[feature_assigned: bytes]
#list of calls for all cells; enables getting features and num features/len(features) 
calls_list = [self.features_per_cell.get(cell) for cell in cell_list]
# now get UMIs to be populated; for those we need bc_data and a lookup table mapping feature id to index: int
# to get the bc data we also need to map barcodes id to index: int
cell_bc_index_list = [self.matrix.bc_to_int(cell) for cell in cell_list]
# should be a list of scipy sparse (m x 1) CSR matrix (column vector); call to scipy.sparse.csr_matrix.getcol()
bc_data = [self.matrix.m.getcol(cell_bc_index) for cell_bc_index in cell_bc_index_list] # .todense() 
# now we need a feature map to ids
feature_ids_map = self.matrix.feature_ids_map
        
res = None
with Pool(processes=NUM_PROC) as pool:
    res = pool.starmap(calculate_get_features_per_cell_table, zip(calls_list, bc_data, repeat(feature_ids_map) ) )

if res is None:
    raise NotImplementedError("pool starmap should not be returning None here, this method must be ERROR state")

columns = ["num_features", FEATURE_CALL_COL, "num_umis"]
features_per_cell_table = pd.DataFrame.from_records(res, columns=columns, index=cell_list)

feature_indx_list = [self.matrix.feature_id_to_int(fid) for fid in feature_ids_this_type]
umi_counts_list = [self.matrix.m.getrow(fidx) for fidx in feature_indx_list]

res = None
with Pool(processes=NUM_PROC) as pool:
    res = pool.map(calculate_feature_assignments, umi_counts_list ) 
    # calculate_feature_assignments will take an ndarray input and call for that guide using GMM 
    
if res is None:
    raise NotImplementedError("pool map should not be returning None here, this method must be ERROR state")

# Pool.map returned ordered results
for r in zip(feature_ids_this_type, res): # results <- list of FeatureAssignments()
    assignments[r[0]] = r[1]

This was then re-run with the follow log-level results:

Finished loading protospacer_call_metrics from JSON
Datetime: 2023-01-10 22:28:19.998053
Took: 00:00:00:00
Finished loading+select filtered_feature_counts_matrix
Datetime: 2023-01-10 22:28:31.198950
Took: 00:00:00:11
Finished guide_assigner.assignments
Datetime: 2023-01-11 00:11:55.885282
Took: 00:01:43:24
Finished guide_assigner.features_per_cell_table
Datetime: 2023-01-11 00:12:10.069009
Took: 00:00:00:14
Finished guide_assigner.assignment_metadata from JSON
Datetime: 2023-01-11 02:29:32.747213
Took: 00:02:17:22
Finished guide_assigner.feature_calls_summary
Datetime: 2023-01-11 02:39:15.609817
Took: 00:00:09:42
Finished guide_assigner.cells_per_feature
Datetime: 2023-01-11 02:39:15.804628
Took: 00:00:00:00
Finished protospacer_call_metrics.update
Datetime: 2023-01-11 02:39:15.804763
Took: 00:00:00:00
Finished sanitize+write CSV
Datetime: 2023-01-11 02:39:16.798977
Took: 00:00:00:00
Finished feature_utils.write_json_from_dict cells_per_feature
Datetime: 2023-01-11 02:39:18.122902
Took: 00:00:00:01
Finished feature_utils.write_json_from_dict protospacer_call_metrics
Datetime: 2023-01-11 02:39:18.127919
Took: 00:00:00:00

From this we see notable improvemtns and a running time of ~4hrs. This is memory -safer than a previous solution I had which was faster (~)