Last updated: 2023-08-30

Checks: 6 1

Knit directory: lab-notes/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(1) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Recording the operating system, R version, and package versions is critical for reproducibility. To record the session information, add sessioninfo: “sessionInfo()” to _workflowr.yml. Alternatively, you could use devtools::session_info() or sessioninfo::session_info(). Lastly, you can manually add a code chunk to this file to run any one of these commands and then disable to automatic insertion by changing the workflowr setting to sessioninfo: ““.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version ba8cc1d. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    analysis/doc_prefix.html

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/cellranger.Rmd) and HTML (docs/cellranger.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
html d9c3393 1onic 2023-08-28 Build site.
html d4afa3b 1onic 2023-07-12 Build site.
html 289324b 1onic 2023-07-12 Build site.
html be52451 1onic 2023-07-12 Build site.
html a73ac21 1onic 2023-06-13 Build site.
html 41874f0 1onic 2023-06-13 Build site.
html 75989aa 1onic 2023-06-13 Build site.
html 14b2f73 1onic 2023-06-13 Build site.
html 6f3f6a4 1onic 2023-05-17 Build site.
html 5192404 1onic 2023-05-17 Build site.
html 93a8936 1onic 2023-05-10 Build site.
html 454b232 1onic 2023-05-10 Build site.
html 6a1bef0 1onic 2023-05-10 Build site.
html b8fbbd1 1onic 2023-05-10 Build site.
html e5648db 1onic 2023-05-10 Build site.
html c0c4bf0 1onic 2023-04-26 Build site.
html 9fbd5e6 1onic 2023-04-26 Build site.
html 1478db1 1onic 2023-04-19 Build site.
html e1b57ff 1onic 2023-03-29 Build site.
html 7288d9c 1onic 2023-03-29 Build site.
html 3aef7a8 1onic 2023-03-29 Build site.
html 3f796b6 1onic 2023-03-22 Build site.
html cdae3ea 1onic 2023-03-10 Build site.
html 99420b0 1onic 2023-03-10 Build site.
html 0f70ffe 1onic 2023-03-10 Build site.
html e4922c1 1onic 2023-03-02 Build site.
html 2d991e9 1onic 2023-03-02 Build site.
html c350c44 1onic 2023-02-06 Build site.
html f12e92e 1onic 2023-02-06 Build site.
html 0b2cc20 1onic 2023-02-06 Build site.
html fdd0022 1onic 2023-02-06 Build site.
html 21d42f1 1onic 2023-01-27 Build site.
html e47cac4 1onic 2023-01-27 Build site.
Rmd cd608b0 1onic 2023-01-27 update
html bce6cff 1onic 2023-01-11 Build site.
html c51b055 1onic 2023-01-11 Build site.
html baf3f16 1onic 2023-01-11 Build site.
html f1c0cc6 1onic 2023-01-11 Build site.
Rmd 40c30e4 1onic 2023-01-11 update

Improving CellRanger Feature Assignment During Aggregation

One core issue I am currently trying to solve is to aggregate the Replogle et al 2022 filtered barcode matrices after counting and alignment. This is a necessary step to aggregate all results from different batches. The problem lies in the complexity of the aggregation, the size, and current inefficient cellranger code.

Approach

To understand the problem I first focused on both the existing code, understanding the data structures, and Python limitations. I also timed current aggregation essentially benchmarking it on the data counts I currently have. For this I chose the Xie et al dataset. The reason for this is becuase the Xie et al dataset is of fairly large length which puts stress on the current code sufficiently.

Details of the Problem

The current availible CellRanger code on the official website takes around 10h to aggregate the Xie et al counts data. Of these 10h between 8h 50m - 9h 10m are spent excusively on a single step of the aggregation: the calling of the protospacers. This narrows the problem I am trying to solve since it seems that the entirety of the aggregation can be done in ~1h with exception to the protospacer step which becomes the clear bottleneck in using cellranger for CRISPR data.

for cell in self.features_per_cell:
    calls = self.features_per_cell.get(cell)
    num_features = len(calls)
    features = sep.join(calls)

    # These three lines represent a fast way of populating the UMIS
    cell_bc_index = self.matrix.bc_to_int(cell)
    bc_data = np.ravel(
        self.matrix.m.getcol(cell_bc_index).todense()
    )  # get data for a single bc, densify it, then flatten it
    umis = sep.join(
        str(bc_data[self.matrix.feature_id_to_int(f_id)]).encode() for f_id in calls
    )

    features_per_cell_table.loc[cell] = (num_features, features, umis)

There is also this:

for feature_id in feature_ids_this_type:
    if not isinstance(feature_id, bytes):
        raise ValueError(
            f"Feature ID {feature_id} must be bytes but was {type(feature_id)}"
        )
    umi_counts = self.matrix.get_subselected_counts(
        log_transform=False, library_type=self.feature_type, list_feature_ids=[feature_id]
    )

    in_high_umi_component = GuideAssigner._call_presence(umi_counts, self.method)
    assignments[feature_id] = FeatureAssignments(
        np.flatnonzero(np.array(in_high_umi_component)), sum(umi_counts), False, None
    )

From this we see that there are currently TWO major loops: for feature_id in feature_ids_this_type: and for cell in self.features_per_cell that both make very large interations on the feature or cell axis of our data.

From this I proposed the following [TWO] solutions:

cell_list = list(self.features_per_cell.keys()) # defaultdict[bytes, List[bytes]] where key=cell and val = list[feature_assigned: bytes]
#list of calls for all cells; enables getting features and num features/len(features) 
calls_list = [self.features_per_cell.get(cell) for cell in cell_list]
# now get UMIs to be populated; for those we need bc_data and a lookup table mapping feature id to index: int
# to get the bc data we also need to map barcodes id to index: int
cell_bc_index_list = [self.matrix.bc_to_int(cell) for cell in cell_list]
# should be a list of scipy sparse (m x 1) CSR matrix (column vector); call to scipy.sparse.csr_matrix.getcol()
bc_data = [self.matrix.m.getcol(cell_bc_index) for cell_bc_index in cell_bc_index_list] # .todense() 
# now we need a feature map to ids
feature_ids_map = self.matrix.feature_ids_map
        
res = None
with Pool(processes=NUM_PROC) as pool:
    res = pool.starmap(calculate_get_features_per_cell_table, zip(calls_list, bc_data, repeat(feature_ids_map) ) )

if res is None:
    raise NotImplementedError("pool starmap should not be returning None here, this method must be ERROR state")

columns = ["num_features", FEATURE_CALL_COL, "num_umis"]
features_per_cell_table = pd.DataFrame.from_records(res, columns=columns, index=cell_list)
feature_indx_list = [self.matrix.feature_id_to_int(fid) for fid in feature_ids_this_type]
umi_counts_list = [self.matrix.m.getrow(fidx) for fidx in feature_indx_list]

res = None
with Pool(processes=NUM_PROC) as pool:
    res = pool.map(calculate_feature_assignments, umi_counts_list ) 
    # calculate_feature_assignments will take an ndarray input and call for that guide using GMM 
    
if res is None:
    raise NotImplementedError("pool map should not be returning None here, this method must be ERROR state")

# Pool.map returned ordered results
for r in zip(feature_ids_this_type, res): # results <- list of FeatureAssignments()
    assignments[r[0]] = r[1]

This was then re-run with the follow log-level results:

Finished loading protospacer_call_metrics from JSON
Datetime: 2023-01-10 22:28:19.998053
Took: 00:00:00:00
Finished loading+select filtered_feature_counts_matrix
Datetime: 2023-01-10 22:28:31.198950
Took: 00:00:00:11
Finished guide_assigner.assignments
Datetime: 2023-01-11 00:11:55.885282
Took: 00:01:43:24
Finished guide_assigner.features_per_cell_table
Datetime: 2023-01-11 00:12:10.069009
Took: 00:00:00:14
Finished guide_assigner.assignment_metadata from JSON
Datetime: 2023-01-11 02:29:32.747213
Took: 00:02:17:22
Finished guide_assigner.feature_calls_summary
Datetime: 2023-01-11 02:39:15.609817
Took: 00:00:09:42
Finished guide_assigner.cells_per_feature
Datetime: 2023-01-11 02:39:15.804628
Took: 00:00:00:00
Finished protospacer_call_metrics.update
Datetime: 2023-01-11 02:39:15.804763
Took: 00:00:00:00
Finished sanitize+write CSV
Datetime: 2023-01-11 02:39:16.798977
Took: 00:00:00:00
Finished feature_utils.write_json_from_dict cells_per_feature
Datetime: 2023-01-11 02:39:18.122902
Took: 00:00:00:01
Finished feature_utils.write_json_from_dict protospacer_call_metrics
Datetime: 2023-01-11 02:39:18.127919
Took: 00:00:00:00

From this we see notable improvemtns and a running time of ~4hrs. This is memory -safer than a previous solution I had which was faster (~)