Last updated: 2023-08-30
Checks: 6 1
Knit directory: lab-notes/
This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(1)
was run prior to running the
code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version ba8cc1d. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: analysis/doc_prefix.html
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/FRperturb.Rmd
) and HTML
(docs/FRperturb.html
) files. If you’ve configured a remote
Git repository (see ?wflow_git_remote
), click on the
hyperlinks in the table below to view the files as they were in that
past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | d9c3393 | 1onic | 2023-08-28 | Build site. |
html | d4afa3b | 1onic | 2023-07-12 | Build site. |
html | 289324b | 1onic | 2023-07-12 | Build site. |
html | be52451 | 1onic | 2023-07-12 | Build site. |
html | a73ac21 | 1onic | 2023-06-13 | Build site. |
html | 41874f0 | 1onic | 2023-06-13 | Build site. |
html | 75989aa | 1onic | 2023-06-13 | Build site. |
html | 14b2f73 | 1onic | 2023-06-13 | Build site. |
html | 6f3f6a4 | 1onic | 2023-05-17 | Build site. |
html | 5192404 | 1onic | 2023-05-17 | Build site. |
html | 93a8936 | 1onic | 2023-05-10 | Build site. |
html | 454b232 | 1onic | 2023-05-10 | Build site. |
html | 6a1bef0 | 1onic | 2023-05-10 | Build site. |
html | b8fbbd1 | 1onic | 2023-05-10 | Build site. |
html | e5648db | 1onic | 2023-05-10 | Build site. |
html | c0c4bf0 | 1onic | 2023-04-26 | Build site. |
html | 9fbd5e6 | 1onic | 2023-04-26 | Build site. |
html | 1478db1 | 1onic | 2023-04-19 | Build site. |
html | e1b57ff | 1onic | 2023-03-29 | Build site. |
html | 7288d9c | 1onic | 2023-03-29 | Build site. |
html | 3aef7a8 | 1onic | 2023-03-29 | Build site. |
html | 3f796b6 | 1onic | 2023-03-22 | Build site. |
html | cdae3ea | 1onic | 2023-03-10 | Build site. |
html | 99420b0 | 1onic | 2023-03-10 | Build site. |
html | 0f70ffe | 1onic | 2023-03-10 | Build site. |
html | e4922c1 | 1onic | 2023-03-02 | Build site. |
html | 2d991e9 | 1onic | 2023-03-02 | Build site. |
html | c350c44 | 1onic | 2023-02-06 | Build site. |
html | f12e92e | 1onic | 2023-02-06 | Build site. |
html | 0b2cc20 | 1onic | 2023-02-06 | Build site. |
html | fdd0022 | 1onic | 2023-02-06 | Build site. |
html | 21d42f1 | 1onic | 2023-01-27 | Build site. |
html | e47cac4 | 1onic | 2023-01-27 | Build site. |
Rmd | cd608b0 | 1onic | 2023-01-27 | update |
FR perturb is an alternative method for analyzing signal in
single-cell pertubation datasets. It features a ‘compressing sensing’
inspired approach, whereby we take advantage of the fact that single
cell data is sparse. So FR-perturb can leverage sparse promoting
algorithyms to accurately LEARN perturbations from
random combinations of pertubations (random composite samples) in the
pertubation matrix dataset rather than directly from pertubation data.
The main advantage of this is that we are not taking the gRNA data and
assaying it individually, instead we are infering pertubations using
LASSO then using those to measure the effect of each pertubation on gene
expression. More specifically a transformed gene expression matrix (Y)
is factorized using sparse PCA, to produces (2) LEFT/RIGHT factor
matrices. We then form a regression model with the LEFT factor matrix
~U: ~U =
XU. By applying LASSO to each column of matrix ~U, we can learn a column of matrix U. X
is the pertubation matrix X with each row normalized to sum 1. The
matrix Beta (estimate of Betas) can then be recovered by multiplying the
resulting learned U matrix by the right factor matrix W.
Processed data is availible on GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE221321
# Downloading from GEO directly to Midway can be tricky since the FTP server is not easy to reach
# its hard to get download the supplementary files without creating the link yourself
# copy the FTP site link from 'Series Matrix File(s)' OR any of the links that give you the FTP prefix:
https://ftp.ncbi.nlm.nih.gov/geo/series/GSE221nnn/GSE221321/matrix/
# edit the URL to target the supplementary TAR file you want by adding: '/suppl/FILE.EXT'
https://ftp.ncbi.nlm.nih.gov/geo/series/GSE221nnn/GSE221321/suppl/GSE221321_RAW.tar
# run wget
wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE221nnn/GSE221321/suppl/GSE221321_RAW.tar
The environment currently on github is incorrectly/unusable at the moment. Use the following .yaml env file with conda:
name: frperturb
channels:
- conda-forge
- bioconda
- anaconda
- defaults
dependencies:
- anndata=0.8.0
- bottleneck
- certifi
- colorama
- cycler
- decorator
- h5py
- hdf5
- ipykernel
- ipython
- jpeg
- jupyter_client
- jupyter_core
- kiwisolver
- matplotlib-base
- matplotlib-inline
- numpy
- pandas
- parso
- patsy
- pexpect
- pickleshare
- pillow
- pip
- prompt-toolkit
- pygments
- pynndescent
- pyparsing
- python=3.8
- pytz
- pyzmq
- python-spams
- readline
- scanpy=1.9.1
- scikit-learn
- scipy
- seaborn
- seaborn-base
- six
- sqlite
- stack_data
- statsmodels
- tk
- tornado
- tqdm
- traitlets
- typing_extensions
- umap-learn
conda env create -f env.yaml
Lets first check the FR-pertub input that we downloaded earlier in a jupyter notebook.
# PYTHON
import numpy as np
import pandas as pd
import scipy
import spams
import scanpy
import time, sys, traceback, argparse
import os
import tqdm
import statsmodels.api as sma
import statsmodels.stats as sms
import functools
import anndata
from tqdm.contrib.concurrent import thread_map
# list groups in H5 object
!h5ls '/project/xuanyao/nikita/SCEPTRE/data/Yao_2023/GSM6858447_KO_conventional.h5ad'
# read in and check expression H5; check obj from above
input_h5ad = '/project/xuanyao/nikita/SCEPTRE/data/Yao_2023/GSM6858447_KO_conventional.h5ad'
dat = scanpy.read_h5ad(input_h5ad)
cell_names = dat.obs.index # cell barcodes
cov_mat = dat.obs # covariate matrix
cov_mat.columns.tolist() # list of availible covariates to use
features_table = dat.var # features table (GENE symbols)
# now read in the pertubation data:
input_perturbation_matrix = '/project/xuanyao/nikita/SCEPTRE/data/Yao_2023/GSM6858449_KD_conventional_perturbations.txt'
p_mat_pd = pd.read_csv(input_perturbation_matrix, index_col = 0, delim_whitespace=True)
guides_df = pd.DataFrame(list(p_mat_pd.index), columns=['guides'])
# explore guides for NTC and Safe targeting keywords
ntcs = [guide for guide in p_mat_pd.index if guide in ['non-targeting'] ]
stcs = [guide for guide in p_mat_pd.index if guide in ['safe-targeting'] ]
print(ntcs) # there is 1 NTC in the dataset: ['non-targeting']
print(stcs) # there is 1 STC in the dataset: ['safe-targeting']
Using the code provided on the github page for FRperturb: https://github.com/douglasyao/FR-Perturb
We need to provide parameters and input files like so:
source activate /home/nbabushkin/miniconda3/envs/frperturb
time python /home/nbabushkin/yao_perturb/FR-Perturb-main/run_FR_Perturb.py --input-h5ad /project/xuanyao/nikita/SCEPTRE/data/Yao_2023/GSM6858449_KD_conventional.h5ad \
--input-perturbation-matrix /project/xuanyao/nikita/SCEPTRE/data/Yao_2023/GSM6858449_KD_conventional_perturbations.txt \
--control-perturbation-name non-targeting \
--covariates Total_RNA_count,Total_unique_genes,10X_channel,Percent_mitochondrial_reads,Guides,Guides_collapsed_by_gene,Total_number_of_guides \
--compute-pval --fit-zero-pval --multithreaded --out KD_conventional --output-factor-matrices --num-perms 500 --guide-pooled
The current factorize and recovery anaylsis approach uses a
transformed expression matrix Y(and normalized to sum 1 X
pertubation matrix) in order to learn the Beta matrix. One idea I have
is whether or not it makes sense to use the non-targeting controls in
order to transform the expression matrix Y. Specifically, in practice
the Y expression matrix is first transformed by taking the log(TP10K +
1) of all gene expression counts and subtracting log(c) from each row of
Y (where log(c) represents the average log(TP10K + 1) of all genes in
cells containing only non-targeting control guides). I would be curious
to see the impact of avoiding using very sparse NTC data on the sparse
factorization of Y` and how that effects the Beta estimates.