Last updated: 2023-09-20
Checks: 6 1
Knit directory: lab-notes/
This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(1)
was run prior to running the
code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 2d1cbd9. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: analysis/doc_prefix.html
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/sceptre.Rmd
) and HTML
(docs/sceptre.html
) files. If you’ve configured a remote
Git repository (see ?wflow_git_remote
), click on the
hyperlinks in the table below to view the files as they were in that
past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 2d1cbd9 | 1onic | 2023-09-20 | update |
html | 5313f6e | 1onic | 2023-09-19 | Build site. |
html | 56c1b96 | 1onic | 2023-09-19 | Build site. |
Rmd | a44a325 | 1onic | 2023-09-19 | update |
html | 039565f | 1onic | 2023-09-19 | Build site. |
Rmd | 9de327d | 1onic | 2023-09-19 | update |
html | bde3a61 | 1onic | 2023-09-19 | Build site. |
Rmd | 639c954 | 1onic | 2023-09-19 | update |
html | b26555a | 1onic | 2023-09-18 | Build site. |
Rmd | 86b383c | 1onic | 2023-09-18 | update |
html | 6a65859 | 1onic | 2023-09-18 | Build site. |
Rmd | d99966c | 1onic | 2023-09-18 | update |
html | 6ce0511 | 1onic | 2023-09-18 | Build site. |
Rmd | 21b2081 | 1onic | 2023-09-18 | update |
html | 51764ad | 1onic | 2023-09-18 | Build site. |
Rmd | f72b8f5 | 1onic | 2023-09-18 | update |
html | 5df5417 | 1onic | 2023-09-11 | Build site. |
Rmd | 3fc87b6 | 1onic | 2023-09-11 | update |
html | 9286368 | 1onic | 2023-09-06 | Build site. |
Rmd | 93764d6 | 1onic | 2023-09-06 | update |
html | f0b8541 | 1onic | 2023-09-05 | Build site. |
html | b499a62 | 1onic | 2023-08-30 | Build site. |
html | f760d8b | 1onic | 2023-08-30 | Build site. |
html | 9cd8470 | 1onic | 2023-08-30 | Build site. |
Rmd | deb2992 | 1onic | 2023-08-30 | update |
html | d9c3393 | 1onic | 2023-08-28 | Build site. |
Rmd | e203ada | 1onic | 2023-08-28 | update |
html | d4afa3b | 1onic | 2023-07-12 | Build site. |
Rmd | 8a6dbfa | 1onic | 2023-07-12 | update |
html | 289324b | 1onic | 2023-07-12 | Build site. |
Rmd | 73fe9d7 | 1onic | 2023-07-12 | update |
html | be52451 | 1onic | 2023-07-12 | Build site. |
Rmd | 248a74d | 1onic | 2023-07-12 | update |
html | a73ac21 | 1onic | 2023-06-13 | Build site. |
Rmd | 27818c3 | 1onic | 2023-06-13 | update |
html | 41874f0 | 1onic | 2023-06-13 | Build site. |
Rmd | b66bbb5 | 1onic | 2023-06-13 | update |
html | 6f3f6a4 | 1onic | 2023-05-17 | Build site. |
Rmd | 135e88f | 1onic | 2023-05-17 | update |
html | 5192404 | 1onic | 2023-05-17 | Build site. |
Rmd | d4ecae8 | 1onic | 2023-05-17 | update |
html | 93a8936 | 1onic | 2023-05-10 | Build site. |
Rmd | e58e200 | 1onic | 2023-05-10 | update |
html | 454b232 | 1onic | 2023-05-10 | Build site. |
Rmd | cfedb2b | 1onic | 2023-05-10 | update |
html | 6a1bef0 | 1onic | 2023-05-10 | Build site. |
Rmd | 6e6664c | 1onic | 2023-05-10 | update |
html | b8fbbd1 | 1onic | 2023-05-10 | Build site. |
Rmd | 54b4de8 | 1onic | 2023-05-10 | update |
html | e5648db | 1onic | 2023-05-10 | Build site. |
Rmd | 273d6ef | 1onic | 2023-05-10 | update |
html | c0c4bf0 | 1onic | 2023-04-26 | Build site. |
Rmd | 2dba149 | 1onic | 2023-04-26 | update |
html | 9fbd5e6 | 1onic | 2023-04-26 | Build site. |
Rmd | 58bb0a1 | 1onic | 2023-04-26 | update |
html | 1478db1 | 1onic | 2023-04-19 | Build site. |
Rmd | 7345dbd | 1onic | 2023-04-19 | update |
html | 13ace72 | 1onic | 2023-03-29 | Build site. |
html | 9b3a894 | 1onic | 2023-03-29 | Build site. |
Rmd | 126bbf7 | 1onic | 2023-03-29 | update |
html | e1b57ff | 1onic | 2023-03-29 | Build site. |
Rmd | 5e606bd | 1onic | 2023-03-29 | update |
html | 7288d9c | 1onic | 2023-03-29 | Build site. |
Rmd | f81ed86 | 1onic | 2023-03-29 | update |
html | 3aef7a8 | 1onic | 2023-03-29 | Build site. |
Rmd | d9310d1 | 1onic | 2023-03-29 | update |
html | 3f796b6 | 1onic | 2023-03-22 | Build site. |
html | d9de43c | 1onic | 2023-03-22 | Build site. |
Rmd | d6f2632 | 1onic | 2023-03-22 | update |
html | 730a8bd | 1onic | 2023-03-22 | Build site. |
Rmd | 081f8b1 | 1onic | 2023-03-22 | update |
html | e7e33e4 | 1onic | 2023-03-22 | Build site. |
Rmd | 06566be | 1onic | 2023-03-22 | update |
html | a5a613b | 1onic | 2023-03-22 | Build site. |
Rmd | 290efcf | 1onic | 2023-03-22 | update |
html | cdae3ea | 1onic | 2023-03-10 | Build site. |
html | 99420b0 | 1onic | 2023-03-10 | Build site. |
Rmd | 0b86434 | 1onic | 2023-03-10 | update |
html | 0f70ffe | 1onic | 2023-03-10 | Build site. |
Rmd | 889c495 | 1onic | 2023-03-10 | update |
html | e4922c1 | 1onic | 2023-03-02 | Build site. |
Rmd | 93b9af8 | 1onic | 2023-03-02 | update |
html | 2d991e9 | 1onic | 2023-03-02 | Build site. |
Rmd | 06d4b8d | 1onic | 2023-03-02 | update |
html | 3971f7d | 1onic | 2023-02-15 | Build site. |
Rmd | 07b5e06 | 1onic | 2023-02-15 | update |
html | c60273d | 1onic | 2023-02-15 | Build site. |
Rmd | ba82814 | 1onic | 2023-02-15 | update |
html | 1d473f7 | 1onic | 2023-02-08 | Build site. |
Rmd | b9945b4 | 1onic | 2023-02-08 | update |
html | c350c44 | 1onic | 2023-02-06 | Build site. |
html | f12e92e | 1onic | 2023-02-06 | Build site. |
html | 0b2cc20 | 1onic | 2023-02-06 | Build site. |
html | fdd0022 | 1onic | 2023-02-06 | Build site. |
Rmd | e24db96 | 1onic | 2023-02-06 | update |
html | 21d42f1 | 1onic | 2023-01-27 | Build site. |
html | e47cac4 | 1onic | 2023-01-27 | Build site. |
Rmd | cd608b0 | 1onic | 2023-01-27 | update |
html | bce6cff | 1onic | 2023-01-11 | Build site. |
html | c51b055 | 1onic | 2023-01-11 | Build site. |
html | baf3f16 | 1onic | 2023-01-11 | Build site. |
html | f1c0cc6 | 1onic | 2023-01-11 | Build site. |
Rmd | 40c30e4 | 1onic | 2023-01-11 | update |
html | 56f1718 | 1onic | 2022-12-14 | Build site. |
html | e99aeb1 | 1onic | 2022-12-14 | Build site. |
Rmd | a7161f1 | 1onic | 2022-12-14 | update |
html | 4913d20 | 1onic | 2022-12-14 | Build site. |
Rmd | ad1f078 | 1onic | 2022-12-14 | update |
html | 9fde784 | 1onic | 2022-12-09 | Build site. |
Rmd | 0a8dde0 | 1onic | 2022-12-09 | update |
html | 976ccd0 | 1onic | 2022-12-09 | Build site. |
Rmd | fc4e55c | 1onic | 2022-12-09 | update |
html | b40a0db | 1onic | 2022-12-09 | Build site. |
Rmd | 716574f | 1onic | 2022-12-09 | fix plots |
html | 0b07039 | 1onic | 2022-12-09 | Build site. |
Rmd | e4c68ad | 1onic | 2022-12-09 | update |
html | bef0821 | 1onic | 2022-12-09 | Build site. |
Rmd | 47f4485 | 1onic | 2022-12-09 | update |
html | cbbb209 | 1onic | 2022-12-09 | Build site. |
Rmd | 82103da | 1onic | 2022-12-09 | update |
html | fec1e8a | 1onic | 2022-12-09 | Build site. |
html | 2be3983 | 1onic | 2022-12-09 | Build site. |
Rmd | 80f2ffc | 1onic | 2022-12-09 | update |
html | 6885f3d | 1onic | 2022-10-05 | Build site. |
Rmd | 39105a4 | 1onic | 2022-10-05 | update |
html | d3d078e | 1onic | 2022-10-05 | Build site. |
html | 519a799 | 1onic | 2022-10-05 | Build site. |
Rmd | d2f2b02 | 1onic | 2022-10-05 | update |
html | e131c98 | 1onic | 2022-10-04 | Build site. |
Rmd | 82a40ca | 1onic | 2022-10-04 | update |
html | e644b7c | 1onic | 2022-10-04 | Build site. |
Rmd | 7d0a2fe | 1onic | 2022-10-04 | update |
html | c95047e | 1onic | 2022-10-04 | Build site. |
Rmd | a382b31 | 1onic | 2022-10-04 | update |
html | 2387785 | 1onic | 2022-09-21 | Build site. |
Rmd | 8a86ec9 | 1onic | 2022-09-21 | update |
html | db3db83 | 1onic | 2022-09-07 | Build site. |
Rmd | 28cf9c9 | 1onic | 2022-09-07 | update |
html | 97513f7 | 1onic | 2022-08-30 | Build site. |
Rmd | df953d6 | 1onic | 2022-08-30 | update |
html | f7c986f | 1onic | 2022-08-30 | Build site. |
Rmd | d2013cc | 1onic | 2022-08-30 | update |
html | 05b39d9 | 1onic | 2022-08-30 | Build site. |
Rmd | 9047831 | 1onic | 2022-08-30 | update |
html | 4799da1 | 1onic | 2022-08-30 | Build site. |
Rmd | 9cc9271 | 1onic | 2022-08-30 | update |
html | 048a513 | 1onic | 2022-08-30 | Build site. |
Rmd | 62bdf7e | 1onic | 2022-08-30 | update |
html | 237f562 | 1onic | 2022-08-30 | Build site. |
Rmd | 726571f | 1onic | 2022-08-30 | update |
html | 69a4646 | 1onic | 2022-08-26 | Build site. |
Rmd | 39b8f0d | 1onic | 2022-08-26 | update |
html | 5aa7dc2 | 1onic | 2022-08-24 | Build site. |
Rmd | 1f58a0c | 1onic | 2022-08-24 | update |
html | 9cf39ab | 1onic | 2022-08-24 | Build site. |
Rmd | 2aea1e2 | 1onic | 2022-08-24 | update |
html | 00096ee | 1onic | 2022-08-24 | Build site. |
Rmd | 1c5014c | 1onic | 2022-08-24 | update |
html | 375d206 | 1onic | 2022-08-24 | Build site. |
Rmd | e620871 | 1onic | 2022-08-24 | update |
html | f52ad4d | 1onic | 2022-08-24 | Build site. |
Rmd | 30aceb3 | 1onic | 2022-08-24 | update |
html | 598570e | 1onic | 2022-08-24 | Build site. |
Rmd | 9a584c1 | 1onic | 2022-08-24 | update |
html | b4756c4 | 1onic | 2022-08-24 | Build site. |
Rmd | fbab5cb | 1onic | 2022-08-24 | update |
html | a55752e | 1onic | 2022-08-24 | Build site. |
Rmd | da54c09 | 1onic | 2022-08-24 | update |
html | c97e2d8 | 1onic | 2022-08-24 | Build site. |
Rmd | d5b953e | 1onic | 2022-08-24 | update |
html | cece443 | 1onic | 2022-08-10 | Build site. |
Rmd | f3579f6 | 1onic | 2022-08-10 | MORE PLOTS |
html | 9955d67 | 1onic | 2022-08-10 | Build site. |
Rmd | 59aa61b | 1onic | 2022-08-10 | update |
html | 617a397 | 1onic | 2022-08-10 | Build site. |
Rmd | 3a438e1 | 1onic | 2022-08-10 | update |
html | 5208fa3 | 1onic | 2022-08-10 | Build site. |
Rmd | 6976d77 | 1onic | 2022-08-10 | update |
html | 7a4197e | 1onic | 2022-08-10 | Build site. |
Rmd | 9c89a1b | 1onic | 2022-08-10 | update |
html | bd41509 | 1onic | 2022-08-10 | Build site. |
Rmd | 1465957 | 1onic | 2022-08-10 | update |
html | 32d7f10 | 1onic | 2022-08-04 | Build site. |
Rmd | 8a49a6f | 1onic | 2022-08-04 | update |
html | 5006f39 | 1onic | 2022-08-04 | Build site. |
Rmd | e56770a | 1onic | 2022-08-04 | update |
html | a6f860c | 1onic | 2022-08-04 | Build site. |
Rmd | fac1ac4 | 1onic | 2022-08-04 | update |
html | 57e862a | 1onic | 2022-08-04 | Build site. |
Rmd | 8c00df6 | 1onic | 2022-08-04 | update |
html | d9c1a09 | 1onic | 2022-08-04 | Build site. |
Rmd | c4cdc99 | 1onic | 2022-08-04 | update |
html | 8a38159 | 1onic | 2022-08-04 | Build site. |
Rmd | e2d078b | 1onic | 2022-08-04 | update |
html | 12331d2 | 1onic | 2022-08-04 | Build site. |
Rmd | 2504faa | 1onic | 2022-08-04 | update |
Rmd | f9e1827 | 1onic | 2022-08-04 | grna comparison |
html | c814105 | 1onic | 2022-08-04 | Build site. |
Rmd | 15536d1 | 1onic | 2022-08-04 | update |
html | c92e0bf | 1onic | 2022-08-03 | Build site. |
Rmd | 7e88c52 | 1onic | 2022-08-03 | many plots today |
html | 16b143f | 1onic | 2022-07-20 | Build site. |
html | 3393b21 | 1onic | 2022-07-20 | Build site. |
html | a0de791 | 1onic | 2022-07-20 | Build site. |
html | bdf7cc6 | 1onic | 2022-07-20 | Build site. |
html | 02d90b5 | 1onic | 2022-07-20 | Build site. |
html | 0c1180e | 1onic | 2022-07-20 | Build site. |
html | 52e007d | 1onic | 2022-07-20 | Build site. |
html | 1b61926 | 1onic | 2022-07-20 | Build site. |
html | fd6de4c | 1onic | 2022-07-20 | Build site. |
html | 635a240 | 1onic | 2022-07-15 | Build site. |
Rmd | 43b30a7 | 1onic | 2022-07-15 | update |
html | d13dab9 | 1onic | 2022-07-13 | Build site. |
Rmd | 9d20f03 | 1onic | 2022-07-13 | update |
html | 695caed | 1onic | 2022-07-13 | Build site. |
Rmd | 4f3ce77 | 1onic | 2022-07-13 | update |
html | ff7c8b1 | 1onic | 2022-07-13 | Build site. |
Rmd | 439a8b1 | 1onic | 2022-07-13 | geo morris qc |
html | a8c96b0 | 1onic | 2022-07-07 | Build site. |
html | b06249e | 1onic | 2022-07-07 | Build site. |
Rmd | 317351e | 1onic | 2022-07-07 | plots and update |
html | 04de0ae | 1onic | 2022-06-29 | Build site. |
Rmd | 21ab34f | 1onic | 2022-06-29 | update |
html | 2da0805 | 1onic | 2022-06-29 | Build site. |
Rmd | 3ddb896 | 1onic | 2022-06-29 | update |
html | 5eee051 | 1onic | 2022-06-29 | Build site. |
Rmd | ba5d409 | 1onic | 2022-06-29 | update |
html | 1516b6d | 1onic | 2022-06-29 | Build site. |
Rmd | 8b0a3fb | 1onic | 2022-06-29 | update |
html | 28a9475 | 1onic | 2022-06-28 | Build site. |
Rmd | c6f8019 | 1onic | 2022-06-28 | update |
html | 58646b4 | 1onic | 2022-06-15 | Build site. |
Rmd | ead2122 | 1onic | 2022-06-15 | update |
html | f4b1305 | 1onic | 2022-06-15 | Build site. |
Rmd | 9432cfd | 1onic | 2022-06-15 | update |
html | cc4be00 | N | 2022-05-18 | Build site. |
html | 1401122 | N | 2022-05-18 | Build site. |
html | 557a2a9 | N | 2022-05-18 | Build site. |
Rmd | 910178f | N | 2022-05-18 | update |
html | c41d4f9 | N | 2022-05-12 | Build site. |
Rmd | 916fafa | N | 2022-05-12 | update |
html | 1261222 | N | 2022-05-12 | Build site. |
Rmd | 2fa70a3 | N | 2022-05-12 | update |
html | 90df802 | N | 2022-05-11 | Build site. |
html | 6e985df | N | 2022-05-11 | Build site. |
html | a844654 | N | 2022-05-11 | Build site. |
html | aa46898 | N | 2022-05-11 | Build site. |
Rmd | 168f3a4 | N | 2022-05-11 | update |
Rmd | 6170e7a | N | 2022-05-11 | fix plots |
html | d79e35c | N | 2022-05-11 | Build site. |
Rmd | 8987b99 | N | 2022-05-11 | update |
html | bfc20a5 | N | 2022-05-04 | Build site. |
Rmd | 06a6d40 | N | 2022-05-04 | update |
html | ab1c893 | N | 2022-05-04 | Build site. |
Rmd | 9791d29 | N | 2022-05-04 | update |
html | af6c895 | N | 2022-05-04 | Build site. |
Rmd | 884eb8e | N | 2022-05-04 | update |
html | 001acfd | N | 2022-04-29 | Build site. |
Rmd | 3915adc | N | 2022-04-29 | update |
html | 52bb457 | N | 2022-04-29 | Build site. |
Rmd | 3258c7b | N | 2022-04-29 | update |
html | 947de76 | N | 2022-04-29 | Build site. |
Rmd | 6206f48 | N | 2022-04-29 | update |
html | ea30a16 | N | 2022-04-29 | Build site. |
Rmd | cb1cd60 | N | 2022-04-29 | update |
html | 299e079 | N | 2022-04-29 | Build site. |
Rmd | 71c9ef9 | N | 2022-04-29 | update |
html | 30904e4 | N | 2022-04-29 | Build site. |
Rmd | d29957e | N | 2022-04-29 | update |
html | 3540605 | N | 2022-04-29 | Build site. |
Rmd | a223787 | N | 2022-04-29 | update |
html | f681be8 | N | 2022-04-29 | Build site. |
Rmd | bca1b72 | N | 2022-04-29 | update |
html | a110a99 | N | 2022-04-27 | Build site. |
html | 3592f27 | N | 2022-04-27 | Build site. |
html | 37aa3b8 | N | 2022-04-27 | Build site. |
html | 5bad46b | N | 2022-04-27 | Build site. |
html | 7edf2ca | N | 2022-04-27 | Build site. |
Rmd | 562a18e | N | 2022-04-27 | update |
html | 9f87eb0 | N | 2022-04-22 | Build site. |
Rmd | dfb9a9c | N | 2022-04-22 | update |
html | 0390b35 | N | 2022-04-22 | Build site. |
Rmd | d0f70f0 | N | 2022-04-22 | update |
html | f9b5df5 | N | 2022-04-22 | Build site. |
Rmd | 2d69f23 | N | 2022-04-22 | update |
html | 15fc29e | N | 2022-04-22 | Build site. |
html | 80b8695 | N | 2022-04-22 | Build site. |
Rmd | 420e795 | N | 2022-04-22 | update |
html | f8f7402 | N | 2022-04-22 | Build site. |
Rmd | c71434c | N | 2022-04-22 | update |
html | f05d18b | N | 2022-04-21 | Build site. |
html | 976ffc2 | N | 2022-04-21 | Build site. |
html | 63c99d5 | N | 2022-04-21 | Build site. |
html | df02413 | N | 2022-04-21 | Build site. |
Rmd | b6f3e2d | N | 2022-04-21 | update |
html | a518146 | N | 2022-04-21 | Build site. |
Rmd | adc623f | N | 2022-04-21 | update |
html | fc9f1e2 | N | 2022-04-21 | Build site. |
Rmd | 73b0488 | N | 2022-04-21 | update |
html | 59b54df | N | 2022-04-21 | Build site. |
Rmd | dd2554b | N | 2022-04-21 | update |
html | 8a988bf | N | 2022-04-21 | Build site. |
Rmd | ed59668 | N | 2022-04-21 | update |
html | a00bb82 | N | 2022-04-21 | Build site. |
Rmd | 03d63e8 | N | 2022-04-21 | update |
html | 999b0f8 | N | 2022-04-21 | Build site. |
Rmd | f669a9f | N | 2022-04-21 | update |
html | 2ab6cab | N | 2022-04-21 | Build site. |
Rmd | a54127c | N | 2022-04-21 | update |
html | 0e98435 | N | 2022-04-21 | Build site. |
Rmd | 6fe4c33 | N | 2022-04-21 | update |
html | 8002074 | N | 2022-04-21 | Build site. |
Rmd | 7468e93 | N | 2022-04-21 | update |
html | 9e5811f | N | 2022-04-21 | Build site. |
Rmd | 5856174 | N | 2022-04-21 | update |
html | ddc3b40 | N | 2022-04-21 | Build site. |
Rmd | d9cf97c | N | 2022-04-21 | update |
html | e1459d3 | N | 2022-04-21 | Build site. |
Rmd | cd8e08b | N | 2022-04-21 | update |
html | 54c3ae8 | N | 2022-04-21 | Build site. |
Rmd | e336022 | N | 2022-04-21 | update |
html | 8442797 | N | 2022-04-21 | Build site. |
html | 209b9f3 | N | 2022-04-21 | Build site. |
Rmd | d2a65d2 | N | 2022-04-21 | update |
html | 695da69 | N | 2022-04-21 | Build site. |
html | b79474c | N | 2022-04-21 | Build site. |
Rmd | eda3361 | N | 2022-04-21 | update |
html | 8e47783 | N | 2022-04-21 | Build site. |
html | 81bb79f | N | 2022-04-21 | Build site. |
Rmd | e0b1eca | N | 2022-04-21 | update |
html | 26cb950 | N | 2022-04-21 | Build site. |
Rmd | 423829f | N | 2022-04-21 | update |
html | df01e8d | N | 2022-04-21 | Build site. |
Rmd | e6e1fbe | N | 2022-04-21 | update |
html | b14e2a0 | N | 2022-04-21 | Build site. |
Rmd | 8774fc3 | N | 2022-04-21 | update |
html | 5bb0238 | N | 2022-04-21 | Build site. |
Rmd | db52d3e | N | 2022-04-21 | update |
html | 95fbb97 | N | 2022-04-21 | Build site. |
Rmd | e24e3fe | N | 2022-04-21 | update |
html | c84b530 | N | 2022-04-21 | Build site. |
Rmd | 0fa186a | N | 2022-04-21 | update |
html | 61df3bb | N | 2022-04-21 | Build site. |
Rmd | fe3ae7b | N | 2022-04-21 | update |
html | 8f3d857 | N | 2022-04-21 | Build site. |
Rmd | ca83c5b | N | 2022-04-21 | update |
html | 8033d82 | N | 2022-04-21 | Build site. |
Rmd | f38d164 | N | 2022-04-21 | update |
html | 977ea3e | N | 2022-04-21 | Build site. |
Rmd | 1d06774 | N | 2022-04-21 | update |
html | b72fe91 | N | 2022-04-21 | Build site. |
Rmd | 6fcba2a | N | 2022-04-21 | update |
html | edc54ba | N | 2022-04-21 | Build site. |
Rmd | 7510177 | N | 2022-04-21 | update |
html | dd8b725 | N | 2022-04-21 | Build site. |
Rmd | 672cac9 | N | 2022-04-21 | update |
html | e2733c6 | N | 2022-04-21 | Build site. |
Rmd | 1d34f25 | N | 2022-04-21 | update |
html | 32987ea | N | 2022-04-21 | Build site. |
Rmd | 6e2940a | N | 2022-04-21 | update |
html | d1c363f | N | 2022-04-21 | Build site. |
Rmd | c2614f9 | N | 2022-04-21 | update |
html | 984b514 | N | 2022-04-21 | Build site. |
Rmd | 0703fdd | N | 2022-04-21 | update |
html | a1aa819 | N | 2022-04-21 | Build site. |
Rmd | cc8198c | N | 2022-04-21 | update |
html | 725c775 | N | 2022-04-21 | Build site. |
html | d31a7f8 | N | 2022-04-21 | Build site. |
html | 47842f8 | N | 2022-04-21 | Build site. |
html | c997e70 | N | 2022-04-21 | Build site. |
html | 2f8b4a5 | N | 2022-04-21 | Build site. |
html | 3012325 | N | 2022-04-21 | Build site. |
html | 99fa823 | N | 2022-04-21 | Build site. |
html | 53e5c0a | N | 2022-04-21 | Build site. |
html | e2c4450 | N | 2022-04-21 | Build site. |
html | 67323bc | N | 2022-04-21 | Build site. |
html | 47fc19a | N | 2022-04-21 | Build site. |
html | 958dea7 | N | 2022-04-21 | Build site. |
html | a2b7524 | N | 2022-04-21 | Build site. |
html | 4b231e9 | N | 2022-04-21 | Build site. |
html | ea6da48 | N | 2022-04-21 | Build site. |
html | 56c8aba | N | 2022-04-21 | Build site. |
html | 649b842 | N | 2022-04-21 | Build site. |
html | be6979f | N | 2022-04-21 | Build site. |
Rmd | a7a4b7c | N | 2022-04-21 | update |
html | be51883 | N | 2022-04-21 | Build site. |
Rmd | 4ef7304 | N | 2022-04-21 | update |
html | 448b536 | N | 2022-04-21 | Build site. |
Rmd | b3505d1 | N | 2022-04-21 | update |
html | e28279e | N | 2022-04-20 | Build site. |
Rmd | db98773 | N | 2022-04-20 | update |
html | bb22208 | N | 2022-04-20 | Build site. |
Rmd | ee0b559 | N | 2022-04-20 | update |
html | 255e600 | N | 2022-04-20 | Build site. |
Rmd | 120049d | N | 2022-04-20 | update |
html | a3359c9 | N | 2022-04-20 | Build site. |
Rmd | f8911b5 | N | 2022-04-20 | update |
html | ae060f1 | N | 2022-04-20 | Build site. |
Rmd | df8da57 | N | 2022-04-20 | update |
html | 2f039ee | N | 2022-04-20 | Build site. |
Rmd | a328d9c | N | 2022-04-20 | update |
html | c3c2bd2 | N | 2022-04-20 | Build site. |
Rmd | b54ef53 | N | 2022-04-20 | update |
html | 4c4e773 | N | 2022-04-20 | Build site. |
Rmd | 28b02b6 | N | 2022-04-20 | update |
html | fa921b4 | N | 2022-04-20 | Build site. |
Rmd | 464dcde | N | 2022-04-20 | update |
html | 22b6266 | N | 2022-04-20 | Build site. |
Rmd | b93489c | N | 2022-04-20 | update |
html | df35a34 | N | 2022-04-20 | Build site. |
Rmd | 8dd0b8e | N | 2022-04-20 | update |
html | c0d283c | N | 2022-04-20 | Build site. |
Rmd | 6d2f0fe | N | 2022-04-20 | update |
html | 64a309d | N | 2022-04-20 | Build site. |
Rmd | caceead | N | 2022-04-20 | update |
html | 57904e2 | N | 2022-04-20 | Build site. |
Rmd | 3a1b6e4 | N | 2022-04-20 | update |
html | 3a63d56 | N | 2022-04-20 | Build site. |
Rmd | de06b14 | N | 2022-04-20 | update |
html | 6d20288 | N | 2022-04-20 | Build site. |
Rmd | ad3cbc0 | N | 2022-04-20 | update |
html | 75a6a87 | N | 2022-04-14 | Build site. |
Rmd | b664cd6 | N | 2022-04-14 | update |
html | ad5acfb | N | 2022-04-14 | Build site. |
Rmd | 4f0db98 | N | 2022-04-14 | update |
html | 63cf604 | N | 2022-04-14 | Build site. |
Rmd | c99d312 | N | 2022-04-14 | update |
html | 77a1d7d | N | 2022-04-14 | Build site. |
Rmd | d83ddae | N | 2022-04-14 | update |
html | f68ef97 | N | 2022-04-14 | Build site. |
Rmd | a5afdfa | N | 2022-04-14 | update |
html | 0facbda | N | 2022-04-14 | Build site. |
Rmd | 232558c | N | 2022-04-14 | update |
html | cd1f137 | N | 2022-04-14 | Build site. |
Rmd | cd8a1aa | N | 2022-04-14 | update |
html | a38d635 | N | 2022-04-14 | Build site. |
Rmd | 8b36190 | N | 2022-04-14 | update |
html | fb85bb8 | N | 2022-04-14 | Build site. |
Rmd | edee7a3 | N | 2022-04-14 | update |
html | 0330576 | N | 2022-04-14 | Build site. |
Rmd | 4178c6c | N | 2022-04-14 | update |
html | fb7b374 | N | 2022-04-14 | Build site. |
Rmd | 6e98e9e | N | 2022-04-14 | update |
html | ab210ec | N | 2022-04-14 | Build site. |
Rmd | cdcb0c6 | N | 2022-04-14 | update |
html | 84e9458 | N | 2022-04-14 | Build site. |
Rmd | d623e2c | N | 2022-04-14 | update |
html | 4f14197 | N | 2022-04-13 | Build site. |
Rmd | 62bbf8a | N | 2022-04-13 | update |
html | fd4742b | N | 2022-04-07 | Build site. |
html | fa53b92 | N | 2022-04-07 | Build site. |
html | b472f9a | N | 2022-04-07 | Build site. |
html | 7240cfb | N | 2022-04-07 | Build site. |
html | ecf9247 | N | 2022-04-07 | Build site. |
html | e865c25 | N | 2022-04-07 | Build site. |
html | 3f5f68c | N | 2022-04-07 | Build site. |
html | dd23b48 | N | 2022-04-07 | Build site. |
html | dca45ed | N | 2022-04-07 | Build site. |
html | 2b73c03 | N | 2022-04-07 | Build site. |
html | 047c6b5 | N | 2022-04-07 | Build site. |
html | b2d88e7 | N | 2022-04-07 | Build site. |
html | 32e9594 | N | 2022-04-07 | Build site. |
html | 65a9a67 | N | 2022-04-07 | Build site. |
html | 75c5e7a | N | 2022-04-07 | Build site. |
html | 5610a89 | 1onic | 2022-03-30 | Build site. |
html | 106b4ba | 1onic | 2022-03-29 | Build site. |
Rmd | 866499b | 1onic | 2022-03-29 | update |
html | 7a94181 | 1onic | 2022-03-29 | Build site. |
Rmd | d3150dd | 1onic | 2022-03-29 | update |
html | 020099a | 1onic | 2022-03-29 | Build site. |
Rmd | a2e1231 | 1onic | 2022-03-29 | update |
html | 2d10d76 | 1onic | 2022-03-29 | Build site. |
Rmd | fdccb6a | 1onic | 2022-03-29 | update |
html | 98c5232 | 1onic | 2022-03-29 | Build site. |
Rmd | 6933781 | 1onic | 2022-03-29 | update |
html | f9e51e5 | 1onic | 2022-03-29 | Build site. |
html | 261d91b | 1onic | 2022-03-29 | Build site. |
Rmd | b5e11b4 | 1onic | 2022-03-29 | update |
html | db4f9ab | 1onic | 2022-03-29 | Build site. |
Rmd | aedf0e8 | 1onic | 2022-03-29 | update |
html | f3de280 | 1onic | 2022-03-29 | Build site. |
Rmd | d637a2f | 1onic | 2022-03-29 | update |
html | bd1d7eb | 1onic | 2022-03-29 | Build site. |
Rmd | 3363c79 | 1onic | 2022-03-29 | update |
html | 187eeda | 1onic | 2022-03-29 | Build site. |
Rmd | 2598b1a | 1onic | 2022-03-29 | update |
html | ac6498b | 1onic | 2022-03-29 | Build site. |
Rmd | 9a52488 | 1onic | 2022-03-29 | update |
html | 6281523 | 1onic | 2022-03-29 | Build site. |
Rmd | 7517bd7 | 1onic | 2022-03-29 | update |
html | 717d3ef | 1onic | 2022-03-29 | Build site. |
Rmd | 196bd88 | 1onic | 2022-03-29 | update |
html | 9985d85 | 1onic | 2022-03-29 | Build site. |
html | 4b359f9 | 1onic | 2022-03-29 | Build site. |
Rmd | 9e43b27 | 1onic | 2022-03-29 | update |
Rmd | b1b36f8 | 1onic | 2022-03-29 | figure folder |
html | b4bc8b4 | 1onic | 2022-02-28 | Build site. |
Rmd | e64e65f | 1onic | 2022-02-28 | update |
html | 2a2c70c | 1onic | 2022-02-14 | Build site. |
Rmd | 36622b5 | 1onic | 2022-02-14 | update |
html | de0d136 | 1onic | 2022-02-14 | Build site. |
Rmd | 7f988ff | 1onic | 2022-02-14 | update |
html | 706bb31 | 1onic | 2022-02-14 | Build site. |
Rmd | 40bbb38 | 1onic | 2022-02-14 | update |
html | 26577ad | 1onic | 2022-02-07 | Build site. |
Rmd | 6c3b33b | 1onic | 2022-02-07 | update |
Rmd | b73671a | 1onic | 2022-02-07 | fix |
Rmd | 1d82c17 | 1onic | 2022-02-07 | fix |
Rmd | 986a058 | 1onic | 2022-02-07 | fix |
html | a1f7199 | 1onic | 2022-02-07 | Build site. |
Rmd | da1e0f8 | 1onic | 2022-02-07 | update |
Rmd | cd76904 | GitHub | 2022-01-31 | Update sceptre.Rmd |
html | b3a3765 | 1onic | 2022-01-24 | Build site. |
Rmd | 5b00f20 | 1onic | 2022-01-24 | update |
html | ba8a711 | 1onic | 2022-01-24 | Build site. |
Rmd | 98103dc | 1onic | 2022-01-24 | update |
html | 3b8a155 | 1onic | 2022-01-24 | Build site. |
Rmd | e52c035 | 1onic | 2022-01-24 | update |
html | d2f418a | 1onic | 2022-01-24 | Build site. |
Rmd | 773e4de | 1onic | 2022-01-24 | update |
html | da72676 | 1onic | 2022-01-24 | Build site. |
Rmd | 9171a12 | 1onic | 2022-01-24 | update |
html | 80e6d10 | 1onic | 2022-01-19 | Build site. |
Rmd | e6da37b | 1onic | 2022-01-19 | update |
html | 916fe74 | 1onic | 2022-01-19 | Build site. |
Rmd | 81baae2 | 1onic | 2022-01-19 | update |
html | 139f0b0 | 1onic | 2022-01-19 | Build site. |
html | b6da5d7 | 1onic | 2022-01-19 | Build site. |
Rmd | b7b9894 | 1onic | 2022-01-19 | update |
In this analysis, we aim to split the Xie et al. dataset and Morris et al. v2 dataset into 2 distinct subsets of varying size, and running SCEPTRE analysis on each subset. To this end, we selected the first half of the cells (subset #1) and the last quarter of the cells (subset #2), after undergoing QC, in order to create two unique sets of high quality cells. Neither the 50% subset nor the 25% subset share any cells in common, all of the cells are unique to each split.
Below are two sets of UPsetR plots showing the overlap of signficant (0.2 FDR) trans- (>10Mb) pooled guide-gene pairs for 3 SCEPTRE runs. All 3 runs were performed using mappability adjusted count data.
The goal of this section is to visualize the intersection of significant signals in the SCEPTRE cis- analysis of the Gasperini et al. dataset against the published manuscript results of 664 site-gene pairs of which 470 are high-confidence.
This section repeats the analysis done below but on the Morris v1 dataset and uses a modified mappability correction method, it also includes average mappability of the high and low ratio selections.
One question we asked was where will we find the highest and lowest mappability genes in the above scatter plot?
The purpose of this section is to analyze the effect of mappability correction with intron counting turned on in CellRanger software. For single nucleus data it is necesary to incude introns during quantification of expression counts from the raw underlying data. This means that including introns is a required setting for processing this data. This is due to the fact that reads from RNA captured in the nuclei will have a higher proportion of unspliced transcipts, as shown in K562 cells in a deep-sequencing study by Tilgner, H. et al. with some of the earlier work done by Ameur et al. (2011), who proposed co-transcriptional dominance in analysis of total RNA-Seq in cells from the human brain. While most splicing occurs during transcription in accoradance to the ‘co-transcriptional splicing’ model, some transcipts will be spliced later and will contain sequences mapping to intronic regions on the genome (introns). For example, Tilgner, H. et al. found alternative exons and long noncoding RNAs to have their splicing occur at a later time, with the latter sometimes not being spliced at all.
The Figure above shows that in K562 cells fractional analysis of deep RNA-Seq has shown that most splicing is initiated before completion of transcription due to the time the poly A site is reached in transcription and the markedly high coSI (author introduced score which represents the completed splicing index) of polyA- nuclear RNA shown in the above figure. This means that for single nucleus RNA-Seq we should expect a higher ratio of intron-aligning reads to exon-aligning reads.
To start with a broad comparison, we plot and compare the shape of the distribution of non-zero cells per gene between the two datasets. The plot below shows that both the mappability adjusted and normal datasets have similarly shaped distributions.
In order to evaluate how the mappabiity adjustment performs we plotted the counts of non-zero cells per gene in both datasets, with each point corresponding to a single gene. We chose to compare non-zero counts becuase decreases in UMIs in a single gene accross a matched number of cells are easier to define by using a binary definition of change. We hypothesize that genes strongly affected by low mappability reads will likely contain a greater proportion of zero counts (UMIs).
In the plot above we highlight two different sub groups in the plot. The ‘high ratio’ genes in gold represent genes where the ratio of non-zero cells per gene was between 0.01 of 1.0 (0.99<ratio<1.01) or nearly identical between datasets. While the ‘low ratio’ genes highlighed in red represent genes where the ratio of non-zero cells per gene decreased between mappability adjustedment and was less than 0.75, the genes also must have had non-zero counts above certain dataset specific thresholds. This was done in order to capture a good ‘triangular shaped’ sample of genes that are well expressed in both datasets but contain deffering non-zero cells between datasets due to mappability adjustment.
One question that we wanted to answer was wether the low ratio or high ratio genes highlighted in the plot above had differing gene mappabiity scores. It would follow that differences in non-zero cells per gene in genes affected by low mappability reads would affect genes with lower mappability scores. We therefore expect to see a decrease in gene mappability when comparing ‘high ratio’ genes to ‘low ratio’ genes. In order to demonstrate this we randomely sampled 100 of each of the highlighted subgroups and plotted their gene mappability scores as defined by Alexis Battle and Ashis Saha 2018. The gene mappability scores used were defined as “weighted average of its exon- and UTR-mappability, weights being proportional to the total length of exonic regions and UTRs, respectively” (Alexis Battle and Ashis Saha 2018). A gene-mappabiliy score of 1 represents a gene where are the k-mers are uniquely mapped to that gene, and no other regions on the genome.
One question we asked was where will we find the highest and lowest mappability genes in the above scatter plot?
Repeating this analysis using QCed matrices for both datasets shows similar results. Below, the datasets were subject to cell filtering and gene filtering to identify and compare common well-expressed genes against matching cells in post-qc data.
This section presents the results of running SCETPRE trans- analysis on random subsets of the Gasperini data. Specifically we are interested in comparing the results from the 50% random cell subset and the 70% random subset. Here we perform normal cell QC but we then randomely sample 50 or 70 percent of cells prior to perfoming gene and guide QC. This results in a downsampled dataset where genes and guides are affected by the decrease in total cells availible for analysis.
This is the cis- pair analysis.
This is the trans- pair analysis.
There may be an issue with these results, I am in the process of running a few checks.
The purpose of this section is to report on SCETPRE replication and replication pi1 stat in order to look at scRNA Seq and SCEPTRE’s ability to replicate eQTL signals found in the DGN study, bulk tissue eQTL vs scRNAseq w/ CRISPR perturbationsa at the eGene site.
To evaluate SCETPRE between set in scRNA datasets we compared trans- signal replication between Xie et al. and Gasperini et al. The % of replicated significant trans- signals is shown on the y-axis while the direction of the replication comparision is indicated on the x-axis of the barplot.
This section will showcase the results of comparing FRPerturb to SCETPRE in datasets with respect to the cis analysis. The UPSETR plots below show the intersection between significant (FDR < 0.1) FRPerturb results and SCETPRE.
For the Morris v2 dataset there are 4,438 cis pairs and SCETPRE found 42 signals, 41 regulators at an cis FDR of 0.05. At the same FDR, FRPerturb found 16 signals, 8 regulators. Below is an UPsetR plot between the 41 sig. cis regulators in SCETPRE and 8 regulators in FRPerturb.
For the Xie et al. dataset there are 2,727 cis pairs and SCETPRE found 95 signals, 72 regulators were significant at an cis FDR of 0.05. At the same time, FRperturb found 54 signals and 22 regulators.
One thing that has come to attention is the fact that after normalization the amount of genes per cell and overall count depth (total reads) per cell is severely low. This makes increasing the QC parameters of the entire dataset after count depth normnalization not possible.
For example here is the overall shape of the distribution of both genes per cell and count depth after normalization in th ecurrent dataset. The red vertical lines inicate current QC parameters.
In order to investigate the cause of the post-normalization criteria we can try plotting pre-normalization stats per gemgroup (aka per batch):
Also showing the same patterns in the feature reads per cell (per gemgroup/batch):
Overall it seems that some batches in the KD8 dataset differ significantly from other batches before normalization. This could be the cause of the low gene per cell count/ low overall count depth after normalization.
In this section we will examine the FDR significant regulators within the trans SCEPTRE results of 4 runs (Morris v1, Morris v2, Xie, and Gasperini).
In order to test for significant enrichment of GO terms associated with biological processes, we need to first annotated our significant results since our trans- regulators in these 3 datasets refer to sgRNAs targeting both enhancers and SNPs. To address this we will link each sgRNA in the trans- gRNA-gene pairs to the closest protein coding or lncRNA gene based on the gencode v32 reference.
# the references used
protein_coding_gencode32 = readRDS('./gencode.v32.primary_assembly.annotation.protein_coding.rds')
protein_coding_lncRNA_gencode32 = readRDS('./gencode.v32.primary_assembly.annotation.protein_coding_lncRNA.rds')
Both of these references are sourced from the GENCODE V32 GTF reference file. The R code to replicate and create these references is availible REFERENCE CODE
For all sig. trans- regulators the results can be seen here: All trans- regulators (n=993)
For all sig. trans- regulators the results can be seen here: All trans- regulators (n=922)
This section will breifly show the basic QC parameters set to filter the mappabiity adjusted version of two of our primary datasets.
This section goes over the QC that was performed in order to prepare the Gasperini dataset for SCEPTER analysis using our QC pipelione.
One of the key modifications to the pipeline inludes the usage of the Suerat in order to read the 10X H5 output of CellRanger Aggr. It is also important to note that in order to aggregate the Gasperini dataset both the lab node and the multicore CellRanger software needed to be used.
This section rehashes what was briefly discussed above: which is that the Gasperini dataset is already large enough to warrant two major changes:
By changing the Gasperini QC to use H5 matrices we can achieve reading GEX/GDO matrices + grouping the guide matrix in 8 minutes.
[1] "combined matrix size: "
[1] 2447 246076
473.602 sec elapsed
For example, when running with the modified 10X Cellranager we can achieve aggregation in 25hr of compute time. And importantly, I found that the processing time was spend almost entirely on the multicore protospacer count step:
2023-04-11 13:25:42 [runtime] (update) ID.AGG_Gasperini.SC_RNA_AGGREGATOR_CS.COUNT_AGGR.SC_RNA_AGGREGATOR._CRISPR_ANALYZER.CALL_PROTOSPACERS.fork0 chunks_running
2023-04-12 13:09:30 [runtime] (chunks_complete) ID.AGG_Gasperini.SC_RNA_AGGREGATOR_CS.COUNT_AGGR.SC_RNA_AGGREGATOR._CRISPR_ANALYZER.CALL_PROTOSPACERS
From these logs we can see that the protospacer time is ALMOST 24 hours! The total runtime is just larger than 25 hours so we can eviudently see that protospacer counting and aggregation takes a super long time even for a smaller dataset with a lot less cells than Replogle (3.9m cells, 8 days aggregate time)
One thing to note is that these plots are now standard for our pipeline with every dataset subject to these views/plots to set QC thresholds (dataset specific). These plots show the data prior to filtering but place RED lines where the current QC thresholds will be applied as a result of QC.
This ensures that the gene we test is well expressed in the cell population we are studying (K562). We are using a QC threshold of 0.525% because this was originally defined for this dataset.
“K562 expressed genes were defined as at least one read in 0.525% of cells in the same dataset.” (Gasperini et al. 2019)
This setion will show the primary results of the Morris trans- analysis.
Here I will show the results of processing Xie et al trans- sceptre results to show calibration on the NTCs and the % Cross-Mappable Results:
type | num_fdr_sig |
---|---|
SNP | 0.6775956 |
SNP-Adjusted | 0.3518519 |
Here I will show the results of processing Morris et al trans- sceptre results to show a table of % Cross-Mappable Results:
type | num_fdr_sig |
---|---|
SNP | 0.6735537 |
SNP-Adjusted | 0.5739437 |
Due to the fast moving nature of single cell research lately some of the older single cell datasets suffer from organizational issues when it related to reprocessing the raw data.
Normally when we browse scRNA seq/CRISPRi pertubation datasets on GEO we will see an option to download the raw data from NCBI SRA repository via something like sratools. For datasets such as Morris et al, Replogle et al, and Xie et al, the rtaw data for the sequencing is availible like so. The general process for downloading and preparing this data would follow the standard sra tools process for unencrypted data. We would download with prefectch and then dump the FASTQ data from the SRA archive. We would then trim reads and perform alignement on the raw FASTQ data with a configured cellranger (configured to the specific RNA-seq/CRISPR perturbation dataset).
But, I have recently encountered a slightly different dataset from Gasperini et al, where the raw data is reported as BAM file. In order to process this data we must first revert it. Why? Becuase actually the BAM is a post-alignment BAM output made by the CellRanger pipeline. In order to recover the original FASTQ files we need to revert this very specific BAM file with the provided Cellranger tool and make sure we do so with the standard metadata. If you were to download and extract the SRA achive the resulting BAM cannot be processed due to missing metadata.
We need to ensure that the version of raw data that we have contains the original metadata encoded into the BAM output by cellranger. To do so, we need to make sure to download the originbal sumbitted data to NCBI not the SRA archive.
1 Download original data [not the usual SRA archive!, key step]
# we will use the standard prefech method for downloading the data
# this command is now modified to include: --type TenX which will download the availible original BAM file with metadata.
/home/nbabushkin/sratoolkit.3.0.0-ubuntu64/bin/prefetch --verbose --verify yes --resume yes --type TenX --option-file sra.txt
Notice that we will skip the extraction step: fastqdump becuase we are not working with SRA archives anymore since the above code will download the original BAM file.
2 Next, we will download the cellranger bam2fastq tool which is used to recover the original FASTQ files from a cellranger BAM output fromt the cellranger github:
wget https://github.com/10XGenomics/bamtofastq/releases/CURRENTRELEASE
3 Run the bam2fastq tool to recover original FASTQ input:
/home/nbabushkin/cellranger_bam2fastq/bamtofastq_linux --nthreads=4 at_scale_screen.2A_1_gRNA_AAGTAGCT.grna.bam 2A_1_gRNA_AAGTAGCT
This will in turn recover each of the 4 lanes used the example batch: “2A_1” into the folder: 2A_1_gRNA_AAGTAGCT
bamtofastq_S1_L001_I1_001.fastq.gz
bamtofastq_S1_L001_R2_001.fastq.gz
bamtofastq_S1_L002_R1_001.fastq.gz
bamtofastq_S1_L003_I1_001.fastq.gz
bamtofastq_S1_L003_R2_001.fastq.gz
bamtofastq_S1_L004_R1_001.fastq.gz
bamtofastq_S1_L001_R1_001.fastq.gz
bamtofastq_S1_L002_I1_001.fastq.gz
bamtofastq_S1_L002_R2_001.fastq.gz
bamtofastq_S1_L003_R1_001.fastq.gz
bamtofastq_S1_L004_I1_001.fastq.gz
bamtofastq_S1_L004_R2_001.fastq.gz
We can then re-run a newer (or changed) version of CellRanger count on these FASTQ files, generating another BAM and a set of matrices for this batch. Lastly, we would run cellranger aggregate on all of the counts.
4 Run CellRanger Count (all batches)
5 Run CellRanger Aggregate
This section will explore running SCEPTRE analysis of the already processed Morris data from GEO. This section aims to replicate the original reported results and compare the results of SCEPTRE being run on existing data with respect to NTC calibration, cis signals, and distribution of the pairs made by Morris et al originally.
comparisons | fdr_sig | bonferroni_sig |
---|---|---|
original SCEPTRE results | 54 | 36 |
nextflow SCEPTRE | 50 | 24 |
snakemake retroSCEPTRE | 55 | 35 |
This section will explore running SCEPTRE analysis of scRNAseq data from Xie et al., with an emphasis on the p values and distribution of SCEPTRE results for the in silico non-targetting controls used in the original manuscript. To this end, this section will show that for a 5,000 gene-gRNA pair subset of the total 84,500 NTC pairs, SCEPTRE has differing results conditional on the SCEPTRE method used. In summary, we believe we are seeing a skew in the distribution of p values associated with the non targetting controls when the newer, version 1.0.0 of SCEPTRE package is used alongside its nextflow pipeline. In contrast, the original results do now show this skew for the 5,000 gene gRNA pair subset that was selected for evaluation tests. When analyzing these pairs using an in house SCEPTRE-snakemake package built upon the original manuscript SCPETRE package, we see that the p values produced are roughly equivalent by Pearson R correlation accross all 5,000 p values to those p values produced in the original manuscript analysis. Notably this SCEPTRE version produces p values for the in silico NTCs that are not skewed, as seen in panel C.
print('loading historic insilico results: ')
hist_results = read_fst('/project/xuanyao/nikita/SCEPTRE/data/Xie_2019/Xie_manuscript_downloaded_process/results_BOXdl/all_results_annotated.fst') %>% as.data.table()
hist_insilico_ntc_results = hist_results %>% filter(type == 'negative_control') # pair_str is already there
print('loading QC results')
# our manu re-run: QC0: Yx ~ Xx + Cx ====> yx_xx_cx
qc0_check_5kNTC = readRDS('/scratch/midway3/nbabushkin/Xie_QC0_SCEPTRE_CHECK3/sceptre_results.rds') # 5k results here
qc0_check_5kNTC = qc0_check_5kNTC %>% rowwise %>% mutate(pair_str = paste0(grna_group,'+',gene_id)) %>% ungroup()
qc0_retro_sceptre = read_fst('/scratch/midway3/nbabushkin/Xie_retroSCEPTRE/results/all_results.fst')
qc0_retro_sceptre = qc0_retro_sceptre %>% rowwise %>% mutate(pair_str = paste0(gRNA_id,'+',gene_id)) %>% ungroup()
qc0_retro_sceptre_snakemake = read_fst('/scratch/midway3/nbabushkin/Xie_QC0_SCEPTRE2_test2/results/all_results.fst')
qc0_retro_sceptre_snakemake = qc0_retro_sceptre_snakemake %>% rowwise %>% mutate(pair_str = paste0(gRNA_id,'+',gene_id)) %>% ungroup()
print('comparing 5k NTC results...')
qc0_w_historic_5k = qc0_check_5kNTC %>% left_join(hist_insilico_ntc_results %>% select(pair_str, hist_p_value = p_value, gene_name), by = 'pair_str')
qc0_w_historic_5kretro = qc0_retro_sceptre %>% select(gene_id, gRNA_id, p_value,z_value, pair_str) %>% left_join(hist_insilico_ntc_results %>% select(pair_str, hist_p_value = p_value, gene_name), by = 'pair_str')
qc0_w_historic_5kretrosnakemake = qc0_retro_sceptre_snakemake %>% select(gene_id, gRNA_id, p_value,z_value, pair_str) %>% left_join(hist_insilico_ntc_results %>% select(pair_str, hist_p_value = p_value, gene_name), by = 'pair_str')
check1 = cor(qc0_w_historic_5k$p_value, qc0_w_historic_5k$hist_p_value) # [1] -0.2058252 3-3-3
check2 = cor(qc0_w_historic_5kretro$p_value, qc0_w_historic_5kretro$hist_p_value) # [1] 0.9999586 now! 3!
check3 = cor(qc0_w_historic_5kretrosnakemake$p_value, qc0_w_historic_5kretrosnakemake$hist_p_value) # 0.977 [1] ...
print('(based on a random sample of 5k in silico NTCs made by manuscript, total: 84.5K NTCs)')
print(paste0('[QC0 calibration; old SCEPTRE v0.1.0] QC0 Pearson R based on historical, reported results: ', check2))
print(paste0('[QC0 calibration; nextflow SCEPTRE v1.0.0] QC0 Pearson R based on historical, reported results: ', check1))
print(paste0('[QC0 calibration; retro SCEPTRE (aka snakemake SCEPTRE)] QC0 Pearson R based on historical, reported results: ', check3))
This code outputs the follow pearson R correlation results:
[1] "[QC0 calibration; old SCEPTRE v0.1.0] QC0 Pearson R based on historical, reported results: 0.999958613825321"
[1] "[QC0 calibration; nextflow SCEPTRE v1.0.0] QC0 Pearson R based on historical, reported results: -0.20592926049342"
[1] "[QC0 calibration; retro SCEPTRE (aka snakemake SCEPTRE)] QC0 Pearson R based on historical, reported results: 0.977527051289071"
From this we can expect the nextflow SCEPTRE results to be quite different, while the snakemake SCEPTRE / retroSCEPTRE results seem to be right on top and possibly differing only due to a randomness mismatch due to the extremely high Pearson R accross the raw p values output by .
To further really explore how the p values are distributed we can make QQ plots and histograms of the raw p values produced by each method: 1) original results 2) nextflow SCEPTRE 1.0.0 package 3) retroSCEPTRE/snakemake SCEPTRE.
Then we can also plot scatter plots too see another pattern evident only in the nextflow or v1.0.0 SCEPTRE.
This is accentuated if we look at 20k in silico NTCs using the nextflow SCEPTRE.
This section will show the results of various different QC configurations aimed at recalculating manuscript results, and to compare the processed matrices input into SCEPTRE against our own re-processed with CellRanger matrices and SCEPTRE input.
This shows the original manuscript results for all 84.5K in silico NTC pairs. Here the results are read directly from those reported in the mansuscript and re-plotted.
This is essentially Figure 3 from the manuscript.
This is the original manuscript NTC results subset to a set of 5k NTCs that we will later compare against (both by visual-plotting and pearson R of p values).
This is roughly the same as above since this is a subset of those shown in the first, control qq plot.
This is the modern rendition of the manuscipt results. Specifically, here we are utilizing the Nextflow SCEPTRE pipeline to evaluate the original manuscirpt pairs. Here we are using a gene expression and gRNA pertubation matrix from the manuscipt (GEO-sourced+manuscript code), including their covariates. The goal here is to use all inputs that are identical to the original inputs used in the manuscript. The expectation is that the p values for the subset of 5k in silico NTCs will be the same. We expect the qqplot to also be similar for these pairs due to the exactedness of the input.
Here we utilize the old SCEPTRE version.
Here we are now changing the gene expression input matrix to be sourced from our re-processing (CellRanger count+aggr output). But we are still using the original covaraites and gRNA matrix.
Here we are now using our own gRNA input matrix to be sourced from our re-processing (CellRanger count+aggr output).
Here we are now using our covariates but we are sourcing the matrices from the manuscript input.
In this section we are trying explore the NTC pairs in Xie et al. dataset used in the SCEPTRE manuscript. In the manuscript the authors run SCEPTRE on the Xie et al. dataset, analyzing candidate CIS pairs and in silico NTC pairs, as well as bulk validations pairs. By re-running the entire analysis we can take a look at the effects of grouping guides together and the usage of in silico controls as NTC, versus the traditional method of designing non-targeting sgRNAs within the CRISPRi experiment.
We start by re-plotting Fig. 3 from the manuscript:
This is strongly similar to Figure 3 shown in the manuscript paper.
Next we can plot our results of the same in silico gene-gRNA NTC pairs:
Lastly we can plot just sgRNA-NTC-1 and sgRNA-NTC-3 NTC guides, paired with all availible genes:
The general idea is visualize Trans- signal in a dataset as a network. As an test, we will make network plots using the signal from Morris et al Trans- results using igraph and network libraries in R.
# read in results, extract gRNAs
results = read_fst('/project/xuanyao/nikita/SCEPTRE/results/Morris_cis_and_trans_analysis/Morris_trans_SCEPTRE_results_10_13/Morris_SCEPTRE_results_short.fst') %>% as.data.table()
gRNAs = results %>% pull(gRNA_id) %>% unique()
# select FDR sig. results, create edge table
# here 'gRNA'-'gene' pairs > populate 'source'-'target' columns in table
# importance = -log10(results_fdr$bh_adjusted_pvalue) could be used to set width of connections later
results_fdr = results %>% filter(bh_rejected)
links = data.table(source = results_fdr$gRNA_id, target = results_fdr$gene_id)
# make vertices list from edge-link table source-target columns
vertices <- unique(unlist(c(links$source,links$target)))
# constuct color vector + node table using vertices list
netnodes = data.table(name = vertices) %>% mutate(colors = ifelse(name %in% gRNAs, sample(colors, replace = TRUE), '#888888'), )
# init. network w/ link table + node table
net <- graph_from_data_frame(d=links, vertices=netnodes, directed=F)
# save + output plot w/ colors
color_vector <- netnodes$colors
png(file=paste0(out_dir, '/node.png'), width=5000, height=5000, res=300) # width=1000, height=1000, res=300
plot(net, edge.arrow.size=.3, vertex.size=3, vertex.color=color_vector, vertex.label.cex=0.8, vertex.label = ifelse(V(net)$name %in% gRNAs, V(net)$name, NA))
title(main = 'Network Plot using FDR gRNA-gene Pairs as Source-Target', cex.main = 2, sub = NULL, xlab = NULL, ylab = NULL)
legend( "bottomright", legend = c('gRNA', 'Gene'), pt.bg = c('#332288','#888888' ), pch = 21, cex = 3.3, bty = "n", title = "Node Type:" )
dev.off()
Here is the plot result. The grey nodes represent genes found in FDR signficant gRNA-gene pairs, while nodes in a random (colorblind-safe) color represent gRNAs.
One way we can modify this further is to use the adjusted p-values as weights in order to show the connection strengh between a given gRNA and gene pair within our plot.
color_vector <- netnodes$colors
png(file=paste0(out_dir, '/node2.png'), width=5000, height=5000, res=300) # width=1000, height=1000, res=300
# we add edge.width=E(net)$importance*.7 to scale the node links using the importance column
plot(net, edge.arrow.size=.3, vertex.size=3, edge.width=E(net)$importance*.7, vertex.color=color_vector, vertex.label.cex=0.8, vertex.label = ifelse(V(net2)$name %in% gRNAs, V(net2)$name, NA),)
title(main = 'Network Plot using FDR gRNA-gene Pairs as Source-Target \n Node connections are weighted by -log10(FDR-adjusted-p-value)', cex.main = 2, sub = NULL, xlab = NULL, ylab = NULL)
legend( "bottomright", legend = c('gRNA', 'Gene'), pt.bg = c('#332288','#888888' ), pch = 21, cex = 3, bty = "n", title = "Node Type:" )
dev.off()
Final plot for FDR sig. Trans- gRNA-gene pairs in Morris et al.
In order to run SCEPTRE we need to first perform QC, generate covariate matrices, and place the matrices into an ondisc H5 multimodal matrix, which is then the primary input to SCETPRE. We also need to generate gene-gRNA pairs based on information after QC is performed.
Below is an outline of the general QC parameters used
We will also generate a initial covariate matrix based on the cDNA/GDO data. But it is important to note that this matrix is only used to create the final covariate matrix, which is stored within the multimodal ondisc H5 object.
The following covariates are necessary for SCEPTRE:
Two important things must be noted: these are based on filtered outputs, from CellRanger, specific to each cell barcode and these utilize information from QC. Notably the percent mitochondrial reads must be derived based on the reference used during QC. The number of sgRNAs that are unique or non-zero, is also based on the thresholded gRNA matrix. For example, we would not count a sgRNA in a cell with a count of 4 becuase this is below our threshold of 5, and the binary matrix would not indicate a presence of that sgRNA, so the covariate matrix must not count this sgRNA towards that cell’s total unique sgRNA count.
These QC plots are meant to show the dataset after it has been filtered by Cellranger, but before we apply the QCs above.
These QC plots serve to highlight the same plots shown above applied to the final, filtered and prepared dataset.
In order to run SCEPTRE we also need to generate gene-grna pairs based on both the filtered genes and gRNAs above. This is done in a seperate script because it takes a lot of memory. One other thing that is important to note, is that we need information about the sgRNAs in order to calculate trans- distances. Notice that we build on the CSV reference made during the cellranger prep.
gRNA_location_ref = data.table::fread('/home/nbabushkin/xie_mappability_sceptre/xie_feature_reference_m1_INFO.csv', stringsAsFactors = F, header = T, sep=',')
# header: id,name,read,pattern,sequence,feature_type,region_pos_hg38,spacer_pos_hg38
# chr1:11671358-11671758, chr1:11671483-11671502
# extract chromosome info
gRNA_location_ref = gRNA_location_ref %>% rowwise() %>% dplyr::mutate(chr = str_replace( strsplit(spacer_pos_hg38, ":")[[1]][1], 'chr', '' ))
# extract start and stop
gRNA_location_ref = gRNA_location_ref %>% rowwise() %>% dplyr::mutate(start = strsplit( strsplit(spacer_pos_hg38, ":")[[1]][2], '-' )[[1]][1] )
gRNA_location_ref = gRNA_location_ref %>% rowwise() %>% dplyr::mutate(end = strsplit( strsplit(spacer_pos_hg38, ":")[[1]][2], '-' )[[1]][2] )
# fix names used in matrices vs those used in the reference
gRNA_location_ref = gRNA_location_ref %>% rowwise() %>% dplyr::mutate(id = str_replace_all(id, '_', '-'))
# clean up the frame: rm non-autosomal gRNAs
temp = nrow(gRNA_location_ref)
gRNA_location_ref = gRNA_location_ref %>% dplyr::filter( chr %in% approved_chrs) # should be only 1-22
print(paste0('Removed ', (temp - nrow(gRNA_location_ref)), ' gRNAs from gRNA reference due to non-autosomal chr location'))
# clean up the frame NAs; expected result: 0 removed
temp = nrow(gRNA_location_ref)
gRNA_location_ref = gRNA_location_ref %>% na.omit() # cannot be NA
print(paste0('Removed ', (temp - nrow(gRNA_location_ref)), ' gRNAs from gRNA reference due NA values'))
# clean up filtered out gRNAs
temp = nrow(gRNA_location_ref)
gRNA_location_ref = gRNA_location_ref %>% dplyr::filter( id %in% ondisc_gRNAs) # should be only 1-22
print(paste0('Removed ', (temp - nrow(gRNA_location_ref)), ' gRNAs from gRNA reference due filtered out gRNAs'))
data.table::fwrite(gRNA_location_ref, file = paste0(out_dir, '/gRNA_location_reference.csv'), quote=F,row.names=F, sep=',') ##### ||||| write
Afterwards we can construct the pairs by calculating distances between the sgRNAs and a given gene. In order to do this efficiently, I load the reference into the a GRanges object, then perform an overlap search for each given sgRNA. Overlaps are then inverted to identify genes outside of 10 Mb.
With the arrival of a new cellranger version recently it became posssible to perform both single cell transcriptome sequencing alignment/counting alongside CRISPR feature barcode extraction. As a result, it is not longer necessary to strictly use tools such as UMI-tools to count and extract CRISPR feature barcode data from the sgRNA amplified raw data.
This section will go over the neccessities for setting up and generating cDNA/GDO matrices for the Xie et al dataset, starting from the raw data on GEO all the way through to final, filtered and aggregated cDNA/GDO matrices that are ready for further analysis such as QC and preperation for SCEPTRE.
Before this section begins, theres another section below on 7/7/2022 that discussed the batches and cellranger reagents used in generating the raw sequencing data.
In order to set up CellRanger we need to understand two main aspects about the dataset:
For 1), we know that Xie et al. is structured to use 5 batches which were sequenced using 10 runs (1_1, 1_2, 2_1…5_2) and based on CellRanger documents we know that each sequencing run should be counted seperately: “first step is to run cellranger count…on each individual GEM well prepared”. Then each run is combined using the aggregation function. This is found on the cellranger documentation here: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/aggregate
This means that we need to set up library CSV files specific to each GEM well as we need CSV inputs for each of the cellranger count commmands we make. An example for batch1_1 is shown below:
fastqs,sample,library_type
.../data/Xie_raw_data/,cDNA1_1,Gene Expression
.../data/Xie_raw_data/,GDO1_1,CRISPR Guide Capture
For 2) you must get to know a bit about the cellranger CRISPR feature barcode extraction and the data you are working with. Firstly, cellranger feature barcode extraction requeres not only the sgRNA protospacer sequence but also its position in READ 2 of the sgRNA amplified GDO raw data. What this means is that we need to understand were exactly within R2 of the sequencing data the barcode is being captured since Xie et al. is using MOSIAC-Seq, an inhouse CRISPR barcoding solution. It is important to realize that although it may appear that Xie et al. is using 10xv2 reagents in order to generate the sequencing data, that only applies to the cDNA data - and not at all to the CRISPR feature barcode data. It would thus be incorrrect to use the CRISPR feature baracode patterns suggested on the Cellranger website for 10xv2 reagents. That pattern is only relevant to datasets where the feature barcode data also uses 10xv2 technology.
In order to determine this using only the data provided I reccomend performing a FBA QC analysis using the FBA tool.
In order to setup FBA we need to first gather a set of barcodes and a sgRNA sequences table. The barcodes we can be taken from filtered data sepecific to the batch. I choose to run on batch 1_1 so I used filtered barcodes from GEO. For the sgRNA input for FBA we need a TSV file, no header, detailing id and protospacer sequence.
Here is the first few lines of my FBA sgRNA input:
GHSGRNALIB_1702_00001 TTCAGTTTGCCTTACTCGT
GHSGRNALIB_1702_00002 TGAACCTACCAGGCTTGCG
GHSGRNALIB_1702_00003 CTTGCGTGGAGGGGACAGA
GHSGRNALIB_1702_00004 GTCCTCTCTGCGAAGCTCG
GHSGRNALIB_1702_00005 TCCTCTCTGCGAAGCTCGA
GHSGRNALIB_1702_00006 TCCCTCGAGCTTCGCAGAG
GHSGRNALIB_1702_00007 GGGATCACTGAAGCAGGCC
GHSGRNALIB_1702_00008 GGGCAGTTAAGATAAGCCA
GHSGRNALIB_1702_00009 AGGCATCTGTCCGTAAGCT
And the barcodes (borrowed from filtered data, make sure to trim OFF the batch number since FBA expects no cellranger batch id (example: AAACCTGAGGAGTACC-1)):
AAACCTGAGAGCTTCT
AAACCTGAGATCCCAT
AAACCTGAGCGTCTAT
AAACCTGAGCTAGTCT
AAACCTGAGGACACCA
AAACCTGAGGAGTACC
AAACCTGAGGAGTTTA
AAACCTGAGGCCATAG
AAACCTGAGTATCGAA
Then we can run FBA like so:
fba qc -t 12 -n 333000 \
-1 /project/xuanyao/nikita/SCEPTRE/data/Xie_raw_data/GDO1_S1_L001_R1_001.fastq.gz \
-2 /project/xuanyao/nikita/SCEPTRE/data/Xie_raw_data/GDO1_S1_L001_R2_001.fastq.gz \
-w /home/nbabushkin/xie_mappability_sceptre/count1_filtered_barcodes_clean.tsv \
-f /home/nbabushkin/xie_mappability_sceptre/xie_feature_sgRNAs_FBA_input.tsv \
--output_directory /home/nbabushkin/xie_mappability_sceptre/Xie_GDO_fbaQC_large \
-cb_m 1 \
-fb_m 1
Here you can see that we provide the raw data in the form of gziped FASTQ files, the baracodes, the sgRNAs sequences, set a hamming distance of 1, and sample only 333,000 random R1/R2 reads in order to make this a bit faster.
From the results we want to focus our attention to the base content of R2 since that is the read that will contain our sgRNA protospacer sequence within the sgRNA amplified data.
TEST
Inspecting this further should show that the our barcodes are likely located between position 19 and position 38 in read 2. In order to generate the sgRNA reference for CellRanger we can thus do the following:
# CELLRANGER REFERENCE
excel_XIE_sgRNA_data = pandas.read_excel('/project/xuanyao/nikita/SCEPTRE/data/Xie_2019/TABLES1.xlsx', sheet_name=0)
ids = excel_XIE_sgRNA_data['ID'].tolist()
cids = [idd.replace(':', '_') for idd in ids]
seqs = excel_XIE_sgRNA_data['spacer sequence'].tolist()
pattern = '5P'+'N'*19+'(BC)'
patterns = [pattern]*len(ids)
reads = ['R2'] * len(ids)
ftypes = ['CRISPR Guide Capture']* len(ids)
cellranger_cols = ['id', 'name', 'read', 'pattern', 'sequence', 'feature_type']
df = pandas.DataFrame(list(zip(cids, cids, reads, patterns, seqs, ftypes)), columns=cellranger_cols)
ref_cols = ['id', 'name', 'read', 'pattern', 'sequence', 'feature_type', 'region_pos_hg38','spacer_pos_hg38']
df_ref = pandas.DataFrame(list(zip(cids, cids, reads, patterns, seqs, ftypes, excel_XIE_sgRNA_data['region pos (hg38)'].tolist(), excel_XIE_sgRNA_data['spacer pos (hg38)'].tolist())), columns=ref_cols)
ss = df.size
temp = df.drop_duplicates(subset=['sequence'], keep=False) # drop both duplicates, keep neither sgRNA
if temp.size < ss:
print('duplicates removed: {}'.format(ss - temp.size))
dups = df[df.duplicated(['sequence'], keep=False)]['sequence'].tolist()
df[df.duplicated(['sequence'], keep=False)]
ref = df.drop_duplicates(subset=['sequence'], keep='first')
ref_ref = df_ref.drop_duplicates(subset=['sequence'], keep='first')
ref.to_csv('/home/nbabushkin/xie_mappability_sceptre/xie_feature_reference_m19.csv', sep=',', header=True, index=False)
ref_ref.to_csv('/home/nbabushkin/xie_mappability_sceptre/xie_feature_reference_m19_INFO.csv', sep=',', header=True, index=False)
This code does several things:
Something really important is the pattern:
pattern = '5P'+'N'*19+'(BC)'
This can be tricky becuase we need to remember that cellranger is written and implemented in Python! Why is this important?
Well, Python is 0 based while R is 1 based. What this means is that positional data in python is offset due to 0 being the first index position. For example, the Nth element of a list is accessed by list[N-1]. In R, the Nth element is simply list[N]. This difference means that if you are using a custom feature barcode experiment, such as MOSIAC-Seq used by Xie et al., and not 10xv2/10xv2 reagents for feature barcodes you must take this into account since Cellranger will interpret and create a REGEX expression within python based on pythonic definition of positions.
As a result, if our sgRNA protospacers must match with a hamming distance of 1 to positions (19:38) in READ 2, the 19th position would have 19 nucleotides before it placing those nucleotides pythonically at positions 0 through 18 (19 total nucleotides). Then the 19th position starts the actual barcode.
pattern = '5P'+'N'*19+'(BC)'
Another way of looking at this is to realize that we need to pad the pattern by 1 nucleotides to offset the fact that 19:38 is actually zero based both in FBA and in Cellranger (but not in R).
Now that we have addressed both 1) and 2) from above, we are ready to run cellranger count:
Cellranger Count cmd example:
/home/nbabushkin/cellranger-7.0.1/cellranger count --disable-ui --include-introns false --no-bam --nosecondary --localcores=48 --localmem=128 --id=gemgroup1_1 \
--transcriptome=/project/xuanyao/nikita/SCEPTRE/cellranger_refs/GRCh38-2020-A_build/cellranger_GRCh38reference \
--libraries=/home/nbabushkin/xie_mappability_sceptre/gemgroup_libraries/1_1libraries.csv --feature-ref=/home/nbabushkin/xie_mappability_sceptre/xie_feature_reference_m19.csv
One thing to note is that while this is very straightforward and easy, cellranger aggregate has two downsides at the moment:
Aggregation CSV (aggr_paths_m19_batched.csv):
sample_id,molecule_h5
gemgroup1_1,/scratch/midway3/nbabushkin/Xie_gemgroup_count_m19/gemgroup1_1/outs/molecule_info.h5
gemgroup1_2,/scratch/midway3/nbabushkin/Xie_gemgroup_count_m19/gemgroup1_2/outs/molecule_info.h5
gemgroup2_1,/scratch/midway3/nbabushkin/Xie_gemgroup_count_m19/gemgroup2_1/outs/molecule_info.h5
gemgroup2_2,/scratch/midway3/nbabushkin/Xie_gemgroup_count_m19/gemgroup2_2/outs/molecule_info.h5
gemgroup3_1,/scratch/midway3/nbabushkin/Xie_gemgroup_count_m19/gemgroup3_1/outs/molecule_info.h5
gemgroup3_2,/scratch/midway3/nbabushkin/Xie_gemgroup_count_m19/gemgroup3_2/outs/molecule_info.h5
gemgroup4_1,/scratch/midway3/nbabushkin/Xie_gemgroup_count_m19/gemgroup4_1/outs/molecule_info.h5
gemgroup4_2,/scratch/midway3/nbabushkin/Xie_gemgroup_count_m19/gemgroup4_2/outs/molecule_info.h5
gemgroup5_1,/scratch/midway3/nbabushkin/Xie_gemgroup_count_m19/gemgroup5_1/outs/molecule_info.h5
gemgroup5_2,/scratch/midway3/nbabushkin/Xie_gemgroup_count_m19/gemgroup5_2/outs/molecule_info.h5
CellRanger Aggregate cmd example:
time /home/nbabushkin/cellranger-7.0.1/cellranger aggr --nosecondary --disable-ui --id=AGG_MOSIAC_SEQ --localcores=48 --localmem=160 \
--csv=/home/nbabushkin/xie_mappability_sceptre/aggr_paths_m19_batched.csv
In this section I will go over how to configure cellranger to submit jobs to the Midway3/Midway2 clusters (make your system Admin happy).
Briefly, as an introduction it is worth mentioning that this arose out to neccessity becuase of the sheer compute power necessary to process the KD8 dataset published by Replogle. Specifically, I was able to generate counts for each gemgroup by spreading my CellRanger counts accross the sever with ample space for the temporary file usage (lots of small files, check your file capacity).
Here is the issue that I ran into specifically after about 12hrs into the aggregation function:
2022-10-04 18:59:40 [runtime] (update) ID.AGGKD8.SC_RNA_AGGREGATOR_CS.COUNT_AGGR.SC_RNA_AGGREGATOR.NORMALIZE_DEPTH.fork0 chunks running (71/129 completed)
2022-10-04 19:05:53 [runtime] (update) ID.AGGKD8.SC_RNA_AGGREGATOR_CS.COUNT_AGGR.SC_RNA_AGGREGATOR.NORMALIZE_DEPTH.fork0 chunks running (72/129 completed)
2022-10-04 19:11:53 [runtime] (update) ID.AGGKD8.SC_RNA_AGGREGATOR_CS.COUNT_AGGR.SC_RNA_AGGREGATOR.NORMALIZE_DEPTH.fork0 chunks running (72/129 completed)
2022-10-04 19:17:54 [runtime] (update) ID.AGGKD8.SC_RNA_AGGREGATOR_CS.COUNT_AGGR.SC_RNA_AGGREGATOR.NORMALIZE_DEPTH.fork0 chunks running (72/129 completed)
2022-10-04 19:24:20 [runtime] (update) ID.AGGKD8.SC_RNA_AGGREGATOR_CS.COUNT_AGGR.SC_RNA_AGGREGATOR.NORMALIZE_DEPTH.fork0 chunks running (73/129 completed)
2022-10-04 19:30:18 [runtime] (update) ID.AGGKD8.SC_RNA_AGGREGATOR_CS.COUNT_AGGR.SC_RNA_AGGREGATOR.NORMALIZE_DEPTH.fork0 chunks running (73/129 completed)
2022-10-04 19:36:18 [runtime] (update) ID.AGGKD8.SC_RNA_AGGREGATOR_CS.COUNT_AGGR.SC_RNA_AGGREGATOR.NORMALIZE_DEPTH.fork0 chunks running (73/129 completed)
From these CellRanger Aggregate logs we see that we are getting “stuck” at the NORMALIZE_DEPTH stage - which in theory may be expected. From this we see that it takes a lot of time and there are over 129 steps.
We can configure cellranger to use a template creating a SLURM template like so:
#!/bin/bash
#SBATCH --job-name=__MRO_JOB_NAME__
#SBATCH --account=pi-\xuanyao
#SBATCH --partition=caslake
#SBATCH -N 1
#SBATCH -c __MRO_THREADS__
#SBATCH --mem=__MRO_MEM_GB__G
#SBATCH -o __MRO_STDOUT__
#SBATCH -e __MRO_STDERR__
#SBATCH --time=32:00:00
echo $CONDA_PREFIX
source activate /home/nbabushkin/miniconda3/envs/stingseq9
__MRO_CMD__
We can then submit jobs like this:
/home/nbabushkin/replogle_mappability_sceptre/cellranger-7.0.1/cellranger aggr --disable-ui --id=AGGKD8 \
--csv=/home/nbabushkin/replogle_mappability_sceptre/cellranger_kd8_agg.csv \
--jobmode=/home/nbabushkin/replogle_mappability_sceptre/slurm.template \
--mempercore=24 --maxjobs=80 --jobinterval=1000
The idea now is to double check the Morris cis results by comparing GEO sourced data vs CellRanger re-processed data (HTO, GDO, cDNA) reads.
The following is a summary of the SCEPTRE runs by a new naming convention (numbered), the prefix “GEO” indicates the processed data matrix was downloaded from GEO.
Below are panaled scatterplots for each of the SCEPTRE runs described above from left to right (3 > 4 > 5 > 6):
Over the past week or so, I have writing a combination of snakemake and shell scripts in order to handle a fairly large scRNA-Seq dataset (with CRISPR pertubations).
We can actually see this admission on the official github:
There is no --gzip|--bizp2 option. You have to compress your files explicitly after they have been written.
There is actually a problem related to this missing feature (removed in parrallel fastq dump), the problem is that there is a spike in data usage due to the uncompressed FASTQ data. This data can become massive, with each GEM GROUP taking up 100-120 gb of data. This means that to process all of the gem groups in a single location (SCRATCH, PROJECT) you would need about 273100 gb or 27 tb of free space. This is way too much so we must stagger our extraction* and compression of the data prior to any further processing in order to reduce this temporary footprint to something more manageable.
An example of code that extracts and matches names accross conventions/data manifests. This code wildcard funtions linking rules to custom functions (we use wildcard functions to link, custom python functions):
######################################################################################################
# WILDCARD FUNCTIONS
def get_srrlist_fromgroupid_wildcards(wildcards):
gemgroup = int(wildcards.groupid)
found_srrs = gemgroup_srrs(gemgroup, kd8_manifest, kd8_sraruntable, ret_table = False)
return found_srrs
######################################################################################################
# FUNCTON
# finds srr ids based on a 'sra_runtable', dowloaded from SRA RUN SELECTOR website
# input: list of libraries (gemgroup)
def find_srr(liblist, sra_table, check = True):
if liblist is None or sra_table is None:
return 'check inputs'
try:
srrs = [str(sra_table.loc[sra_table['Library Name'] == lib]['Run'].item()) for lib in liblist] if type(liblist) == list else kd8_sra.loc[kd8_sra['Library Name'] == liblist]['Run'].item() # try-catch list comp
if check:
return (srrs if len(srrs) == len(liblist) else 'cannot find certain runs') if type(liblist) == list else srrs
return srrs
except ValueError:
return 'func error'
# FUNCTON
# finds and matches SRR id to a library name in runtable, return file names
# 1) SRR id, ie: SRR00000
# 2) sra_runtable pandas dataframe, dowloaded from SRA RUN SELECTOR website, which comes from GEO or if you have the 'BioSample' ID code:
# https://www.ncbi.nlm.nih.gov/Traces/study/?acc='BioSample' OR https://www.ncbi.nlm.nih.gov/biosample/'BioSample'
def get_library_name(sra_id, sraruntable):
if not type(sra_id) == str: return 'sra_id must be str ie: SRR000000'
try:
library_name = sraruntable.loc[sraruntable['Run'] == sra_id]['Library Name'].item()
if not len(library_name) != 1: return 'error multiple library names matched'
return library_name
except ValueError:
return 'func error'
# FUNCTON
# finds list of srr ids based on GEMGROUP
# 1) gemgroup, int, corresponding to GEMGROUP in the download manifest, preffered over alternative: library name ex: 'KD8_p1_0'
# 2) kd8_manifest pandas dataframe, downloaded from https://gwps.wi.mit.edu/ lab data website, which comes from the published paper
# 3) sra_runtable pandas dataframe, dowloaded from SRA RUN SELECTOR website, which comes from GEO or if you have the 'BioSample' ID code:
# https://www.ncbi.nlm.nih.gov/Traces/study/?acc='BioSample' OR https://www.ncbi.nlm.nih.gov/biosample/'BioSample'
def gemgroup_srrs(gemgroup, kd8_manifest, kd8_sraruntable, ret_table = False):
if not type(gemgroup) == int: return 'gemgroup must be int'
inputs = kd8_manifest.loc[kd8_manifest['gemgroup'] == gemgroup]
inputs = inputs.iloc[0, 2:34].tolist()
gemgroup_libs = [inpt.strip().replace('_R1_001.fastq.gz', '').replace('_R2_001.fastq.gz', '') for inpt in inputs]
uniq_gemgroup_libs = list(set(gemgroup_libs))
gemgroup_dl_srrlist = find_srr(uniq_gemgroup_libs, kd8_sraruntable, check = True)
if ret_table:
look_up_table = pandas.DataFrame(inputs, columns=['File Names'])
look_up_table['Library Name'] = look_up_table['File Names'].apply(lambda filename: filename.strip().replace('_R1_001.fastq.gz', '').replace('_R2_001.fastq.gz', ''))
look_up_table['SRR'] = look_up_table['Library Name'].apply(lambda library: find_srr(library, kd8_sraruntable, check = True ))
return gemgroup_dl_srrlist, look_up_table
return gemgroup_dl_srrlist
An example of code that creates the libraries CSV input needed to orient CellRanger to location/organization of data
# MAKE LIBRARIES CSV FILE
# this rule does two things:
# 1) reads in gemgroup info from manifest; forming libraries CSV file
# 2) verifies R1/R2 '.fastq.gz' data concordance/integrity
# an error here indicated issue with downloaded files / structure / names
rule make_libraries_csv:
priority: 10
input:
gemgroup_folder_done = config['KD8_gemgroup_download_folder'] + '/GEM_Group_{gemgroupID}/fastqdump.done' # output from fastqdump_.smk
output:
gemgroup_libraries_csv = config['gemgroup_libraries_folder'] + '/{gemgroupID}_libraries.csv'
params:
gemgroup = '{gemgroupID}',#lambda wildcards, output: wildcards.gemgroupID,
kd8_manifest_file = config['KD8_manifest'], # manifest file from paper
gemgroup_folder = config['KD8_gemgroup_download_folder'] + '/GEM_Group_{gemgroupID}'
resources: time_min=10, mem='2G', nodes=1, cpus=1, partition='caslake' # UNTIMED ATM
run:
import os
gemgroup = params.gemgroup # this comes from the snakemake wildcards per gemgroup
gemgroup_size = 4 # 2 cDNA samples (2x4 lanes = 8*2 R1/R2 =16 ), 2 sgRNA samples (2x4 lanes = 8*2 R1/R2 =16 )
gemgroup_seq = [1,2]
manifest = pandas.read_csv(params.kd8_manifest_file, sep = ',')
print('current gemgroup: {}'.format(gemgroup))
library_name = manifest.loc[manifest['gemgroup'] == int(gemgroup)]['library'].tolist()[0]#.item()
print('current library_name: {}'.format(library_name))
library_part = library_name.split('_')[1]
library_run = library_name.split('_')[2]
library_filename_cDNA = 'KD8_seq{{seq}}_{library_part}_mRNA_{library_run}'.format(library_part=library_part, library_run=library_run)
library_filename_sgRNA = 'KD8_seq{{seq}}_{library_part}_sgRNA_{library_run}'.format(library_part=library_part, library_run=library_run)
library_filename_seqs = [library_filename_cDNA.format(seq=seq) for seq in gemgroup_seq] + [library_filename_sgRNA.format(seq=seq) for seq in gemgroup_seq]
original_files = manifest.loc[manifest['gemgroup'] == int(gemgroup)]
original_files = original_files.iloc[0, 2:34].tolist()
for expected_file in original_files:
fpath = os.path.join(params.gemgroup_folder, expected_file)
if os.path.isfile(fpath): # check file exists
print('file path exists for: {} FILE: {}'.format(expected_file, fpath))
else:
error_file = os.path.join(params.gemgroup_folder, 'file_error_{}.err'.format(expected_file))
print('error with {}'.format(expected_file))
shell('touch {error_file}')
out_table = pandas.DataFrame([params.gemgroup_folder]*gemgroup_size, columns=['fastqs'])
out_table['sample'] = library_filename_seqs
out_table['library_type'] = out_table['sample'].apply(lambda name: 'Gene Expression' if name != None and 'mRNA' in name else 'CRISPR Guide Capture' )
out_table.to_csv(output.gemgroup_libraries_csv,index=False)
Notably, here is the processing time for GEM GROUP: 4 (part of the completed trail run):
real 844m20.770s ----> ~ 14hr PER GEMWELL (!!!)
user 3322m53.806s
sys 167m10.436s
this is shocking becuase we are limited to about processing 20-25 gemwell in scratch
(we can sumbit all 20-25 at a time for 14 hrs run )
another thing is that we can break this down further via logs:
how does cDNA time compare to GDO time?
of the 14 hrs 12.5 were spent on processing GDO counts via: SC_MULTI_CORE.COUNT_ANALYZER._CRISPR_ANALYZER.CALL_PROTOSPACERS.fork0 chunks_running
only: 0:33 - 1:30 so the cDNA took almost 1 hr and thats it
One of the things I have been recently working on was moving all allignment and counting over to CellRanger. This comes after after analyzing a lot of the Morris data with Kallisto-bustools alignment and counting for the GDO and HTO matrices, two important modalities capturing pertubation and improving data quality in single cells experiments, respectively. The concrete reason behind is a recent email with one of the authors were it was made clear to us that other leaders in the field have noticed that Kallisto-Bustools makes mistakes during alignment and counting, leading to issues resulting in missing counts for CRISPR sgRNAs bearing similar sequences - which is ALWAYS the case becuase generally single cell pertuybation experiments, in recent literature, always target regions with at least 2 sgRNAs. So using Kallisto-bustools seems to offer only a disadvantage and a source of noise if some reads are thrown out and others are miscounted.
In order to depart from using Kallisto-bustools, running a custom CellRanger is necessary for two reasons:
These are two tenants of using CellRanger on datasets outside of the 10X universe. If you are using 10X chemistry and strictly using their reagents per their protocol then not much additional information/modification is necessary becuase CellRanger comes more or less preconfigured. On the CellRanger website there are concrete and succinct examples demonstrating alignment and counting for multimodal scRNAseq, such as pertubation experiments.
This is not the case for Morris et al data; where ECCITE-Seq was used.
To adapt CellRanger we need to address the two points above:
HTO and GDO sequences within ECCITE-Seq are not oriented the same way, the HTO sequence is captured on the forwards strand, while the GDO sequence captured in ECCITE-Seq is on the reverse strand. To adjust for this we take the reverse compliment of the designed sgRNA, using this as CellRanger input (for GDO sequences).
CellRanger must be given a search pattern specific to the chemistry, for ECCITE-Seq the HTO pattern is simply the the 5’ prime end followed by the HTO sequence. This would inputted to CellRanger as “5P(BC)”.
For the GDO sequences this is not the case becuase there is a CRISPR specific adjacent sequence which is used as an anchor during sgRNA capture. Thus we need a specfic sequence relevant to the primer used to form the pattern we need to input into CellRanger.
One way to approach this is to brute force and discover the adjacent sequence yourself. In order to do this we can take a random sampling of R2 reads - where the sgRNA capture took place, selecting only the firs 40nt becuase we dont need to look past that. Then we can plot the base content percentages at each nt position within our sampled R2 reads.
This yields two important peices of information:
Using this we can then form the GDO pattern: 5PGCATAGCTCTTAAAC(BC). Which tells CellRanger that the 5 prime end is followed by the adjacent sequence and then the barcode (sgRNA, reverse compliment).
This section of notes will be for taking a first look at the data availibility for Replogle et al (Cell 2022).
Briefly, there are 4 sets of cells sequenced.
In this dataset, all of the expressed genes (~9.9k) were targetted at day 8 after transduction.
In this dataset, DEPMAP essential genes (~2.3k) were targetted at day 7 after transduction in RPE1 cells.
In this dataset, DEPMAP essential genes (~2.3k) were targetted at day 6 after transduction in K562 cells.
The experiment is actually the same as cell set #1 or the K562 cells but the difference is that a different primer was introduced prior to sequencing in order to adapt the Perturb-seq Illumina strategy to be compatable with the Ultima sequencing platform. The scRNA seq libraries from set 1 (KD8) were then sequenced on Ultima platform in order to test its viability over the more popular Illumina sequencing platform.
All of the data for the 4 sets above (KD8, RD7, KD6, and KD8_ultima) are availible in 3 formats:
FASTQ/BAM: this is essentially the raw data option. The format of the data and organization is provided on figshare, but generally, the FASTQs are named and correspond to each 10x lane / GEM well group that was used. The strategy would be to download and run CellRanger on each GEM group / 10x lane and then run CellRanger aggregate on the resulting counts.
MEX: this is the output of CellRanger aggregate in the pipeline described above. Specifically, this is the output stored in filtered feature bc matrix folder after the CellRanger has finished running. These matrices should just be CellRanger ‘processed and filtered’, with no additional filtering beyond STAR alignment, counting, and removing empty cells/doublets via the CellRanger method (analogous and derived from EmptyDrops method by Lun et al. 2019)
Processed data: this is the output of the processing described in the methods section. Data from above was 1) adjusted, the counts scaled w/ factors to account for variable sequencing depth accross gem groups 2) fitlered based on QC metrics (total adjusted UMI count aka depth and mitochondrial percent) 3) subset to cells bearing sgRNA(s) targetting a single gene and 4) UMI count normalized and relative z normalized.
The raw data pertaining to 1) above is actually well organized becuase the authors included a file manifest. This is important becuase it makes running CellRanger count quicker and straightforward even if there are many files to deal with.
Looking at the file manifest and GEO, we can see that each library (ie: KD8) is split into 3 parts (1, 2, and 3) and each part has a numbered run. Put together, the run number and library part correspond to a single GEM group number (for the KD8 experiment there are 1-273 GEM group numbers). This is important becuase each FASTQ file is named according to the GEM group (1-273) but uses the part based/run number naming in the manifest for the FASTQ file themselves. For example, the file name for KD8 part 3 run 5, aka gem group #235, would be KD8_seq1_p3_mRNA_5 followed by 10x naming convention. The easiest way to incorporate this into a pipeline would be to extract the file names corresponding to each GEM group / 10x lane from the manifest and match that to files downloaded from SRA, keeping the trimmed filenames, seperately, as the sample names for later. Each group of filenames from this process would then correspond to the input files, and input sample names, necessary for each CellRanger count command needed to process the data (1-273 for KD8). In other words, each line in the manifest is the input for the CellRanger count pipeline + each line contains the sample names needed for telling CellRanger how to find the FASTQs.
This is included in the file manifest discussed above, so the process for handling raw FASTQ files pertaining to the sgRNA capture is the same. The main difference is the naming for run number and library part now contains ‘sgRNA’ in place of ‘cDNA’ (see the example file name above). You would merely append the sample names derived from the sgRNA FASTQ files to the cDNA sample names in the CellRanger command libraries file (–libraries config file), indicating that the sgRNA sample names correspond to CRISPR guide capture.
The following notes are regarding the covariates used in the Morris STING-Seq analysis within SCEPTRE.
Briefly, there are 5 covariates: total gene UMIs, unique gene UMIs, total gRNA UMIs, unique gRNA UMIs, and lastely: mitochondrial ratio. Batch is normally included as well as a 6th covariate but in the case of the Morris dataset, there is only a single batch throughout the entire experiment becuase Morris et al ran a single lane/GEM well during sequencing. For other analyses, for example, the 2.5M Replogle dataset, there are 287 GEM Wells that were used to capture all of those 2.5M cells. Each cell barcode will actually be tagged with a number (1-287) indicating the batch that cell arose from. From this we can form the batch covariate per cell and then add it to the coviarate matrix. More details are provided below for handling and assigning batch correctly.
This covariate is calculated by taking the total UMI count (sum) across all detected genes, prior to filtering genes. An example of a filter would be genes expressed in at least 0.525% of the cell population.
This covariate is calculated by taking the amount of unique, or nonzero UMI counts (sum) across all detected genes, prior to filtering genes.
This covariate is calculated by taking the total UMI counts (sum) across all assayed gRNAs, prior to filtering the gRNAs for SCEPTRE. An example of a gRNA filter would be percentage or count based threshold defining well-captured/well-represented gRNAs within the cell population. Similar to gene defintion above.
This covariate is calculated by taking the amount of unique, or nonzero, UMI counts (sum) across all assayed gRNAs, prior to filtering the gRNAs for SCEPTRE, AFTER a threshold (later used in SCEPTRE) is applied. For example, to calculate these we would take the raw unfiltered GDO matrix (from GEO or from our own re-processing pipeline), apply the threshold we will later use (ie 5 UMIs counts), then take the column sum of nonzero elements (colSum(gRNA_thresholded != 0)).
This covariate is calculated by finding the percentage of reads mapping to mitochontrial genes per cell. Typically, I would do this by first extracting all mitochondrial encoded genes from the reference GTF annotation (in the case of Morris: ENSEMBL v96). This vector of ENSEMBL ids is passed as an arguement to Suerats FeaturePercentage function, which can calculate the percentage of reads corresponding to a particular gene set. In our case its mito genes so the result is the percentage of reads mapping to mitochondrial genes. This value for all cells is then taken out of Seurat and saved to a CSV file, so that it can be later used in generating the covariate matrix.
This imporatant covariate is not used in the Morris preprint, BUT I
still want to talk about it becuase it is important to correct for batch
effects arising from different GEM Wells used during sequencing, since
each 10x GEM well can only produce a finite set of valid barcodes while
the amount of cells in contemporary studies is ever increasing. The
easiest way to address this is to pull GEM well numbers from cellbarcode
extensions (cellbarcode-X) since part of the CellRanger pipeline
includes seperating CellRanger count calls by GEM well / 10x lane so
each set of barcodes included in the final matrix will actually have an
extension corresponding to this step in CellRanger count. The output of
cellranger aggregate used during QC, filtering, and preparing for
SCETPRE thus already has the batch number for each cell baked into the
cellbarcode. We would thus extract this in practice by calling:
batch_n = stringr::sapply(strsplit(colnames(cDNA_matrix), '-'), "[", 2); names(batch_n)=colnames(cDNA_matrix)
One of the original questions I was investigating was p-value inflation in the SCEPTRE results and non-concordance between our cis results and Morris results at the SNP raw p-value level (overlap based on FDR is strong). In order to investigate the covars, I decided to compare the old covariates provided in the Morris supplement with those derived as part of the SCEPTRE processing pipeline. I chose to start my comparison by plotting scatter plots between the original covariates and the GEO-QC derived covariates. As shown in previous notes, we have seperate SCETPRE runs dedicated to both the GEO sourced data (GEO QC-X) and the reprocessed raw data (QC-X).
From this gridplot we can see there are two very problematic covariates; with unique gRNAs being the most extreme. I also think these show two seperate issues going on at once. For example, we can immediately see that there was a mistake in calculating the unique, nonzero gRNA UMIs. They appear uniformly larger which suggests a procedural error. This turned out to be the exact case: I was originally calculating the unique nonzero gRNAs based on the final, filtered, un-thresholded gRNA matrix. In other words, I was using the final, filtered, gRNA UMI count matrix then performing a colSums on each cell. This lead to larger than expected counts becuase cells containing a gRNA will low UMIs would still be nonzero. But Morris clearly used a covariate here calculated 1) on the thresholded gRNA matrix 2) on the unfitlered matrix. As a result, the covariate calculated in our analysis was much higher as seen on the scatterplot. Lastly, the total gRNA counts exhibit the opposite behaviour, they seem to be larger in the GEO gRNA matrix. This, along with a similar, weaker, pattern can be seen in the total gene UMIs as well, and it suggests that the covariates are actually calculated prior to filtering. Why? Becuase the sums have feature counts across genes and gRNAs that may have been potentially fitlered out/removed due to QC. This would result in the original covariates being slightly larger, which is the pattern we see in both the total gene UMIs and total gRNA UMIs. By calculating the covariates prior to filtering the matrices (cDNA + GDO) we can correct this mistake.
After correcting the issues seen above, I re-ran the paper cell barcode GEO-QC SCEPTRE run, which is a GEO-QC dedicated to running on the original paper cell barcodes (9343 cells, original genes, no custom filtering, aimed at replicating the cis results). I then replotted the covariate matrices.
From this we can see that addressing both issue #1 and issue #2 explained above lead to an alignment of the covaraites (1:1 with the original, supplement-reported covariates per cell).
In order to evaluate if this change had an impact on the concordance of SNP (raw) p-values in the cis results, I ran the GEO-QC dedicated to the original paper barcodes and generated scatter plots between the p values seen in the SNP results reported in the supplement (1, x-axis) vs. p-values from the GEO-QC SCEPTRE run (2, y-axis). I then made a panel plot showing the original, weak concordance in PANEL A. This was actually orignally from a previous meeting, in the meeting we made this exact scatter plot, showing a large problem with the SNP p-values.
Then, from left to right, in PANEL B and in PANEL C we see two progressive changes corresponding to each issue identified above. Panel B shows the effect of correcting the unique nonzero gRNA count to be based on the thresholded matrix and Panel C shows the effect of correcting both the nonzero gRNA counts AND performing the covariate computation on the unfiltered matrices (as done by Morris). Panel C corresponds to the corrected, recalculated covariates plotted above.
Reminder In each of the scatter plots above, the baseline reference termed “1” refers to the original results (the p value reported in the supplement) while “2” refers to the GEO-QC SCEPTRE run dedicated to original cellbarcodes.
From this I think I can conclude, to some degree, that the lack of concordance we saw in the left-most panel was likely due to the difference in covariates illustrated by the two covariate scatter-panels above. The panel on the right seems to show raw cis SNP p values that are more concordant with the original results, specifically in the moderate p-value range corresponding to log10s 2-4 (so the p value is between 0.01 - 0.0001).
This raised questions for me:
This is interesting to me becuase gene UMIs and gRNA UMIs for genes and gRNAs, repectively, that are not included in the final analysis (low expressed genes and low capture gRNAs) are included in the covariates. Maybe I am just not used to this. But another thing to note is the difference between the middle and right panels wherein the difference in covariates was wether or not they were calculated based on filtered or unfiltered matrices, respectively for the middle and right panels. But the difference is subtle compared to progression seen from left to middle (correcting noznero gRNA sums to be thresholded).
The purpose of this section/notes is to elaborate and explore the library design choices used to create the sgRNAs NTC used in the Morris experiment. This was borne mostly from an interest to explore a question asked by Dr. Im during journal club regarding the exact nature of the sgRNAs being used as controls in the experiment.
The goal is to compare the Morris gRNA matrix in GEO against the re-processed Kallisto bustools gRNA matrix of the same data, subset to the 9343 cells used in the orginal analysis. We want to answer the question: is there a significant difference between the two same gRNA matrixes? This is important because these matrixes are used, respectively, for the GEO-QC and QC RUNS below.
Our general expectation is that the two matrixes are near identical or at least contain concordant UMI counts between the identical cell barcodes in each matrix. We expect this becuase both gRNA matrixes were generated through the Kallisto-Bustools KITE workflow. Briefly, this workflow consists of two steps: 1) making a novel fasta reference for each possible gRNA sequence (hammiing distance = 1bp) 2) counting appearences of sequences from (1).
Both matrixes contain counts for 210 Morris gRNAs, but differ in the total amount of cells (unfiltered) lighty. The GEO gRNA matrix size is: 210 X 137.3K and the re-processed matrix size is: 210 X 137K, a ~300 cell difference. We subset the gRNA matrixes to the original analysis cells, using their cell barcodes reported in the supplement.
In order to explore the comparison, we plot scatter plots and histograms showing the distribution of UMI counts per gRNA across all 210 gRNAs subset to 9343 cells used in the original analysis. Below we look at a few specific examples:
Here we see that the UMI counts between matrixes are strongly concordant with most of the 9343 cells containing between 0-50 UMIs for this gRNA (histogram 0-100 clearly shows the bulk of the cells containing UMIs greater than 5).
We can see based on these two gRNA that sometimes a gRNA is clearly better in terms of the number of cells that gRNA is later detected in. But in both cases the cell-specific UMI count was mostly concordant despite on gRNA being clearly better in the context of the experiment.
For example, lets look at one of the problem gRNA we identified earlier again:
In another case, vice versa behaviour, the GEO matrix seems to contain UMI counts while the reprocessed matrix contains almost no cells with these two PTC gRNA targetting the same PTC gene.
The goal here is look closer into various QC methods and their affect on overlap in terms of signal (significant pairs). Briefly, the Morris data was either taken from GEO (GEO-QC) or reprocessed from raw data (QC) and then the orginal cis pairs were tested for association with the expectation that there will be a strong overlap in signal and concordant raw p values amongnst cis signals.
RUN | cell detectability threshold | min unique UMIs | min total UMIs | dataset desc. |
---|---|---|---|---|
GEO original | ~5% | 850 | X | (original pairs + original results as reported in manuscript/supplement) |
GEO paper QC | X | X | X | original pairs + data QC: subset to cell barcodes/genes/gRNAs reported in results |
GEO-QC1 | 0.5% | 1200 | 1200 | original pairs + original NTCs |
GEO-QC2 | 5% | 1200 | 1200 | original pairs + original NTCs |
GEO-QC3 | 5% | 1200 | 1200 | cis pairs based on 500 kb distance + all NTCs |
QC1 | 0.5% | 1200 | 1200 | original pairs + original NTCs |
QC2 | 5% | 1200 | 1200 | original pairs + original NTCs |
QC3 | 5% | 1200 | 1200 | cis pairs based on 500 kb distance + all NTCs |
Its important to point out that the first two RUNS are control-esque. The first run entered is simply the results reported in the supplement and QC metrics taken from the paper (850). The second RUN in the table is aimed at re-creating the paper results as accurately as possible by 1) No new QC 2) original pairs. The idea here is that when we take data from GEO and rerun the orginal cis pairs we should get nearly identical p values and equal signals since its as close to the manuscript results as possible.
The other RUNS are seperated into GEO-QC or QC categories to indicate where the data is sourced from and the table further idicates which QC thresholds were used on those cis SCEPTRE runs.
Here we compare the original results to the results from GEO-QC2. In both cases, FDR and Bonferroni corrections were applied to BOTH the SNP and PTC pairs, becuase there are only a few PTC pairs that we test for association (11).
Here we plot each gRNA (210 total: SNP, PTC, NTC subtypes) as a point, and we define the axiis as the proportion of cells (out of 9343 cells reported in supplment) that contain a gRNA after a 5 UMI threshold is applied. Cells with UMI counts lower than the threshold are considered zeroes/dont contain a gRNA. This is done to simulate the threshold later used to binarize the gRNA matrix during SCEPTRE.
Here we plot the same set of gRNAs (210) across the same set of cells (9343) as above but we change our definition of detectability to be simply non zero UMI counts (>= 1 UMI). For example, a gRNA is ‘detected’ if a cell has at least 1 UMI corresponding to that gRNA.
Below is a table of each of the proportions per gRNA based on the >= 5 UMI threshold (plot #1 above).
gRNAs | geo_matrix_gRNAs | reprocessed_matrix_gRNAs |
---|---|---|
SNP-11_1 | 0.0322166 | 0.0324307 |
SNP-11_2 | 0.0072782 | 0.0072782 |
SNP-1_1 | 0.0089907 | 0.0090977 |
SNP-26_2 | 0.1140961 | 0.1164508 |
SNP-26_1 | 0.0225838 | 0.0226908 |
SNP-1_2 | 0.0138071 | 0.0143423 |
SNP-28_1 | 0.1249063 | 0.1268329 |
SNP-28_2 | 0.0396018 | 0.0401370 |
SNP-20_1 | 0.0068500 | 0.0071711 |
SNP-20_2 | 0.1597988 | 0.1693246 |
SNP-82_1 | 0.0370331 | 0.0389596 |
SNP-13_2 | 0.0364979 | 0.0369260 |
SNP-13_1 | 0.0421706 | 0.0432409 |
SNP-82_2 | 0.0649684 | 0.0655036 |
SNP-55_2 | 0.0176603 | 0.0176603 |
SNP-80_2 | 0.1107781 | 0.0175532 |
SNP-80_1 | 0.0711763 | 0.0451675 |
SNP-75_2 | 0.0653966 | 0.0158407 |
SNP-75_1 | 0.0001070 | 0.0001070 |
SNP-22_2 | 0.0313604 | 0.0317885 |
SNP-22_1 | 0.0111313 | 0.0113454 |
SNP-86_1 | 0.0338221 | 0.0356417 |
SNP-74_2 | 0.0248314 | 0.0252596 |
SNP-86_2 | 0.2995826 | 0.3065397 |
SNP-49_2 | 0.0024617 | 0.0029969 |
SNP-58_1 | 0.0080274 | 0.0081344 |
SNP-58_2 | 0.0460238 | 0.0463449 |
SNP-49_1 | 0.0416354 | 0.0420636 |
SNP-7_2 | 0.0474152 | 0.0338221 |
SNP-7_1 | 0.0050305 | 0.0034250 |
SNP-36_1 | 0.0433480 | 0.0073852 |
SNP-36_2 | 0.0000000 | 0.0001070 |
SNP-53_2 | 0.0971851 | 0.0469870 |
SNP-53_1 | 0.0568340 | 0.0000000 |
SNP-74_1 | 0.1135610 | 0.1152735 |
SNP-34_2 | 0.1540191 | 0.1557316 |
SNP-34_1 | 0.0439902 | 0.0443112 |
SNP-5_1 | 0.0033180 | 0.0038532 |
SNP-5_2 | 0.0741732 | 0.0778123 |
SNP-55_1 | 0.0601520 | 0.0610082 |
SNP-24_2 | 0.0600450 | 0.0605801 |
SNP-85_2 | 0.0021406 | 0.0021406 |
SNP-85_1 | 0.1891255 | 0.1931928 |
HSPA8-2 | 0.1223376 | 0.0299690 |
SNP-32_1 | 0.0120946 | 0.0012844 |
SNP-32_2 | 0.0000000 | 0.0001070 |
HSPA8-1 | 0.1440651 | 0.0922616 |
SNP-15_1 | 0.0021406 | 0.0022477 |
SNP-3_2 | 0.0378893 | 0.0381034 |
SNP-3_1 | 0.1768169 | 0.1790645 |
SNP-15_2 | 0.0452745 | 0.0453816 |
SNP-51_1 | 0.0000000 | 0.0147704 |
CD55-8 | 0.0009633 | 0.0005352 |
SNP-17_2 | 0.0376753 | 0.0187306 |
SNP-17_1 | 0.0459167 | 0.0220486 |
CD55-9 | 0.0101680 | 0.0103821 |
SNP-41_2 | 0.0082415 | 0.0085626 |
SNP-57_2 | 0.0002141 | 0.0003211 |
SNP-57_1 | 0.0492347 | 0.0355346 |
SNP-41_1 | 0.0064219 | 0.0067430 |
SNP-51_2 | 0.0000000 | 0.0016055 |
SNP-30_2 | 0.0184095 | 0.0186236 |
SNP-30_1 | 0.0132720 | 0.0132720 |
SNP-73_2 | 0.0295408 | 0.0305041 |
SNP-25_1 | 0.0000000 | 0.0599379 |
SNP-25_2 | 0.0000000 | 0.0066360 |
SNP-19_1 | 0.0253666 | 0.0254736 |
SNP-8_1 | 0.0409933 | 0.0420636 |
SNP-8_2 | 0.1845232 | 0.1899818 |
SNP-19_2 | 0.0019266 | 0.0019266 |
RPL22-2 | 0.0351065 | 0.0366049 |
RPL22-1 | 0.0369260 | 0.0374612 |
SNP-27_2 | 0.0187306 | 0.0187306 |
SNP-27_1 | 0.1090656 | 0.1108852 |
CD55-3 | 0.0000000 | 0.0000000 |
SNP-65_1 | 0.0004281 | 0.0003211 |
SNP-65_2 | 0.0002141 | 0.0000000 |
CD55-10 | 0.0000000 | 0.0014984 |
SNP-21_1 | 0.0216205 | 0.0217275 |
SNP-21_2 | 0.1367869 | 0.1420315 |
SNP-12_2 | 0.0969710 | 0.0992187 |
SNP-83_1 | 0.0032110 | 0.0002141 |
SNP-83_2 | 0.0814514 | 0.0075993 |
SNP-12_1 | 0.0099540 | 0.0100610 |
NTC-12 | 0.0933319 | 0.0956866 |
SNP-81_2 | 0.0718185 | 0.0236541 |
SNP-81_1 | 0.0001070 | 0.0000000 |
SNP-38_2 | 0.0614364 | 0.0621856 |
NTC-11 | 0.0070641 | 0.0070641 |
SNP-38_1 | 0.1068179 | 0.1082094 |
NTC-10 | 0.0369260 | 0.0374612 |
SNP-23_2 | 0.0000000 | 0.0000000 |
SNP-23_1 | 0.0268650 | 0.0048164 |
SNP-71_2 | 0.0014984 | 0.0014984 |
SNP-71_1 | 0.0212994 | 0.0215134 |
SNP-6_2 | 0.0459167 | 0.0467730 |
SNP-35_2 | 0.0157337 | 0.0161618 |
SNP-35_1 | 0.0648614 | 0.0671091 |
SNP-6_1 | 0.0345713 | 0.0347854 |
SNP-77_1 | 0.0085626 | 0.0086696 |
SNP-77_2 | 0.0369260 | 0.0372471 |
SNP-33_1 | 0.0221556 | 0.0222627 |
SNP-33_2 | 0.0058868 | 0.0058868 |
SNP-4_1 | 0.0559777 | 0.0224767 |
SNP-4_2 | 0.0209783 | 0.0012844 |
SNP-24_1 | 0.0069571 | 0.0070641 |
SNP-2_2 | 0.1718934 | 0.1734989 |
SNP-14_1 | 0.0174462 | 0.0174462 |
SNP-14_2 | 0.0590817 | 0.0613293 |
SNP-2_1 | 0.0366049 | 0.0384245 |
SNP-31_2 | 0.0001070 | 0.0002141 |
SNP-31_1 | 0.0011774 | 0.0011774 |
SNP-16_2 | 0.0490207 | 0.0055657 |
SNP-16_1 | 0.0000000 | 0.0124157 |
SNP-61_1 | 0.1446002 | 0.1474901 |
SNP-61_2 | 0.0477363 | 0.0480574 |
SNP-9_1 | 0.0031039 | 0.0032110 |
SNP-62_2 | 0.0061008 | 0.0003211 |
SNP-62_1 | 0.0033180 | 0.0000000 |
SNP-9_2 | 0.0387456 | 0.0400300 |
SNP-45_2 | 0.1008241 | 0.0215134 |
SNP-45_1 | 0.0245103 | 0.0171251 |
SNP-64_1 | 0.0168040 | 0.0171251 |
SNP-18_1 | 0.0428128 | 0.0433480 |
SNP-18_2 | 0.0291127 | 0.0295408 |
SNP-64_2 | 0.0447394 | 0.0449534 |
CD52-1 | 0.0105962 | 0.0117735 |
NTC-9 | 0.0598309 | 0.0602590 |
CD52-2 | 0.1172000 | 0.0620786 |
CD55-2 | 0.0000000 | 0.0077063 |
SNP-43_1 | 0.0000000 | 0.0338221 |
SNP-43_2 | 0.0000000 | 0.0078133 |
CD55-1 | 0.0462378 | 0.0474152 |
CD55-6 | 0.0000000 | 0.0183025 |
CD55-7 | 0.1205180 | 0.0718185 |
CD55-4 | 0.2636198 | 0.0810232 |
CD55-5 | 0.0000000 | 0.0002141 |
NTC-7 | 0.0009633 | 0.0010703 |
NTC-6 | 0.0842342 | 0.0858397 |
NTC-5 | 0.0709622 | 0.0725677 |
NTC-4 | 0.0296479 | 0.0306112 |
NTC-3 | 0.2083913 | 0.2126726 |
NTC-2 | 0.1292947 | 0.1396768 |
NTC-1 | 0.0072782 | 0.0078133 |
SNP-39_2 | 0.0111313 | 0.0113454 |
SNP-39_1 | 0.2826715 | 0.2926255 |
NTC-8 | 0.2679011 | 0.2755004 |
SNP-63_2 | 0.0987905 | 0.1026437 |
SNP-73_1 | 0.0872311 | 0.0894788 |
SNP-10_1 | 0.0000000 | 0.0000000 |
CD46-2 | 0.0375682 | 0.0080274 |
SNP-37_1 | 0.0162689 | 0.0163759 |
SNP-10_2 | 0.0270791 | 0.0217275 |
PPIA-1 | 0.0048164 | 0.0049235 |
SNP-37_2 | 0.0220486 | 0.0223697 |
SNP-63_1 | 0.0938671 | 0.0953655 |
NMU-2 | 0.3113561 | 0.3267687 |
SNP-70_2 | 0.0186236 | 0.0012844 |
SNP-70_1 | 0.0000000 | 0.0074922 |
SNP-78_2 | 0.1085305 | 0.1099219 |
SNP-78_1 | 0.0168040 | 0.0169111 |
SNP-76_1 | 0.0359628 | 0.0387456 |
SNP-76_2 | 0.0900139 | 0.0942952 |
SNP-66_2 | 0.0017125 | 0.0054586 |
SNP-66_1 | 0.0001070 | 0.0779193 |
PPIA-2 | 0.4319812 | 0.4685861 |
SNP-60_1 | 0.1437440 | 0.0230119 |
SNP-60_2 | 0.0498769 | 0.0001070 |
SNP-68_1 | 0.0265439 | 0.0268650 |
SNP-47_1 | 0.0571551 | 0.0580113 |
SNP-47_2 | 0.0064219 | 0.0064219 |
SNP-68_2 | 0.1665418 | 0.1700739 |
SNP-44_2 | 0.0002141 | 0.2578401 |
SNP-44_1 | 0.0027828 | 0.0363909 |
SNP-69_1 | 0.0204431 | 0.0205501 |
SNP-69_2 | 0.0278283 | 0.0288986 |
SNP-87_1 | 0.0396018 | 0.0406722 |
SNP-87_2 | 0.0340362 | 0.0341432 |
SNP-42_1 | 0.0002141 | 0.0000000 |
SNP-42_2 | 0.0113454 | 0.0012844 |
SNP-29_1 | 0.0363909 | 0.0366049 |
SNP-29_2 | 0.1875201 | 0.1912662 |
SNP-88_2 | 0.0057797 | 0.0057797 |
SNP-88_1 | 0.0012844 | 0.0012844 |
SNP-59_1 | 0.0501980 | 0.0507332 |
SNP-48_2 | 0.0222627 | 0.0229048 |
SNP-48_1 | 0.0841272 | 0.0890506 |
SNP-59_2 | 0.0222627 | 0.0222627 |
SNP-67_2 | 0.1027507 | 0.0287916 |
SNP-72_1 | 0.0299690 | 0.0238681 |
SNP-72_2 | 0.0229048 | 0.0116665 |
SNP-50_1 | 0.0357487 | 0.0364979 |
SNP-50_2 | 0.0400300 | 0.0408862 |
SNP-79_2 | 0.0094188 | 0.0099540 |
SNP-79_1 | 0.1937279 | 0.2000428 |
SNP-67_1 | 0.0829498 | 0.0522316 |
SNP-52_2 | 0.0191587 | 0.0198009 |
SNP-52_1 | 0.0092048 | 0.0094188 |
SNP-84_2 | 0.0000000 | 0.0000000 |
SNP-84_1 | 0.0001070 | 0.0000000 |
CD46-1 | 0.0004281 | 0.0001070 |
SNP-54_1 | 0.0270791 | 0.0050305 |
SNP-54_2 | 0.0000000 | 0.0002141 |
SNP-40_2 | 0.0070641 | 0.0018195 |
SNP-40_1 | 0.0002141 | 0.0004281 |
NMU-1 | 0.0062079 | 0.0066360 |
SNP-46_1 | 0.0330729 | 0.0093118 |
SNP-46_2 | 0.0157337 | 0.0090977 |
SNP-56_2 | 0.0127368 | 0.0063149 |
SNP-56_1 | 0.0079204 | 0.0031039 |
From these scatter plots we see that several gRNAs are starkly different between datasets. While most gRNAs are found in an equal proportion of the same 9343 cells, some do not fall on the 1:1 line (black line above).
For example: SNP 44-2 appears to be barely detected (~ 0 proportion => ~ 0 cells) while in the reprocessed gRNA matrix that same gRNA was detected with a threshold >= 5 UMIs in almost ~0.25 or 25% of cells (9343 cells => ~ 2225 cells). This is strange because we are comparing the same group of cells (9343 cell barcodes) so the cells should have roughly the same gRNA “content” by raw UMI counts and by detectability (thresholded UMI counts).
Taking a closer look at particularly which cells had SNP-44-2 and what exactly are the UMI counts:
# First we are going to inspect the GEO gRNA matrix:
# load original paper matrix
GDO = '/project/xuanyao/nikita/SCEPTRE/data/Morris_2021/GSM5225858_GDO.mtx.gz'
GDO_cbc = '/project/xuanyao/nikita/SCEPTRE/data/Morris_2021/GSM5225858_GDO.barcodes.txt'
GDO_features = '/project/xuanyao/nikita/SCEPTRE/data/Morris_2021/GSM5225858_GDO.features.txt'
f_grna_matrix = ReadMtx(mtx = GDO, features = GDO_features, cells = GDO_cbc, cell.column=1, feature.column=1, mtx.transpose=T)
# pull out those >= 5 UMI counts in GEO
THRESHOLD_cells = (f_grna_matrix['SNP-44_2', paper_cbcs])[ (f_grna_matrix['SNP-44_2', paper_cbcs]) >= 5] # 2 total
# pull out those >= 1 UMI counts in GEO
non_zero_cells = (f_grna_matrix['SNP-44_2', paper_cbcs])[ (f_grna_matrix['SNP-44_2', paper_cbcs]) >= 1] # 87 total
Here are the cell-level UMI counts for SNP 44-2 in the GEO gRNA dataset:
r$> THRESHOLD_cells # showing 2 / 2
CAGATCAGTTCAGGCC TCATTACGTAACGACG
5 11
r$> non_zero_cells # showing 31 / 87
umis barcodes
1: 1 AACCGCGGTACAGTTC
2: 4 AACCGCGTCAGTCAGT
3: 1 AAGCCGCGTTTGTGTG
4: 1 AAGTCTGAGGACGAAA
5: 1 AATCCAGCATTCTCAT
6: 1 ACACCCTCAAAGTCAA
7: 1 ACATCAGTCGAGAACG
8: 1 ACCCACTCATGCCTAA
9: 1 ACCCACTGTTATCCGA
10: 1 ACCCACTGTTGCGTTA
11: 1 ACGAGCCAGAAACCTA
12: 1 ACGCAGCCAGGACGTA
13: 1 ACGGAGACATCAGTAC
14: 1 ACGGCCACACAACGTT
15: 1 ACTTGTTGTATTACCG
16: 1 AGAATAGCACAGTCGC
17: 1 AGGGTGAAGTCTCCTC
18: 1 AGTGTCAAGTTACCCA
19: 1 ATCATCTAGCCAGGAT
20: 1 ATCCGAAGTCCGCTGA
21: 1 ATGAGGGAGCGTGTCC
22: 1 ATGGGAGCACGGATAG
23: 1 ATTACTCAGTTAACGA
24: 1 CACACCTCAATGTAAG
25: 1 CACCTTGTCGGCGCAT
26: 1 CACTCCATCCCTCAGT
27: 1 CAGAATCTCATTTGGG
28: 1 CAGAATCTCTGGCGAC
29: 5 CAGATCAGTTCAGGCC
30: 1 CAGCTGGAGCACCGTC
31: 1 CAGTAACGTCTTCGTC
Now we look to the reprocessed gRNA dataset, showing just the results from THRESHOLD_cells (>= 5 UMIs) we see a completely different story for this gRNA.
THRESHOLD_cells = (gRNA_matrix['SNP-44_2', paper_cbcs])[ (gRNA_matrix['SNP-44_2', paper_cbcs]) >= 5] # (2409 total) kallisto-bustools reprocessed gRNA matrix
r$> THRESHOLD_cells[1:40,] # first 40 cells / 2409
umis barcodes
1: 8 AAACCTGAGCCCAGCT
2: 7 AAACCTGCATGCAACT
3: 5 AAACCTGGTCATATCG
4: 5 AAACCTGTCAAAGTAG
5: 5 AAACGGGAGACCACGA
6: 22 AAACGGGCAATCTACG
7: 6 AAACGGGCACATGTGT
8: 73 AAACGGGCACATTCGA
9: 9 AAACGGGGTAAACCTC
10: 7 AAACGGGGTCGGATCC
11: 290 AAACGGGTCGCACTCT
12: 5 AAAGATGAGAGGTAGA
13: 5 AAAGATGAGCGATATA
14: 109 AAAGATGCAGCTCGCA
15: 5 AAAGATGCATAAAGGT
16: 5 AAAGATGGTCTGCGGT
17: 20 AAAGATGTCGTATCAG
18: 5 AAAGATGTCGTTACGA
19: 7 AAAGCAAAGAGTGAGA
20: 5 AAAGCAACAGCGTCCA
21: 204 AAAGCAACAGGACCCT
22: 9 AAAGCAAGTTACTGAC
23: 31 AAAGCAATCACTTCAT
24: 5 AAAGCAATCAGCAACT
25: 5 AAAGCAATCTTCTGGC
26: 8 AAAGTAGAGGCATGGT
27: 309 AAAGTAGCACATGACT
28: 134 AAAGTAGCAGTGGGAT
29: 93 AAAGTAGGTCTGCAAT
30: 7 AAAGTAGGTGTGGCTC
31: 172 AAAGTAGTCCAGATCA
32: 344 AAATGCCAGATGCGAC
33: 469 AAATGCCAGTATCGAA
34: 7 AAATGCCCAAGTAGTA
35: 319 AAATGCCCACACCGAC
36: 5 AAATGCCCAGATAATG
37: 9 AAATGCCGTCCTAGCG
38: 6 AACACGTCATCGGACC
39: 6 AACACGTGTCGCATCG
40: 107 AACACGTGTGAGGGAG
From this we see that most GEO UMI counts for SNP-44-2 are close to zero, with 85 (/87 cells nonzero) cells containing between 1-4 UMIs. For the reprocessed dataset this is the opposite: most UMI counts for SNP 44-2 are >= 5, with 2409/9343 cells containing >=5 UMIs.
Plotting the histograms of all non zero cells for SNP 44-2:
Here we plot the QQplots and p value histrograms associated with the SCEPTRE results from GEO vs those from the re-processed dataset. We specifically investigate the NTC qqplots and histograms in order to look at the behaviour/detectability of problem NTCs which refers to when NTCs produce a phenotype signal by chance (Replogle et al. Cell. 2022)
NTC pairs removing NTC-3, 8, and 11.
Here we do the same but for the re-processed dataset (raw data > UMI count matrixes > QC > SCEPTRE).
NTC pairs removing NTC-3, 8, and 11.
Here we do the same but for the re-processed dataset that used a mappability adjusted GTF reference (raw data > UMI count matrixes > QC > SCEPTRE).
NTC pairs removing NTC-3, 8, and 11.
Here we apply 3 different definitions for genes detected in the Morris dataset from GEO (cDNA matrix) and the re-processed Morris dataset output from CellRanger (filtered, cDNA matrix). We define genes as being well expressed or detectably expressed, including them in our downstream SCEPTRE analysis, if a read is present in at least 0.525%, 2% and 5% of all cells, respective to their datasets (GEO cDNA or pre-processed, filtered cDNA normal/mappablity adjusted). The size of each dataset in genes by cell and the amount of genes detected is shown below.
Dataset | Size (genes x cell) | 0.525% of cell pop. | 2% of cell pop. | 5% of cell pop. |
---|---|---|---|---|
GEO cDNA dataset | 35606 x 680k | 15329 | 12052 | 10325 |
re-processed cDNA dataset (normal) | 36601 x 15391 | 14579 | 11968 | 10345 |
re-processed cDNA dataset (mappability adjusted) | 35580 x 13902 | 13689 | 11152 | 9462 |
This dataset is the Morris raw data for the cDNA, gRNA, and HTO matrixes that was processed accordingly: the cDNA reads were processed with CellRanger 7.0.0, the gRNA and HTO matrixes were processed according to the Kallisto-Bustools kite workflow. In both cases matrixes are generated and passed onto pre-processing in order to apply QC and prepare the data for SCEPTRE analysis.
Plot of distribution of cells per gene as a histogram. A red line is
placed at 5% of 15,391 cells (770) as the minumum amount of cells a gene
must be found in, in order for it to be kept in the analysis. This is
used to filter out genes that are not found in most of the dataset and
that gene is not used later in association testing.
Plot of the distribution of genes per cell as a histogram. A red line
is placed at the current QC threshold of 850 unique genes. This
threshold does NOT appear to be appropriate for this dataset at the
moment.
Plot of the count depth vs rank. The plot itself was introduced
originally in the Drop-seq paper (Macosko et al., Highly parallel
genome-wide expression profiling of individual cells using nanoliter
droplets, 2015.) Here you rank cells based on count depth to produce a
count depth (total UMIs) vs barcode rank plot and place a horizontal
threshold where the line curves down steeply.
This is meant to be used in supplment to the plot above, where we
seek to define a lower bound for total UMI counts (count depth) based on
the distribution present in the data.
This plot is a scatter plot of count depth vs number of genes per
cell. This plot was taken from “Current best practices in scRNA-seq
analysis: a tutorial” Luecken & Theis. 2019 and to shows all cells
in the datasets as points.
The following plots mirror those above but are made using the final, (input) expression and gRNA matrixes after QC. These plots serve to double check and visualize the final SCEPTRE ready dataset and shouldnt contain any concerns. Addional to the plots above, a MOI plot or gRNA per cell histogram is added to this section.
Plot of distribution of cells per gene as a histogram. A red line is
placed at 5% of the original 15,391 cells (770).
Plot of the distribution of genes per cell as a histogram. A red line
is placed at the current QC threshold of 850 unique genes. This
threshold does NOT appear to be appropriate for this dataset at the
moment.
Plot of the count depth vs rank. The plot itself was introduced
originally in the Drop-seq paper (Macosko et al., Highly parallel
genome-wide expression profiling of individual cells using nanoliter
droplets, 2015.) Here you rank cells based on count depth to produce a
count depth (total UMIs) vs barcode rank plot and place a horizontal
threshold where the line curves down steeply.
This is meant to be used in supplment to the plot above, where we
seek to define a lower bound for total UMI counts (count depth) based on
the distribution present in the data.
This plot is a scatter plot of count depth vs number of genes per
cell. This plot was taken from “Current best practices in scRNA-seq
analysis: a tutorial” Luecken & Theis. 2019 and to shows all cells
in the datasets as points.
This is the canonical MOI plot or gRNA per cell plot. A red line is
placed on the median gRNAs per cell.
When looking at the UpsetR plots showing the intersection between significant gene-gRNA pairs between the Morris GEO dataset based SCEPTRE analysis and the re-processed raw data based analysis we saw that there was little overlap. This was cause for concern since we expect most of the results between the two to be the same. In order to investigate what is going on we will look at the QC thresholds and QC plots in both the GEO dataset and the re-processed dataset.
This dataset is taken directly from GEO: Morris GEO Dataset. GEO describes the dataset there as: “Processed data provided in standard UMI count matrix format for cDNA, GDO and HTO. Each sample has three files separated into features, barcodes and a sparse matrix file.”. Afterwards the data matrixes are pre-processed in order to apply QC thresholds and prepare the data for SCEPTRE analysis (input) within the general Snakemake pipeline.
Plot of distribution of cells per gene as a histogram.
Plot of the distribution of genes per cell as a histogram. The y-axis
is in log10 becuase of the large amount of zeros when all cells are
considered. A red line is placed at the current QC threshold of 850
unique genes.
Plot of the count depth vs rank. The plot itself was introduced
originally in the Drop-seq paper (Macosko et al., Highly parallel
genome-wide expression profiling of individual cells using nanoliter
droplets, 2015.) Here you rank cells based on count depth to produce a
count depth (total UMIs) vs barcode rank plot and place a horizontal
threshold where the line curves down steeply.
This is meant to be used in supplment to the plot above, where we
seek to define a lower bound for total UMI counts (count depth) based on
the distribution present in the data. This plot also features a log10
y-axis becuase of the sheer amount of zeros.
This plot is a scatter plot of count depth vs number of genes per
cell. This plot was taken from “Current best practices in scRNA-seq
analysis: a tutorial” Luecken & Theis. 2019 and to shows all cells
in the datasets as points.
The point of this section is to illustrate that current QC thresholds as specified in the Morris manuscript are insuficient for the dataset on GEO. Here we replot the plots from above, adding in MOI, in order to show the distribution after QC. In places were applicable red lines have been added to show the location of a QC threshold.
For both the plots featuring count depth (total UMIs) there is no current (lower) threshold placed as it was not included in the manuscript. For that reason, these plots have no red line. I suspect that reason behind this was the use of the HTO matrix, which was used in conjuction with Suerats HTODemux function in order to identify singlets/multiplets/empty cells.
Plot of distribution of cells per gene as a histrogram. A red line is
placed at 5% of 9343 cells (467.15) as the minumum amount of cells a
gene must be found in, in order for it to be kept in the analysis. This
is used to filter out genes that are not found in most of the dataset
and that gene is not used later in association testing.
Plot of the distribution of genes per cell as a histogram. The y-axis
is frequency or count. A red line is placed at the current QC threshold
of 850 unique genes.
Plot of the count depth vs rank. The plot itself was introduced
originally in the Drop-seq paper (Macosko et al., Highly parallel
genome-wide expression profiling of individual cells using nanoliter
droplets, 2015.) Here you rank cells based on count depth to produce a
count depth (total UMIs) vs barcode rank plot.
This is meant to be used in supplment to the plot above, where we
seek to define a lower bound for total UMI counts (count depth) based on
the distribution present in the data.
This plot is a scatter plot of count depth vs number of genes per
cell. This plot was taken from “Current best practices in scRNA-seq
analysis: a tutorial” Luecken & Theis. 2019 and to shows all cells
in the datasets as points.
This is the canonical MOI plot or gRNA per cell plot.
Firstly, the Xie et al data set contains roughly 105k cells. The experiment is a CRISPR based scRNA seq by design. For this dataset, there are 5,170 sgRNAs targetting 516 enchancers. There are 10 sgRNAs per enchancer for a total of ~ 5,160 enchancer-sgRNAs and 10 control sgRNAs (6 NTCs, 4 PTCs?).
The chemistry used for the library prep was 10X 3’ prime version 2. Prior to running cellranger on the raw data, we first look at the structure of the libraries sequenced to generate the data in the first place. Looking at the S2 Supplement table we see:
We can see that there are a total of 10 scRNAseq libraries over 5 batches. Initially, cellranger count was run on each batch (joining libraries) and then cellranger aggregate was used to correct for batch effects and merge UMI counts across all samples (single filtered matrix out). This was later changed to performing cellranger count on each library and then aggregating all counts into a single matrix.
Below is a table of significant results. The adjusted p value threshold used was 5%.
type | num_fdr_sig | num_bonf_sig |
---|---|---|
SNP | 802 | 196 |
PTC | 76 | 38 |
SNP-Mappability | 507 | 162 |
PTC-Mappability | 32 | 15 |
Below is a table showing the percent of crossmapping significant results in each category. Each significant gene-gRNA pair is compared to a list of cross mapping genes (symmetric cross mappability score > 0). See 5/12 notes for details on how this was done.
type | num_fdr_sig | num_bonf_sig |
---|---|---|
SNP | 0.7206983 | 0.6887755 |
PTC | 0.8421053 | 0.8947368 |
SNP-Mappability | 0.5621302 | 0.5493827 |
PTC-Mappability | 0.8125000 | 0.8000000 |
The general idea for this section is to check the NTC qqplots and histograms for the normal or un-adjusted SCEPTRE trans-analysis of the Morris dataset. The QCs used are shown below and SCEPTRE was run on pairs generated from the genes/gRNAs in the dataset. Each pair has a distance of at least 10Mb.
The general idea for this section is to check the NTC qqplots and histograms for the mappability-adjusted SCEPTRE trans-analysis of the Morris dataset. This is a mirror run of the normal analysis with the only change being the GTF reference.
The main objective of this section is to describe how raw data from scRNA seq experiments, such as Morris 2021, Gasperini 2019, and Xie 2019, are run through the CellRanger pipeline and QC in the context of our mappability paper/test.
Preprocessing or alignment is relatively straightforward. cDNA reads (usually in R2 of the fastq files) from transcripts are split into groups according to cells, based on cell barcodes (usually R1), and then aligned to a reference to determine molecules of origin. The aligned reads are then collapsed based on UMI (usually R1). Collapsing is done because during PCR there might be PCR duplicates that are from the original transcript. In order to only count each unique transcript once, a collapsing strategy must be used prior to counting. Typically, this is done solely based on UMI and cDNA aligned to genes. For example, the collapsing strategy used by kallisto | bustools and applicable to droplet-based methods is the following:
This is known as the naïve approach to collapsing UMIs and allows for computational efficiency since it forms the problem of collapsing into a NP problem, as opposed to NP complete alternative. This is partially why Kallisto is near-optimal. Cellranger uses a similar workflow, and its only major difference is that alignment is not. In Cellranger, alignment is base-pair to the genome via STAR/STARsolo while Kallisto uses pseudoalignment to the refence transcriptome to generate count matrixes. Cellranger also uses the same hamming distance of 1 to correct cell barcodes in step 1) above, in fact, Kallisto borrowed this step from Cellranger.
Raw data: Typically, data from scRNA seq perturbation experiments (scRNAseq + CRISPR) using 10x Chromium technology is reported in paired fastq files, labeled based on lane arising from each Chromium device (GEM wells). This is because each lane in a Chromium Device is expected to process up to 10,000 cells per lane. For example, data from Morris 2021 is reported on GEO as the following:
From this we see that there are 4 samples using 2 lanes of a single
GEM well. We can aggregate samples within a GEM well for quantification.
We can then run the cellranger count pipeline (assuming we have valid
cellranger references made). For larger datasets each GEM well needs to
be run separately and the final cellranger count output aggregated using
cellranger’s aggregation function (cellranger agg -h).
After running cellranger count pipeline, we can proceed to performing QC
on the outputted filtered_feature_bc_matrix folder containing our gene
by cell matrix in MTX format.
Summary of QC: We apply a few QC thresholds.
The resulting gRNA (perturbation) and cDNA (expression) matrixes are feature by cell, and are used as inputs for SCEPTRE analysis.
How does this process fit into the overall experiment?
The point of having a unified, snakemake, pipeline for processing raw scRNAseq cDNA expression data into UMI count matrixes is because our method for improving trans gene regulation with SCEPTRE relies on carefully modifying the quantification step. As outlined above, we can see that aligning reads to the reference is done prior to collapsing UMIs and counting them. Our method introduces a modified reference into this step specifically. We modify the reference GTF annotation used to construct the cellranger reference by removing low mappability regions. The adjusted reference is then used to quantify gene expression in a more conservative way, as we hope to improve detecting trans regulatory signals by removing regions that are problematic in the alignment step due to the multi mapping issue. Thus, the snakemake pipeline insures that our treatment of the cDNA expression data remains the same throughout the experiment with respect to all quantification and quality control steps.
In order to begin investigating which signals from SCEPTRE trans analysis are cross mappable, we took the signifcant gene-gRNA pairs and intersected them with genes identified as being cross-mappable based on a Symmetric Cross-Mappabiity score from False positives in trans-eQTL analysis arising from RNA seq alignment errors.
First data from the above paper was downloaded; notably two versions corresponding to the hg19 and hg38 references. Both sets use a kmer length = 76 for exonic regions and a kmer=36 for UTR regions. This represents a region of 76 mers from the genes exonic regions and 36 mers from its untranslated regions with a maximum tolerance of 2 mismatches.
Terms:
The data for symmetric cross-mappability is a 3 column table, with the first two columns defining gene A and gene B, and the last column being the symmetric cross-mappabiity. The genes are originally reported using ENSEMBL ids, which match our gene_id column in SCEPTRE results.
gene A gene B symmetric cross-mappabiity
ENSG00000000003.10 ENSG00000100058.8 25
ENSG00000000003.10 ENSG00000101017.9 5.5
ENSG00000000003.10 ENSG00000102595.14 8
After intersection we can classify our results using a binary flag: cross_maps; which indicates if a gene in a gene-gRNA pair has a non-zero symmetric cross-mappability score (TRUE/FALSE). The table below is based on the significant table from 4/28 (below).
site_type | bh_reject (FDR=<0.05) pairs w/ symmetric cross-mapping | bh_reject (FDR=<0.05) pairs (cross-map = FALSE) | bonferroni_reject (=<0.05) pairs w/ symmetric cross-mapping | bonferroni_reject (=<0.05) pairs (cross-map = FALSE) | total bh_reject (FDR=<0.05) | total bonferroni_reject (=<0.05) |
---|---|---|---|---|---|---|
SNP | 445 | 169 | 124 | 52 | 614 | 176 |
PTC | 131 | 22 | 53 | 4 | 153 | 57 |
site_type | bh_reject (FDR=<0.05) pairs w/ symmetric cross-mapping | bh_reject (FDR=<0.05) pairs (cross-map = FALSE) | bonferroni_reject (=<0.05) pairs w/ symmetric cross-mapping | bonferroni_reject (=<0.05) pairs (cross-map = FALSE) | total bh_reject (FDR=<0.05) | total bonferroni_reject (=<0.05) |
---|---|---|---|---|---|---|
DHS | 2245 | 1243 | 878 | 448 | 3488 | 1326 |
TSS | 8494 | 3738 | 1370 | 572 | 12232 | 1942 |
We found that some gRNAs, NTCs, appeared to be inflated with p values deviating from expectation in the Morris tran-wide analysis. This is shown below on ‘4/20/2022 Morris trans-wide SCEPTRE NTC QQPlots’ wherein a few of the NTCs showed inflation while others did not. In order to investigate this further in another dataset, here we will plot NTC p values from the Gasperini 2019 trans-wide SCEPTRE analysis.
This update includes additional investigation of the resampled statistics (z-scores) that are used to construct the emperical null distribution in SCEPTRE analysis. In order to further investigate we now plot a histogram of their p values (Z scores forming the null converted to p values using original SCEPTRE fit_t_skew(), which uses R selm in order to fit linear models with skew-elliptical error term)
The idea is to inspect the z scores of NTC pairs which are both normal-looking (non-significant, on expected line in qpplot) vs NTC pairs that showed p-value inflation.
First, we identify the NTC pairs we are interested in looking into. This is done by inspecting NTC pairs ordered by -log10(p_value) specific to a gRNA (NTC). From NTC plots by NTC on 4/20/2022, we see that NTC-8 shows a lot of inflation while NTC-4 and NTC-1 are relatively normal. Looking at just the expanded grid plot (which contains NTCs adjusted all-together) we see that by descending order points #5 on both NTC-1 and NTC-4 are almost exactly on the expected line with an observed -log10(p_value) value of about ~3.34/3.07, while descending point #1 on NTC-8 (as well as a few others) is inflated with an observed -log10(p_value) value of ~8.11.
We can pull out these 3 pairs with the corresponding information. The index column corresponds to rank by descending -log10(p_value). Based on above reasoning, we select #1 from NTC-8 and #5 from both NTC-1 and NTC-4.
idx gene_id gRNA_id p_value neglog10_obs_pval
5 ENSG00000171045 NTC-1 0.000457 3.34
5 ENSG00000110063 NTC-4 0.000849 3.07
1 ENSG00000108298 NTC-8 0.00000000782 8.11
other pairs:
2 ENSG00000171863 NTC-8 0.0000000154 7.81
3 ENSG00000161016 NTC-8 0.0000000197 7.70
4 ENSG00000110700 NTC-8 0.0000000235 7.63
5 ENSG00000178741 NTC-8 0.0000000257 7.59
1 ENSG00000101182 NTC-4 0.000202 3.70
2 ENSG00000099901 NTC-4 0.000320 3.50
3 ENSG00000145191 NTC-4 0.000343 3.47
4 ENSG00000172977 NTC-4 0.000403 3.40
1 ENSG00000100865 NTC-1 0.0000251 4.60
2 ENSG00000215154 NTC-1 0.000214 3.67
3 ENSG00000254093 NTC-1 0.000308 3.51
4 ENSG00000173559 NTC-1 0.000309 3.51
Then we can then plot the resampled test statistics of these 3 gene-gRNA pairs.
Additional plots:
Briefly, in order to check trans pairs, the distance between each gene-gRNA pair was calculated for both the Morris and Gasperini pairs run in the SCEPTRE analysis. This was done by gettting the gene position infomation from the reference genome, and the position of gRNA target. This is availible for all gene-gRNA pairs tested. A additional flag value of -1 is used to indicate pairs across chromosomes. Below are the plotted distances seperated by site_type (SNP, PTC (TSS), TSS, and DHS pairs)
Originally made on 4/14/2022, this was updated to reflect a major correction. See day below.
Trans gene-gRNA pairs are defined as genes at least >=10Mb from the gRNA target.
site_type | bh_reject (=<0.05) | bonferroni_reject (=<0.05) |
---|---|---|
SNP | 614 | 176 |
PTC | 153 | 57 |
site_type | bh_reject (=<0.05) | bonferroni_reject (=<0.05) |
---|---|---|
DHS | 3488 | 1326 |
TSS | 12232 | 1942 |
This was updated in order to reflect changes/correction of mistate. 1) The trans pairs were not correct. 2) Gene-gRNA pairs included non-autosomal genes on MT,X,Y.
Here are links to tables containing significant results using two different corrections: + FDR (bh) <= 0.05 + Bonferroni <= 0.05
The Morris transcriptome-wide SCEPTRE results contain two types of pairs, SNP and PTC. The PTC pairs are made by pairing the positive controls targeting the TSS of genes such as CD42, CD52, and CD55.
The tables also contain addition columns.
The Gasperini transcriptome-wide SCEPTRE results contain two types of pairs too, DHS and TSS. These can be accessed through the site_type column.
The tables also contain addition columns.
Using GenomicRanges in order to find target gene(s), both by proximity (closest gene problem) and by window size (ie: sliding 1MB window or 10MB window).
# FIRST prepare the lookup/reference table from our reference (b37 in this example)
# prepare genomic ranges object using b37 reference
annotations = data.table::fread(paste0('path/ensembl.Homo.sapiens.GRCh37.p13.tsv'), stringsAsFactors = F, header = T, sep='\t') %>% as_tibble()
library(GenomicRanges)
granges_genes_info = annotations
colnames(granges_genes_info) <- c('ensembl_gene_id', "GeneSymbol","Chr","Start","End")
genes_GR <- makeGRangesFromDataFrame(granges_genes_info,keep.extra.columns = TRUE)
genes_GR <- keepStandardChromosomes(genes_GR, pruning.mode="coarse")
# our example query
tchr = result_x[1, 'target_site.chr'] %>% pull() %>% as.numeric() # target chromosome (>numeric, excludes MT, X, Y chromosomes)
ts = result_x[1, 'target_site.start'] %>% pull() %>% as.numeric() # target gene/location/SNP/site (DHS, TSS) start position
te = result_x[1, 'target_site.stop'] %>% pull() %>% as.numeric() # target gene/location/SNP/site (DHS, TSS) end position
ttype = result_x[1, 'site_type'] %>% pull() %>% as.character() # a categorical variable/character value, can be arbitrary
site_size = te - ts
# now search
GENE_query_1Mb <- GRanges(tchr, IRanges(ifelse(ts > (1e6+site_size), ts - (1e6+site_size), 1), te + (1e6+site_size), id = ttype)) # this creates a 1MB (1e6) window around our site
GENE_query <- GRanges(tchr, IRanges(ts, te, id = ttype))
in1Mb_range = genes_GR[overlapsAny(genes_GR, GENE_query_1Mb[1])] # this returns a GRanges object with all genes within the query window defined above
gene_ids_in1Mb_range = in1Mb_range$ensembl_gene_id # pull ids to list
gene_symbols_in1Mb_range = in1Mb_range$GeneSymbol # pull symbols to list
index_nearest_gene = nearest(GENE_query, genes_GR) # this finds the nearest gene, returning its index in the reference table: genes_GR
nearest_gene_id = genes_GR[index_nearest_gene, ]$ensembl_gene_id # get that gene's id
nearest_gene_symbol = genes_GR[index_nearest_gene, ]$GeneSymbol # get that gene's HGNC symbol
Updated on 5/9/2022 - in order to fix plot issues and remove problematic NTCs from the Morris trans analysis.
Below are histograms of p-values in the Morris trans-wide analysis. The Morris trans-wide analysis included cis pairs which were removed before plotting below using information from the Ensembl 96 reference. In order to achieve better calibration, several problematic NTCs were removed. This decrease in gRNAs/NTCs lowers the total amount from 11x10325 (113K) to 8x10325 (83K). This is reflected in the NTC hisogram/QQplot below.
In the Gasperini trans-wide SCEPTRE analysis, we tested a total of ~77M pairs of which ~72M were DHS pairs where the target gRNA was >= 10Mb from the gene. The pairs were generated with the code found below: See ‘Notes on preparing Gasperini trans-wide analysis’ right click open image to zoom
Below are plots of the Morris cis-SCEPTRE analysis with additional NTCs. See results on 4/13/2022 for a table of the additional NTC pairs.
Below are histograms of p-values for the original Gasp. cis analysis. Original is used to highlight analysis done by Morris on Gasp. data in the preprint. The preprint found 538 signals at FDR <=0.1, while our SCEPTRE run returned 586 signals at FDR <= 0.1.
Statement of Problem: When running the original cis Morris analysis, NTC pairs were taken from the results reported in S3 of the Morris 2021 preprint. This resulted in 15.6k NTC gene-gRNA pairs. This plot is shown below for 3/23. Later, additional NTCs were spiked into the SCEPTRE, resulting in 26,009 total NTC pairs and plotted again on 4/13. See 4/13 for a table of gene-gRNA pairs run in cis.
During the transcriptome-wide SCEPTRE analysis of the Morris 2021 dataset, the NTC gRNAs were paired across all possible genes. See the first table in 4/18, which reports all gRNAs in the transcriptome-wide analysis for more information.
r$> table(morris_trans_results %>% filter(site_type == 'NTC') %>% pull(gRNA_id))
NTC-1 NTC-10 NTC-11 NTC-12 NTC-2 NTC-3 NTC-4 NTC-5 NTC-6 NTC-8 NTC-9
10325 10325 10325 10325 10325 10325 10325 10325 10325 10325 10325
With NTCs, the general idea is to plot them in two ways, seperatly making the expected p values and together.
We make a qqplot for all the NTCs, setting our expected p value based on the aggregate of all the NTCs, and seperate them out into seperate plots (grid plots, cowplot).
# here we specifically make our neglog10_exp_pval based on ALL 113K NTCs (see table above, 11x10,325 pairs)
results_NTC_all = results_NTC_all %>% arrange(p_value) %>% mutate(quantile=1:nrow(results_NTC_all)/(nrow(results_NTC_all) + 1)) %>% mutate(neglog10_obs_pval = -log10(p_value), neglog10_exp_pval = -log10(quantile))
Now, we make a qqplot for each individual NTCs, setting our expected p value based on that specific NTC, then plot.
ntc9 = results %>% filter(gRNA_id == 'NTC-9')
ntc9 = ntc9 %>% arrange(p_value) %>% mutate(quantile=1:nrow(ntc9)/(nrow(ntc9) + 1)) %>% mutate(neglog10_obs_pval = -log10(p_value), neglog10_exp_pval = -log10(quantile))
The following is a barplot of gene-gRNA pairs run during the transcriptome wide analysis of Morris dataset. These include cis and trans pairs.
The following is a table of gene-gRNA pairs that are located trans to the target gRNA (>= 10Mb distance). The flag was made by checking the distance between gRNA and gene for each pair.
Below are qqplots for the Morris 2021 trans-wide analysis.
To generate this plot we selected the NTCs tested in the cis analysis below and took those pairs from the transcriptome-wide analysis of Morris 2021.
This builds upon the results reported below on 3/23 by including additional NTCs.
In order to add additional NTCs, I spiked in extra by assigning my favorite NTCs to all availible genes in the dataset (10325). This results in a new amount of NTCs: 26,009.
NTC-1 NTC-10 NTC-11 NTC-12 NTC-2 NTC-3 NTC-4 NTC-5 NTC-6 NTC-7 NTC-8 NTC-9
475 10325 456 536 579 601 511 523 583 10325 578 517
The is the trans portion of the preprint: defined as trans-effects upon cCRE perturbation. Notes on preparing the trans-pairs for this analysis:
gRNA_gene_pairs_SCEPTRE = data.table()
for (gRNA in ondisc_gRNAs){
df = data.table(gene_id = ondisc_genes) %>% mutate(gRNA_id = gRNA)
gRNA_gene_pairs_SCEPTRE = rbindlist(list(gRNA_gene_pairs_SCEPTRE, df))
}
These NTCs were originally tested in the cis results (see 3/23/2022 below). To generate this plot we selected the NTCs tested in the cis analysis below and took those pairs from the transcriptome-wide analysis of Morris 2021. In other words, this plot does not include all NTCs tested in the trans analysis of Morris 2021 data.
The is the cis portion of the preprint: defined as cis-effects upon cCRE perturbation. There are 730 gene gRNA pairs that correspond to the SNP targeting results, while there are 15.6k NTC gene-gRNA pairs. For the combined plot, non-targeting gRNA-to-gene pairs were randomly sampled from the same set of genes tested in cis for targeting gRNAs.
The goal here is to prepare a set of gene gRNA pairs to be used for testing for trans gene regulation within the Gasperini 2019 dataset.
library(biomaRt)
library(dplyr)
#grch37 = useEnsembl(biomart="ensembl",GRCh=37)
#listDatasets(grch37) # what is availible rn (the last release aka 37.p13)
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl", GRCh=37)
listAttributes(ensembl) # list what attributes we can pull ie: gene, chromosome, position, etc
b37_reference <- getBM(attributes=c('ensembl_gene_id','ensembl_transcript_id','hgnc_symbol', 'gene_biotype', 'strand', 'chromosome_name','start_position','end_position', 'entrezgene_id','description'), mart = ensembl)
# original_results_raw <- suppressWarnings(read_tsv(paste0('/project2/xuanyao/nikita/SCEPTRE/data_matrixes/Gasperini_2019_dataset/at_scale_data/GSE120861_all_deg_results.at_scale.txt'), col_types = 'cddddddccccciiciiccl'))
# dhs_enhancer_pairs = original_results %>% filter(site_type == "DHS" & quality_rank_grna == "top_two") %>% select(gene_id, site_type, grna_group, target_site, target_site.start, target_site.stop, chr) %>% distinct(grna_group, .keep_all = TRUE)
# Notice that dhs_enhancer_pairs is a selection of results for DHS + top two AND we perform distinct(grna_group) to isolate unique gRNAs to make the pairs from.
dhs_enhancer_pairs = as_tibble(data.table::fread(paste0('/home/nbabushkin/stingseq_scripts/Gasperini_78kDHS_pairs_table.csv'), stringsAsFactors = F, header = T, sep=','))
tss_pairs = as_tibble(data.table::fread(paste0('/home/nbabushkin/stingseq_scripts/Gasperini_6kTSS_pairs_table.csv'), stringsAsFactors = F, header = T, sep=','))
approved_chrs = paste0('chr', seq(1,22))
library(GenomicRanges)
annotated_genes = annotations %>% select(hgnc_symbol, chromosome_name, start_position, end_position, ensembl_gene_id) %>% filter(ensembl_gene_id %in% approved_genes) %>% mutate(chromosome_name = paste0('chr', chromosome_name))
annotated_genes = annotated_genes %>% distinct(hgnc_symbol, chromosome_name, start_position, end_position, ensembl_gene_id)
colnames(annotated_genes) <- c("GeneSymbol","Chr","Start","End",'ensembl_gene_id')
annotated_genes
genes_GR <- makeGRangesFromDataFrame(annotated_genes,keep.extra.columns = TRUE)
genes_GR <- keepStandardChromosomes(genes_GR, pruning.mode="coarse")
print(paste0('starting tss_pairs loops...'))
tic()
df1 = foreach (x=1:nrow(tss_pairs),
.combine=function(...) rbindlist(list(...)),
.multicombine=TRUE) %dopar% {
gRNA = tss_pairs[x, 'grna_group'] %>% pull()
s = tss_pairs[x, 'target_site.start'] %>% pull()
e = tss_pairs[x, 'target_site.stop'] %>% pull()
c = tss_pairs[x, 'chr'] %>% pull()
query <- GRanges(c, IRanges(ifelse(s > 1e7, s - 1e7, 1), e + 1e7, id = gRNA))
genes_on_chr = annotated_genes %>% filter(Chr == c) %>% pull(ensembl_gene_id) %>% unique()
genes_not_on_chr = annotated_genes %>% filter(Chr %in% approved_chrs & Chr != c) %>% pull(ensembl_gene_id) %>% unique()
#f = findOverlaps(query, genes_GR, ignore.strand=TRUE)
inrange = genes_GR[overlapsAny(genes_GR, query[1])]
inrange_genes = inrange$ensembl_gene_id %>% unique()
out_of_range_genes = setdiff(genes_on_chr, inrange_genes)
genes = c(out_of_range_genes, genes_not_on_chr) %>% unique()
#cat(paste0(gRNA ,' :: # genes found: ', length(genes), ' \n'))
df = data.table(gene_id = genes) %>% mutate(gRNA_id = gRNA)
# if (x %% 100 == 0){cat(paste0('status: ', x, '/', nrow(tss_pairs), ' - ',Sys.time(),' \n'))}
return(df)
}
toc()
print(paste0('starting dhs_enhancer_pairs loops... '))
tic()
df2 = foreach (x=1:nrow(dhs_enhancer_pairs),
.combine=function(...) rbindlist(list(...)),
.multicombine=TRUE) %dopar% {
gRNA = dhs_enhancer_pairs[x, 'grna_group'] %>% pull()
s = dhs_enhancer_pairs[x, 'target_site.start'] %>% pull()
e = dhs_enhancer_pairs[x, 'target_site.stop'] %>% pull()
c = dhs_enhancer_pairs[x, 'chr'] %>% pull()
query <- GRanges(c, IRanges(ifelse(s > 1e7, s - 1e7, 1), e + 1e7, id = gRNA))
genes_on_chr = annotated_genes %>% filter(Chr == c) %>% pull(ensembl_gene_id) %>% unique()
genes_not_on_chr = annotated_genes %>% filter(Chr %in% approved_chrs & Chr != c) %>% pull(ensembl_gene_id) %>% unique()
#f = findOverlaps(query, genes_GR, ignore.strand=TRUE)
inrange = genes_GR[overlapsAny(genes_GR, query[1])]
inrange_genes = inrange$ensembl_gene_id %>% unique()
out_of_range_genes = setdiff(genes_on_chr, inrange_genes)
genes = c(out_of_range_genes, genes_not_on_chr) %>% unique()
#cat(paste0(gRNA ,' :: # genes found: ', length(genes), ' \n'))
df = data.table(gene_id = genes) %>% mutate(gRNA_id = gRNA)
#if (x %% 100 == 0){cat(paste0('status: ', x, '/', nrow(dhs_enhancer_pairs), ' - ',Sys.time(),' \n'))}
return(df)
}
toc()
gRNA_gene_pairs_SCEPTRE = rbindlist(list(df1, df2)) # used in the %dopar% above, as well
# final cleaning/sanity function
gRNA_gene_pairs_SCEPTRE = gRNA_gene_pairs_SCEPTRE %>% filter(gRNA_id %in% onidsc_gRNAs, gene_id %in% ondisc_genes)
Note: make sure to set up registerDoParallel(cores=XX) correctly prior otherwise it will take >1hr to bind the 76M pairs together.
Notes on preparing SCEPTRE for Morris 2021 data:
Single Cell Data Processing within STINGseq:
cDNA/GDO matrixs processing:
# cDNA QC FILTER
sobject_expression_subset <- subset(sobject_expression, subset = nFeature_RNA >= 850 & percent_mito < 20, cells = common_cellbarcodes)
# 35606 features across 14813 samples within 1 assay
# this is 28 cells OFF from the manuscript
median_genes_per_cell = median(sobject_expression_subset@meta.data$nFeature_RNA)
# result: 3899
# paper: 3875
cDNA_cells = colnames(sobject_expression_subset) # get the cells that passed expression QC above
HTO matrix:
GDO matrix:
sobject_expression_subset <- NormalizeData(sobject_expression_subset, normalization.method = 'CLR') # normalize data w/ CLR transform
# "We used the HTODemux function and determined a threshold of 0.91 maximized the number of
# singlets detected to 9,520 while reducing the number of multiplets (multiple cells per droplet) to
# 2,482 and empty cells to 2,726."
sobject_expression_subset <- HTODemux(sobject_expression_subset, assay = 'HTO', positive.quantile = 0.91)
These are the results on 78k DHS enhancers in Gasperini 2019 across multiple conditions.
The goal was to downsample and evaluate SCEPTRE performance on NTCs/SNPs across downsampled MOI, cell count, and gRNAs in Gasperini 2019. The plots below show some of the QC results from the downsampled datasets prior to running SCEPTRE.
Plan:
1. Download 3 datasets: Gasperini 2019, Datlinger 2017, and Morris 2021
2. Convert SCEPTRE into snakemake pipeline (preprocess to ondisc + sceptre analysis)
3. Replicate findings on trans-regulation
Current Steps:
1. Process Gasperini et al 2019 data for SCEPTRE
2. Run SCEPTRE to test gRNA-to-gene pairs in cis
3. Generate qqplot analogous to Figure 2.C in Morris 2021 bioxiv paper