sgrna_modeler package¶
Submodules¶
sgrna_modeler.datasets module¶
Training and testing datasets for modeling guide activity.
Includes class for generating new dataset objects, so new data can be easily modeled and tested
Example: >>> import pandas as pd >>> import sgrna_modeler.enzyme as en >>> new_data = pd.read_csv('new dataset') >>> new_dataset = Activity_Data(new_data, en.cas9, '30mer', 'activity', 'new data')
-
class
sgrna_modeler.datasets.ActivityData(data, enzyme, kmer_column, activity_column, name, group_column='')[source]¶ Bases:
objectStore information about activity data
Parameters: - data (pandas dataframe) – data to model
- enzyme (dict) – cas9 or cas12a
- kmer_column (str) – sequences to model
- name (str) – name of the dataset
- group_column – column to include in prediction output
:type group_column:str
-
sgrna_modeler.datasets.load_doench_2016()[source]¶ Data from:
Doench, John G., et al. “Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9.” Nature biotechnology 34.2 (2016): 184.
Example: >>> import sgrna_modeler.datasets as da >>> doench = da.load_doench_2016() >>> doench.data Unnamed: 0 30mer ... drug predictions 0 0 CAGAAAAAAAAACACTGCAACAAGAGGGTA ... nodrug 0.544412 1 1 TTTTAAAAAACCTACCGTAAACTCGGGTCA ... PLX_2uM 0.617512 2 2 TCAGAAAAAGCAGCGTCAGTGGATTGGCCC ... nodrug 0.476232 3 3 AATAAAAAATAGGATTCCCAGCTTTGGAAG ... PLX_2uM 0.459882 4 4 GATGAAAAATATGTAAACAGCATTTGGGAC ... PLX_2uM 0.290841 ... ... ... ... ... 5305 5305 GCACTTTGGTGTGGCTGACTGAGTGGGCCA ... PLX_2uM 0.586758 5306 5306 TTCTTTTGTAAGAACCCGCTGTGTTGGTTT ... PLX_2uM 0.492066 5307 5307 GCCCTTTGTCATCGTAGGAAGATATGGCTG ... AZD_200nM 0.479728 5308 5308 CAAATTTGTTCTTTAAATGGCTACAGGAGG ... AZD_200nM 0.478952 5309 5309 CAAATTTGTTCTTTAAATGGCTACAGGAGG ... PLX_2uM 0.478952 [5310 rows x 9 columns]
-
sgrna_modeler.datasets.load_kim_2018_test()[source]¶ Indel frequencies from:
Kim, Hui Kwon, et al. “Deep learning improves prediction of CRISPR–Cpf1 guide RNA activity.” Nature biotechnology 36.3 (2018): 239.
Example: >>> import sgrna_modeler.datasets as da >>> kim_2018_test = da.load_kim_2018_test() >>> kim_2018_test.data 50 bp synthetic target and target context sequence ... Indel frequency 0 GCAATTTGGTTTTAAAACAGAATATACAGTCTAAAAAACCAGCTTG... ... 71.580711 1 CTGATGGCCATTTAAACAACTCTTTGAGCTCTCCAGTTCAAGCTTG... ... 19.672949 2 TTTAGATGATTTTAAACCAGCATCTATAGACACTTCCTGTAGCTTG... ... 75.641026 3 ACATTTGGACTTTAAACCCAAACTACTTGTCCAACGGTACAGCTTG... ... 46.920217 4 CTCTACCAGGTTTAAACGCTTCCACACTTGTGTCAGTAATAGCTTG... ... 54.981550 ... ... ... 2958 AGTTTGGAATTTTTTTTACACTGATCCTCAGCACATCTCAAGCTTG... ... -0.378500 2959 CAGGCTTTCTTTTTTTTCCTTTCCTAGTTGGTTCATTCCCAGCTTG... ... 0.189438 2960 AACAGTGGCTTTTTTTTGCTGCTAGCACATATGTATGGGTAGCTTG... ... -2.857143 2961 CAGCCTCATGTTTTTTTGGGAACCAATCGATAATCACATTAGCTTG... ... 11.275673 2962 TTGGATTGTGTTTTTTTTTAGCACCTTATTTTCCTTGAAGAGCTTG... ... -1.675978 [2963 rows x 10 columns]
-
sgrna_modeler.datasets.load_kim_2018_train()[source]¶ Indel frequencies from:
Kim, Hui Kwon, et al. “Deep learning improves prediction of CRISPR–Cpf1 guide RNA activity.” Nature biotechnology 36.3 (2018): 239.
Example: >>> import sgrna_modeler.datasets as da >>> kim_2018_train = da.load_kim_2018_train() >>> kim_2018_train.data 50 bp synthetic target and target context sequence 10 bp + PAM + 23 bp protospacer + 17 bp) ... Indel frequency 0 TGCGCGAGCGTTTAAAAAACATCGAACGCATCTGCTGCCTAGCTTG... ... 14.711302 1 CTAAAGAAACTTTAAAAATCTTTTCTGCCAGATCTCCAGAAGCTTG... ... 0.238095 2 TTGCCATTGTTTTAAAACAGGTTCTGTACTTGATCTCTCCAGCTTG... ... 88.079746 3 TTGCACATATTTTAAAACTGAGTTCAAAGACCACTCTTCCAGCTTG... ... 75.392670 4 TAGACTAATGTTTAAAAGCAAGTGCAAGTCTTTGGAATCTAGCTTG... ... 63.320080 ... ... ... 14995 TCCATCTTCATTTTTTTTGTAGAGTAGGGCTTTATTTCCAAGCTTG... ... -0.467290 14996 CCTTCTCTCCTTTTTTTTTCAAGATCTGATTCTTCTTGCAAGCTTG... ... 0.000000 14997 CCAGGACTTGTTTTTTTTTCAATCTGTTCATCTTGGACCAAGCTTG... ... 0.239006 14998 ACCATCATAATTTTTTTTTGCAACATAGCCATTTCTTTTTAGCTTG... ... -0.272826 14999 GAGCGCTTCTTTTTTTTTTTCGGGGTCTCGTTGCTGGGCGAGCTTG... ... -2.766164 [15000 rows x 10 columns]
-
sgrna_modeler.datasets.load_kim_2019_test()[source]¶ Indel frequencies from:
Kim, Hui Kwon, et al. “SpCas9 activity prediction by DeepSpCas9, a deep learning–based model with high generalization performance.” Science advances 5.11 (2019): eaax9249.
Example: >>> import sgrna_modeler.datasets as da >>> kim_2019_test = da.load_kim_2019_test() >>> kim_2019_test.data Target context sequence (4+20+3+3) ... Background subtracted indel frequencies (average, %) 0 AAAACTGTGAGTGTGGGACCTGCTGGGGGC ... 44.125755 1 AAACACAACCAATCCGAGGCCTTCTGGGTC ... 12.163189 2 AAACTGTGAGTGTGGGACCTGCTGGGGGCT ... 68.901263 3 AAACTTGAGAGCTTTCATAAAGCTTGGCAA ... 13.135690 4 AAAGAAGCGGACTTTAAAGTTCGAGGGAGA ... 48.355156 .. ... ... ... 537 TTTGCAGCGCGTTGACTTATTCATGGGTCA ... 36.249050 538 TTTGCTAGGAATATTGAAGGGGGCAGGGGA ... 38.622947 539 TTTGTGGTGGTTGCTATGGTAATCCGGCAC ... 12.246218 540 TTTTTACAATTCTGTGAGTTAGAGTGGGCA ... 0.385915 541 TTTTTGAGGTGCACTAATAGAGGGTGGAGT ... 41.100730 [542 rows x 5 columns]
-
sgrna_modeler.datasets.load_kim_2019_train()[source]¶ Indel frequencies from:
Kim, Hui Kwon, et al. “SpCas9 activity prediction by DeepSpCas9, a deep learning–based model with high generalization performance.” Science advances 5.11 (2019): eaax9249.
Example: >>> import sgrna_modeler.datasets as da >>> kim_2019_train = da.load_kim_2019_train() >>> kim_2019_train.data Barcode ... Background subtracted indel (%) 0 TTTGACACACACGCACTAG ... 24.287805 1 TTTGACACACACTCGTATG ... 69.500438 2 TTTGACACACACTCTCGTC ... 25.994760 3 TTTGACACACACTCTGCTG ... 57.964590 4 TTTGACACACACTGCATAT ... 39.355020 ... ... ... 12827 TTTGTGTGTCTCGTATCAC ... 40.853256 12828 TTTGTGTGTCTCTACACGC ... 11.480880 12829 TTTGTGTGTCTCTCACGTA ... 63.861469 12830 TTTGTGTGTCTCTCTAGTC ... 51.650932 12831 TTTGTGTGTCTCTCTCAGA ... 40.019124 [12832 rows x 9 columns]
-
sgrna_modeler.datasets.load_meyers_2017_test()[source]¶ Essential genes from GeckoV2 achilles screens:
Meyers, Robin M., et al. “Computational correction of copy number effect improves specificity of CRISPR–Cas9 essentiality screens in cancer cells.” Nature genetics 49.12 (2017): 1779-1784.
Mean activity is averaged accross screens after Z-scoring by non-essentials
Example: >>> import sgrna_modeler.datasets as da >>> meyers_2017_test = da.load_meyers_2017_test() >>> meyers_2017_test.data Species Build Chromosome Number ... Percent Protein Notes mean_activity 0 human GRCh38 1.0 ... 22.12 NaN 3.325952 1 human GRCh38 1.0 ... 2.30 NaN 2.645421 2 human GRCh38 1.0 ... 56.87 NaN 2.040191 3 human GRCh38 1.0 ... 40.38 NaN 3.356250 4 human GRCh38 1.0 ... 40.11 NaN 1.602670 .. ... ... ... ... ... ... ... 667 human GRCh38 NaN ... 8.56 NaN 1.240547 668 human GRCh38 NaN ... 8.95 NaN 1.078080 669 human GRCh38 NaN ... 30.93 NaN -0.364154 670 human GRCh38 NaN ... 34.24 NaN 2.605412 671 human GRCh38 NaN ... 41.25 NaN 2.620977 [672 rows x 20 columns]
-
sgrna_modeler.datasets.load_meyers_2017_train()[source]¶ Essential genes from GeckoV2 achilles screens:
Meyers, Robin M., et al. “Computational correction of copy number effect improves specificity of CRISPR–Cas9 essentiality screens in cancer cells.” Nature genetics 49.12 (2017): 1779-1784.
Mean activity is averaged accross screens after Z-scoring by non-essentials
Example: >>> import sgrna_modeler.datasets as da >>> meyers_2017_train = da.load_meyers_2017_train() >>> meyers_2017_train.data Species Build Chromosome Number ... Percent Protein Notes mean_activity 0 human GRCh38 1.0 ... 24.87 NaN -0.230160 1 human GRCh38 1.0 ... 23.26 NaN 3.045755 2 human GRCh38 1.0 ... 18.60 NaN 1.307097 3 human GRCh38 1.0 ... 18.07 NaN -1.307698 4 human GRCh38 1.0 ... 13.95 NaN 1.278670 ... ... ... ... ... ... ... 7897 human GRCh38 NaN ... 11.78 NaN 1.959897 7898 human GRCh38 NaN ... 13.14 NaN -0.429659 7899 human GRCh38 NaN ... 16.01 NaN 1.187820 7900 human GRCh38 NaN ... 19.03 NaN 1.573194 7901 human GRCh38 NaN ... 39.62 NaN 2.044455 [7902 rows x 20 columns]
sgrna_modeler.enzymes module¶
sgrna_modeler.features module¶
-
sgrna_modeler.features.featurize_guides(kmers, features=None, guide_start=9, guide_length=20)[source]¶ Take guides and encodes for modeling :param kmers: vector of context sequences :param features: boolean dictionary of which feature types to inlcude:
[‘Pos. Ind. 1mer’, ‘Pos. Ind. 2mer’, ‘Pos. Ind. 3mer’, ‘Pos. Ind. Zipper’,’Pos. Dep. 1mer’, ‘Pos. Dep. 2mer’, ‘Pos. Dep. 3mer’, ‘Pos. Dep. Zipper’, ‘Pos. Ind. Rep.’, ‘GC content’, ‘Tm’, ‘Physio’, ‘Double Zipper’]Parameters: - pam_start – int
- pam_end – int
- guide_start – int
- guide_end – int
Returns: featurized matrix
-
sgrna_modeler.features.get_context_order(k)[source]¶ Parameters: k – length of kmer Returns: list of characters of each nt position
sgrna_modeler.models module¶
-
class
sgrna_modeler.models.KerasSgrnaModel(random_state=7, val_frac=0.1, base_arc=None)[source]¶ Bases:
objectThis class is for creating, training, and predicting guide activity with a Keras model
Parameters: - random_state – set random state in train/test split for reproducibility
- val_frac (float) – amount of data to use for early stopping
- base_arc (function, which takes an input shape and returns a keras model) – base architecture to build neural network, defaults to build_kim2018
Example: >>> from sgrna_modeler import datasets as da >>> from sgrna_modeler import models as sg >>> train_data = da.load_kim_2018_train() >>> train_model = sg.KerasSgrnaModel() >>> train_model.fit(train_data) >>> test_data = da.load_kim_2018_test() >>> test_predictions = train_model.predict(test_data)
-
fit(train_dataset)[source]¶ Fit a model to the training data
Parameters: train_dataset ( sgrna_modeler.datasets.ActivityData) – training dataReturns: self
-
load_weights(weights, enzyme, name)[source]¶ Load previously trained weights
Parameters: - enzyme – cas9 or cas12a
- weights (str) – filepath to weights
- name – name of the model
:type name:str
-
predict(test_dataset)[source]¶ Predict activity of test data
Parameters: test_dataset ( sgrna_modeler.datasets.ActivityData) – testing dataReturns: dataframe of predictions and other meta information Return type: pandas dataframe
-
class
sgrna_modeler.models.SklearnSgrnaModel(random_state=7, val_frac=0.1, model=None, features=None)[source]¶ Bases:
objectscikit-learn gradient boosting for modeling sgRNA activity
Parameters: - random_state (int) – set random state in train/test split for reproducibility
- val_frac (float) – amount of data to use for early stopping
- model (sklearn GradientBoostingRegressor) – base model
- features (list) – features to model
Example: >>> from sgrna_modeler import datasets as da >>> from sgrna_modeler import models as sg >>> train_model = sg.SklearnSgrnaModel() >>> rs2_data = da.load_doench_2016() >>> train_model.fit(rs2_data)
-
fit(train_dataset)[source]¶ Fit a model to the training data
Parameters: train_dataset ( sgrna_modeler.datasets.ActivityData) – training dataReturns: self
-
load_model(model, enzyme, name)[source]¶ Load previously trained model
Parameters: - enzyme – cas9 or cas12a
- model (str (*.joblib)) – filepath to trained model
- name – name of the model
:type name:str
-
predict(test_dataset)[source]¶ Predict activity of test data
Parameters: test_dataset ( sgrna_modeler.datasets.ActivityData) – testing dataReturns: dataframe of predictions and other meta information Return type: pandas dataframe
-
sgrna_modeler.models.build_kim2018(input_shape=(34, 4))[source]¶ Build a convolutional neural network
From: Kim, Hui Kwon, et al. “Deep learning improves prediction of CRISPR–Cpf1 guide RNA activity.” Nature biotechnology 36.3 (2018): 239.
Parameters: input_shape (tuple) – guide length by nts (4) Returns: CNN architecture Return type: keras Model object
sgrna_modeler.mutagenesis module¶
Module contents¶
Top-level package for sgRNA Modeler.