A graph based approach for the genome wide prediction of conditionally essential genes

Major goals of the project

This project aims to answer a fundamental question in biology, how can we identify essential genes for the organism’s survival in a given set of conditions. To this end, our first goal is to build computational models to predict essential and conditionally essential genes using machine learning methods and network analyses. Various gene networks inferred from different high-throughput data will be integrated, and the genes with known phenotypes will be labeled. A diffusion-based model will be developed to infer phenotypes for other genes in the network. This model makes use of both functional data of a gene and its topology context in the network to predict its phenotype. Our second goal is to further refine the predicted essential genes computationally with spatial and temporal transcriptomic data using state-space model. At last, we aim to experimentally validate the refined gene set in model organisms. The validation results as feedbacks will further optimize our computational pipeline. In the end, we will make genome-wide predictions for different species, and all our prediction results and pipeline will be shared online.

Significant results by far:

Using as starting point our previously developed method for the analysis of the dynamics of gene expression driven by external and internal regulatory networks using state space models, we were able to identify common temporal expression dynamics patterns during embryonic development in worm and fly.

We have observed that our diffusion method outperforms other state-of-the-art disease gene prediction methods. We can extend our method to use other attributes for the phenotypic characterisation of genes and integrate other biological networks, such as gene expression, for the prediction of conditionally essential genes.

Participants

PI: Dr. Mark Gerstein

Participants from the Gerstein lab at Yale University

Dr. Shuang Liu plays a leading role in the project, and coordinates the progresses from different participants of the project. She mainly focuses on developing the statistical models and assembling them into the pipeline for essential gene predictions. Moreover, she is also designing computational experiments to validate the prediction results. Besides her role in research, she is also responsible for training students and junior postdocs.

Dr. Prashant Siva Emani has strong background in physics and mathematics. He is applying his expertise in developing statistical models to measure the importance of genes in network contexts. He also contributed to the designs of the project in the past, and he is helping on the future validation part of the project. He also guided the graduate students and the non-student research assistant on data collection and data processing.

Dr. Gamze Gursoy is also working on the development of statistical models and measures for gene importance. She directly mentored Molly Green, the non-student research assistant to collect useful data from databases and literatures.

Dr. Shantao Li contributed to the developments of statistical models for gene importance. More specifically, he is applying diffusion models for this purpose using gene regulatory networks.

Molly Green is a non-student research assistant for this project. She worked closely with Dr. Gamze Gursoy on developing novel statistical models to predict gene importance. Moreover, she also has been working on collecting data for model training and testing from databases and literatures.

Dr. Mihali Felipe is the system administrator in the Gerstein lab. He helped on setting up computational pipelines and managing computing resources. He also helped on providing data storages.

Michael R. Schoenberg is a graduate student in the Gerstein lab. He has been guided by the postdocs in the project to help on the developments of statistical models for gene importance predictions. Moreover, he has been also working on data collections and processing.

Other collaborators

From Royal Holloway, University of London

Alberto Paccanaro, Professor in the Computer Science Department at Royal Holloway

Juan Caceres, PhD student in the Computer Science Department at Royal Holloway

From Brunel University London

Dr. Cristina Sisu, Lecturer, Genomic Data Analytics

Co-PI: Dr. Haiyuan Yu

Participants from the Yu lab at Cornell University

Dr. Yuan Liu has extensive experience in cellular, molecular, and biochemistry experiments. She join the lab last year and has been leading the effort to set up the validation experimental pipeline.

Dr. Dapeng Xiong has extensive experience in computational biology. He joined the lab last year and has been involved in setting up the computational pipeline for designing and processing the validation experiments

Snigdha mahapatra is the lab programmer and system administrator. She maintains the lab MySQL database, where all raw and processed experimental results are stored, as well as performing large- scale repetitive data analysis (e.g., sequencing analysis).