A general and flexible method for signal extraction from single-cell RNA-seq data.

Single-cell RNA-sequencing (scRNA-seq) is a powerful high-throughput technique that enables researchers to measure genome-wide transcription levels at the resolution of single cells. Because of the low amount of RNA present in a single cell, some genes may fail to be detected even though they are expressed; these genes are usually referred to as dropouts. Here, we present a general and flexible zero-inflated negative binomial model (ZINB-WaVE), which leads to low-dimensional representations of the data that account for zero inflation (dropouts), over-dispersion, and the count nature of the data. We demonstrate, with simulated and real data, that the model and its associated estimation procedure are able to give a more stable and accurate low-dimensional representation of the data than principal component analysis (PCA) and zero-inflated factor analysis (ZIFA), without the need for a preliminary normalization step.

Kernel Multitask Regression for Toxicogenetics.

The development of high-throughput in vitro assays to study quantitatively the toxicity of chemical compounds on genetically characterized human-derived cell lines paves the way to predictive toxicogenetics, where one would be able to predict the toxicity of any particular compound on any particular individual. In this paper we present a machine learning-based approach for that purpose, kernel multitask regression (KMR), which combines chemical characterizations of molecular compounds with genetic and transcriptomic characterizations of cell lines to predict the toxicity of a given compound on a given cell line. We demonstrate the relevance of the method on the recent DREAM8 Toxicogenetics challenge, where it ranked among the best state-of-the-art models, and discuss the importance of choosing good descriptors for cell lines and chemicals.

The inconvenience of data of convenience: computational research beyond post-mortem analyses.

The Functional Impact of Alternative Splicing in Cancer.

Alternative splicing changes are frequently observed in cancer and are starting to be recognized as important signatures for tumor progression and therapy. However, their functional impact and relevance to tumorigenesis remain mostly unknown. We carried out a systematic analysis to characterize the potential functional consequences of alternative splicing changes in thousands of tumor samples. This analysis revealed that a subset of alternative splicing changes affect protein domain families that are frequently mutated in tumors and potentially disrupt protein-protein interactions in cancer-related pathways. Moreover, there was a negative correlation between the number of these alternative splicing changes in a sample and the number of somatic mutations in drivers. We propose that a subset of the alternative splicing changes observed in tumors may represent independent oncogenic processes that could be relevant to explain the functional transformations in cancer, and some of them could potentially be considered alternative splicing drivers (AS drivers).

ARHGEF17 is an essential spindle assembly checkpoint factor that targets Mps1 to kinetochores.

To prevent genome instability, mitotic exit is delayed until all chromosomes are properly attached to the mitotic spindle by the spindle assembly checkpoint (SAC). In this study, we characterized the function of ARHGEF17, identified in a genome-wide RNA interference screen for human mitosis genes. Through a series of quantitative imaging, biochemical, and biophysical experiments, we showed that ARHGEF17 is essential for SAC activity, because it is the major targeting factor that controls localization of the checkpoint kinase Mps1 to the kinetochore. This mitotic function is mediated by direct interaction of the central domain of ARHGEF17 with Mps1, which is autoregulated by the activity of Mps1 kinase, for which ARHGEF17 is a substrate. This mitosis-specific role is independent of ARHGEF17’s RhoGEF activity in interphase. Our study thus assigns a new mitotic function to ARHGEF17 and reveals the molecular mechanism for a key step in SAC establishment.

NetNorM: Capturing cancer-relevant information in somatic exome mutation data with gene networks

Genome-wide somatic mutation profiles of tumours can now be assessed efficiently and promise to move precision medicine forward. Statistical analysis of mutation profiles is however challenging due to the low frequency of most mutations, the varying mutation rates across tumours, and the presence of a majority of passenger events that hide the contribution of driver events. Here we propose a method, NetNorM, to represent whole-exome somatic mutation data in a form that enhances cancer-relevant information using a gene network as background knowledge. We evaluate its relevance for two tasks: survival prediction and unsupervised patient stratification. Using data from 8 cancer types from The Cancer Genome Atlas (TCGA), we show that it improves over the raw binary mutation data and network diffusion for these two tasks. In doing so, we also provide a thorough assessment of somatic mutations prognostic power which has been overlooked by previous studies because of the sparse and binary nature of mutations.

MetaVW: Large-Scale Machine Learning for Metagenomics Sequence Classification.

Metagenomics is the study of microbial community diversity, especially the uncultured microorganisms by shotgun sequencing environmental samples. As the sequencers throughput and the data volume increase, it becomes challenging to develop scalable bioinformatics tools that reconstruct microbiome structure by binning sequencing reads to reference genomes. Standard alignment-based methods, such as BWA-MEM, provide state-of-the-art performance, but we demonstrate in Vervier et al. (2016) that compositional approaches using nucleotides motifs have faster analysis time, for comparable accuracy. In this work, we describe how to use MetaVW, a scalable machine learning implementation for short sequencing reads binning, based on their k-mers profile. We provide a step-by-step guideline on how we trained the classification models and how it can easily generalize to user-defined reference genomes and specific applications. We also give additional details on what effect parameters in the algorithm have on performances.

Job Offers


We propose several projects for  M2 or PhD students. Please contact us if you are interested in joining our team.

Maxime Woringer

Julia Ronsch

Post-doctoral position in genomics

151112091656_1_540x360We are looking for motivated and productive post-doctoral fellows (both experimental and computational biologists) to join our ATIP-Avenir team Replication Program and Genome Instability at the Institut Curie, Paris, France. The team focuses on using cutting-edge high-throughput genomic approaches and genome-wide data analyses to study the spatio-temporal replication program of the human genome and its impact on genome stability in normal and cancer cells, in population as well as at single molecule/cell level.

The Institut Curie, located in the heart of Paris, is the largest French cancer research center and one of the world’s leading institutions in the field. It represents an excellent location to perform multidisciplinary research supported by a strong network of the PSL Research University. The project will be performed in close collaboration between our genome analysis team, the bioinformatics/sequencing platforms and the worldwide experimental biologist experts. The candidate will benefit from an excellent multi-disciplinary environment.


Applicant should be a researcher hold, or in the process of completing, a PhD degree in molecular/cell biology, bioinformatics or related areas. The candidate should have solid experimental and/or computational skills and a strong interest in genome biology. Experience with 3th generation sequencing, high-throughput imaging or single cell data analysis is a plus. The candidate should be highly motivated, curious and enthusiastic to work in a collaborative team.

Duration: 1-year renewable, up to 3 years.

Interested candidates should send a detailed CV (education, work/research experience, language skills and other skills relevant for the position, list of publications) along with a cover letter and contact details of 3 referees to C.L. Chen ().



Key publications:

– Brison O*, EL-Hilali S*, et al. Chen CL (2019). Transcription-Mediated Organization of The Replication Initiation Program Across Large Genes Sets Up Common Fragile Sites Genome-Wide. bioRxiv. doi:https://doi.org/10.1101/714717.

– Shi M J, et al. (2019) APOBEC-mediated Mutagenesis as a Likely Cause of FGFR3 S249C Mutation Over-representation in Bladder Cancer. Eur. Urol. 76, 9–13.

– Klein K*. Wang W.* (*co-first authors), et al. (2017) Genome-Wide Identification of Early-Firing Human Replication Origins by Optical Replication Mapping. bioRxiv. doi:https://doi.org/10.1101/214841.

– Petryk N, et al., Chen CL* and Hyrien O* (*co-last authors). (2016) Replication landscape of the human genome. Nat. Commun.  7:10208.

– Van Dijk EL*, Chen CL* (co-first authors), et al. (2011) XUT, a novel class of antisense regulatory ncRNA in yeast. Nature 475:114-7.

A Family of Vertebrate-Specific Polycombs Encoded by the LCOR/LCORL Genes Balance PRC2 Subtype

The polycomb repressive complex 2 (PRC2) consists of core subunits SUZ12, EED, RBBP4/7, and EZH1/2 and is responsible for mono-, di-, and tri-methylation of lysine 27 on histone H3. Whereas two distinct forms exist, PRC2.1 (containing one polycomb-like protein) and PRC2.2 (containing AEBP2 and JARID2), little is known about their differential functions. Here, we report the discovery of a family of vertebrate-specific PRC2.1 proteins, “PRC2 associated LCOR isoform 1” (PALI1) and PALI2, encoded by the LCOR and LCORL gene loci, respectively. PALI1 promotes PRC2 methyltransferase activity in vitro and in vivo and is essential for mouse development. Pali1 and Aebp2 define mutually exclusive, antagonistic PRC2 subtypes that exhibit divergent H3K27-tri-methylation activities. The balance of these PRC2.1/PRC2.2 activities is required for the appropriate regulation of polycomb target genes during differentiation. PALI1/2 potentially link polycombs with transcriptional co-repressors in the regulation of cellular identity during development and in cancer.

Histone variants H2A.Z and H3.3 coordinately regulate PRC2-dependent H3K27me3 deposition and gene

The hierarchical organization of eukaryotic chromatin plays a central role in gene regulation, by controlling the extent to which the transcription machinery can access DNA. The histone variants H3.3 and H2A.Z have recently been identified as key regulatory players in this process, but the underlying molecular mechanisms by which they permit or restrict gene expression remain unclear. Here, we investigated the regulatory function of H3.3 and H2A.Z on chromatin dynamics and Polycomb-mediated gene silencing.

Genome-Wide Identification of Early-Firing Human Replication Origins by Optical Replication Mapping

The timing of DNA replication is largely regulated by the location and timing of replication origin firing. Therefore, much effort has been invested in identifying and analyzing human replication origins. However, the heterogeneous nature of eukaryotic replication kinetics and the low efficiency of individual origins in metazoans has made mapping the location and timing of replication initiation in human cells difficult. We have mapped early-firing origins in HeLa cells using Optical Replication Mapping, a high-throughput single-molecule approach based on Bionano Genomics genomic mapping technology. The single-molecule nature and 290-fold coverage of our dataset allowed us to identify origins that fire with as little as 1% efficiency. We find sites of human replication initiation in early S phase are not confined to well-defined efficient replication origins, but are instead distributed across broad initiation zones consisting of many inefficient origins. These early-firing initiation zones co-localize with initiation zones inferred from Okazaki-fragment-mapping analysis and are enriched in ORC1 binding sites. Although most early-firing origins fire in early-replication regions of the genome, a significant number fire in late-replicating regions, suggesting that the major difference between origins in early and late replicating regions is their probability of firing in early S-phase, as opposed to qualitative differences in their firing-time distributions. This observation is consistent with stochastic models of origin timing regulation, which explain the regulation of replication timing in yeast.

Fragile X Mental Retardation Protein regulates R-loop formation and prevents global chromosome

Fragile X syndrome (FXS) is the most prevalent inherited intellectual disability caused by mutations in the Fragile X Mental Retardation gene (FMR1) and deficiency of its product, FMRP. FMRP is a predominantly cytoplasmic protein thought to bind specific mRNA targets and regulate protein translation. Its potential role in the nucleus is not well understood. We are interested in the global impact on chromosome stability due to FMRP loss. Here we report that compared to an FMRP-proficient normal cell line, cells derived from FXS patients exhibit increased chromosome breaks upon DNA replication stress induced by a DNA polymerase inhibitor, aphidicolin. Moreover, cells from FXS individuals fail to protect genomic regions containing R-loops (co-transcriptional DNA:RNA hybrids) from aphidicolin-induced chromosome breaks. We demonstrate that FMRP is important for abating R-loop accumulation during transcription, particularly in the context of head-on collision with a replication fork, and thereby preventing chromosome breakage. By identifying those FMRP-bound chromosomal loci with overlapping R-loops and fragile sites, we report a list of novel FMRP target loci, many of which have been implicated in neurological disorders. We show that cells from FXS patients have reduced expression of xenobiotics metabolic enzymes, suggesting defective xenobiotics metabolism/excretion might contribute to disease development. Our study provides new insights into the etiological basis of, and enables the discovery of new therapeutic targets for, the FXS.

APOBEC-mediated Mutagenesis as a Likely Cause of FGFR3 S249C Mutation Over-representation in

FGFR3 is one of the most frequently mutated genes in bladder cancer and a driver of an oncogenic dependency. Here we report that only the most common recurrent FGFR3 mutation, S249C (TCC→TGC), represents an APOBEC-type motif and is probably caused by the APOBEC-mediated mutagenic process, accounting for its over-representation. We observed significant enrichment of the APOBEC mutational signature and overexpression of AID/APOBEC gene family members in bladder tumors with S249C compared to tumors with other recurrent FGFR3 mutations. Analysis of replication fork directionality suggests that the coding strand of FGFR3 is predominantly replicated as a lagging strand template that could favor the formation of hairpin structures, facilitating mutagenic activity of APOBEC enzymes. In vitro APOBEC deamination assays confirmed S249 as an APOBEC target. We also found that the FGFR3 S249C mutation was common in three other cancer types with an APOBEC mutational signature, but rare in urothelial tumors without APOBEC mutagenesis and in two diseases probably related to aging. PATIENT SUMMARY: We propose that APOBEC-mediated mutagenesis can generate clinically relevant driver mutations even within suboptimal motifs, such as in the case of FGFR3 S249C, one of the most common mutations in bladder cancer. Knowledge about the etiology of this mutation will improve our understanding of the molecular mechanisms of bladder cancer.


Common Fragile Sites (CFSs) are chromosome regions prone to breakage under replication stress, known to drive chromosome rearrangements during oncogenesis. Most CFSs nest in large expressed genes, suggesting that transcription elicits their instability but the underlying mechanisms remained elusive. Analyses of genome-wide replication timing of human lymphoblasts here show that stress-induced delayed/under-replication is the hallmark of CFSs. Extensive genome-wide analyses of nascent transcripts, replication origin positioning and fork directionality reveal that 80% of CFSs nest in large transcribed domains poor in initiation events, thus replicated by long-traveling forks. In contrast to formation of sequence-dependent fork barriers or head-on transcription-replication conflicts, traveling-long in late S phase explains CFS replication features. We further show that transcription inhibition during the S phase, which excludes the setting of new replication origins, fails to rescue CFS stability. Altogether, results show that transcription-dependent suppression of initiation events delays replication of large gene body, committing them to instability.

A variant erythroferrone disrupts iron homeostasis in -mutated myelodysplastic syndrome.

Myelodysplastic syndromes (MDS) with ring sideroblasts are hematopoietic stem cell disorders with erythroid dysplasia and mutations in the splicing factor gene. Patients with MDS with mutations often accumulate excessive tissue iron, even in the absence of transfusions, but the mechanisms that are responsible for their parenchymal iron overload are unknown. Body iron content, tissue distribution, and the supply of iron for erythropoiesis are controlled by the hormone hepcidin, which is regulated by erythroblasts through secretion of the erythroid hormone erythroferrone (ERFE). Here, we identified an alternative transcript in patients with MDS with the mutation. Induction of this transcript in primary -mutated bone marrow erythroblasts generated a variant protein that maintained the capacity to suppress hepcidin transcription. Plasma concentrations of ERFE were higher in patients with MDS with an gene mutation than in patients with wild-type MDS. Thus, hepcidin suppression by a variant ERFE is likely responsible for the increased iron loading in patients with -mutated MDS, suggesting that ERFE could be targeted to prevent iron-mediated toxicity. The expression of the variant transcript that was restricted to -mutated erythroblasts decreased in lenalidomide-responsive anemic patients, identifying variant ERFE as a specific biomarker of clonal erythropoiesis.

The Polycomb protein Ezl1 mediates H3K9 and H3K27 methylation to repress transposable elements in

In animals and plants, the H3K9me3 and H3K27me3 chromatin silencing marks are deposited by different protein machineries. H3K9me3 is catalyzed by the SET-domain SU(VAR)3-9 enzymes, while H3K27me3 is catalyzed by the SET-domain Enhancer-of-zeste enzymes, which are the catalytic subunits of Polycomb Repressive Complex 2 (PRC2). Here, we show that the Enhancer-of-zeste-like protein Ezl1 from the unicellular eukaryote Paramecium tetraurelia, which exhibits significant sequence and structural similarities with human EZH2, catalyzes methylation of histone H3 in vitro and in vivo with an apparent specificity toward K9 and K27. We find that H3K9me3 and H3K27me3 co-occur at multiple families of transposable elements in an Ezl1-dependent manner. We demonstrate that loss of these histone marks results in global transcriptional hyperactivation of transposable elements with modest effects on protein-coding gene expression. Our study suggests that although often considered functionally distinct, H3K9me3 and H3K27me3 may share a common evolutionary history as well as a common ancestral role in silencing transposable elements.

Plasmodium myosin A drives parasite invasion by an atypical force generating mechanism.

Plasmodium parasites are obligate intracellular protozoa and causative agents of malaria, responsible for half a million deaths each year. The lifecycle progression of the parasite is reliant on cell motility, a process driven by myosin A, an unconventional single-headed class XIV molecular motor. Here we demonstrate that myosin A from Plasmodium falciparum (PfMyoA) is critical for red blood cell invasion. Further, using a combination of X-ray crystallography, kinetics, and in vitro motility assays, we elucidate the non-canonical interactions that drive this motor’s function. We show that PfMyoA motor properties are tuned by heavy chain phosphorylation (Ser19), with unphosphorylated PfMyoA exhibiting enhanced ensemble force generation at the expense of speed. Regulated phosphorylation may therefore optimize PfMyoA for enhanced force generation during parasite invasion or for fast motility during dissemination. The three PfMyoA crystallographic structures presented here provide a blueprint for discovery of specific inhibitors designed to prevent parasite infection.