Generative AI Visualizes Thousands of 3D Genome Structures in Just Minutes

Researchers at Massachusetts Institute of Technology (MIT) (MA, U.S.) have developed ChromoGen, a deep learning model that generates 3D genomic structures, which can reveal crucial insights into genetic expression, and deepen scientists’ understanding of health and disease.
DNA’s 3D Structure Influences Gene Expression
Although all cells in an organism carry a full copy of their DNA, they perform remarkably different functions. This is due to the unique pattern of genes that are either expressed or switched off; a neuronal cell gene expression differs greatly from that of a liver cell.
Across different cellular populations, the sets of genes that are active and switched off vary, allowing cells to differentiate and carry out specialized functions in the body. This on/off function is largely carried out by transcription factors, which are a group of proteins that bind to genomic sequences to promote or repress gene expression.
The 3-dimensional organization of DNA plays a vital role in gene expression. A type of structural protein, called histones, condenses and folds DNA into chromatin. This arrangement makes for regions that are either physically accessible or inaccessible to transcription factors, in other words, which genes are expressed and switched off, respectively.
For example, a neuronal cell’s chromatin will only expose the genes relevant for brain function, while genes important for liver cell functioning are tucked away and ‘hidden’ from transcription factors.
Notably, chromatin structures are dynamically affected by dietary or behavioral factors – a phenomenon known as epigenetic modification. Certain foods, or behaviors like smoking, can attach molecular tags to the genome that impact how DNA folds.
In turn, these epigenetic changes can cause genes to become active or switched off over time, which could have a significant impact on health. For instance, if a ‘switched off’ gene that normally promotes cell growth becomes active, it may trigger uncontrolled cell division, potentially leading to tumor formation.
Genomic Structures are Hard to Capture
Studying genomic structures can deepen scientists’ understanding of how distinct expression patterns emerge, offering valuable insights into developmental biology. Moreover, exploring the role of epigenetics could also reveal key information about how lifestyle choices contribute to disease development.
Currently, researchers know only a little about the molecular interactions that drive this DNA folding.
Scientists currently use experimental techniques to model chromatin structures; one method, Hi-C, visualizes the average chromatin conformations from a population of cells by:
- Cross-linking DNA to ‘lock’ interacting regions together with a strong chemical bond.
- Cutting the genome randomly using digestion enzymes.
- Rejoining or ‘glueing’ the fragments back together with a labelling molecule. In which, cross-linked areas are more likely to rejoin because of their physical proximity.
- Sequencing the rejoined fragment from their labelling molecule, that has helped ‘glue’ the fragment back together.
Data obtained from Hi-C tells scientists which areas of the genome are physically interacting, which indicates its 3D chromatin structure, and understand how DNA is spatially organized within a population of cells.
A variation of this technique, called Dip-C, applies a similar approach to a single cell instead of a population. It extracts and amplifies the DNA from a single cell to visualize individual chromatin structures.
Although effective, these methods require large amounts of data, which is labor-intensive, costly, and time-consuming to obtain. They also do not fully account for epigenetic modifications, which can significantly alter DNA folding.
ChromoGen: Using AI to Predict 3D DNA Structures
There have been several attempts to predict the 3D structures of DNA from their sequences using deep learning techniques. However, this is still limited in its applications, since these techniques, just like the more traditional ones, also do not factor in epigenetic changes to chromatin conformation.
Researchers are looking to overcome these limitations, and one promising strategy is the development of more powerful AI models.
One research group at MIT in the U.S. recently published a paper in Science on ChromoGen, a generative AI capable of visualizing 3D chromatin structures both at the single-cell scale and at its population level.
ChromoGen integrates two models into one workflow, the EPCOT model and the diffusion U-Net distribution, and predicts these 3D DNA structures in two key stages:
- The EPCOT component embeds DNA sequence and accessibility data into a low-dimensional space, simplifying the information to extract only key features. This allows ChromoGen to understand patterns in sequencing data that influence overall structure.
- The EPCOT framework’s low-dimensional embeddings are used as inputs to a generative diffusion model, in this case is the integrated diffusion U-Net neural network which learns to denoise the data step by step, generating high-resolution chromatin structures.
Results and Future Perspectives
The research team trained the model with over 11 million 3D chromatin conformations previously obtained through Dip-C. They then used ChromoGen to generate said chromatin structures from over 2,000 DNA sequences, and the AI did it in just a matter of minutes – much faster than Hi-C or Dip-C, which can take up to a week to complete.
Impressively, ChromoGen’s predictions of the physical distances between different genomic regions aligned strongly with experimental observations, reporting a high correlation of 97% between AI-predicted and Hi-C/Dip-C obtained data.
Such a study has widespread implications for future research. AI models, such as ChromoGen, could radically cut time spent on costly experiments while allowing scientists to visualize chromatin structures at a high resolution.
Exploring these structures at the single-cell or population level could uncover the molecular forces behind DNA folding and potentially reveal the mechanisms controlling genetic expression that is associated with certain diseases.
Wider Perspective: AI in The Chromatin Structural Space
Outside of this paper, the field of AI-powered chromatin structural analysis is advancing quickly. One review, published in Briefings in Bioinformatics, provides a comprehensive overview of the benefits and drawbacks of several deep learning techniques developed for chromatin structural predictions.
A majority of deep learning techniques currently utilize two different strategies to transform data, a crucial process for extracting key patterns that inform machine predictions:
- Stripe Methods. Large datasets are ‘striped’ across multiple storage units; in this case, chromatin interaction data is divided into ‘bins’ that are further organized into structured ‘stripes.’ The model will compare pairs of bins to extract patterns following the structure of these stripes, which may take forms such as V-shape or Zig-zag shapes.
- 1D to 2D Conversion Methods. Encoders convert 1D data (linear sequences) into grid-like data to identify patterns and important relationships. It does this using two core methods: tiling and transposition, which expand data into a larger format, and change the order of data arrangement, respectively.
ChromoGen, which integrates an EPCOT framework with a diffusion model, uses a form of the second method; one-dimensional data (the DNA sequence and open chromatin information) conversion. This 1D data is transformed into a grid-like structure to extract key patterns, which is the chromatin/ 3D structure.
However, reducing data dimensionality can remove some high-level interactions in chromatin structure, whereas AI frameworks that utilize stripe methods may capture more complex spatial structures much more effectively.
There are also some limitations surrounding the type of input data that ChromoGen uses.
AI models developed for chromatin structural visualization use a combinatorial approach to inform predictions – for example, ChromoGen infers chromatin accessibility from DNase-seq data, an experimental technique using a specialized enzyme to cut DNA bonds.
Although effective, DNase-seq is less efficient at detecting chromatin accessibility than an alternative experimental method, such as Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq). ATAC-seq requires fewer processing steps than DNase-seq, leaving less room for possible errors, and uses an enzyme specific to open chromatin regions, helping to reduce random noise.
ChromoGen also ignores parameters that are important for gene expression. One crucial factor is the influence of DNA-binding proteins, which define boundaries within the chromatin and indicate enhancement or suppression of certain genes.
While the results of ChromoGen are exciting, it is important to consider the study in a wider context, acknowledging that there are still hurdles to using deep learning for genomic structural visualizations.