Paul's Projects: Genome3D Research

A Viewer-model Framework for Visualizing Multi-scale Three-dimensional Genome with Online Integration

ABSTRACT

This research was a six-week effort put into developing a 3D web browser for human genomic data. The goal was to create a working 3D browser in Unity Game Engine with streaming capability. A working 3D browser was programmed successfully in C# programming language that is native to Unity Game Engine. Unity Game Engine was chosen as the development setting because of its established web support. The engine allowed for easy publication and running of the program on the web. It also allowed real-time playbacks of the program, allowing convenient and quick tests.

The 3D Browser was able to display genomic materials on three levels. These three levels were stored into separate folders that had the following names from the lowest to highest resolution: gloop, fiber, and nucleo. Each data was stored as either .bin or .xml format. The 3D Browser was programmed to read and load data stored in .xml format. The .bin files were encrypted, so reading them was very difficult. Additionally, the .bin files were simply compressed versions of their .xml variants, and the Unity Game Engine had basic parsers for .xml files. The decision was that developing a parser for the .bin files was unnecessary, and a fully functional parser for .xml files was developed. This project established foundations for the program. Control systems, basic parsers for loading XML data, and basic client-to-server communication systems were developed for the web browser. Although not yet a fully featured program, it has all the basics necessary as a platform for further development.

INTRODUCTION

Fundamental studies in molecular biology often involve the study of human genomes. The Human Genome Project is an ongoing project dedicated to map out the human genome. However, mapping out the entire human genome requires massive efforts to sequence. Three billion base pairs must be organized, identified, and cloned repeatedly by technicians (Cooper, 1994). The incredible amount of effort required makes the mapping of the human genome a very expensive and time-consuming process. Greg Sutcliffe’s sequencing of the first plasmid, pBR322, in the 1980’s took three years and over $100,000 to sequence 4,361 base pairs (Cooper, 1994). Many techniques to automate or simplify this process have been developed in the past two decades.

One of the critical discoveries that helped simplify the process of mapping the human genome was the discovery and development of Polymerase Chain Reaction (PCR) in 1993. The PCR repeatedly amplifies a very small piece of a DNA segment, making obtaining large amounts of a specific segment possible. The technique yields enough DNA to make mapping and sequencing the human genome possible without having to take billions of DNA segments in a repetitive fashion. The discovery and development of Polymerase Chain Reaction is accredited to Mullis and Fashion. A Nobel Prize has been awarded to them for their work, and they are considered as pioneers of modern molecular biology. Modern techniques such as the PCR significantly improved the process of mapping DNA sequences that sequencing all of pBR322 can now be done within a few days for under $1,000. Researchers today generate thousands of sequence data every month that can be read by computers.

In order for the computers to be able to read the genomic information, a specific program is required. Cooper (1994) states that “the human genome may be considered a biological ‘program’ written in a largely unknown programming language.” Making a program to assemble this seemingly unknown language into a format that can be interpreted by researchers around the world requires innovative programming ability. The Human Genome Project started simply by feeding images of the mappings into computers. While this process was an easy implementation, it was too rudimentary to be of great use for researchers. The mappings showed the positions and patterns of the DNAs, but identifying relationships between segments of DNA was very difficult. A more visual-oriented and intuitive manifestation of the human genome was required.

As an effort to provide a visual framework for human genomic model, the Department of Computer Science and Engineering of the University of South Carolina produced a program called Genome3D. Genome3D, “a GUI-based C++ program which runs on Windows platforms”, is an application of object-oriented technology that was the first program to integrate and visualize human genomic information in three dimensions (Asbury, Mitman, Tang, & Zheng, 1993). A model of physical genome requires a three-dimensional position of each atom in the genome, and with over one billion atoms within a human genome, the size can easily exceed 600 gigabytes. By compressing such data into XML format and computing atom positions on demand, the size of a human genomic model is drastically reduced to around 1.5 gigabytes.

A model, however, retains four different resolutions of its structures, each having different sets of information, and the total size of the model is multiplied by a factor of 4. Especially when working with multiple human genomic models, this size is still too large to be easily downloaded and used by researchers all around the globe.

One solution to this problem is using streaming technology. This technique distributes the human genomic model into small bits and allows the program to run with only partial information so that the clients can download the program and the genomic information bits at a time and be able to run it without having to download the entire file first. Through this technology, downloading time is significantly reduced. By making Genome3D a more accessible and size-efficient program, this project strives to produce an example of a pioneering application of computer science and engineering in the field of molecular biology.

METHODS AND MATERIALS

This project was done entirely on a computer running a Windows Operating System. All programming was done in C# language. The Unity 3D Game Development Engine was used in integrating programming into an executable. This game engine was used for its native support for web-based applications. It also simplified testing process because it would instantly load and run the currently active project.

The C++ programming language, in which Genome3D was written, was not supported by the Unity 3D Engine, so the decision was to port the code from C++ to C#. However, because the original code was not documented, it was extremely difficult and time-consuming to decipher and convert to C#. A new decision was made to program from scratch in C#.

For the first step, a control scheme was designed and programmed. A system that renders atomic points for a genome model with specific three dimensional coordinates was coded. To provide this system with coordinates for each atomic point, an xml parser was coded. The xml files, which are text-based document encoders, contain all the information for a genomic model and thus the coordinates for all atomic points. In order to provide the program with the coordinates, an xml parser was programmed to load these xml files and feed the coordinates for each point to the rendering system. Three different types of data points, gloop, fiber, and nucleo, which correspond to different levels of resolution, were coded as three distinct classes. The xml files contain the parameters for what type of data point to load. The rendering system receives those parameters and calls for the corresponding data point to be rendered. After finishing each step, a test was run to make sure the program was running as intended.

RESULTS

By the end of the project, the basic functions of Genome3D had been ported to C#. The program was able to load a genome model file encoded in xml format and populate the browser world with data points. Any data point within the frustum of the camera, a rectangular prism that represents the user’s field of view, was rendered by one of the three types of data points. Data points outside the frustum were not rendered to conserve memory. The control scheme was tested by two testers, both of whom reported that navigating using the browser was intuitive and easy. Due to time limitations, the streaming technology was not incorporated. Porting the basic functionalities of the program required so much time that most of the goals were not met. Future work must be to implement a server-to-client communication system so that users will be able to stream genome model files onto their computers from a server that contains them. Although the program was able to read and load data points from the model files, there are also extra bits of information in them that the program is not yet able to interpret. These bits of information include points at which the DNA folds and twists. Handling this information requires skillful vertex manipulation.

Although the goals for this project were not met, the basics of the program have been ported to C#. With the new foundation in place, future work can be easily done to incorporate new features and eventually achieve the goals that were set for this project.