The Saudi Human Genome Program

The Saudi Human Genome Program 618 372 IEEE Pulse

Oil wells, endless deserts, stifling heat, masses of pilgrims, and wealthy-looking urban areas still dominate the widespread mental image of Saudi Arabia. Currently, this image is being extended to include a recent endeavor that is reserving a global share in the limelight as one of the top ten genomics projects currently underway: the Saudi Human Genome Program (SHGP). With sound funding, dedicated resources, and national determination, the SHGP targets the sequencing of 100,000 human genomes over the next five years to conduct world-class genomics-based biomedical research in the Saudi population. Why this project was conceived and thought to be feasible, what is the ultimate target, and how it operates are the questions we answer in this article.
Saudi Arabia has a high burden of genetic diseases, mostly due to the high rate of marriage of relatives (around 60% of the marriages). The genetic diseases show up in the form of severe inherited diseases, which manifest early in life, affecting 8% of births in the kingdom, and in the form of common genetic diseases, such as diabetes, that manifest later in life and affect over 20% of the population. These diseases heavily impact quality of life for affected individuals and are a huge burden on the national health care system. It is estimated that the annual cost of these diseases is about US$27 billion. A substantial reduction in children born with genetic disabilities would immediately save over US$270 million, and similar or greater savings may result from a small delay in the age of onset of diabetes or other common disorders.
Genetic diseases are caused by mutations in the DNA, specifically in the area including genes. A mutation in the gene accordingly translates to a mutation in the respective protein. If the mutation changes the protein structure and related physiochemical properties, the function of the protein in the cell will be affected. The severity of the disease depends on the importance of the affected protein and its role in the human physiology. According to disease databases, the number of genetic disorders ranges between 7,000 and 8,000; approximately 3,500 of these are still of unknown mutation.
The first step toward eliminating the burden of genetic diseases is to find the mutations and respective genes that predispose individuals to the diseases. Then, proper preventative counseling can be planned, or a scheme of rational therapies can be devised on an individual basis, in what may be regarded as personalized medicine. In many cases, the disease-causing genes and gene variants are specific to the Saudi population and are unlikely to be discovered by research conducted outside the region. Hence, the establishment of the SHGP was a must to provide the necessary infrastructure to solve cases and understand disease in the Saudi population.
Interestingly, the abundance of genetic diseases in the Saudi population combined with large family sizes makes it easier to identify the gene and mutation underlying a particular disease because one can compare different disease carriers to healthy people, which gives stronger evidence. Furthermore, the studies done as part of the SHGP can be used to verify results obtained by other similar studies that draw conclusions from fewer cases. Thus, the national nature of the project can still benefit the global endeavor in fighting the diseases.

The Mission

The SHGP mission is to identify the genetic basis of severe and common inherited disease in the Saudi population utilizing state-of-the-art genome sequencing, bioinformatics, and validation techniques. It aims to establish the complete foundation for genomic medicine—lab infrastructure, technical capacity, and a genomic knowledge database. The database is planned to be a major output of the project to serve the whole medical community, in Saudi Arabia and worldwide. It will help in understanding the genetic bases of diseases and identifying better treatments, which will effectively contribute to the future developments of personalized medicine and genomic sciences.
“The SHGP will position the kingdom at the forefront of personalized medicine and will empower our citizens to help them make informed decisions for their health plans. It’s hoped other global academic institutions will use the impressive facilities the King Abdulaziz City for Science and Technology (KACST) is launching in the near future,” said Prince Dr. Turki bin Saud Al Saud, KACST president.

The Setup

Figure 1: A KACST research building that houses genome sequencing labs.
Figure 1: A KACST research building that houses genome sequencing labs.

The SHGP is funded and organized by the KACST and involves the creation of a national network of ten genome centers to recruit subjects and undertake the sequencing required (Figure 1). It also involves the establishment of a centralized knowledge base at the KACST to store the resulting information on population variations, including those causing disease, and to make this available to enable future diagnostic and screening efforts. The core technology in the genome centers is what is known as next-generation sequencing (NGS) technology, which is a recent development that enables efficient and cost-effective reading of the DNA sequence that composes the genome of an individual. Advanced computing infrastructure to process big genomic data has also been established to transform the output of the next-generation sequencers into useful knowledge.
“The SHGP is the largest disease gene discovery project ever undertaken and will therefore also establish the kingdom as a world leader in disease genetics research and personalized medicine,” said Dr. Sultan Al-Sedairy, the project principal investigator and the executive director of the Research Center of the King Faisal Specialist Hospital and Research Center (KFSHRC).

The Start and Current Status

The SHGP was officially launched in December 2013. In the city of Riyadh, where the KACST headquarters is located, a central genomics and bioinformatics facility is currently running (Figure 2). Another high-throughput lab is also running in the KFSHRC. This is in addition to three other labs in Jeddah, Medinah, and Riyadh that are ready to run at the time of writing. Furthermore, there are another five satellite genomics labs around the kingdom currently being established.

Figure 2: SHGP researchers working in the KACST Genome Sequencing Lab.
Figure 2: SHGP researchers working in the KACST Genome Sequencing Lab.

All the project labs are involved in performing NGS to achieve sequencing of the 100,000 genomes. Each is equipped with NGS machines and primary computation power, All the labs follow a standardized procedure for sample collection, banking, processing, and sequencing. The sequencing data are processed through an optimized bioinformatics workflow utilizing a central computer hosted in the KACST. Such workflow guarantees the quality of data generated by satellite labs and provides a link between all research groups, hospitals, clinicians, and scientists involved.
The genomic variant data will be fully analyzed and used to create a Saudi-specific database that will provide the basis for future development of personalized medicine in the kingdom, representing the most comprehensive effort to identify disease-causing genes for the population of a country and within the Arab world.

Top Medical Genomics Projects

Looking at the landscape of large-scale genome projects in medicine, we can observe a shift from internationally oriented projects to national and regional ones. The 1,000 Genomes Project and the Personal Genome Project are two examples of international projects targeting the sequencing of thousands of human genomes. These projects were launched shortly before the widespread availability of low-cost NGS technologies. Projects launched nowadays are mostly of a national or regional nature, targeting more individuals and within population-specific contexts. Examples of such recent projects include the Exome Sequencing Project (2,440 U.S. individuals), the Iceland Genome Project (2,636 Icelandic individuals), the Genomics England 100,000 Genomes Project (100,000 U.K. individuals), the Million Veteran Program (1 million U.S. veterans), and the SHGP (100,000 Saudi individuals).
Among these projects, we see that the SHGP has some interesting characteristics. The population is homogeneous as in the Iceland project, it has well-defined medical targets as in the Scottish and Exome Sequencing Projects, and it is of large scale as is the case for the Million Veteran Program and Genomics England program. The recruitment of samples is controlled to target relevant cohorts that help in identifying the genetic causes of the disease. “All these points position the SHGP as one of the top international biomedical projects,” said Dr. Brian Meyer, chair of the Genetics Department at KFSHRC.

The GEnomics in SHGP – The User of Revolutionary Technology

The study of mutations reveals the causes of many diseases. Many mutation-detection methods rely on the properties of basepair mismatches between a normal and a mutated DNA strand. Restriction enzyme polymorphisms were the first tools used for genetic diagnosis, in combination with Southern blotting of genomic DNA. This technique was first used to detect mutations related to sickle-cell anemia. It was further modified by the use of polymerase chain reaction. These techniques were able to detect the presence of a mutation but unable to read the DNA sequence. Sanger sequencing was then introduced as a new way to read the DNA sequence, and it became the standard way to detect the mutations underlying Mendelian disorders. Early successes from the application of this method included the identification of the mutations responsible for cystic fibrosis and Huntington’s disease, among others. The Human Genome Project was completed in 2000 using automated Sanger sequencing.
Since the announcement of the sequencing of the human genome, there has been a need to improve the specificity, sensitivity, scalability, speed, and cost-effectiveness of reading DNA. NGS was introduced in 2007 and has since revolutionized the genomic sciences. With the first version of NGS, a single sequencing run could produce a maximum of about 1 GB of data. Four years later, the data output was increased nearly 1,000-fold. NGS enables us to generate a large volume of sequencing data in a matter of days or hours. By comparison, the first human genome sequencing needed ten years to be completed at a cost of about US$3 billion. Today, we can sequence a whole human genome in a few days at a cost of a few thousand U.S. dollars. “The simplicity of NGS technology reduced the amount of overhead for running the lab facility and enabled enormous productivity with reasonable team size,” said Dr. Dorota Monies, head of the KFSHRC Genome Sequencing Unit.

Sequencing Workflows in the SHGP

The SHGP offers different sequencing workflows, including whole-genome sequencing (WGS), whole-exome sequencing (WES), and targeted gene sequencing. While WGS covers the whole genome, WES targets only the coding regions (exons of the genes) of the genome. It is estimated that the exome covers 2–3% of the genome. In target sequencing, only selected genes (referred to as a gene panel) will be sequenced. Targeted sequencing is faster and less costly, but there is a chance of missing disease-causing mutations as the approach does not cover the whole set of genes.
One of the research objectives of the SHGP is to investigate the use of different gene panels to cover different disease categories, where the gene panel can be selected based on the corresponding phenotype. Therefore, the design and synthesis of 13 custom-made gene panels covering all Online Mendelian Inheritance in Man (OMIM)-documented annotated genes was the approach taken by the SHGP.
Recent publications of the project team covering over 5,000 samples show that the use of the gene panels has many advantages compared to the direct use of WES:

  • a low number of false positives compared to WES
  • a high diagnostic rate
  • a low cost per sample—up to 50 samples can be multiplexed and sequenced together in one run.

These advantages encourage the use of these panels in clinical laboratory settings. “These results are very promising, and we believe it will be soon part of routine clinical practice. It will speed up the diagnostic process and reduce the time taken from months to days,” said Dr. Nada Altassan, head of the KFSHRC Behavioral Genetics Unit.

Big Data and Bioinformatics in SHGP

The SHGP, by the scale and nature of its data, is a typical big data project, where the four “V”s (volume, velocity, variety, and veracity) characterizing big data are present. When running at full capacity, the project will produce 10–15 TB of raw sequence data per day. Therefore, establishing a highperformance and scalable information technology (IT) infrastructure and the use of advanced bioinformatics methods are major components of the SHGP. “The structure of the participating centers and the distribution of the genomic data production and analysis form an interesting IT challenge that is probably the first of its kind worldwide,” said Dr. Mohamed Abouelhoda, head of the SHGP bioinformatics team.

Figure 3: The high-performance computer SANAM, one of the top supercomputers worldwide in the green data center in the KACST.
Figure 3: The high-performance computer SANAM, one of the top supercomputers worldwide in the green data center in the KACST.

All the labs produce significant amounts of data that should be analyzed and moved to the central storage for large-scale data analysis, with results to be shared among researchers inside and outside the kingdom. While each satellite lab has some computing power to participate in the data analysis, the main computing power for storage and analysis resides in the KACST. The SHGP has also access to the energy-efficient, high-performance computer, SANAM, with a performance of 532 TFlops and high-speed interconnects data rate of 56 Gb/s (Figure 3). “SANAM is one of the top supercomputers worldwide,” said Dr. Abdulqadir Alaqeeli from the KACST SANAM team.
To cope with this distributed IT infrastructure, the SHGP bioinformatics team has developed methods to manage the data and the analysis among the different sites using different computational resources. The transfer of data is prioritized and scheduled to reduce the required bandwidth. The use of commercial cloud computing solutions is also part of the design, to automatically scale the in-house IT resources in response to abrupt computation loads. Collectively, the central and satellite computer resources as well as the automatic extension with commercial cloud solutions work together like a hybrid multicloud system.

Next Actions

Over its course, the project, with its new discoveries, will find the genetic basis of different genetic diseases. The best practices learned will also help in establishing populationscale diagnostic capabilities to bring research results into the clinic on a wide scale. The project will then pave the way for advanced treatment plans using promising technologies like stem cells and gene/genome editing, where the defective components (genes or cells) can be manipulated either by knocking them out or introducing other nondefective variants that can function well in the living cell. The KACST already has plans to support initiatives in these areas to develop further technologies to leverage the information gained by the SHGP in the near future.
More information about the Saudi Human Genome Program is available online. The SHGP team can be contacted online.