Data Mining, Bioinformatics, Computational Biology
Biological Data Mining
(Chapman & Hall/Crc Data Mining and Knowledge Discovery Series)
September 2009, ISBN: 978-1-4200-8684-3
Synopsis
This 733-page book examines the concepts, problems, progress, and
trends in developing and applying data mining techniques in genome
biology, a rapidly growing field of study. By studying the concepts
and case studies presented in the book, readers can gain significant
insight and develop practical solutions in future biological data
mining projects.
Editors
Prof. Jake Y. Chen
Indiana University School of Informatics
Purdue University School of Science Department of Computer and Information Science
Indiana Center for Systems Biology and Personalized Medicine
Indianapolis, IN 46202
USA
Email: jakechen@iupui.edu
Web site: http://bio.informatics.iupui.edu/
Prof. Stefano Lonardi
Department of Computer Science and Engineering
Institute for Integrative Genome Biology
Center for Plant Cell Biology
University of California
Riverside, CA 92521
USA
Email: stelo@cs.ucr.edu
Web site: http://www.cs.ucr.edu/~stelo/
Chapters
1. "Consensus Structure Prediction for RNA Alignments" by Junilda
Spirollari and Jason T.L. Wang
2. "Invariant Geometric Properties of Secondary Structure Elements in
Proteins" by Matteo Comin, Concettina Guerra, and Giuseppe Zanotti
3. "Discovering 3D Motifs in RNA" by Alberto Apostolico, Giovanni
Ciriello, Christine E. Heitsch, and Concettina Guerra
4. "Protein Structure Classification Using Machine Learning Methods"
by Yazhene Krishnaraj and Chandan Reddy
5. "Protein Surface Representation and Comparison: New Approaches in
Structural Proteomics" by Lee Sael and Daisuke Kihara
6. "Advanced Graph Mining Methods for Protein Analysis" by Yi-Ping Phoebe
Chen, Jia Rong, and Gang Li
7. "Predicting Local Structure and Function of Proteins" by Huzefa
Rangwala and George Karypis
8. "Computational Approaches for Genome Assembly Validation" by
Jeong-Hyeon Choi, Haixu Tang, Sun Kim, and Mihai Pop
9. "Mining Patterns of Epistasis in Human Genetics" by Jason H. Moore
10. "Discovery of Regulatory Mechanisms from Gene Expression Variation by
eQTL Analysis" by Yang Huang, Jie Zheng, and Teresa M. Przytycka
11. "Statistical Approaches to Gene Expression Microarray Data
Preprocessing" by Megan Kong, Elizabeth McClellan, Richard
H. Scheuermann, and Monnie McGee
12. "Application of Feature Selection and Classification to Computational
Molecular Biology" by Paola Bertolazzi, Giovanni Felici, and Giuseppe
Lancia
13. "Statistical Indices for Computational and Data-Driven Class Discovery
in Microarray Data" by Raffaele Giancarlo, Davide Scaturro, and
Filippo Utro
14. "Computational Approaches to Peptide Retention Time Prediction for
Proteomics" by Xiang Zhang, Cheolhwan Oh, Catherine P. Riley, Hyeyoung
Cho, and Charles Buck
15. "Inferring Protein Functional Linkage Based on Sequence Information
and Beyond" by Li Liao
16. "Computational Methods for Unraveling Transcriptional Regulatory
Networks in Prokaryotes" by Dongsheng Che and Guojun Li
17. "Computational Methods for Analyzing and Modeling Biological Networks"
by Natashaa Przulj and Tijana Milenkovi
18. "Statistical Analysis of Biomolecular Networks"
by Jing-Dong J. Han and Chris J. Needham
19. "Beyond Information Retrieval: Literature Mining for Biomedical
Knowledge Discovery" by Javed Mostafa, Kazuhiro Seki, and Weimao Ke
20. "Mining Biological Interactions from Biomedical Texts for Efficient
Query Answering" by Muhammad Abulaish, Lipika Dey, and Jahiruddin
21. "Ontology-Based Knowledge Representation of Experiment Metadata in
Biological Data Mining" by Richard H. Scheuermann, Megan Kong, Carl
Dahlke, Jennifer Cai, Jamie Lee, Yu Qian, Burke Squires, Patrick Dunn,
Jeff Wiser, Herb Hagler, Barry Smith, and David Karp
22. "Redescription Mining and Applications in Bioinformatics" by Naren
Ramakrishnan and Mohammed J. Zaki
23. "Data Mining Tools and Techniques for Identification of Biomarkers for
Cancer" by Mick Correll, Simon Beaulah, Robin Munro, Jonathan Sheldon,
Yike Guo, and Hai Hu
24. "Cancer Biomarker Prioritization: Assessing the in vivo Impact of in
vitro Models by in silico Mining of Microarray Database, Literature,
and Gene Annotation" by Chia-Ju Lee, Zan Huang, Hongmei Jiang, John
Crispino, and Simon Lin
25. "Biomarker Discovery by Mining Glycomic and Lipidomic Data" by Haixu
Tang, Mehmet Dalkilic, and Yehia Mechref
26. "Data Mining Chemical Structures and Biological Data" by Glenn
J. Myatt and Paul E. Blower
Preface of the Book
Modern biology has become an information science. Since the invention
of DNA sequencing method by Sanger in the late seventies, public
repositories of genomic sequences have been growing exponentially,
doubling in size every sixteen months - a rate often compared to the
growth of semiconductor transistor densities in CPUs known as Moore's
Law. In the nineties, the public-private race to sequence the human
genome further intensified the fervor to generate high-throughput
biomolecular data from highly parallel and miniaturized
instruments. Today, sequencing data from thousands of genomes,
including plants, mammals, and microbial genomes are accumulating at
an unprecedented rate. The advent of second-generation DNA sequencing
instruments, high-density cDNA microarrays, tandem mass spectrometers,
and high-power NMRs, have fueled the growth of molecular biology into
a wide spectrum of disciplines such as personalized genomics,
functional genomics, proteomics, metabolomics, and structural
genomics. Few experiments in molecular biology and genetics performed
today can afford to ignore the vast amount of biological information
accessible publicly. Suddenly, molecular biology and genetics have
become data rich.
Biological data mining is a data-guzzling turbo engine for
post-genomic biology, driving the competitive race towards
unprecedented biological discovery opportunities in the 21st
century. Classical bioinformatics emerged from the study of
macromolecules in molecular biology, biochemistry and
biophysics. Analysis, comparison, and classification of DNA and
protein sequences were the dominant theme of bioinformatics in the
early nineties. Machine learning mainly focused on predicting genes
and proteins functions from their sequences and structures. The
understanding of cellular functions and processes underlying complex
diseases were out of reach. Bioinformatics scientists were a rare
breed, and their contribution to molecular biology and genetics was
considered marginal, because the computational tools available then
for biomolecular data analysis were far more primitive than the array
of experimental techniques and assays that were available to life
scientists. Today, we are now witnessing the reversal of these past
trends. Diverse sets of data types that cover a broad spectrum of
genotypes and phenotypes, particularly those related to human health
and diseases, have become available. Many interdisciplinary
researchers, including applied computer scientists, applied
mathematicians, biostatisticians, biomedical researchers, clinical
scientists, and biopharmaceutical professionals, have discovered in
biology a gold mine of knowledge leading to many exciting
possibilities: unraveling of tree of life, harnessing the power of
microbial organisms for renewable energy, finding new ways to diagnose
disease early, and developing new therapeutic compounds that save
lives. Much of the experimental high-throughput biology data are
generated and analyzed 'in haste', therefore leaving plenty of
opportunities for knowledge discovery even after the original data is
released. Most of the bets on the race to separate the wheat from the
chaff have been placed on biological data mining techniques. After
all, when easy, straightforward, first-pass data analysis hasn't
yielded novel biological insights, data mining techniques must be able
to help - or, many presumed so.
In reality, biological data mining is still much of an 'art',
successfully practiced by a few bioinformatics research groups that
occupy themselves in solving real-world biological problems. Unlikely
data mining in business, where the major concerns are often related to
the bottom line - profit, the goals of biological data mining can be
as diverse as the spectrum of biological questions that exist. In the
business domain, association rules discovered between sales items are
immediately actionable; in biology, any unorthodox hypothesis produced
by computational models has to be first red-flagged and is lucky to be
validated experimentally. In the internet business domain,
classification, clustering, and visualization of blogs, network
traffic patterns, and news feeds add significant values to regular
internet users who are unaware of high-level patterns that may exist
in the data set; in molecular biology and genetics, any clustering or
classification of the data presented to biologists may promptly elicit
questions like 'great, but how and why did it happen?' or 'how can you
explain these results in the context of the biology I know?' The
majority of general-purpose data mining techniques do not take into
considerations prior knowledge domain of the biological problem,
leading them to often underperform hypothesis-driven biological
investigative techniques. The high level of variability of
measurements inherent in many types of biological experiments or
samples, the general unavailability of experimental replicates, the
large number of hidden variables in the data, and the high correlation
of biomolecular expression measurements also constitute significant
challenges in the application of classical data mining methods in
biology. Many biological data mining projects are attempted and then
abandoned, even by experienced data mining scientists. In the extreme
cases, large-scale biological data mining efforts are jokingly labeled
as fishing expeditions and dispelled, in national grant proposal
review panels.
This book represents a culmination of our past research efforts in
biological data mining. Through this book, we wanted to showcase a
small, but noteworthy sample of successful projects involving data
mining and molecular biology. Each chapter of the book is authored by
a distinguished team of bioinformatics scientists whom we invited to
offer the readers the widest possible range of application domains. To
ensure high quality standards, each contributed chapter went through
standard peer-reviews and a round of revisions. Contributed chapters
have been grouped into four major sections. The first section,
entitled Sequence, Structure, and Function, collects contributions on
data mining techniques designed to analyze biological sequences and
structures with the objective of discovering novel functional
knowledge. The second section on Genomics, Transcriptomics, and
Proteomics, contains studies addressing emerging large-scale data
mining challenges in analyzing high-throughput 'omics' data. The
chapters in the third section, entitled Functional and Molecular
Interaction Networks, address emerging system-scale molecular
properties and their relevance to cellular functions. The fourth
section is about Literature, Ontology, and Knowledge Integrations, and
it collects chapters related to knowledge representation, information
retrieval, and data integration for structured and unstructured
biological data. The contributed works in the fifth and last section,
entitled Genome Medicine Applications, address emerging biological
data mining applications in medicine.
We believe this book can serve as a valuable guide to the field for
graduate students, researchers, and practitioners. We hope that the
wide range of topics covered will allow readers to appreciate the
extent of the impact of data mining in molecular biology and
genetics. For us, research in data mining and its applications to
biology and genetics is fascinating and rewarding. It may even help
save human lives one day. This field offers great opportunities and
rewards if one is prepared to learn molecular biology and genetics,
design user-friendly software tools under the proper biological
assumptions, and validate all discovered hypothesis rigorously using
appropriate models.
Jake Y. Chen and Stefano Lonardi
At Indianapolis, IN and Riverside, CA, USA
February 2009
Labels: announcement, Biological Data Mining, book

<< Home