What Specifically Do We Think Students In The Quantitative Aspects Of The
New Biology Should Know?
Isaac S. Kohane
ChildrenÕs Hospital Informatics Program
Division of Health Sciences and Technology, Harvard University and MIT
Harvard Partners Center for Genetics and Genomics
Motivation
Introduction
Discussions of what constitutes the discipline of bioinformatics, clinical
informatics, computational biology or biomedical informatics are by their
unbounded nature doomed to an inconclusive and unsatisfying outcome. Furthermore,
these discussions only have the most abstract relationship to what research
or educational programs are implemented by investigators. Specifications
of curricula for trainees in these disciplines (which for pragmatic reasons
we will denote in their entirety as biomedical computing solely within the
confines of this paper and without making any claims for the appropriateness
of this term) are helpful but they leave much open for interpretation and
therefore a large, and perhaps desirable latitude, for specification of minimal
competencies of biomedical computing students/trainees.
It was with this background that I had the pleasure of an informal discussion
with one of the luminaries in biomedical computing regarding our shared and
differing views of the necessary competencies that we expected of our graduate
students and post-doctoral fellows. As we trod along well-worn pathways of
countless similar discussions, the following question occurred:
ÒWhat question in a test would you expect/want 70% of the graduates of our
training programs to be able to answer?Ó
I subsequently posed the same question to the electronic mail list of the
American College of Medical Informatics,
a nominated and elected group of nationally and internationally recognized
investigators in biomedical computing. Within a week I received several replies
for a total of 37 questions, which form the core of this manuscript. The
purpose here is neither to claim that these are a definitive or representative
set nor to be prescriptive in specifying necessary competencies. Rather the
intent is to use these questions as a convenience sample on which to base
further discussions for the very concrete decisions on the content of class
work, seminars and other training mechanisms for the future leaders in biomedical
computing.
The organization of the remainder of the paper is:
1) Review
and categorization of the questions
2) Verbatim
copies of the questions
3) Discussion
of some of the implications of this convenience sample.
As the purpose here is to promote further discussion the readers might themselves
the following questions:
Would I expect my student/trainee to know the answer to the question?
If not, would I expect her to understand the question and the explicit or
implicit challenges it refers to?
If I do not expect the student to answer the question, should she know which
expert to ask and how best to pose the question to that expert?
Does this question relate at all to what I understand as biomedical computing?
Of course, the reader can also ask the same questions of himself or herself
as they ask of their students.
1.1 Categorization
Post-hoc, I have created a small taxonomy for these questions. Again, this
taxonomy is not intended to be normative or prescriptive and but rather only
as a useful set of abstractions for categorizing these particular questions.
Category |
Elaboration |
Number
of questions |
Clinical application |
Used in clinical care or research |
22 |
Biology application |
Used in basic biology research |
21 |
Knowledge and/or Data Representation |
Tradeoffs and optimization
involved in representing data and knowledge |
8 |
Probability |
Applications of probability
theory, hypothesis testing and statistics. |
5 |
Databases |
Design, implementation and
performance of instances of databases. |
7 |
Management |
Working with institutions
and people to engineer information and social systems. |
3 |
Standardization |
Use of shared terminologies,
data models and ontologies. |
3 |
Science Community |
How to operate knowledgeably
and effectively and pleasantly in the larger academic community. |
3 |
Decision Support |
Improving decision-making,
avoiding mistakes/errors. |
4 |
Translational research |
Bringing knowledge gleaned
in basic research to clinical relevance. |
1 |
User Interface/visualization |
Interactions with software
artifacts and large data sets from the user perspective. |
1 |
Algorithms/Machine learning |
Development, optimization
and testing of algorithms. Clustering, and classification and other
data-driven characterizations of systems. |
7 |
Public Policy |
The way in which national
and international policies affect biology, medicine and computational
methods. |
1 |
Models |
Useful abbreviated descriptions
of a system that can be used for diagnosis/fault analysis and/or prediction/prognosis. |
3 |
1.2 The
questions
The questions returned to me by the members of the American College of Medical
Informatics are reproduced verbatim here with correction of only spelling.
The terminology used in these questions varies and is not identical across
questions but that is part of the inherent challenge of understanding these
questions developed by experts.
- At the request of your hospital's Board of Directors,
the CIO is
preparing a presentation on implementing an electronic medical record that
includes CPOE. Because of your biomedical informatics expertise, the CIO
asks you to prepare a slide listing the 6 most important factors that predict
success of large clinical computing initiatives, with 1 or 2 sentences
describing each factor. She needs the slide in 1 hour.
- After graduation from your informatics
training, you are hired by a forward-looking moderate-size academic health
care center to introduce clinical decision support and quality/safety
measures into their care delivery system. Your first attempt to do so,
a stand-alone guideline system that can be consulted about how to implement
patient-specific guidelines (eligibility screening and therapy planning)
that leverages calls to the clinical data repository and the ADT demographics
systems, is only
used by 5% of all clinicians, and then only sparingly. What actions would
you take at this point: (a) first list possible problems; (b) list steps
you would take to "diagnose" the problem; (c) list several possible
solutions to the problems listed in (a); and, (d) explain how you would
implement one
or more of the proposed solutions in a way to improve the chances of success.
What stakeholders in the institution should be consulted at what stages
of addressing the problem, and how? How important are guideline "standards" such
as GLIF and GEM in addressing such "local" issues? What aspects
of how you approach the problem would be of "general" interest
to the informatics community (i.e. lessons learned)?
- (take-home essay) Write a 4-5 page small grant
proposal to investigate an exciting single-topic single-facet informatics
issue of interest to you. Include Specific Aims, Background summary of
relevant previous work, Methods (including data collection, data quality,
data analysis, tests for "significance"), "Potential weaknesses",
and, if relevant, "Humans Subjects/ HIPAA Issues".
- Name five exemplary institutions with ongoing
informatics (clinical or bioinformatics) activities. Develop a rating
system that would allow your friend's younger sister to determine criteria
for deciding to which program to apply for her training, and use the
criteria to rate the five programs you listed.
- Describe 3 different approaches taken by past
system developers to computer-assisted clinical diagnosis. Discuss strengths
and weaknesses of each approach. Describe associated knowledge base construction
and maintenance issues for each system as part of your evaluation.
- Compare the strengths and weaknesses of hierarchical,
relational, and object-oriented database models. Describe which model
you would use for 1) a patient-oriented clinical information system optimized
for rapid retrieval and 2) a ten year observational NIH-funded study
to determine gene expression profiles that predict clinical outcomes
in prostate cancer. Justify your choice.
- Name five types of medical errors that
impact on patient safety that are amenable to reduction or elimination
by use of information technologies. Describe the sources of information
needed to reduce each type of error, and the issues related to design
and deployment of the error-reducing function.
- Describe the major modes of database integration,
including
federation, data warehouses, and what the challenges in biomedical environments
are to their use and implementation.
- Comment on whether current clinical data repositories
are
potential goldmines for genetic/molecular medical discoveries or
tangled messes unsuitable for discovery. Support your comment with facts.
- Describe the ways in which model organism genomes
will impact
human health. Why are they worth sequencing, what discoveries will they
enable using informatics, and how can informatics translate genomic comparisons
into comparisons in the phenotype domain?
- Describe the advantages and disadvantages of
an 'integrated' vs.
'interfaced' approach for a clinical systems architecture in a mid-sized
academic medical center with ambulatory care clinics. Draw a potential
architecture for each approach. For each data path, describe the relevant
standards to be used. Discuss issues surrounding costs, functionality,
performance, scale, flexibility, and ease of implementation and management.
- Discuss critical design issues in the user interface
for an
electronic health record. Consider at least three user models and describe
how the application functionality supports their workflow. Describe how
the user interface supports (or not as appropriate) the collection of standardized
(controlled) terminology and data.
- Discuss the trade-offs in at least three different
approaches to
generalized inference in the context of producing a differential diagnosis
based upon a set of input observations. Be sure to address issues surrounding
knowledge engineering and knowledge acquisition, uncertainty management,
potential biases, heuristics, and computational complexity.
- Describe a current public policy development
that is likely to affect the ability to build and deploy applications
that provide effective decision support to clinicians and patients at
the point of need. What are the major issues under discussion or debate?
What are the implications for informatics applications? How can you follow
and influence the discussion?
- Under what circumstances is it legitimate to
create a new biomedical vocabulary?
- What is a medical informaticist or biomedical
informaticist. Explain the qualifications, responsibilities and
expected capabilities.
- Consider the tasks of human-mouse genomic sequence
alignment to identify conserved non-coding regions and alignment of human
ESTs to the human genome to identify coding regions and candidate cSNPs,
in both cases using a BLAST like algorithm.
a) Discuss the tradeoffs in parameterizing each search, the effect of parameter
choices on search sensitivity and specificity, anticipated computing system
requirements and the relative compute time for the two tasks.
b) Describe three nonrandom features of mammalian genomic sequence, how
these will influence the interpretation of your search results, and modifications
to your search strategy that you will introduce to accommodate nonrandom
structure in genomic sequence.
c) Is EST to genomic sequence alignment a good way to identify cSNPs? Discuss
the expected sensitivity and specificity for identifying an ancient cSNPs
using this strategy. How will the presence of paralogous genes alter the
interpretation of your results?
- Design a relational database schema to store
the results of the cSNP search described in problem 1 that will allow
you to execute the following queries:
- find the candidate cSNPs in a given genomic interval
- find all of the CGAP libraries whose donor carries a particular cSNP.
What strategies could you use to make the processing of these queries efficient?
- You hypothesize that ontogeny determines gene
expression patterns and that tissues of similar embryonic origin will
have similar gene expression profiles. Assuming that this hypothesis
is correct, describe a computational strategy using the tissue specific
gene expression data of Su et al, to build a "molecular ontogeny".
Be precise in describing your calculation and justify the decisions you
make.
Su AI, Cooke MP, Ching KA, et al (2002) "Large-scale analysis of the
human and mouse transcriptomes." PNAS 99(7):4465-70. PMID:11904358
- You are asked to comment on the claims of value
realized by an ambitious and comprehensive clinical information system
encompassing both the hospital and a large group practice. The financial
returns are claimed to be in the 10s of millions of dollars. The primary
contributors are reduction in practice variation, reduction in ADEs,
reduction in decubiti, and a wide variety of smaller dollar amounts surrounding
pharmacy costs, labor costs in HIM, chart pulls formerly required for
prescription refills, nursing productivity, and reduction of duplicate
tests. Qualitative factors - including nursing and physician satisfaction
- are also included.
Armed with this information, the CEO has taken these savings into the 5-year
budget and has cut the relative increase in budget allocation to compensate
for the savings espoused by this report. You are asked: how do we measure
these savings? How do we measure the value? How much will it cost to perform
these measurements, which systems must be implemented to realize these
benefits, and over what time are we expected to see benefit.
Discuss, in the 10 minutes remaining in your conversation, an approach
to thinking about the benefits realized by clinical systems and desscribe
the recommendations you would make the the CEO concerning his budget.
`
- What are the state of the art approaches to
indexing and retrieval of knowledge-based information? What are the major
limitations of these approaches and research thrusts trying to overcome
those limitations?
- Describe the ways that informatics applications
can be used to facilitate the use of evidence-based medicine techniques
in real clinical practice.
- Describe the spectrum of practice for biomedical
informatics, from the academic to the applied? What are the job roles
informaticians are likely to take and how are they best trained?
- Tell me as much as you can about this amino
acid sequence. For
example, can you find its name, organism, structure, function,
relevance (if any) to human health, etc. Describe the web sites
and/or other resources you used to find the information.
MLMASTTSAVPGHPSLPSLPSNSSQERPLDTRDPLLARAELALLSIVFVAVALSNGLVLAALARRGRRGHWAPIHVFIGHLCLADLAVALFQVLPQLAWKATDRFRGPDALCRAVKYLQMVGMYASSYMILAMTLDRHRAICRPMLAYRHGSGAHWNRPVLVAWAFSLLLSLPQLFIFAQRNVEGGSGVTDCWACFAEPWGRRTYVTWIALMVFVAPTLGIAACQVLIFREIHASLVPGP
SERPGGRRRGRRTGSPGEGAHVSAAVAKTVRMTLVIVVVYVLCWAPFFLVQLWAAWDPEAPLEGAPFVLLMLLASLNSCTNPWIYASFSSSVSSELRSLLCCARGRTPPSLGPQDESCTTASSSLAKDTSS
- What is a statistical correction for multiple
testing? Under what circumstances is one required? What effect does a
multiple testing correction have on statistical power? Name the classical
multiple testing correction and at least two more recent alternative
approaches, and describe the strengths and weaknesses of each.
- Consider a hidden Markov model for a family
of protein sequences. What is hidden? What is the Markovian property?
What is the difference between first order Markovian and second order
Markovian processes? Sketch a state and transition diagram for an HMM
with two conserved residues, labeling each state, and identifying the
parameters of the model. What data and algorithms are required to fit
such a model?
- In ab initio protein structure prediction, an empirical
energy
function is often minimized.
A. Provide a citation for an energy function used in this
sort of work. What are the non-bond terms in that function?
What biophysical forces do they account for?
B. The global minimum energy conformation might not provide all of the
structural information about a protein that is relevant to its function. Why
not? Describe two computational approaches, which provide more information that
just the minimum energy. What additional information do they provide? What are
the advantages and disadvantages of each?
- Programming question (everyone must answer this
question): Below is a dataset describing protein-protein interactions.
Each line in the dataset begins with a symbol (no spaces) identifying
a protein-protein interaction assay, followed by a pair of IDs of proteins
that are asserted to interact by a particular experiment using that assay,
all separated by whitespace e.g.
yeast-two-hybrid p1 p4
yeast-two-hybrid p1 p5
yeast-two-hybrid p2 p3
yeast-two-hybrid p2 p3
yeast-two-hybrid p2 p4
immune-coprecipitation p1 p5
immune-coprecipitation p3 p4
x-ray-crystallography p3 p4
However, the assays are noisy, so we want to provide a way to calculate
the reliability of a relationship. You are provided with a set of rules to calculate
what is (or isn't reliable). However, different people have differing beliefs
about which rules to use, so your code has to readily accept different rule sets.
Example rules might be:
* If two different assays agree, then the interaction is reliable.
* x-ray-crystallography is always reliable.
* One yeast-two-hybrid experiment is not reliable, but if two such
experiments have the same result, it is reliable.
Write a program that reads in two files: one of experimental results,
and one of rules, and outputs a list of reliable interactions. You may
define whatever declarative representation you like for the rules, but it must
be powerful enough to express the above examples. Here is a lisp-y approach to
representing the rules that you may use if you like:
; If two different assays agree, then the interaction is reliable.
(test (?interaction1 ?interaction2)
(and (same (proteins ?interaction1) (proteins ?interaction2))
(not (equal (assay ?interaction1) (assay ?interaction2))))
(reliable (proteins ?interaction1)))
; x-ray-crystallography is always reliable.
(test (?interaction)
(equal 'x-ray-crystallography (assay ?interaction))
(reliable (proteins ?interaction)))
; One yeast-two-hybrid experiment is not reliable, but if two such
; experiments have the same result, it is reliable.
(test (?interaction1 ?interaction2)
(and (equal 'yeast-two-hybrid (assay ?interaction1))
(equal 'yeast-two-hybrid (assay ?interaction2))
(same (proteins ?interaction1) (proteins ?interaction2)))
(reliable (proteins ?interaction1)))
Discuss the efficiency of your program with respect to both experiments
and rules. What would be necessary to scale your solution to large databases
and/or complex rules?
Think about the problem, and describe the issues involved and your algorithm
before you start coding. Use good software engineering practices, and comment
your code.
- What is the purpose of the Rocke-Durbin transformation
of gene
expression array data? Why is that important? What two parameters need
to be estimated in order to use the transformation? How are they estimated?
- Cann et al. (Nature 325: 31-36; 1987) took mitochondrial
DNA from a diverse set of humans. Based on the divergence among genomes
and an estimate of the sequence mutation rate, they inferred that the
mitochondria in their sample had their most recent common ancestor about
200,000 years ago. Assuming human generation time is 20 years, and that
there are many alleles of this mitochondrial DNA, use Coalescent analysis
to estimate the effective population size Ne. How would this
estimate be effected if there were only two alleles in the sample?
- What does the Hannenhalli-Pevzner algorithm
do? What is its computational complexity? What is the computational complexity
of the unsigned variant of the problem the algorithm solves? What biological
conclusions does one draw from its application?
- Compare and contrast database federations versus
data warehousing, and describe the significance of these two methods
in bioinformatics. Describe some general challenges to data integration,
and some challenges that are specific to each technique.
- Name two potential sources of errors
in the assembly of shotgun genomic sequence, and describe how they
can be addressed. Compare and contrast Kent's GigAssembler and Myers'
Celera assembler.
- Articulate the distinctions and common themes
across nomenclatures, data models, and ontologies. For extra credit describe
one of each from clinical informatics and bioinformatics.
- How many probabilities would you have to estimate
for a test for prostate cancer that involved 4 independent biochemical
tests each with three values (low, normal, high). Explain how many probabilities
you would have to estimate if the order of the test mattered and how
many if the order did not.
- Explain how you would decide which metric to
use (e.g. correlation coefficient, Euclidean distance, mutual information)
for a clustering algorithm applied to a) biosurveillance data obtained
from emergency room data b) RNA expression data from a DNA microarray?
1.3 Discussion
This small sample of questions confirms, if there was any doubt, the breadth
of competencies expected by various scientific constituencies of students
in biomedical computing. It ranges from clinical applications to those
in basic biology, and from probability theory to management pragmatics. These
questions ask of the student the judgment to manage top-level decision-makers
to effect technology diffusion in hide-bound healthcare systems and the knowledge
and mathematical skills to understand the limitations of microarray normalization
techniques and those of evolutionary models. These questions also presuppose
a broad and deep system engineering and knowledge representation experience
with the ability to readily program in a variety of computer languages.
Is this a reasonable set of expectations? It could be argued that for society to successfully
advance our understanding of basic biology and how to translate it into
improved national and international health, this full range of expertise
is necessary (1) (2). For example, to translate clues from evolutionary
biology into a new genomic diagnostic (3) and then to incorporate this
diagnostic into routine clinical workflow (4) touches on a similarly broad
set of scientific and engineering challenges. The question then arises,
in this era of increasing multidisciplinary expertise: Is it necessary
for one individual to have substantive knowledge of all these areas? (5)
Or is it only necessary for the individual to work successfully as part
of a team of investigators each with their own expertise? (6) If the former
path is chosen then it may only require incremental changes to existing
educational programs with introductory or survey courses in the to provide
recognition of the nature of the complementary sets of expertise. Perhaps
because of the natural inertia of our educational systems, or insight into
the goals of our students, this is the path that has been taken by most
of the current programs and the new educational programs in bioinformatics,
system biology or computational biology. However, if the latter path is
taken, then this will require a radical rethinking of our entire undergraduate
and graduate curricula. It would require both the devising of several new
courses and the mandating of the adoption of in-depth existing specialized
classes in several disparate disciplines. This in turn would engender a
number of Òzero-sumÓ considerations of tradeoffs in the investment of the
students or fellowsÕ limited years of formal education, and investment
on the part of our colleges and universities in faculty development and
faculty time in these new courses.
Inspection of existing or recommended curricular content in biomedical informatics
suggests that some have made the decision that specialization within one
segment of the spectrum of competencies touched upon by the above questions
is preferable and/or practical or most relevant to a specific constituency
(7) (8). Yet at least some programs have attempted to integrate the
full breadth of biomedical computing expertise touched upon by the questions
list above (9). Whether they have been successful or whether the goal is
appropriate in the first instance (10) (11) will certainly be the subject
of much further discussion.
References
1. Gurwitz D, Weizman A, Rehavi M. Education:
Teaching pharmacogenomics to prepare future physicians and researchers for
personalized medicine. Trends Pharmacol Sci 2003;24(3):122-5.
2. Ford JH, 2nd, Turner A, Yoshii A. Information
requirements of genomics researchers from the patient clinical record. J
Healthc Inf Manag 2002;16(4):56-61.
3. Eichenbaum-Voline S, Olivier M, Jones EL,
Naoumova RP, Jones B, Gau B, et al. Linkage and association between distinct
variants of the APOA1/C3/A4/A5 gene cluster and familial combined hyperlipidemia.
Arterioscler Thromb Vasc Biol 2004;24(1):167-74.
4. Kaushal R, Shojania KG, Bates DW. Effects
of computerized physician order entry and clinical decision support systems
on medication safety: a systematic review. Arch Intern Med 2003;163(12):1409-16.
5. Piko BF, Stempsey WE. Physicians of the
future: Renaissance of polymaths? J R Soc Health 2002;122(4):233-7.
6. Vena JE, Weiner JM. Innovative multidisciplinary
research in environmental epidemiology: the challenges and needs. Int J Occup
Med Environ Health 1999;12(4):353-70.
7. Dyer BD, LeBlanc MD. Meeting report: Incorporating
genomics research into undergraduate curricula. Cell Biol Educ 2002;1(4):101-4.
8. Moffett SE, Menon AS, Meites EM, Kush S,
Lin EY, Grappone T, et al. Preparing doctors for bedside computing. Lancet
2003;362(9377):86.
9. Altman RB. The interactions between clinical
informatics and bioinformatics: a case study. J Am Med Inform Assoc 2000;7(5):439-43.
10. Kulikowski CA. The micro-macro spectrum of medical
informatics challenges: from molecular medicine to transforming health care
in a globalizing society. Methods Inf Med 2002;41(1):20-4.
11. Kohane IS. Bioinformatics and clinical informatics:
the imperative to collaborate [comment] [editorial]. J Am Med Inform Assoc
2000;7(5):512-6.