Biomedical Data Science: Mining and Modeling

Course Description

Rapid developments in bio- and information- technology and are changing the way that biomedical scientists interact with data. Traditionally, data were the end result of laborious experimentation, and their interpretation mostly involved careful thought and background knowledge. Today, data are increasingly generated much earlier in the scientific workflow and are much larger in scale. Also, before the data can be interpreted, extensive computational processing is often necessary. Thus, the data deluge in biomedicine now requires mining and modeling on a large scale - ie biomedical data science.

This course aims to equip students with some of the concepts and skills relevant to biomedical data science, with an emphasis on bioinformatics, a sub-discipline of this broader field, through examples of mining and modeling of genomic and proteomic data. More specifically, bioinformatics encompasses the analysis of gene sequences, macromolecular structures, and functional genomics data on a large scale. It represents a major practical application for modern techniques in data mining and simulation. Specific topics to be covered include sequence alignment, large-scale processing, next-generation sequencing data, comparative genomics, phylogenetics, biological database design, geometric analysis of protein structure, molecular-dynamics simulation, biological networks, mining of functional genomics data sets, and machine learning approaches for data integration.

Course Survey

If you are taking the class, please fill out BOTH of these surveys by 2/1:

survey 1: https://forms.gle/G4FoHkRG34kBMist9

survey 2: https://forms.gle/v7xAdzR8L5LAtJue9

Overall Flow of the Class

(Module = Group of Lectures)

Introduction
Module on “the Data” (Genomic, Proteomic & Structural Data), introducing the main data sources (their properties, where you access, &c). This module also includes discussion of databases and knowledge representation issues.
Module on Mining (Alignment & variant calling necessary for personal genomics; Basic multi-omics calculations; Supervised & unsupervised mining approaches towards multi-omic data; Networks)
Module on Molecular Modeling

Lectures

MW 1:00 - 2:15 PM, BASS305. All lectures will be recorded. Recordings will be available in Canvas a few minutes after each lecture
The first 3 lectures, and first discussion section will be held remotely on zoom. The zoom link can be found here

Discussion Section

F 10:00-11:00 AM or F 1:00-2:00 PM, BASS405

Different headings for this class (5 variants)

CBB 752 / CPSC 752 - Grad. with programming
- This graduate-level version of the course consists of lectures, in-class tests, discussion section, programming assignments, and a final programming project.
MB&B 752 / MCDB 752 - Grad. without programming
- This graduate-level version of the course consists of lectures, in-class tests, discussion section, written problem sets, and a final (semi-computational section and a literature survey) project. Unlike CBB752, there is no programming required.
MB&B 753b3 / MB&B 754b4 - Modules
- For graduate students the course can be broken up into two “modules” (each counting 0.5 credit towards MB&B course requirement):
- 753 - Biomedical Data Science: Mining (1st half of term)
- 754 - Biomedical Data Science: Modeling (2nd half of term)
- Each module consists of lectures, in-class tests, written problem sets, and a final, graduate level written project that is half the length of the full course’s final project.
MB&B 452 / MCDB 452 - Undergrad.
- This undergraduate version of the course consists of lectures, in-class tests, discussion section, written problem sets, and a final (semi-computational section and a literature survey) project. The programming assignments from CBB752 can be substituted for the written work by permission of instructor.
S&DS 352 - Undergrad.
- This undergraduate version of the course consists of lectures, in-class tests, discussion section, programming assignments, and a final programming project.
Auditing
- This is allowed. We would strongly prefer if you would register for the class.

Prerequisites

The course is keyed towards CBB graduate students as well as advanced undergraduates and graduate students wishing to learn about types of large-scale quantitative analysis that whole-genome sequencing and forms of large-scale biological data will make possible. It would also be suitable for students from other fields such as computer science, statistics or physics wanting to learn about an important new biological application for computation.

Students should have:

A basic knowledge of biochemistry and molecular biology.
A knowledge of basic quantitative concepts, such as single variable calculus, basic probability & statistics, and basic programming skills.

These can be fulfilled by: MBB 200 and Mathematics 115 or permission of the instructor.

Class materials

There is no text book for this class. PPT slides will be available after the lectures. We recommend Biochemistry by Lubert Stryer for biochemistry prerequisite.

Class Requirements

Discussion Section / Readings

Papers will be assigned throughout the course. These papers will be presented and discussed in weekly 60-minute sections with the TFs. A brief summary (a half-page per article) should be submitted at the beginning of the discussion session.

In-class tests: Quiz

There will be a quiz covering the 1st half of the course.
There will be a quiz covering the 2nd half of the course.

Quizes will comprise simple questions that you should be able to answer from the lectures plus the main readings.

For references, please refer the previous Quiz Archive

Programming Assignments (Req’d for CBB and CS grad. students)

There will be two homework assignments. We will try to promote the idea of reproducible research and using version control system, specifically GitHub, in facilitating the process of homework submission.

Non-programming Assignments

There will be equivalent two homework assignments, particularly for MB&B and MCDB students without a programming background. The programming part will be replaced with assignments involving the use of web-based tools or essay questions.

Pages from previous years

2022 Spring is the 25th time Bioinformatics has been taught at Yale. Pages for the 24 previous iterations of the class are available. Look at how things evolve! (Enrollment stats)
2021 Spring - (Enrollment stats)
2020 Spring - (Enrollment stats)
2019 Spring - (Enrollment stats)
2018 Spring - (Enrollment stats)
2017 Spring - (Enrollment stats)
2016 Spring - (Enrollment stats)
2015 Spring - (Enrollment stats)
2014 Spring - (Enrollment stats)
2012 Fall - (Enrollment stats)
2012 Spring - (Enrollment stats)
2011 Spring
2010 Spring
2009 and earlier (12 years of classes, staring in ‘98) (Note the pre-2010 course was Genomics & Bioinformatics; after 2010, the course contains all of the “Bioinformatics” of previous years and then more (!) with less “Genomics”.)

Class data dump

Syllabus and class info dump in single PDF file: PDF
Class poster: pdf