diff --git a/.gitignore b/.gitignore index 7585238..2eb9938 100644 --- a/.gitignore +++ b/.gitignore @@ -1 +1,3 @@ book +mermaid-init.js +mermaid.min.js diff --git a/book.toml b/book.toml index c908655..4ccc17d 100644 --- a/book.toml +++ b/book.toml @@ -8,6 +8,7 @@ title = "Notes @ ETH Zürich" [output.katex] [output.html] +additional-js = ["mermaid.min.js", "mermaid-init.js"] mathjax-support = true [output.html.fold] @@ -17,6 +18,9 @@ level = 0 [output.pytoc] command = "python3 ../../scripts/autogen-toc.py" +[preprocessor.mermaid] +command = "mdbook-mermaid" + [preprocessor.katex] renderers = ["html"] diff --git a/src/23fs/fg/01_modern_genomics_i.md b/src/23fs/fg/01_modern_genomics_i.md index 8b94fca..b6ba74d 100644 --- a/src/23fs/fg/01_modern_genomics_i.md +++ b/src/23fs/fg/01_modern_genomics_i.md @@ -1 +1,263 @@ # Modern Genomics I + + + +## Introduction + +Molecular biology is governed by the "central dogma". Genomics is basically studying the DNA part of it. + +**Genome sequence**: complete listing of all nucleotides of one organisms, in correct order, and mapped to the chromosomes. + +```mermaid +graph LR + A[DNA] -->|Transcription| B[RNA] + B -->|Translation| C[Protein] +``` + +### Timeline + +Efficient sequencing technology arrived rather late. Initially the sequencing process was cumbersome and radioactive. + +- 1975: "dideoxy" DNA sequencing (Sanger) +- 1977: first genome (bacteriophage $\phi X 174$) +- 1995: first cell (Haemophilus influenzae) +- 1998: first animal (Caenorhabditis elegans) +- 2001: Homo sapiens +- Today (February 2023) + - genomes available for: 409,947 Bacteria, 4,988 Archaea, 47,200 Eukaryotes + - human genomes fairly routine + - below 1000$ raw costs + - "Personal Genome Projects" are enrolling 100’000s of volunteers, including their medical records + +### Why Genomics? + +- Because we want an inventory of all genes and functions +- Because wea can compare genomes to learn about evolution, to get hints on gene function, etc + - Comparison can be either based on DNA or protein + - Alignments, dot plots, whole chromosome comparison + +### Comparative genomics use case examples + +- Gene prediction + - Gene prediction algorithms that use homology (=comparative genomics result) information: SLAM, SGP, Twinscan (= N-SCAN)... + - ![UCSC genome browser gene prediction](img/20240202163019.png) +- Gene family evolution + +## Classical sequencing method + +### Sanger (double-deoxy) sequencing + +Natural DNA extension requires 3'-OH. The dideoxy method uses a 2',3'-dideoxy nucleotide, which lacks the 3'-OH group. This causes the DNA chain to terminate. By introducing different dideoxy nucleotides, the sequence can be read. + +### Automated Dye Sequencing + +Variants of Sanger sequencing. Still utilize the dideoxy method to terminate DNA elongation. The difference is that the dideoxy nucleotides are labeled with different fluorescent dyes. The sequence is read by a laser. + +![Two variants of Automated Dye Sequencing](img/20240202163700.png) + +Dye terminator sequencing is now widely used over the rather cumbersome (4 tubes per sample) dye primer chemistry. + +## New (Next-generation) sequencing technologies + +Generally involves first **amplifying** the DNA, then **sequencing** it. Sequencing is done by detecting the nucleotides as they are incorporated into the growing DNA strand (sequencing by synthesis). High-throughput is achieved by parallelizing the sequencing process. + +### Amplification technologies + +First-generation amplification technology: needs DNA-library in bacterial vectors --> cumbersome and biased + +Improvement: get rid of bacteria + +#### Emulsion PCR + +Improvement: bacterium free, but still needs cloning + +```mermaid +%%{init: {"graph": {"htmlLabels": false}} }%% +graph TD + A["`**Fragment** the DNA, ligate **adapters** to ends, make **single-stranded**`"] --> B["Attach to microbeads"] + B --> C["`PCR-amplify, in a **water-oil emulsion**`"] + C --> D["`Enrich beads having successful amplifications, then place into regular lattice + (see figure below for details)`"] +``` + +![Emulsion PCR enrichment step](img/20240203111110.png) + +In short, the enrichment is done by capturing the second (5'-end) primer of the PCR product onto a large polysyrene bead. + +#### PCR on solid support + +![PCR on solid support](img/20240203111302.png) + +### Barcoding and "linked reads" + +![Barcoding](img/20240203114453.png) + +![10X technology](img/20240203114512.png) + +### Sequencing technologies + +First-generation sequencing needs DNA size-separation on a gel + +Improvement: get rid of gel (sequencing by synthesis) + +#### Pyrosequencing + +![Pyrosequencing](img/20240203113950.png) + +#### Reversible terminator sequencing + +![Reversible terminator sequencing](img/20240203113926.png) + +#### Sequencing by semi-conductor + +Directly detects the release of H+ ions when a nucleotide is incorporated into the growing DNA strand. + +![Sequencing by semi-conductor](img/20240203114721.png) + +### Current implementations of NGS + +- Illumina + - Illumina NovaSeq 6000 + - PCR on solid support + - reversible terminator sequencing + - read length ca. 250bp + - `1e14` bp per run +- Ion Torrent / Life Techn. Inc + - Ion Gene Studio S5 + - PCR on beads + - sequencing by semi-conductor + - read length ca. 600bp + - `1e10` bp per run + +## Third-generation sequencing technologies + +**Single molecule sequencing**. **No** need for **amplification**. + +Characterized by extremely long reads, but also high error rates. + +- Pacific Biosciences + - **SMRT** (single molecule real time) sequencing + - ![pacbio1](img/20240203114758.png) + - ![pacbio1](img/20240203114809.png) +- Oxford **Nanopore** + - MinION + - ![oxford nanopore](img/20240203114829.png) + +> **Self-note**: minimap2 is a popular aligner for long reads. + +## Environmental sequencing + +Traditional genome sequencing requires individual **cell isolation** and **cultivation**. This is not possible for the majority of microorganisms. But one advantage is that it's possible to re-assemble the genome from the reads. + +Environmental sequencing: directly sequence DNA extracted from the environment without purification and clonal cultivation. Genome assembly is generally not possible. + +> **Self-note**: data generated from environmental sequencing is typically large in size, but highly fragmented and contaminated. Lots of exciting research in this area. + +### How to deal with environmental sequencing data + +- Novel gene discovery + - Sequence identity comparison to known genes + - But > 50% of the environmental genomes are not similar to any known genome +- Novel gene families +- Gene family clustering (similar samples have similar gene family distribution) + +## Single-cell sequencing + +### Why? + +- **Heterogeneity** in cell populations + - Tumor cells + - Immune cells + - Microbial communities + - Developmental biology + +### How? + +In short, we first get **single** cells, then amplify the **whole genome** and sequence it. + +The challenges lie in the bolded parts. + +#### Single-cell isolation + +(In the very first "single cell" genomics paper, the "single" cells were literally picked manually...nowadays we don't do that) + +1. Sorting with optical tweezers + ![Optic tweezers](img/20240304231454.png) +2. Dilution series +3. Flow sorting + ![Flow sorting](img/20240304231416.png) + +#### Whole genome amplification + +Steps summarized: + +1. MDA +2. phi 29 debranching +3. S1 nuclease digestion +4. DNA pol I nick translation +5. Cloning + +- Isothermal **Multiple displacement amplification** (MDA) + - **Phi29** DNA polymerase + - **Random primers** + - **Isothermal** amplification + +![MDA](img/20240304231641.png) + +After MDA, we obtained a "**hyperbranched chromosome**". After *debranching* and cloning, we can sequence and re-assemble the genome. + +![Next steps](img/20240304231833.png) + +The debranching is done by incubating phi 29 DNA pol with hyperbranched DNA **without any primer**. The *strand-replacement* activity of phi 29 DNA pol will remove the hyperbranched structure. + +S1 nucleases are used to remove the remaining single-stranded DNA. + +Nicks are filled in by DNA pol I (has 5'->3' exonuclease activity). + +## Genomic databases + +This section likely won't be covered in the exam. + +> **General popular resources**: +> +> - Raw data: NCBI **sequence read archive (SRA)** (also it's European counterpart, EBI **European Nucleotide Archive (ENA)**, but they are basically the same thing now) +> - seq quality score included +> - but incomplete: legacy & newer data not available +> - gigantic in size +> - Sequencing projects: [GOLD (Genomes OnLine Database)](genomesonline.org) +> - keep track of "who is sequencing what" and responsible researchers (contacts), funding sources, sequencing centers etc +> - Genome browsers +> - Display features (genes, transcripts...) on the genomes, show annotations (conflicts, variants also included), homolog search +> - UCSC genome browser, Ensembl (popular in Europe) +> - Pros and cons of genome browsers +> - Pros +> - easy to use +> - regularly updated +> - automated annotation pipelines => fast to include new genomes +> - very powerful export utilities (`BioMart` in Ensembl) +> - API for local access +> - DAS (distributed annotation system) for data exchange +> - long term project, stable funding, likely not going away +> - Cons +> - focus on vertebrates, few other genomes +> - complex db schema +> - popular, so can be slow +> +> Special ones: +> +> - Comparative genomics databases +> - STRING (protein-protein interactions, focused on microbial genomes, maintained by von Mering group at UZH) +> - specialized on comparing genomes (at nucleotide-level, or gene-level) +> - to visualize evidence of selection (exons, regulatory sites, ...) +> - to infer past evolution of genomes (rearrangements, gains, losses, ...) +> - to establish gene histories (orthology, paralogy, synteny, ...) +> - often require extensive offline computation before they go online +> - some of their services also offered by generic genome browsers/sites. +> - Organism-specific databases +> - Flybase, Wormbase, TAIR, SGD... +> - community driven, extensive manual input +> - specific terms, abbreviations, gene names... +> - Specialized databases +> - IGSR: human population genetics +> - OMIM: known disease-causing mutations +> - KEGG: metabolic pathways and enzymes diff --git a/src/23fs/fg/04_transcriptomics_ii.md b/src/23fs/fg/04_transcriptomics_ii.md index 52fe2d8..d631a28 100644 --- a/src/23fs/fg/04_transcriptomics_ii.md +++ b/src/23fs/fg/04_transcriptomics_ii.md @@ -1 +1,160 @@ # Transcriptomics II + + + + + +## Methods of Exploratory Data Analysis + +Typical shape of data we have in hand: a data matrix. + +![Data matrix](img/20240203134507.png) + +where the **columns** are the **samples** and the **rows** are the **features** (usually gene expressions). (It can be the other way around, but this is the most common case.) + +- **Clustering** + - Hierarchical clustering + - k-means clustering +- **Dimensionality reduction** + - Matrix Factorization + - **PCA** + - **MDS** (Multidimensional Scaling) + - Graph-based methods + - **t-SNE** + - **UMAP** + +## Clustering + +- Goals + - group **similar samples** + - evaluate similarity between expression profiles of the samples + - test if similarities match the experimental design and effect sizes + - test if variations within condition is smaller than between conditions + - outliers detection + - group **similar genes** + - **guilt by association**: infer functions of unknown genes from known genes with the same expression pattern (co-expression) + +### Distance measures + +To cluster samples, we need to first define a "distance" measure between samples. + +Commonly used distance measures: + +- **Euclidean distance** (typically for clustering samples) + - Euclidean distance of two profiles $\mathbf{x}$ and $\mathbf{y}$ with $p$ genes (i.e. the distance between two $p$-dimensional vectors $\mathbf{x}$ and $\mathbf{y}$) + - $d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^{p} \left(x_i - y_i\right)^2}$ + - **Expression values MUST BE LOG SCALE** +- **$1 - \text{corr}(\mathbf{x}, \mathbf{y})$** (typically for clustering genes) + - Correlation coefficient of two profiles $\mathbf{x}$ and $\mathbf{y}$ with $p$ samples + - $\text{corr}(\mathbf{x}, \mathbf{y}) = \frac{\sum_{i=1}^{p}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{p}(x_i - \bar{x})^2 \cdot \sum_{i-1}^{p}(y_i - \bar{y})^2}}$ + - $\bar{x} = \frac{1}{p}\sum_{i=1}^{p}x_i$ + - $\bar{y} = \frac{1}{p}\sum_{i=1}^{p}y_i$ + +### Hierarchical Clustering + +**Algorithm**: + +1. Compute the distance matrix ($n \times (n - 1) / 2 \rightarrow O(n^2)$) between all samples +2. Find pair with minimal distance and merge them +3. Update the distance matrix +4. Repeat 2-3 until all samples are merged + +**Parameters**: + +- distance measure for samples + - usually $1 - \text{corr}(\mathbf{x}, \mathbf{y})$ for gene expression +- distance measure for clusters (**linkage rule**) + - **Single** linkage: **minimum** distance between any elements of the two clusters + - **Complete** linkage: **maximum** distance between any elements of the two clusters + - **Average** linkage: **average** distance between *all* elements of the two clusters + - **Ward's** linkage: **minimal** increase in **intra-cluster variance** + +**Input**: + +- **distance matrix** + - The linkage can be derived directly from the distance matrix. + - Hence clustering algorithm only needs the distrance matrix as input, not the measurements individually. + +### k-means Clustering + +**Algorithm**: + +1. Randomly assign each sample to one of the $k$ clusters +2. Compute the centroid (cluster center, average of the assigned samples) of each cluster +3. Assign each sample to the cluster with the closest centroid +4. Repeat 2-3 until convergence or a maximum number of iterations + +**Parameters**: + +- number of clusters $k$ +- distance measure for samples + +**Input**: + +- data matrix (cannot directly use distance matrix) + +This method **minimizes the intra-cluster variance**. + +Choice of $k$ affects the result. + +### Comparison of Hierarchical Clustering and k-means Clustering + +| | Hierarchical Clustering | k-means Clustering | +| --- | --- | --- | +| Computing time | $O(n^2 \log (n))$ | $O(n \cdot k \cdot t)$ | +| Memory | $O(n^2)$ | $O(n \cdot k)$ | + +When clustering large numbers of genes (>`1e4`, hierarchical clustering is not practical + +## Dimensionality Reduction + +### Principal Component Analysis (PCA) + +**Goal**: Find a new coordinate system such that the first axis (principal component) captures the most variance, the second axis captures the second most variance, and so on. + +The data is **linearly** transformed to a new coordinate system. + +### t-SNE + +**Algorithm**: + +1. In the high-dimensional space, create a *probability distribution* that dictates how likely two points are to be neighbors. +2. Recreate a low dimensional space that follows the same probability distribution as best as possible. + +How to find the best low-dimensional representation: + +- **preserve the pairwise distances** between neighboring points in the high-dimensional space +- non-linear, different transformations on different regions + +Characteristic: + +- Powerful, but need to fiddle with random seed and perplexity +- Non-deterministic + +### UMAP + +Uniform Manifold Approximation and Projection + +Approach: Find for each point the neighbors and build simplices (simplex: a generalization of the concept of a triangle or tetrahedron to arbitrary dimensions) and then optimize the low-dimensional representation to preserve the simplices. + +## Differential Expression Analysis diff --git a/src/23fs/fg/11_quality_control_and_standards.md b/src/23fs/fg/11_quality_control_and_standards.md index 9015c5c..f7221a7 100644 --- a/src/23fs/fg/11_quality_control_and_standards.md +++ b/src/23fs/fg/11_quality_control_and_standards.md @@ -1 +1 @@ -# Qualitiy Control and Standards +# Quality Control and Standards diff --git a/src/23fs/fg/functional_genomics.md b/src/23fs/fg/functional_genomics.md index 7f4ed02..1da8a8c 100644 --- a/src/23fs/fg/functional_genomics.md +++ b/src/23fs/fg/functional_genomics.md @@ -1,5 +1,7 @@ # Functional Genomics +A course collectively taught by multiple professors. Multiple-choice exam, lots of memorization. + ## Table of Contents diff --git a/src/23fs/fg/img/04-20230313162850.png b/src/23fs/fg/img/04-20230313162850.png new file mode 100644 index 0000000..1f784b1 Binary files /dev/null and b/src/23fs/fg/img/04-20230313162850.png differ diff --git a/src/23fs/fg/img/20240202163019.png b/src/23fs/fg/img/20240202163019.png new file mode 100644 index 0000000..fc6af81 Binary files /dev/null and b/src/23fs/fg/img/20240202163019.png differ diff --git a/src/23fs/fg/img/20240202163700.png b/src/23fs/fg/img/20240202163700.png new file mode 100644 index 0000000..3fe0808 Binary files /dev/null and b/src/23fs/fg/img/20240202163700.png differ diff --git a/src/23fs/fg/img/20240203111110.png b/src/23fs/fg/img/20240203111110.png new file mode 100644 index 0000000..1b85e3e Binary files /dev/null and b/src/23fs/fg/img/20240203111110.png differ diff --git a/src/23fs/fg/img/20240203111302.png b/src/23fs/fg/img/20240203111302.png new file mode 100644 index 0000000..16852fd Binary files /dev/null and b/src/23fs/fg/img/20240203111302.png differ diff --git a/src/23fs/fg/img/20240203113926.png b/src/23fs/fg/img/20240203113926.png new file mode 100644 index 0000000..3d71f44 Binary files /dev/null and b/src/23fs/fg/img/20240203113926.png differ diff --git a/src/23fs/fg/img/20240203113950.png b/src/23fs/fg/img/20240203113950.png new file mode 100644 index 0000000..e106f0a Binary files /dev/null and b/src/23fs/fg/img/20240203113950.png differ diff --git a/src/23fs/fg/img/20240203114453.png b/src/23fs/fg/img/20240203114453.png new file mode 100644 index 0000000..bce788f Binary files /dev/null and b/src/23fs/fg/img/20240203114453.png differ diff --git a/src/23fs/fg/img/20240203114512.png b/src/23fs/fg/img/20240203114512.png new file mode 100644 index 0000000..0e9aec7 Binary files /dev/null and b/src/23fs/fg/img/20240203114512.png differ diff --git a/src/23fs/fg/img/20240203114721.png b/src/23fs/fg/img/20240203114721.png new file mode 100644 index 0000000..f8b0ccd Binary files /dev/null and b/src/23fs/fg/img/20240203114721.png differ diff --git a/src/23fs/fg/img/20240203114758.png b/src/23fs/fg/img/20240203114758.png new file mode 100644 index 0000000..7ad9c04 Binary files /dev/null and b/src/23fs/fg/img/20240203114758.png differ diff --git a/src/23fs/fg/img/20240203114809.png b/src/23fs/fg/img/20240203114809.png new file mode 100644 index 0000000..11c289f Binary files /dev/null and b/src/23fs/fg/img/20240203114809.png differ diff --git a/src/23fs/fg/img/20240203114829.png b/src/23fs/fg/img/20240203114829.png new file mode 100644 index 0000000..fd83ed8 Binary files /dev/null and b/src/23fs/fg/img/20240203114829.png differ diff --git a/src/23fs/fg/img/20240203134507.png b/src/23fs/fg/img/20240203134507.png new file mode 100644 index 0000000..4aedfea Binary files /dev/null and b/src/23fs/fg/img/20240203134507.png differ diff --git a/src/23fs/fg/img/20240304231416.png b/src/23fs/fg/img/20240304231416.png new file mode 100644 index 0000000..cef0409 Binary files /dev/null and b/src/23fs/fg/img/20240304231416.png differ diff --git a/src/23fs/fg/img/20240304231454.png b/src/23fs/fg/img/20240304231454.png new file mode 100644 index 0000000..4a871b9 Binary files /dev/null and b/src/23fs/fg/img/20240304231454.png differ diff --git a/src/23fs/fg/img/20240304231641.png b/src/23fs/fg/img/20240304231641.png new file mode 100644 index 0000000..a3ed6f6 Binary files /dev/null and b/src/23fs/fg/img/20240304231641.png differ diff --git a/src/23fs/fg/img/20240304231833.png b/src/23fs/fg/img/20240304231833.png new file mode 100644 index 0000000..92f131e Binary files /dev/null and b/src/23fs/fg/img/20240304231833.png differ diff --git a/src/23fs/fg/img/20240305004611.png b/src/23fs/fg/img/20240305004611.png new file mode 100644 index 0000000..56a9198 Binary files /dev/null and b/src/23fs/fg/img/20240305004611.png differ diff --git a/src/README.md b/src/README.md index e66b19c..6abbdea 100644 --- a/src/README.md +++ b/src/README.md @@ -1,8 +1,29 @@ # Notes taken @ ETH Zurich -## Table of Contents +## Introduction -### 23FS +Here I list all courses I took at ETH Zurich and provide a brief overview of the course content. The clickable links will lead you to the respective course notes. + +Courses are organized by semester and year. The course notes are organized by lecture. + +### 22HS + +- Computational Biology +- Data Mining I +- Information Systems for Engineers +- Computational Systems Biology + +### 23FS + HS -- [Functional Genomics](./23fs/fg/functional_genomics.md) - [Mobile Health and Activity Monitoring](./23fs/mham/mobile_health_and_activity_monitoring.md) +- [Functional Genomics](./23fs/fg/functional_genomics.md) +- Introduction to Machine Learning +- Big Data for Engineers +- Statistical Models in Computational Biology + +No courses (except for seminar) taken in 23HS since I was doing an internship at Roche. + +### 24FS + +- Biofluiddynamics +- Synthetic Biology I diff --git a/src/SUMMARY.md b/src/SUMMARY.md index 476f761..70ae233 100644 --- a/src/SUMMARY.md +++ b/src/SUMMARY.md @@ -7,18 +7,19 @@ # 22HS - [Computational Biology]() +- [Data Mining I]() --- -# 23FS +# 23FS + HS - [Mobile Health and Activity Monitoring](./23fs/mham/mobile_health_and_activity_monitoring.md) - [Introduction]() - [Functional Genomics](./23fs/fg/functional_genomics.md) - - [Modern Genomics I]() + - [Modern Genomics I](./23fs/fg/01_modern_genomics_i.md) - [Modern Genomics II](./23fs/fg/02_modern_genomics_ii.md) - [Transcriptomics I](./23fs/fg/03_transcriptomics_i.md) - - [Transcriptomics II]() + - [Transcriptomics II](./23fs/fg/04_transcriptomics_ii.md) - [MicroRNAs and other small RNAs]() - [Proteomics]() - [Metabolomics]() @@ -26,3 +27,10 @@ - [Protein Networks]() - [Epigenomics and Gene Regulation]() - [Qualitiy Control and Standards]() + +--- + +# 24FS + +- [Biofluiddynamics]() +- [Synthetic Biology I]()