From e29b2ac6fea5b9d305afab3b20ae3a35e2a9f502 Mon Sep 17 00:00:00 2001 From: Akash Sonowal Date: Fri, 24 Nov 2023 08:00:01 +0530 Subject: [PATCH] update --- interview_prep/readme.md | 364 +++++++++++++++++++++++++++++++++++---- readme.md | 364 ++++----------------------------------- 2 files changed, 364 insertions(+), 364 deletions(-) diff --git a/interview_prep/readme.md b/interview_prep/readme.md index 461057c..99e9b3a 100644 --- a/interview_prep/readme.md +++ b/interview_prep/readme.md @@ -1,33 +1,333 @@ -# Roadmap to Cracking MAANG Interviews - -## Coding -1. DSA -2. Data Processing and Manipulation (w/Python and SQL) -3. Math and Statistics -4. ML Functions -5. ML Applications - -## Applied Statistics (w/ Probability) -1. Make the shortest and most concise notes -2. Top Questions +[![Language](https://img.shields.io/badge/python-3.8-blue.svg)](https://www.python.org) +![](https://img.shields.io/github/issues/akashsonowal/ml-engineering-foundations?style=plastic) +![](https://img.shields.io/github/forks/akashsonowal/ml-engineering-foundations) +![](https://img.shields.io/github/stars/akashsonowal/ml-engineering-foundations) +![](https://img.shields.io/github/license/akashsonowal/ml-engineering-foundations) + +## Software Engineer (ML) +- Coding + - DSA (LeetCode) + - ML Algos from scratch + 1. [Uber] Write an AUC from scratch using vanilla Python + 2. [Google] Write the K-Means algorithm using Numpy only +- ML Theory (Breadth) - General Understanding + 1. [Amazon] Explain how the cross-validation work + 2. [Etsy] How do you handle imbalanced labels in classification models + 3. [McKinsey] What is the variance-bias trade-off? +- ML Algorithms (Depth) - Depth in 1-2 algorithms + 1. [Amazon] What is the pseudocode of the Random Forest model? + 2. [Amazon] What is the variance and bias of the Random Forest model? + 3. [Amazon] How is the Random Forest different from Gradient Boosted Trees? +- Applied ML / Business Case + 1. [Google] Given a dataset that contains purchase history on PlayStore, how would you build a propensity score model? + 2. [PayPal] How would you build a fraud model without labels? + 3. [Apple] How would you identify meaningful segmentation? +- ML System Design + 1. [Google] Build a real-time translation system + 2. [Uber] Build a real-time ETA model + 3. [Amazon] Build a recommender system for product search + +## Data Scientist +- Coding + - DSA (if Full Stack i.e., deploy models in production) + - Data Manipulation (SQL and Pandas) + - Statistical Coding + - ML Functions (From Scratch) +- Applied Statistics +- ML & MLOps +- Product Sense & AB Testing +- Causal Inference + +====================================== + +To check gpu utilisation + +!watch -n1 nvidia-smi + +๐Ÿญ. ๐—ข๐—Ÿ๐—ฆ +โ†ณ Price optimization +โ†ณ Customer lifetime value prediction +โ†ณ Demand forecasting +โ†ณ Factor analysis + +๐Ÿฎ. ๐—Ÿ๐—ผ๐—ด๐—ถ๐˜€๐˜๐—ถ๐—ฐ ๐—ฅ๐—ฒ๐—ด๐—ฟ๐—ฒ๐˜€๐˜€๐—ถ๐—ผ๐—ป +โ†ณ Customer propensity model +โ†ณ Uplift modeling +โ†ณ Attribution modeling +โ†ณ Fraud modeling + +๐Ÿฏ. ๐—ž-๐— ๐—ฒ๐—ฎ๐—ป๐˜€ +โ†ณ Customer segmentation +โ†ณ Time-series clustering +โ†ณ Market basket analysis +โ†ณ Product grouping + +๐—•๐—ฎ๐˜€๐—ถ๐—ฐ๐˜€ +- Univariate statistics - mean, median, mode +- Standard deviation and variance +- Covariance and correlation +- Population and sample +- Nominal, ordinal and continuous, discrete data types +- Outlines +- The Simpsonโ€™s Paradox +- Selection Bias + +๐—›๐˜†๐—ฝ๐—ผ๐˜๐—ต๐—ฒ๐˜€๐—ถ๐˜€ ๐—ง๐—ฒ๐˜€๐˜๐—ถ๐—ป๐—ด +- Hypothesis Statements +- Z-Test +- T-Test +- T-Test for sample means +- Z-Test for proportions +- Paired and unpaired T-Tests +- Variance test +- ANOVA +- Chi-Squared test +- Goodness of Fit test for categorical data +- Nominal, ordinal and continuous, discrete data types +- Pairwise tests +- T-Test assumptions +- Non-parametric tests +- Type 1 & 2 Errors + +๐—ฃ๐—ฟ๐—ผ๐—ฏ๐—ฎ๐—ฏ๐—ถ๐—น๐—ถ๐˜๐˜† & ๐——๐—ถ๐˜€๐˜๐—ฟ๐—ถ๐—ฏ๐˜‚๐˜๐—ถ๐—ผ๐—ป๐˜€ +- The Bayes Theorem +- Conditional probability +- Normal distribution +- Uniform distribution +- Bernoulli distribution +- Binomial distribution +- Geometric distribution +- Poisson distribution +- Exponential distribution +- Deriving the mean and variance of distributions +- Central Limit Theorem +- The Birthday problem +- Card probability problems +- Die roll problems + +๐—ฅ๐—ฒ๐—ด๐—ฟ๐—ฒ๐˜€๐˜€๐—ถ๐—ผ๐—ป ๐— ๐—ผ๐—ฑ๐—ฒ๐—น๐—ถ๐—ป๐—ด +- OLS regression +- Confidence vs prediction intervals +- Logistic regression +- Regression model assumptions +- Model diagnostic checks +- R-Square vs R-Square Adjusted +- AIC, BIC, CP Statistics +- Model Interpretation + +# Large Scale Machine Learning (with MLOps) + +For use cases in computer vision and NLP, features can be computed using a small number of data sources. However, for use cases with tabular data like fraud, recommender system, ETA, dynamic pricing, etc. features are computed from many sources, + +## Data Support +1. Data Ingestion +- Streaming Ingestion: Click Stream (millions/sec) + Clickstream is user interactions with some interface: Mobile App, Voice Assistant, Desktop + Clickstream Tools (Pub/Sub): Kafka (Apache open source), Kinesis (AWS Managed) + + - install kafka + - setup zookeeper setup (directory and port) and run the zookeeper server. + - setup broker and run the broker server + + Note: while setting up you would also set topics + + setup a producer: + - webapp with click stream + - connect the falsk app with kafka broker + - send the response from web app to kafka broker to specific topic in broker + + You can setup a consumer message that validates whether we are able to receive message from producer in terminal. + + Notes: + I Clickstream +An ordered series of interactions that users have with some interface. In the traditional sense, this can be literal clicks of a mouse on a +desktop browser. Interactions can also come from touchscreens and conversational user interfaces. +| Change Data Capture +The process of recording changes In the data within a database system. For Instance, If a user cancels their Netflix subscription, then +the row In some table will change to indicate that they're no longer a subscriber. +The change In this row can be recorded and referenced later for analysis or audit purposes. +| Apache Kafka % +An open-source software platform which provides a way to handle real-time data streaming. +| Amazon Kinesis % +An AWS product that provides a way to handle real-time data streaming. + +| Zookeeper % +Aservice designed to reliably coordinate distributed systems via naming service, configuration management, data synchronization, +leader election, message queuing, or notification systems. + +| Database +Atool used to collect and organize data. Typically, database management systems allow users to Interact with the database. + +I oLtp +Online transaction processing. A system that handles (near) real-time business processes. For example, a database that maintains a +table of the users subscribed to Netflix and which Is then used to enable successful log-ins would be considered OLTP. This is opposed +to OLAP. + +I oLap +online analytical processing. A system that handles the analytical processes of a business, including reporting, auditing, and business +Intelligence. For example, this may be a Hadoop cluster which maintains user subscription history for Netflix. This is opposed to OLTP. + +| Availability Zone (AZ) +Typically a single data center within a region which has two or more data centers. The term "multi-AZ" implies that an application or +software resource is present across more than one AZ within a reglon. This strategy allows the software resource to continue operating, +even if a data center becomes unavailable. -## ML and MLOps -1. ML Crash Course -2. Large Scale ML - -## Product Sense and AB Testing -1. Make the shortest and most concise notes -2. Top Questions - -## Causal Inference -1. Make the shortest and most concise notes -2. Top Questions - -## System Design -1. ML System Design -2. Large Scale System Design (System Expert Course) - -## Basic SWE -1. OOPs -2. LLD (Object Oriented Design) -3. Infrastructure (Infra Expert Course) + +- Change Data Capture +- Live Video + +2. Data Storage +configure hdfs sink through kafka +pass the address of name node or master node +then run the connector + +you can see the name node and verify what are getting stored + +Notes: +| Hard Disk Drive (HDD) +Astorage device which operates by setting bits on a spinning magnetic disk. The capacity and the read/write performance of the HDD +are the main characteristics to consider when using an HDD within a particular system. + +| Data Replication +Astrategy used to mitigate the potential of data loss In the event of a system or component failure. In the most basic form, it involves +writing Identical data to more than one device or location. More efficient techniques like erasure coding incorporate mathematics to +recover lost data without referring to an explicit copy of the data + +| Hadoop Distributed File System (HDFS) +An open-source Apache software product which provides a distributed storage framework. + +I Avro +Arow-oriented data serializer provided by Apache. + +| Parquet +A column-oriented data storage format provided by Apache. +| Exactly-once Semantics ยง +Guarantees that an object within a distributed system is processed exactly once. Other semantics include maybe, at-least-once, and at- +most-once ยง + +3. Data Processing +SSH into name node of hdfs cluster +use AWS EMR to setup the processing cluster +Jupyter notebook outisde of spark cluster also can interact with it and submit jobs in cluster. The tool to do that is livy. +I Recommendation Carousel +A component within a graphical or conversational user interface which presents recommendations to a user. This can include +products, ads, and media content. +| Central Processing Unit (CPU) +A general purpose compute resource responsible for executing computer programs. +| Seasonality +The predictable changes of data throughout the calendar year. +I Parallelization +When two or more computer programs are executed at the same Instant across more than one processor. +| Random Access Memory (RAM) +A device on a computer which stores the data and machine code of a running computer program. +| Apache Spark +Asoftware interface which provides implicit parallelism and fault tolerance across a cluster of machines. +| Apache YARN +Asoftware product responsible for managing compute resources across a cluster of machines. +| Elastic MapReduce (EMR) % +An Amazon Web Services product which provides users access to a Hadoop cluster. +| Jupyter Notebook +A Project jupyter product which provides an interactive workspace to execute code. +4. Process Orchestration +The airflow dag defines how worker first ssh into name node of hadoop cluster. The worker trigger spark jobs. +So we can say airflow automates the data processing pipelines. +| Apache Airflow +Aworkflow management system that provides users a way to author, schedule, and execute software. + +## Exploration +Our jupyter notebook should have access to hdfs and spark clusters for compute. So we have explored the data and its applications to models. +| Automated Machine Learning (Auto-ML) +A strategy which automates the process of applying features and labels to a machine learning model. | +| Data Governance +The method of managing, using, and protecting an organization's data. + +## Experimentation +Once we are satisfied on offline evaluations during exploration, we want online evaluations with real users. +1. Frequentist AB Testing +- A/B Testing +The process of providing two or more different experiences across two or more subgroups of a population. The goal is to measure the change in behavior of the subgroups upon receiving the respective experiences. + +- A/ATest +An A/B test in which the experiences being tested are identical to one another. This is done in an effort to determine statistical validity of the A/B tool, the metric being monitored, and the analysis process being used. + +- User Agent +An identifier used to describe the software application which a user is using to Interact with another software application. For instance, +an HTTP request to a website typically includes the user agent so that the website knows how to render the webpage. + +- Session ID +A unique identifier assigned to a user to keep track of a user's connected interactions. For instance, a session may include a user logging In, purchasing an item, and logging out. Here, the session ID would be used to reference the group of those three Interactions. +This session ID can be stored in the user's Internet browser as a cookie. + +- Cookie +A small piece of data stored by a browser which indicates stateful information for a particular website. For instance, a cookle can be +stored in your browser after you log in to a website to indicate that you are logged In. This will stop subsequent pages from asking you +to log In again. + +Frequentist approach uses p-value and confidence interval +3. Bayesian AB Testing +Here we use prior knowledge. +The posterior and prior of same distribution initially. +In bayesian we sample from distribution and infer results with probability. +Beta Distribution +This distribution is used to model percentages and proportions such as click-through probabilities. +5. Multi Armed Bandit +| Multi-Armed Bandit (MAB) +A process which provides a number of choices | + +7. Impact Estimation +THis is determine if experiment is worth is pre-experiment and whether to launch post-experimentation +3 Key Terms - +| Shadow Test 3 +Running two or more versions of software In parallel while only surfacing the result of one of the experiences to the end user. Thisfs | +done in an effort to gauge the differences between the software versions. 1 +| Sample Selection Bias +The bias that occurs when sampling a population into one or more subgroups at random results in a systematic inclusion or exclusion +of some data. ยง +| Experiment Collision | +The event where one experiment unintentionally influences the result of one or more separate experiments. ยง + +## Large Scale Training +1. Basic Models (Random Forest, Boosted Trees, Matrix Factorization, Logistic Regression etc). +Read parquet files from HDFS Node and then run experiments. +| MLlib +Alibrary provided by Apache Spark which provides Spark clusters access to machine learning algorithms and refated utiities. MLIIb | +provides a Dataframe-based APl which Is unofficially referred to as Spark ML, ยง +~ provides a Dataframe-based APl which is unofficially referred to as Spark ML +2. Deep Learning Models +For distributed DL we will use PyTorch or Tensorflow....or some more advanced tools like Horovod by Uber +| Model Parallelism | +Amachine learning model training strategy used to maximize the utilization of compute resources (CPUs/GPUS) In which the model Is 3 +distributed across two or more devices. 1 +| Data Parallelism +Amachine learning model training strategy used to maximize the utilization of compute resources (CPUs/GPUS) In which the datais | +distributed across two or more devices ยง +| Graphics Processing Unit ยง +A specialized device that has many cores, allowing it to perform many operations at a time. ยง +GPUs are often used within deep learning to accelerate training of neural networks by taking advantage of their ability to perform 1 +many parallel computations. +| Concurrency +When two or computer programs share a single processor. +3. Model Validation +Bayesian Optimization is good but it is not non-parallizable. In Bayesian Op, the prior keep changing with iterations. + +## Productionization +This is the MLOps (Writing test to reduce error, versioning, tracking, environment syncronization etc) + +## Hosting +1. Data Hosting +| In-memory Database 1 +A database which relies either solely or primarily on the RAM of a computer. +| Distributed Cache % +A cache which s distributed across two or more machines +2. Model Hosting +I Numba +Ajust-in-time Python compller which resolves a subset of the python programming language down to machine code + +Metrics +Lift, CTR + +Offline metrics: Precision, Recall, Logloss, Ranking Loss + +Online metrics: CTR, Watch Time, Conversion Rate + +![Alt text](image.png) diff --git a/readme.md b/readme.md index 99e9b3a..461057c 100644 --- a/readme.md +++ b/readme.md @@ -1,333 +1,33 @@ -[![Language](https://img.shields.io/badge/python-3.8-blue.svg)](https://www.python.org) -![](https://img.shields.io/github/issues/akashsonowal/ml-engineering-foundations?style=plastic) -![](https://img.shields.io/github/forks/akashsonowal/ml-engineering-foundations) -![](https://img.shields.io/github/stars/akashsonowal/ml-engineering-foundations) -![](https://img.shields.io/github/license/akashsonowal/ml-engineering-foundations) - -## Software Engineer (ML) -- Coding - - DSA (LeetCode) - - ML Algos from scratch - 1. [Uber] Write an AUC from scratch using vanilla Python - 2. [Google] Write the K-Means algorithm using Numpy only -- ML Theory (Breadth) - General Understanding - 1. [Amazon] Explain how the cross-validation work - 2. [Etsy] How do you handle imbalanced labels in classification models - 3. [McKinsey] What is the variance-bias trade-off? -- ML Algorithms (Depth) - Depth in 1-2 algorithms - 1. [Amazon] What is the pseudocode of the Random Forest model? - 2. [Amazon] What is the variance and bias of the Random Forest model? - 3. [Amazon] How is the Random Forest different from Gradient Boosted Trees? -- Applied ML / Business Case - 1. [Google] Given a dataset that contains purchase history on PlayStore, how would you build a propensity score model? - 2. [PayPal] How would you build a fraud model without labels? - 3. [Apple] How would you identify meaningful segmentation? -- ML System Design - 1. [Google] Build a real-time translation system - 2. [Uber] Build a real-time ETA model - 3. [Amazon] Build a recommender system for product search - -## Data Scientist -- Coding - - DSA (if Full Stack i.e., deploy models in production) - - Data Manipulation (SQL and Pandas) - - Statistical Coding - - ML Functions (From Scratch) -- Applied Statistics -- ML & MLOps -- Product Sense & AB Testing -- Causal Inference - -====================================== - -To check gpu utilisation - -!watch -n1 nvidia-smi - -๐Ÿญ. ๐—ข๐—Ÿ๐—ฆ -โ†ณ Price optimization -โ†ณ Customer lifetime value prediction -โ†ณ Demand forecasting -โ†ณ Factor analysis - -๐Ÿฎ. ๐—Ÿ๐—ผ๐—ด๐—ถ๐˜€๐˜๐—ถ๐—ฐ ๐—ฅ๐—ฒ๐—ด๐—ฟ๐—ฒ๐˜€๐˜€๐—ถ๐—ผ๐—ป -โ†ณ Customer propensity model -โ†ณ Uplift modeling -โ†ณ Attribution modeling -โ†ณ Fraud modeling - -๐Ÿฏ. ๐—ž-๐— ๐—ฒ๐—ฎ๐—ป๐˜€ -โ†ณ Customer segmentation -โ†ณ Time-series clustering -โ†ณ Market basket analysis -โ†ณ Product grouping - -๐—•๐—ฎ๐˜€๐—ถ๐—ฐ๐˜€ -- Univariate statistics - mean, median, mode -- Standard deviation and variance -- Covariance and correlation -- Population and sample -- Nominal, ordinal and continuous, discrete data types -- Outlines -- The Simpsonโ€™s Paradox -- Selection Bias - -๐—›๐˜†๐—ฝ๐—ผ๐˜๐—ต๐—ฒ๐˜€๐—ถ๐˜€ ๐—ง๐—ฒ๐˜€๐˜๐—ถ๐—ป๐—ด -- Hypothesis Statements -- Z-Test -- T-Test -- T-Test for sample means -- Z-Test for proportions -- Paired and unpaired T-Tests -- Variance test -- ANOVA -- Chi-Squared test -- Goodness of Fit test for categorical data -- Nominal, ordinal and continuous, discrete data types -- Pairwise tests -- T-Test assumptions -- Non-parametric tests -- Type 1 & 2 Errors - -๐—ฃ๐—ฟ๐—ผ๐—ฏ๐—ฎ๐—ฏ๐—ถ๐—น๐—ถ๐˜๐˜† & ๐——๐—ถ๐˜€๐˜๐—ฟ๐—ถ๐—ฏ๐˜‚๐˜๐—ถ๐—ผ๐—ป๐˜€ -- The Bayes Theorem -- Conditional probability -- Normal distribution -- Uniform distribution -- Bernoulli distribution -- Binomial distribution -- Geometric distribution -- Poisson distribution -- Exponential distribution -- Deriving the mean and variance of distributions -- Central Limit Theorem -- The Birthday problem -- Card probability problems -- Die roll problems - -๐—ฅ๐—ฒ๐—ด๐—ฟ๐—ฒ๐˜€๐˜€๐—ถ๐—ผ๐—ป ๐— ๐—ผ๐—ฑ๐—ฒ๐—น๐—ถ๐—ป๐—ด -- OLS regression -- Confidence vs prediction intervals -- Logistic regression -- Regression model assumptions -- Model diagnostic checks -- R-Square vs R-Square Adjusted -- AIC, BIC, CP Statistics -- Model Interpretation - -# Large Scale Machine Learning (with MLOps) - -For use cases in computer vision and NLP, features can be computed using a small number of data sources. However, for use cases with tabular data like fraud, recommender system, ETA, dynamic pricing, etc. features are computed from many sources, - -## Data Support -1. Data Ingestion -- Streaming Ingestion: Click Stream (millions/sec) - Clickstream is user interactions with some interface: Mobile App, Voice Assistant, Desktop - Clickstream Tools (Pub/Sub): Kafka (Apache open source), Kinesis (AWS Managed) - - - install kafka - - setup zookeeper setup (directory and port) and run the zookeeper server. - - setup broker and run the broker server - - Note: while setting up you would also set topics - - setup a producer: - - webapp with click stream - - connect the falsk app with kafka broker - - send the response from web app to kafka broker to specific topic in broker - - You can setup a consumer message that validates whether we are able to receive message from producer in terminal. - - Notes: - I Clickstream -An ordered series of interactions that users have with some interface. In the traditional sense, this can be literal clicks of a mouse on a -desktop browser. Interactions can also come from touchscreens and conversational user interfaces. -| Change Data Capture -The process of recording changes In the data within a database system. For Instance, If a user cancels their Netflix subscription, then -the row In some table will change to indicate that they're no longer a subscriber. -The change In this row can be recorded and referenced later for analysis or audit purposes. -| Apache Kafka % -An open-source software platform which provides a way to handle real-time data streaming. -| Amazon Kinesis % -An AWS product that provides a way to handle real-time data streaming. - -| Zookeeper % -Aservice designed to reliably coordinate distributed systems via naming service, configuration management, data synchronization, -leader election, message queuing, or notification systems. - -| Database -Atool used to collect and organize data. Typically, database management systems allow users to Interact with the database. - -I oLtp -Online transaction processing. A system that handles (near) real-time business processes. For example, a database that maintains a -table of the users subscribed to Netflix and which Is then used to enable successful log-ins would be considered OLTP. This is opposed -to OLAP. - -I oLap -online analytical processing. A system that handles the analytical processes of a business, including reporting, auditing, and business -Intelligence. For example, this may be a Hadoop cluster which maintains user subscription history for Netflix. This is opposed to OLTP. - -| Availability Zone (AZ) -Typically a single data center within a region which has two or more data centers. The term "multi-AZ" implies that an application or -software resource is present across more than one AZ within a reglon. This strategy allows the software resource to continue operating, -even if a data center becomes unavailable. +# Roadmap to Cracking MAANG Interviews + +## Coding +1. DSA +2. Data Processing and Manipulation (w/Python and SQL) +3. Math and Statistics +4. ML Functions +5. ML Applications + +## Applied Statistics (w/ Probability) +1. Make the shortest and most concise notes +2. Top Questions - -- Change Data Capture -- Live Video - -2. Data Storage -configure hdfs sink through kafka -pass the address of name node or master node -then run the connector - -you can see the name node and verify what are getting stored - -Notes: -| Hard Disk Drive (HDD) -Astorage device which operates by setting bits on a spinning magnetic disk. The capacity and the read/write performance of the HDD -are the main characteristics to consider when using an HDD within a particular system. - -| Data Replication -Astrategy used to mitigate the potential of data loss In the event of a system or component failure. In the most basic form, it involves -writing Identical data to more than one device or location. More efficient techniques like erasure coding incorporate mathematics to -recover lost data without referring to an explicit copy of the data - -| Hadoop Distributed File System (HDFS) -An open-source Apache software product which provides a distributed storage framework. - -I Avro -Arow-oriented data serializer provided by Apache. - -| Parquet -A column-oriented data storage format provided by Apache. -| Exactly-once Semantics ยง -Guarantees that an object within a distributed system is processed exactly once. Other semantics include maybe, at-least-once, and at- -most-once ยง - -3. Data Processing -SSH into name node of hdfs cluster -use AWS EMR to setup the processing cluster -Jupyter notebook outisde of spark cluster also can interact with it and submit jobs in cluster. The tool to do that is livy. -I Recommendation Carousel -A component within a graphical or conversational user interface which presents recommendations to a user. This can include -products, ads, and media content. -| Central Processing Unit (CPU) -A general purpose compute resource responsible for executing computer programs. -| Seasonality -The predictable changes of data throughout the calendar year. -I Parallelization -When two or more computer programs are executed at the same Instant across more than one processor. -| Random Access Memory (RAM) -A device on a computer which stores the data and machine code of a running computer program. -| Apache Spark -Asoftware interface which provides implicit parallelism and fault tolerance across a cluster of machines. -| Apache YARN -Asoftware product responsible for managing compute resources across a cluster of machines. -| Elastic MapReduce (EMR) % -An Amazon Web Services product which provides users access to a Hadoop cluster. -| Jupyter Notebook -A Project jupyter product which provides an interactive workspace to execute code. -4. Process Orchestration -The airflow dag defines how worker first ssh into name node of hadoop cluster. The worker trigger spark jobs. -So we can say airflow automates the data processing pipelines. -| Apache Airflow -Aworkflow management system that provides users a way to author, schedule, and execute software. - -## Exploration -Our jupyter notebook should have access to hdfs and spark clusters for compute. So we have explored the data and its applications to models. -| Automated Machine Learning (Auto-ML) -A strategy which automates the process of applying features and labels to a machine learning model. | -| Data Governance -The method of managing, using, and protecting an organization's data. - -## Experimentation -Once we are satisfied on offline evaluations during exploration, we want online evaluations with real users. -1. Frequentist AB Testing -- A/B Testing -The process of providing two or more different experiences across two or more subgroups of a population. The goal is to measure the change in behavior of the subgroups upon receiving the respective experiences. - -- A/ATest -An A/B test in which the experiences being tested are identical to one another. This is done in an effort to determine statistical validity of the A/B tool, the metric being monitored, and the analysis process being used. - -- User Agent -An identifier used to describe the software application which a user is using to Interact with another software application. For instance, -an HTTP request to a website typically includes the user agent so that the website knows how to render the webpage. - -- Session ID -A unique identifier assigned to a user to keep track of a user's connected interactions. For instance, a session may include a user logging In, purchasing an item, and logging out. Here, the session ID would be used to reference the group of those three Interactions. -This session ID can be stored in the user's Internet browser as a cookie. - -- Cookie -A small piece of data stored by a browser which indicates stateful information for a particular website. For instance, a cookle can be -stored in your browser after you log in to a website to indicate that you are logged In. This will stop subsequent pages from asking you -to log In again. - -Frequentist approach uses p-value and confidence interval -3. Bayesian AB Testing -Here we use prior knowledge. -The posterior and prior of same distribution initially. -In bayesian we sample from distribution and infer results with probability. -Beta Distribution -This distribution is used to model percentages and proportions such as click-through probabilities. -5. Multi Armed Bandit -| Multi-Armed Bandit (MAB) -A process which provides a number of choices | - -7. Impact Estimation -THis is determine if experiment is worth is pre-experiment and whether to launch post-experimentation -3 Key Terms - -| Shadow Test 3 -Running two or more versions of software In parallel while only surfacing the result of one of the experiences to the end user. Thisfs | -done in an effort to gauge the differences between the software versions. 1 -| Sample Selection Bias -The bias that occurs when sampling a population into one or more subgroups at random results in a systematic inclusion or exclusion -of some data. ยง -| Experiment Collision | -The event where one experiment unintentionally influences the result of one or more separate experiments. ยง - -## Large Scale Training -1. Basic Models (Random Forest, Boosted Trees, Matrix Factorization, Logistic Regression etc). -Read parquet files from HDFS Node and then run experiments. -| MLlib -Alibrary provided by Apache Spark which provides Spark clusters access to machine learning algorithms and refated utiities. MLIIb | -provides a Dataframe-based APl which Is unofficially referred to as Spark ML, ยง -~ provides a Dataframe-based APl which is unofficially referred to as Spark ML -2. Deep Learning Models -For distributed DL we will use PyTorch or Tensorflow....or some more advanced tools like Horovod by Uber -| Model Parallelism | -Amachine learning model training strategy used to maximize the utilization of compute resources (CPUs/GPUS) In which the model Is 3 -distributed across two or more devices. 1 -| Data Parallelism -Amachine learning model training strategy used to maximize the utilization of compute resources (CPUs/GPUS) In which the datais | -distributed across two or more devices ยง -| Graphics Processing Unit ยง -A specialized device that has many cores, allowing it to perform many operations at a time. ยง -GPUs are often used within deep learning to accelerate training of neural networks by taking advantage of their ability to perform 1 -many parallel computations. -| Concurrency -When two or computer programs share a single processor. -3. Model Validation -Bayesian Optimization is good but it is not non-parallizable. In Bayesian Op, the prior keep changing with iterations. - -## Productionization -This is the MLOps (Writing test to reduce error, versioning, tracking, environment syncronization etc) - -## Hosting -1. Data Hosting -| In-memory Database 1 -A database which relies either solely or primarily on the RAM of a computer. -| Distributed Cache % -A cache which s distributed across two or more machines -2. Model Hosting -I Numba -Ajust-in-time Python compller which resolves a subset of the python programming language down to machine code - -Metrics -Lift, CTR - -Offline metrics: Precision, Recall, Logloss, Ranking Loss - -Online metrics: CTR, Watch Time, Conversion Rate - -![Alt text](image.png) +## ML and MLOps +1. ML Crash Course +2. Large Scale ML + +## Product Sense and AB Testing +1. Make the shortest and most concise notes +2. Top Questions + +## Causal Inference +1. Make the shortest and most concise notes +2. Top Questions + +## System Design +1. ML System Design +2. Large Scale System Design (System Expert Course) + +## Basic SWE +1. OOPs +2. LLD (Object Oriented Design) +3. Infrastructure (Infra Expert Course)