Artificial Intelligence

Created by	BBorhan
Last edited time	@July 20, 2025 10:32 PM
Tag

Resources

Atkia Apu’s Note

Mahir Labib Dihan Sir’s Note
1. https://drive.google.com/file/d/1Uf8A_hCBnC2F1-QIez4U-ckrPEbmkYJe/view
1. https://drive.google.com/file/d/1Ek05PYorrHT_xsqdB71c86dyKVjj58Ty/view

Introduction to AI, Turing test, Agents

Artificial Intelligence (AI) is a branch of computer science and engineering that deals with intelligent behavior, learning, and adaptation in machines.

Views of AI fall in four categories:

Thinking humanly

Thinking rationally

Acting humanly

Acting rationally

Turing Test: A computer passes the test if a human interrogator, after posing some written questions, cannot tell whether the written responses come from a person or a from a computer.

Capabilities of passing Turing test

Natural language processing to enable it to communicate successfully in English.

Knowledge representation to store what it knows or hears.

Automated reasoning to use the stored information to answer questions and draw new conclusions.

Machine learning to adopt to new circumstances and to detect and extrapolate patterns.

Agents: An agent is anything that perceives its environment through sensors and acts upon it through actuators.

Structure

Sensors: Devices that collect environmental data

Percepts: Data received from sensors

Actuators: Mechanisms that allow the agent to act on the environment

Actions: Tasks performed by actuators

Types of Agents

Agent Type	Description	Key Features	Limitations	Example
Simple Reflex Agent	Acts based on current percepts, using condition-action rules ("If X, then Y").	No memory of past percepts; simple to implement.	Limited intelligence; may enter deadlocks or loops.	Thermostat turning on/off based on temperature.
Model-Based Reflex Agent	Maintains an internal state to track unobserved environment aspects.	Uses knowledge of world evolution and action effects.	Requires modeling of environment dynamics.	Robot navigating a partially visible room.
Goal-Based Agent	Acts to achieve specific goals by evaluating action sequences.	Flexible; explicit knowledge representation.	Requires search/planning for goal achievement.	Navigation system planning a route.
Utility-Based Agent	Chooses actions based on a utility function measuring success. The Utility-based agent is useful when there are multiple possible alternatives, and an agent has to choose in order to perform the best action.	Balances conflicting goals and success likelihood.	Complex to compute utility for all states.	Robot prioritizing speed vs. safety.
Learning Agent	Learns from experience to improve performance.	Components: Learning Element, Performance Element, Critic, Problem Generator.	Needs initial knowledge; learning takes time.	Self-driving car adapting to traffic patterns.

Reflexive Agent: An agent whose action depends only on the current percepts.

Model based : An agent whose action is derived from an internal model of the current world state
1. Partial observability
1. Updating the internal state information as times goes by requires two kinds of knowledge to be encoded
  1. We need some information information about how the world evolves independently of the agent
  1. We need some information about how the agent’s own actions affect the world

Goal Based agent : An agent that select actions that it believes will achieve explicitly represented goals.
1. Expansion of Model based agent
1. Desirable situation
1. Searching and planning

Utility Based Agent: An agent that selects actions that it believes will maximize the expected utility outcome state
1. Utility function
1. Deals with happy and unhappy state

Learning Agent: An agent whose behavior improves over time based on its experience
1. Learning Element: Responsible for making improvements.
1. Performance Element : is what we have previously considered to be the entire agent: it takes in percepts and decides on actions
1. Critic: The learning element uses feedback from the critic on how the agent is doing and determines how the performance element should be modified to do better in the future.
1. Problem Generator: suggesting actions that will lead to new and informative experiences.

Rational Agent: A rational agent selects actions that maximize its performance measure, based on percepts and build-in knowledge.
- act to gather information or explore for better outcomes
- autonomous if their behavior is based on their own experience
- Rationality does not imply omniscience

Autonomous Agents: Agents whose behavior is determined by their own experience rather than pre-programmed decisions.
- Sufficient initial knowledge to operate
- Ability to learn and adapt to new situations

Omniscience: An omniscient agent knows the exact outcomes of its actions and its certain no other outcomes are possible.
- It is impossible in real-world

Different informed and uninformed search techniques

State space search: It is the possible of all states including start and goal state where a particular problem solution is going to be searched.

Data structure: Graph

Represented By

A set of states

An initial state

A set of operators or actions

A partial function/transition function

A set of goal states

An optional path cost function

Generalized state space search algorithm

Initialize:
	L : Linked List = {}
	Sc : Node = Null
	GOAL_FOUND : Boolean = FALSE
	GOAL_EXISTS : Boolean = TRUE
	
Add node si (intial) to L
WHILE GOAL_FOUND IS FALSE
AND GOAL_EXISTS IS TRUE 
DO
	SET Sc : An unexpanded node from L
	IF Sc IS NOT NULL THEN
		FOREACH successor node Ss of Sc DO
			IF Ss EQUALS Sg THEN
				SET GOAL_FOUND = TRUE
			ELSE
				IF L not contains Ss THEN
					Add Ss to L
				END IF
			END IF
			MARK Sc as expanded
		END FOREACH
	ELSE
		SET GOAL_EXISTS = FALSE
	END IF
END WHILE

Search Tree:

State space represented by Search Tree.

A fundamental data structure that used to systematically represent and explore all possible search states of a problem to find a solution.

Provides a framework for solving problems by simulating the decision making process through a sequence of actions.

In search tree
- A root node represents the initial state
- Children of a node → successor
- Fringe of a tree → States not yet expanded

Properties we use to evaluate an algorithm

Completeness : Guaranteed to find a solution if one exists

Optimal : If it always find the best solution

Time complexity: The amount of time an algorithm takes

Space Complexity : The amount of memory an algorithm requires

Informed vs Uninformed

Aspects	Uninformed	Informed
Definition	Search algorithm that explore the problem space without any domain-specific knowledge or heuristics, relying on problem structure and predefined rules. Also known as blind search algorithm.	Search algorithms that use domain-specific heuristic or estimates of the cost to reach the goal.
Knowledge Used	Only problem structure	Heuristics estimate cost to goal.
Time	Consuming	Quick Solution
Cost	Costly	Less
Time and Space Complexity	More	Less
Example	BFS, DFS	A*, Best first search

Uninformed Algorithm

Breadth-First-Search Algorithm
1. Explores all nodes at the current depth before moving to the next level
1. Uses a queue FIFO
1. Shortest path but more memory

Depth-First-Search
1. Explores as far as possible along each path before backtracking
1. Uses a Stack LIFO
1. Infinite loop, memory-efficient
1. How it detects cycle
  1. When DFS is applied over a graph if DFS finds an edge that points to an already visited vertex
1. DFS is not optimal
  1. Doesn’t consider path cost
  1. Many states keep reoccurring → No guarantee to find a solution
  1. May go to infinite loop
  1. Doesn’t find the shortest path always

Depth Limited Search
1. A variation of DFS that limits the depth of exploration to prevent infinite loops in large or infinite space spaces
1. Useful when the goal depth is known but cannot find solution beyond the depth limit.
1. Limitation
  1. Not optimal
  1. Incompleteness
  1. Effectiveness depend on the depth limit

Iterative Deepening Depth First Search
1. combines BFS and DFS altogether
1. Adv
  1. Ensure completeness and optimally → BFS
  1. Memory Efficient → DFS
1. Dis
  1. repeats all the work of the previous phase

Uniform Cost Search
1. Extends BFS by considering path costs always expanding the last cost node first.
1. Find the least cost, Slower than BFS

Bidirectional Search
1. Runs two simultaneous searching one from initial state (forward search) and another from goal (backward search)
1. It replaces one search graph with two small search subgraphs
1. The search stops when there two graphs intersect each other
1. When it doesn’t work
  1. Implementation of tree is difficult
  1. The goal state is unknown/unclear in advance
  1. Finding an efficient way to check if a match exists is tricky which can increase the time.

Algorithm	TC	SC	Optimal	Complete
BFS	b^d	b^d	Yes	Yes
DFS	b^d	bd	No	No
DLS	b^l	bl	No	if l ≥ d
IDDFS	b^d	bd	Yes	Yes
Uniform	b^d	b^d	Yes	Yes
Bidirectional	b^(d/2)	b^(d/2)	Yes	Yes

Informed Search, Heuristics*, How heuristics help? A* search, Proof of optimality of A*

https://iipseries.org/assets/docupload/rsl2024AF233C7BF02A178.pdf

Best-First Search
1. Greedy search, It picks such path that appears best at the moment.
1. A blend of both DFS and BFS
1. Choose the most promising node at each step
1. It is applied by priority queue
1. Advantage :
  1. It switch between BFS and DFS by gaining the advantage of both algorithm
  1. More effective and capable than BFS, DFS
1. Disadvantage
  1. Worst scenario : operate as an unguided DFS
  1. Get stuck in a loop
  1. Not optimal
  1. Not complete
1. TC/SC : $b^m$
$h(n) = g(n)$
$h(n) = \text{estimated cost from node n to the goal. }$
$g(n) \text{= cost from the start node to node n}$

A* Algorithm
1. Heuristic : A heuristic is a technique designed to solve a problem faster than classic method or to find an approximate solution when the classic methods fail to find any exact solution.
  1. How it helps in finding solution
    1. It useful in reducing the time and resources required to find the solutions by focusing the search on the most promising path.
    1. It prioritizes which node to explore based on their estimated cost to the goal.
    1. It helps in state space search by guiding the exploration, reducing the search space and improving efficiency.
    1. It allows algorithm to focus on exploring path that are more likely to be solution.
  1. Admissible Heuristic: A heuristic $h(n)$ is admissible if for every node $n$ , $h(n) \leq h^*(n)$ , where $h^*(n)$ is the true cost to reach the goal statement from $n$ . An admissible heuristic never overestimates the cost to reach the goal, thus it is optimistic.
    1. $h_2(n) \geq h_1(n)$ for all node $n$ and both are admissible then $h_2$ dominates $h_1$ . $h_2$ is better for search.
1. Evaluation Function, $f(n) = g(n) + h(n)$
1. Optimal, Complete
1. TC : Exponential.
1. $A^*$ is always optimal
  Suppose some suboptimal goal $G_2$ has been generated and is in the fringe. Let n be an unexpanded node in the fringe such that n is on a shortest path to an optimal goal $G_2$ .
  $f(G_2) = g(G_2) \text{ since } h(G_2) = 0 \\ f(G) = g(G) \text{ since } h(G) = 0\\ g(G_2) \gt g(G), \text{since } G_2 \text{ is suboptimal} \\ f(G_2) \gt f(G), \\ h(n) \leq h^*(n) \\ g(n) + h(n) \leq g(n) + h^*(n) \\ f(n) \leq f(G) \lt f(G_2)$
  A* will never select $G_2$ for expansion.
  
  A heuristic is consistent if for every node n, every successor n’ of n generated by any action a,
  $h(n) \leq c(n, a, n') +h(n')$
  If h is consistent,
  $f(n') = g(n')+h(n') = g(n)+c(n,a,n')+h(n') = g(n)+h(n) \geq f(n)$
  $f(n)$ is non-decreasing along any path. So, A* using graph search is optimal.
1. Prove that the uniform-cost search is a special case of A* search.
A* search uses the evaluation function:
$f(n) = g(n) + h(n)$
where:
- $g(n) = \text{ cost from the start node to node n}$
- $h(n) = \text{heuristic estimate of the cost from n to the goal.}$
Uniform-Cost Search (UCS) does not use a heuristic, so:
$h(n) = 0$
Therefore, for UCS, the evaluation function becomes:
$f(n) = g(n)+0 = g(n)$
This means UCS always expands the node with the lowest path cost g(n), exactly like A* with a zero heuristic.
Hence, Uniform-Cost Search is a special case of A Search where the heuristic function h(n)is zero for all nodes.
g. Prove that, if heuristic function h never overestimates by more than cost c, A* using h returns a solution whose cost exceeds that of the optimal solution by no more that c.
Now, suppose h(n) <= h*(n)+c as given and let G2 be a goal that is sub-optimal by more than c, i.e. f(G2)=g(G2) > C* +c. Now consider any node n on a path to an optimal goal. We have
f(n)=g(n)+h(n) <= g(n)+h*(n)+c <= C*+c <= f(G2)
so G2 will never be expanded before an optimal node is expanded
because f(n)<f(G2)

Local Search Algorithm: Local search algorithms are essential tools in artificial intelligence and optimization, employed to find high-quality solutions in large and complex problem spaces.

Hill-Climbing Search:
1. It is a straightforward local search algorithm that iteratively moves towards better solution.
1. Process : Start → Evaluate → Move → Repeat
1. Pros: Easy to implement, works well in small search space
1. Cons
  1. Local Maxima: Hill-climbing can get stuck at a local maximum.
  - Plateaus: On a flat region where neighboring states have the same heuristic value, hill-climbing may wander aimlessly, slowing progress or failing to find the goal.
  - Ridges: In state spaces with ridges (narrow paths of improvement), hill-climbing may oscillate between states without advancing toward the goal.
  - No Backtracking: Hill-climbing only considers the current state and its neighbors, without backtracking to explore alternative paths, missing better solutions elsewhere.
- Solution
  - Introduce Gradient descent search which is a variation of all hill climbing that moves downhill
  - Introduce a small random jump to escape the plateau.
  - Use stochastic hill climbing where steps are probabilistic can help navigation ridges.

Simulated Annealing

Local Beam Search

Genetic Algorithm

Tabu Search

2 player zero sum games, mini-max algorithm, alpha-beta pruning

https://ocw.mit.edu/courses/15-053-optimization-methods-in-management-science-spring-2013/2e66a9d9a74dc5c11b620f70663400da_MIT15_053S13_tut08.pdf

https://www.arsdcollege.ac.in/wp-content/uploads/2020/03/Artificial_Intelligence-week3.pdf

The 2-person 0-sum game is a basic model in game theory. There are two players, each with associated set of strategies. While one player aims to maximize his payoff, the other player player attempts to take an action to minimize this payoff. The gain of a player is the loss of another.

Mini-Max Algorithm

Mini-max algorithm is a recursive or backtracking algorithm which is used in decision making and game theory.

Both opponent plays optimally

Both the players fight it as opponent player gets the minimum benefit while they get the maximum benefit.

MAX will select maximized value

MIN will select the minimized value

A DFS algorithm

Proceed all the way down to the terminal node then backtrack the tree as recursion.

The game is modeled as a game tree

Node → Game State

Edges → Legal Moves

Root → Current Position

Leaves → Game Outcomes with utility value (+1 → max, -1 → min)

Process

Start at the root (MAX turn)

Recursively, explore all possible moves down to terminal nodes or fixed depth

At leaf node assign values using a utility function or heuristic evaluation function for non-terminals states

Backpropagate values
1. At max nodes → Select he child with highest value
1. At min Nodes → Select the child with lowest value

At the root, choose the move that yields the highest value ensuring the best outcome against min optimal play.

Analytics: $\text{TC} : O(b^d), \text{SC : } O(bd)$

Alpha-beta Pruning

Modified version of minmax algorithm

It uses an optimization technique for the minimax algorithm.

The alpha-beta pruning to a standard minimax algorithm returns the same move as the standard does, but it removes all the nodes which are not really affecting the final decision but making algorithm slow.

Alpha : The best/highest-value choice we have found so far at any point along the path of maximizer. Initial value of a alpha is $-\infin$

Beta: The best/lowest-value choice we have found so far at any point along the path of minimizer. The initial value is $+\infin$

Condition for alpha-beta pruning
- $\alpha \geq \beta$

Genetic algorithm, steps of genetic algorithm. (using MAXONE problem, see slides), Different Crossover and Selection techniques, GA for solving optimization problems

https://egyankosh.ac.in/bitstream/123456789/12697/1/Unit-11.pdf

http://www.cs.cmu.edu/~02317/slides/lec_8.pdf

Genetic Algorithm: A genetic Algorithm is a search heuristic inspired by Charles Darwin’s “Theory of Natural Evolution”. IT is used to solve optimization problem by mimicking the process of natural selection where the fittest individual are selected for reproduction to produce offspring for the next generation.

Application of GA: Optimization, Automatic Programming, Machine and robot learning, Economic models, Ecological models, Population genetic models and Models of social systems.

Steps

Initialization: Start with randomly generated population

Evaluation: Evaluate each individual using a fittest function

Selection: Select the individuals for reproduction using different techniques
1. Types of selection techniques
  1. Roulette Wheel Selection: Conceptually, this can be represented as a game of roulette - each individual gets a slice of the wheel, but more fit ones get larger slices than less fit ones.
    $P_i = \frac{F_i}{\sum F_i}$
  1. Rank based selection: where the probability of an individual being selected for reproduction or survival is determined by its rank within the population, not its raw fitness score.
  1. Elitist Selection: Chose only the most fit members of each generation.
  1. Cutoff Selection: Select only those that are above a certain cutoff for the target function.
  1. Scaling Selection :

Crossover: Combine pair to produce offspring
1. Types
  1. Single Point: Randomly select a single point for a crossover
  1. Two point crossover: Avoids cases where genes at the beginning and end of a chromosome are always split
  1. Uniform
    1. A random subset is chosen
    1. The subset is taken from parent 1 and the other bits from parent 2

Mutation : Random changes to some individual to maintain diversity
1. Mutation prevents the algorithm to be trapped in a local minimum

Termination: Repeat the process until the criterion met

The basic algorithm

[Start] Genetic random population of n chromosomes

[Fitness] Evaluate the fitness $f(x)$ of each chromosome x in the population

[New Population] Create a new population by repeating following steps until the New Population is complete
1. [Selection] Select two parent chromosomes from a population according to their fitness value
1. [Crossover] With a crossover probability, cross over the parents to form a new offspring.
1. [Mutation] With a mutation probability, mutate new offspring at each locus.
1. [Accepting] Place new offspring in the new population

[Replace] Use new generated population for a further sum of the algorithm

[Test] If the condition is satisfied, stop and return the best solution in current population.

[Loop] Go to Step2 for fittest evaluation.

MaxOne problem: The Maxone problem is to find a binary string of length ‘l’ that contains the maximum number of one. The optimal solution is a string of all 1s.

Bayes' rule, Belief update, Naive Bayes Classifier, Formulation, Dealing with sparse data, Usage in document classification, Gaussian Naive Bayes

Baye’s Theorem: Baye’s theorem is a fundamental principle in probability theory that allows for the computation of the conditional probability of a hypothesis H gives observed evidence E.

Derivation: $P (A|B) = \frac{P(A \cap B)}{P(B)}, P (B|A) = \frac{P(A \cap B)}{P(A)}$

⇒ $P(A | B) P(B) = P(B|A) P(A)$

⇒ $P(A|B) = \frac{P(B|A) P(A)}{P(A|B)}$

⇒ $P(H|E) = \frac{P(E|H) P(H)}{P(E)} = \frac{P(E|H)P(H)}{P(E|H)P(H) + P(E|\neg H) P(\neg H)}$

Bayesian Network: It is a decision making tool . A BN is a powerful probabilistic graphical model used for decision making under uncertainty,

Principle:

Structure: It consists of DAG and CPT (Conditional Probability Tree)
1. Node : Random Variable
1. Edge : Conditional Dependency
1. No Edge : Conditional Independence.
Each node associated with a CPT that quantifies the probability of the node given its parent node. $P(X_i| Parent)$
The network encodes the joint probability distribution of all variables as $P(X_1 ... X_n) = \prod P(X_i | Pa(X_i))$

Probabilistic Inference : BN update the probabilities of known variable using Baye’s theorem, facilitating reasoning under uncertainity. For example, the network can be used to update knowledge of the state of a subset of variables when other variables (the evidence variables) are observed. This process of computing the posterior distribution of variables given evidence is called probabilistic inference.

Belief Update: A Bayesian update or belief update is a change in probabilistic beliefs after gaining new knowledge. For example, after observing a patient's test result, we might revise our probability that a patient has a certain disease. If this belief revision obeys Bayes's Rule, then it is called Bayesian. When the evidence is observed, the prior probabilities are update to posterior probabilities.

Application: Medical diagnosis, decision support, prediction and forecasting, anomaly detection.

Naive Bayes Classifier: Naive Bayes is a classification algorithm that uses probability to predict which category a data point belongs to, assuming that all features are unrelated.

Formulation

Bayes theorem, $P(C|X) = \frac{P(X|C) \cdot P(C)}{P(X)}$

C : Class, X : Feature

Naive assumption: All features are conditionally independent given the class.

$P(X|C) = \prod_{i=1}^n P(x_i|C)$

Hence, final classification becomes,

$\hat{C} = \arg\max_C P(C) \prod_{i=1}^n P(x_i|C)$

Dealing with Sparse Data (Zero Probabilistic)
In real datasets, some feature-class combinations may not appear in training data, resulting in zero probability, which can nullify the whole product.

To handle this, we use Laplace Smoothing (Additive Smoothing):

$P(x_i|C) = \frac{count(x_i, C) + \alpha}{\sum_{j} count(x_j, C) + \alpha \cdot |V|}$

Usage in document classification

Spam filtering, sentiment analysis, topic classification

Definitions

Mass function: The mass function, denoted $m$ , assign a value range $[0,1]$ to every subset frame of discernment.

Belief (Bel): The belief function, $Bel(A)$ , for a subset A is the sum of the mass probabilities of all proper subset of A.
$Bel(A) = \sum_{B \subseteq A} m (B)$

Plausibility (Pls): The plausibility function, $Pls(A)$ , represents the maximum possible belief in A.
$Pls(A) = 1 - Bel(\neg A)$

Belief Interval : The belief interval for a subset A is the range $[Bel(A), Pls(A)]$ , which express the range of belief in A.

Gaussian Naive Bayes Algorithm

Gaussian Naive Bayes is a type of Naive Bayes method working on continuous attributes and the data features that follows Gaussian distribution throughout the dataset.

Calculate Mean and Variance
$\mu (X) = \frac{\sum}{n}$
$\sigma^2 = \frac{\sum (xi - \mu)^2}{n-1} \text{[sample variance]}$
$\sigma^2 = \frac{\sum (xi - \mu)^2}{n} \text{[population variance]}$

$P(X|Y) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp^{-\frac{(x - \mu_x)^2}{2\sigma^2}}$

$\text{Posterior}, P(Y) = \frac{P(Y) \prod P(X_i | Y)}{\text{evidence}}$

Decision Tree, Information Gain & Gini Index.

https://downloads.ctfassets.net/kdr3qnns3kvk/6nDiFgv0LRFz3ocCMvZGMR/c8b9acb313cae4f7ccc20a61058dbb80/Week5-DecisionTrees.pdf

A decision tree is a machine learning model that uses a tree like structure to make decision based on a sequence of questions and conditions.

Entropy is a measure of impurity or disorder in a dataset
$Ent(S) = -\sum_{i=1}^{n} p_i \log_2 p_i$

Information Gain measures the reduction in entropy after a dataset is split on an attribute:

$IG(S, A) = Ent(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} Ent(S_v)$

Gini Index : $Gini(n) = 1 - \sum [p_i]^2$

Machine Learning, Steps of ML, Hyperparameter tuning, why needed?

https://www.cs.cmu.edu/~hn1/documents/machine-learning/notes.pdf

Machine learning is programming computers to optimize a performance criterion using example data or past experience.

Aspect	Traditional Programming	Machine Learning
Programming	Rules and logic are manually coded by the programmer.	Model learns patterns automatically from data.
Input	Data and explicit rules.	Data (often labeled for supervised learning).
Output	Deterministic output based on coded rules.	Predictions or decisions based on learned patterns.
Flexibility	Limited to predefined rules; changes require recoding.	Adapts to new data; retraining updates the model.
Use Cases	Well-defined, rule-based tasks (e.g., payroll, sorting).	Complex, pattern-based tasks (e.g., image recognition, predictions).
Examples (Lecture 1)	Calculating payroll	Speech recognition, personalized

Steps of ML

Project Setup : This is the first step to plan and set up the environment for machine learning project.
1. Understand Business Goal
  1. Having conversation with stakeholders
1. Choose the solution to the problem
  1. Which category of the models derive the highest impact

Data Preparation:
1. Collection of data
  1. Clear about the goal with clear objective → Identify which data are vital for the model tuning
  1. Collecting related data from various sources according to project requirement
1. Cleaning of Data
  1. Identify and handle missing values, inconsistencies, removing duplicates
1. Data transformation
  1. Convert cleaned data into a format suitable for machine learning
  1. modifying or converting data, feature scaling, feature encoding
1. Data Reduction
  1. Simplifying data without losing the essence
1. Data Splitting
  1. Split data into different sets to ensure more reliable and actionable
  1. Splitting of data
    1. Training set:
      1. actual dataset from which a model trains
      1. model sees and learns from this data
      1. 60% of total data
    1. Validation set
      1. Used to training hyperparameters
      1. Model sees this data for evaluation but doesn’t learn from this
      1. 15% of total data
    1. Testing set
      1. Evaluate the model after training to complete
      1. Provides unbiased evaluation of the models
      1. 20-25% of data
1. Importance of data preparation
  1. Provide reliable prediction outcomes
  1. Identify data issues or error
  1. Increase decision making capability
  1. Reduce cost
  1. Increase model performance

Deployment:
1. Deploy the model:
  1. integrating it into a production environment where it can process real world data
  1. MLOPS
1. Monitor Model performance
  1. Regularly test the performance of model as data
1. Improve Model
  1. Continuously iterate and improve model
  1. Replace model with an updated version

Feature	Random Split	Stratified Split	K-Fold Cross-Validation	Time-Based Split
How it works	Randomly divides data	Keeps class proportions same	Splits data into k parts, trains k times. in each iteration, k-1 folds are used for training, and the remaining 1 fold is used for validation	Splits data by time order. Earlier data points form the training set, and later data points form the test or validation set
When to use	Data has no order, balanced classes	Classification with imbalanced classes	When data is small, want reliable results	For time-series or sequential data
Pros	Simple and fast	Keeps classes balanced	Uses all data for training and testing	Keeps time order, avoids future data leaking
Cons	May not keep class balance	Only for classification problems	Takes more time to run	Not for random data, only ordered data
Example	Splitting customer data randomly	Heart disease data with rare cases	Small dataset with 5 folds	Stock prices split by date

Hyperparameter

Hyperparameter tuning is a fundamental process in machine learning that involves selecting the best set of hyperparameters to maximize a model's performance and generalization ability. Hyperparameters are predefined configuration settings, such as learning rate, batch size, or the number of layers, which are not learned from the data but significantly influence the training process and final model quality.

Some techniques

Grid Search : exhaustively tests all combinations of specified hyperparameter values

Random Search: which samples parameter combinations randomly to reduce computation cost

Importance of it

Improve model performance: capturing patterns without overfitting or underfitting

Balance bias and variance trade off: controls complexity that affects the bias-variance trade off.

Enhancing learning efficiency

Optimizes resource usage:

Supervised, Unsupervised and Reinforcement learning

Aspect	Supervised Learning	Unsupervised Learning	Reinforcement Learning
Definition	A model is trained on a labeled dataset with input-output pairs to predict outputs for new data	A model identifies patterns or structures in unlabeled data without predefined outputs	An agent learns by interacting with an environment, making decisions to maximize cumulative rewards .
Data Type	Labeled (input-output pairs)	Unlabeled (no predefined outputs)	No predefined input-output; uses rewards
Goal	Predict accurate outputs for new inputs	Find patterns or structures in data	Maximize cumulative reward
Feedback	Direct (correct labels)	None (inferred from data structure)	Delayed (reward signals from environment)
Examples	House price prediction, disease classification	Customer segmentation, data compression	Robot navigation, game playing
Algorithms	Linear regression, SVM, logistic regression	K-means, PCA	Q-learning, policy gradients
Human Involvement	Yes	No	Low

Gradient Descent Algorithm using a linear regression example

Gradient Descent is an optimization algorithm used to find the optimal parameters of a machine learning model by iteratively adjusting parameters to minimize.

In linear regression, the goal is to find the line that best fits a set of data points. The model point can be,

$y =wx + b$

$y : \text{predicted output} , w : \text{ weight} , x : \text{input} , b :\text{bias}$

We measure how well the model fits the data using a cost function. A common cost function for linear regression is the mean squared error (MSE)

$J (w, b) = \frac{1}{2m} \sum (\hat y_i - y_i)^2$

$\hat y_i = w_{xi} + b \text{ is predicted values} , y_i = \text{actual value}, m = \text{number of data point}$

Algorithm

Lets, $w=b=0$

Calculate the gradient descent
1. $\frac{dJ}{dw} = \frac{1}{2m} \sum 2 \cdot (\hat y_i - y_i) \cdot (x_i +0-0) = \frac{1}{m} \sum (\hat y_i - y_i) x_i$
1. $\frac{dJ}{db} = \frac{1}{m} \sum (\hat y_i - y_i)$

Adjust $\text{w and b}$ using gradient,
1. $w = w - \alpha \frac{dJ}{dw}$
1. $b = b - \alpha \frac{dJ}{db}$

Iterate the process until converges

Example: $w(x,y) = \{ (1,2), (2,3), (3,5) \}$

$w=b=0$

Calculate gradients : For each iteration compute $\frac{dJ}{dw} \text{ and } \frac{dJ}{db}$

Learning rate, $\alpha = 0.001$

After multiple iteration, it finds values of $\text{ w and b }$ that minimizes $J(w, b)$

Regression and classification problems

Aspect	Regression	Classification
Definition	Regression is a type of supervised learning where the algorithm learns to predict continuous values based on input feature.	Classification is a type of supervised learning where the algorithm learns to assign input to a specific category or class based on input features.
Used for	Predicting the values	Predicting the class
Output Labels	Continuous values	Discrete
Algorithm	Linear, Polynomial, Decision tree	Naive Bayes, Decision tree, Logistic Regression
Evaluation Metrics	MSE, MAE. R^2 Score	Accuracy, Precision, ROC-AUC
Example	Predicting house price, forecasting sales, predicting temperature, stock price	Email Classification, Disease diagnosis, image recognition, fraud detection

**Performance metrics*, AUC - ROC**

https://www.tutorialspoint.com/machine_learning/machine_learning_performance_metrics.htm

Performance metrics in machine learning are used to evaluate the performance of a machine learning model.

Confusion Matrix

Performance Metrics for Regression Problems

Mean Absolute Error (MAE):
1. It is basically the sum of average of the absolute difference between the predicted and actual values.
1. $\text{MAE = } \frac{1}{n} \sum |Y - \hat Y|$

Mean Square Error (MSE):
1. average of the squared difference between target value and predicted value
1. $\text{MAE = } \frac{1}{n} \sum (Y - \hat Y) ^2$
1. Differentiable more optimized

Root Mean Square Error (RMSE)
1. the square root of MSE
1. $\text{MAE = } \sqrt{\frac{1}{n} \sum (Y - \hat Y) ^2}$
1. Differentiable
1. Handles small error done by MSE

$R^2$ Score:
1. R Squared metric is generally used for explanatory purpose and provides an indication of the goodness or fit a set of predicted output values to the actual output values.
1. $R^2 = 1 - \frac{\text{Sum of squared error}}{\text{Total Sum of squares}}$

Performance Metrics for Classification Metrics

Accuracy
1. It is the ratio correct prediction to total predictions
1. $\frac{\text{TP + TN}}{\text{TP + TN + FP + FN}}=\frac{\text{Correct Prediction}}{\text{Total Observation}}$

Precision
1. Proportion of true positive instances out of all predicted positive instances
1. $\text{Precision = } \frac{TP}{TP+FP}$
1. Defined as number of correct documents returned by our ML Model
1. A precision score 1 → The model didn’t miss any true positive and is able to classify well between correct and incorrect labelling
1. A low precision (<0.5) → a high number of false positive

Recall/Sensitivity
1. actual positive correctly identified
1. $\text{Recall} = \frac{TP}{TP+FN}$
1. Recall 1 → didn’t miss any true positive and able to classify well
1. Low recall (<0.5) → high number of false negative

Specificity
1. proportion of actual negative correctly identified
1. $\text{Specificity = } \frac{TN}{TN+FP}$

$F1$ Score:
1. Harmonic mean of precision and recall
1. $F1 = 2 \cdot \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$

Precision-Recall Tradeoff

$\text{Precision} \propto \frac{1}{\text{Recall}}$ , can improve one a a time, but not both. So we need precision-recall combination graph to observe both.

AUC

Area Under curve

For a better model, this should be 1

ROC AUC : Receiver Operating Characteristic Area Under Curve

https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc

ROC-AOC score is a measure of the ability of a classifier to distinguish between positive and negative instances

The ROC curve is visual representation of model performance across all thresholds
- Calculated by TPR and FPR
- $\text{TPR = } \frac{TP}{TP+FN}$ , $\text{FPR = } \frac{TN}{TN+FP}$
ROC and AUC of a hypothetical perfect model.

AUC-ROC generates probability values instead of binary 0/1 values.

It should be used where data set is roughly balanced

ROC provides trade of between TPR And FPR
- $\text{Value } \lt 0.5$ → Poor classifier
- $\text{Value } = 0.5$ → Random
- $\text{Value } \gt 0.7$ → Good
- $\text{Value } = 0.8$ → Indicates strong classifier
- $\text{Value } = 1$ → Perfectly predict everything

AUC-ROC Curve

Loss functions, L1, L2, Huber loss, Binary Cross Entropy loss

A loss function used in ML to measure the difference between the predicted output of a model and the actual target. Also known as cost function or error function.

Loss function for regression

$MAE/L_1$ Loss
1. average of absolute difference between the actual and the predicted value
1. $MAE = \frac{1}{n} \sum | \hat y_i - y_i |$

$MSE/L_2$ Loss
1. average of squared difference between actual and predicted value
1. $MSE = \frac{1}{n} \sum {(\hat y_i - y_i)^2}$

Huber Loss
1. Defined as combination of MSE and MAE loss function
  1. MSE → when error is small (approx. 0)
  1. MAE → when error is large (approx. $\alpha$ )
1. Hyperparameter $\delta$ controls this error to make quadratic error
1. $L_\delta (y, f(x)) = \begin{cases} \frac{1}{2}(y-f(x))^2 & \text{for} |y-f(x)| \le \delta \\ \delta |y-f(x)| - \frac{1}{2} \delta^2 & \text{Otherwise} \end{cases}$

Loss functions for Classification

Binary cross-entropy loss
1. Binary cross-entropy (log loss) is a loss function used in binary classification problems
1. It measures the performance of a classified model whose predicted output is a probability value between 0 to 1
1. When the number of classes is 2, its binary classification.
  1. $L = - \frac{1}{m} \sum { y_i \log (\hat y_i) + (1 - y_i) \cdot \log{(1-\hat y_i)} }$
1. Binary cross-entropy for multiple classes ( > 2)
  $L = \frac{1}{m} \sum y_i \log \hat y_i$

Hinge Loss
1. It is developed for support vector machine model evaluation
  $L = max(0, 1 - y \times f(x))$

Overfitting and underfitting in terms of bias and variance

Bias : Gap between actual data of model and predicted value of data

High Bias : Predicted value is more away than actual value (underfitting)

Low Bias : Predicted value is near to actual value

Variance: Prediction value how much scatter with relation between each other

Low Variance : Group of predicted does not scatter

High Variance : Scatter with each other (Overfitting)

Overfitting: Overfitting is an undesirable condition in ML that occurs when the ML gives accurate prediction for training data but not for new data.

Reasons

High Variance and Low Bias

Model → too complex

Training data size → Less

Training data → irrelevant information, noise

Trains for too long on a single sample

Reductions

Removing Feature

Reduce Complexity

Increase the size of training data

Improve quality of training data

Early stopping during training phase

Underfitting: It represents inability of the model to learn the training data effectively and result in poor performance both on training and testing data.

Reasons

Too simple model

Model has no capability to represent complexity in data

Size of training data → less

High Bias

Training dataset → noise

Reductions

Increase Complexity of model

Increase the size of training data

Increase number of features

Increase duration of dataset

Activation functions, linearity and differentiability of activation functions, ReLU activation functions, use of ReLU in deep learning

Activation function is a mathematical function used in artificial neural networks to determine the output of a neuron.

Common Activation functions

Linear: $f(x)=x$ , $(-\inf, +\inf)$

Sigmoid: $g(x) = \frac{1}{1+e^{-x}}, (0,1)$

Tanh: $g(x) = \frac{e^x - e^{-x}}{e^x+e^{-x}}, (-1,1)$

ReLU : Reflected Linear Unit $g(x)=max(0,x),$ $[0,\inf)$

Leaky ReLU: $g(x) = \begin{cases} ax, & \text{for} x \lt 0 \\ x & x \ge 0 \end{cases}$
$(-\inf, \inf)$

Swish : $g(x) = \frac{x}{1+e^{-x}}$

Activation function should be non-linear

WHY ? To learn and represent complex and non-linear relationship between inputs and outputs

If all activation functions are linear the NN would fail to mapping complex relation
Output Layer 1 : $f(w_1x+b_1)$
Output Layer 2: $w_2f(w_1x+b_1) + b_2$
If $f$ is a linear function $f(x)$ ,
$w_2(w_1x + b_1) + b_2 = w’x + b’$
This is still a linear transformation, no matter how many layers are stacked.

Linear activation makes deep network redundant where non-linear allows for hierarchical feature learning, complex mapping.

Activation function should be differentiable

WHY? It can be used in process of backpropagation neural network are trained using the backpropagation algorithm which involves forward and backward passing
1. backward passing requires chain rule of calculus to compute derivatives layer by layer
  1. So, differentiability is must

Example: For a network output $y$ with activation function $f$ , the gradient with respect to $w$
$\frac{dL}{dw} = \frac{dL}{df} \cdot \frac{df}{dz} \cdot \frac{dz}{dw}$
where $z=wx+b$ and $L$ is the loss function. If f is not differentiable, the chain rule can’t propagate gradient through $f$ .

ReLU activation function

ReLU is one of the most popular activation function used in NN especially in deep learning.

$f(x) = max(0, x), [0, \inf)$

It has become default choice

simplicity and effectiveness

positives values to pass through while setting all negative vales to zero.

maintain necessary complex pattern

Use of ReLU

Non-linearity

It has constant gradient = 1 for $x\gt0$ which to mitigate vanishing gradient problem.

Make training deep networks computationally efficient and effective

Gradient is constant and doesn’t shrink, allowing effective backpropagate in deep learning

Non linearity and computational simplicity

Simple operation reduces computation

Basic structure of Artificial Neurons*
[Basic terminologies, Learning rate, momentum, threshold]

Artificial Neurons is a mathematical model to simulate how neurons process information. It is the fundamental building block of artificial neural network.

Artificial Neural Network:

A machine learning algorithm that uses a network of interconnected nodes to process data, inspired by the human brain.

Basic components of Perceptron:

Perceptron is type of ANN (Artificial Neural Network), which is a fundamental concept in ML.

Input Layer
1. Consists of one/more input neuros
1. receive signal from external world

Weight:
1. strength of the connection between input and output neuron

Bias
1. added in input layer to provide the perceptron with additional flexibility

Activation function:
1. Activation function is a mathematical function used in artificial neural networks to determine the output of a neuron.

Output:
1. a single binary value either 0/1

Basic Terminologies

Learning Rate:
1. a critical hyperparameter in ML and NN that determines the size of the steps taken during optimization process to minimize loss function.
1. $W_{new} = W_{old} - LR\times G$
1. Controls how much the model updates its weight wrt the gradient
1. Small LR → 0.0001 → Slow convergence
  Large LR → 1.0 → Speed up Loss oscillates
  LR → 0.01 → weights are update at right space.

Momentum
1. a parameter optimization technique that accelerates the gradient descent by adding a fraction of the previous update to the current update
1. Reduce oscillations and improves convergence
1. $V_t = \beta V_{t-1} - \eta \Delta L (w_t)$

Threshold
1. refers to a boundary value used to determine how a neuron behaves or how output are processed particularly in binary classification or activation function.
1. Decision making
1. $y=\begin{cases} 1 & \text{ if } \sum w_ix_i+b \gt threshold \\ 0 & \text{otherwise} \end{cases}$
1. A NN predicts the probability that an email is spam and the threshold is 0.5. $P(0.7) = \text{SPAM}$

Structure of Multi-layer backpropagation network, derivation of backpropagation algorithm

Backpropagation is a fundamental algorithm for training artificial neural network. The goal of backpropagation is to reduce the difference between the models predicted output and the actual output by adjusting weights and biases.

It typically consists of three main components.

Input layer

Hidden layer :
1. intermediate layer that learn pattern from input data
1. each neuron in hidden layer is connected to every in the previous and subsequent layers
1. Hidden layers $\propto$ Network Capacity

Output layer
1. apply a suitable activation function

Derivation of backpropagation

Linearly separable and linearly non-separable problems

Convolution, CNN, usage of CNN*, different layers in CNN

Convolution

Convolution is a mathematical operation used in signal and image processing to extract features by applying a filter (or kernel) over an input. In the context of neural networks, this operation forms the foundation of Convolutional Neural Networks (CNNs), which are specialized for processing structured grid-like data such as images.

CNN

A Convolutional Neural Network (CNN) is a type of Deep Learning neural network architecture commonly used in Computer Vision. CNNs are particularly effective for tasks like image classification, object detection, and facial recognition due to their ability to capture spatial hierarchies and patterns. Convolutional Neural Network consists of multiple layers.

Uses of Convolutional Neural Networks

Image classification

Object detection

Facial recognition

Image segmentation

Medical Image Analysis

Video Analysis

Different Layers in CNN

Input : Receives the raw input data, such as pixel values of an image, and passes it to subsequent layers for processing.

Convolution: Applies convolution operations using filters to extract features like edges, textures, or patterns from the input data.

Activation Layer: Introduces non-linearity (e.g., ReLU) to enable the network to learn complex patterns by transforming the convolved output.

Pooling Layer: Reduces the spatial dimensions (e.g., max pooling) of the feature maps, retaining important information while decreasing computational load.

Fully Connected Layer: Connects all neurons from the previous layer to produce high-level reasoning, consolidating features for final decision-making.

Output Layer: Generates the final output, such as class probabilities or a classification label, based on the processed features

RNN, uses of RNN, vanishing and exploding gradients problems, how solved?* Architectural difference between LSTM and GRU*

Recurrent Neural Networks (RNNs) are a class of neural networks designed to handle sequential data by maintaining a memory of previous inputs through recurrent connections.

Uses of Recurrent Neural Networks

Natural Language Processing

Time series prediction

Speech recognition

Machine translation

Vanishing and Exploding Gradients Problems

During backpropagation through time (BPTT) in RNNs, the following issues arise:

Vanishing Gradients: Gradients become extremely small as they are propagated back through many time steps, causing early layers to learn slowly or not at all.

Exploding Gradients: Gradients grow excessively large, leading to unstable training and large weight updates that disrupt learning.

Solutions to Vanishing and Exploding Gradients

Gradient Clipping: Limits the gradient magnitude during backpropagation to prevent exploding gradients by capping values above a threshold.

Long Short-Term Memory (LSTM) Units: Introduces memory cells and gates (input, forget, output) to preserve long-term dependencies, mitigating vanishing gradients

Gated Recurrent Units (GRU): A simplified version of LSTM with update and reset gates, improving gradient flow and reducing vanishing gradient effects.

ReLU Activation: Replaces traditional activations to avoid the saturation issues that contribute to vanishing gradients

Architectural Differences Between LSTM and GRU

Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) are advanced recurrent neural network architectures designed to mitigate the vanishing gradient problem. This document compares their architectural differences in a tabular format.

Transfer learning, usage

Transfer learning is a machine learning technique where a model trained on a large, general dataset is fine-tuned for a specific task with a smaller dataset.

Transfer learning is a machine learning technique where knowledge gained from solving one task is leveraged to improve a model’s performance on a different but related task. Instead of training a model from scratch for every new problem, transfer learning uses a pre-trained model—typically trained on a large dataset for a source task—and adapts it (often by fine-tuning) for a target task, which may have less data or slightly different requirements

Usage

Transformer model, Attention mechanism in transformer architecture, Transformer vs RNN*.

The Transformer model is a deep learning architecture introduced for sequence-to-sequence tasks, relying entirely on attention mechanisms and eliminating recurrent structures.

A transformer model is a type of neural network architecture that has revolutionized the field of artificial intelligence, especially in natural language processing (NLP) and other sequential data tasks. Introduced in the 2017 paper ”Attention is All You Need,” transformers have become the foundation for many modern AI systems, including large language models and generative AI.

Attention mechanism

Transformer vs RNN