Artificial Intelligence

Created by
BBorhan
Last edited time
Tag

Resources

  1. Atkia Apu’s Note
  1. Mahir Labib Dihan Sir’s Note
    1. https://drive.google.com/file/d/1Uf8A_hCBnC2F1-QIez4U-ckrPEbmkYJe/view
    1. https://drive.google.com/file/d/1Ek05PYorrHT_xsqdB71c86dyKVjj58Ty/view

Introduction to AI, Turing test, Agents


Artificial Intelligence (AI) is a branch of computer science and engineering that deals with intelligent behavior, learning, and adaptation in machines.

Views of AI fall in four categories:

  1. Thinking humanly
  1. Thinking rationally
  1. Acting humanly
  1. Acting rationally

Turing Test: A computer passes the test if a human interrogator, after posing some written questions, cannot tell whether the written responses come from a person or a from a computer.

Capabilities of passing Turing test

  1. Natural language processing to enable it to communicate successfully in English.
  1. Knowledge representation to store what it knows or hears.
  1. Automated reasoning to use the stored information to answer questions and draw new conclusions.
  1. Machine learning to adopt to new circumstances and to detect and extrapolate patterns.

Agents: An agent is anything that perceives its environment through sensors and acts upon it through actuators.

Structure

  • Sensors: Devices that collect environmental data
  • Percepts: Data received from sensors
  • Actuators: Mechanisms that allow the agent to act on the environment
  • Actions: Tasks performed by actuators

Types of Agents

Agent TypeDescriptionKey FeaturesLimitationsExample
Simple Reflex AgentActs based on current percepts, using condition-action rules ("If X, then Y").No memory of past percepts; simple to implement.Limited intelligence; may enter deadlocks or loops.Thermostat turning on/off based on temperature.
Model-Based Reflex AgentMaintains an internal state to track unobserved environment aspects.Uses knowledge of world evolution and action effects.Requires modeling of environment dynamics.Robot navigating a partially visible room.
Goal-Based AgentActs to achieve specific goals by evaluating action sequences.Flexible; explicit knowledge representation.Requires search/planning for goal achievement.Navigation system planning a route.
Utility-Based AgentChooses actions based on a utility function measuring success. The Utility-based agent is useful when there are multiple possible alternatives, and an
agent has to choose in order to perform the best action.
Balances conflicting goals and success likelihood.Complex to compute utility for all states.Robot prioritizing speed vs. safety.
Learning AgentLearns from experience to improve performance.Components: Learning Element, Performance Element, Critic, Problem Generator.Needs initial knowledge; learning takes time.Self-driving car adapting to traffic patterns.

  1. Reflexive Agent: An agent whose action depends only on the current percepts.
  1. Model based : An agent whose action is derived from an internal model of the current world state
    1. Partial observability
    1. Updating the internal state information as times goes by requires two kinds of knowledge to be encoded
      1. We need some information information about how the world evolves independently of the agent
      1. We need some information about how the agent’s own actions affect the world
  1. Goal Based agent : An agent that select actions that it believes will achieve explicitly represented goals.
    1. Expansion of Model based agent
    1. Desirable situation
    1. Searching and planning
  1. Utility Based Agent: An agent that selects actions that it believes will maximize the expected utility outcome state
    1. Utility function
    1. Deals with happy and unhappy state
  1. Learning Agent: An agent whose behavior improves over time based on its experience
    1. Learning Element: Responsible for making improvements.
    1. Performance Element : is what we have previously considered to be the entire agent: it takes in percepts and decides on actions
    1. Critic: The learning element uses feedback from the critic on how the agent is doing and determines how the performance element should be modified to do better in the future.
    1. Problem Generator: suggesting actions that will lead to new and informative experiences.

Different informed and uninformed search techniques


State space search: It is the possible of all states including start and goal state where a particular problem solution is going to be searched.

Data structure: Graph

Represented By

Generalized state space search algorithm

Initialize:
	L : Linked List = {}
	Sc : Node = Null
	GOAL_FOUND : Boolean = FALSE
	GOAL_EXISTS : Boolean = TRUE
	
Add node si (intial) to L
WHILE GOAL_FOUND IS FALSE
AND GOAL_EXISTS IS TRUE 
DO
	SET Sc : An unexpanded node from L
	IF Sc IS NOT NULL THEN
		FOREACH successor node Ss of Sc DO
			IF Ss EQUALS Sg THEN
				SET GOAL_FOUND = TRUE
			ELSE
				IF L not contains Ss THEN
					Add Ss to L
				END IF
			END IF
			MARK Sc as expanded
		END FOREACH
	ELSE
		SET GOAL_EXISTS = FALSE
	END IF
END WHILE

Search Tree:

Properties we use to evaluate an algorithm

  1. Completeness : Guaranteed to find a solution if one exists
  1. Optimal : If it always find the best solution
  1. Time complexity: The amount of time an algorithm takes
  1. Space Complexity : The amount of memory an algorithm requires

Informed vs Uninformed

AspectsUninformedInformed
DefinitionSearch algorithm that explore the problem space without any domain-specific knowledge or heuristics, relying on problem structure and predefined rules.
Also known as blind search algorithm.
Search algorithms that use domain-specific heuristic or estimates of the cost to reach the goal.
Knowledge UsedOnly problem structureHeuristics estimate cost to goal.
TimeConsumingQuick Solution
CostCostlyLess
Time and Space ComplexityMoreLess
ExampleBFS, DFSA*, Best first search

Uninformed Algorithm

  1. Breadth-First-Search Algorithm
    1. Explores all nodes at the current depth before moving to the next level
    1. Uses a queue FIFO
    1. Shortest path but more memory
  1. Depth-First-Search
    1. Explores as far as possible along each path before backtracking
    1. Uses a Stack LIFO
    1. Infinite loop, memory-efficient
    1. How it detects cycle
      1. When DFS is applied over a graph if DFS finds an edge that points to an already visited vertex
    1. DFS is not optimal
      1. Doesn’t consider path cost
      1. Many states keep reoccurring → No guarantee to find a solution
      1. May go to infinite loop
      1. Doesn’t find the shortest path always
  1. Depth Limited Search
    1. A variation of DFS that limits the depth of exploration to prevent infinite loops in large or infinite space spaces
    1. Useful when the goal depth is known but cannot find solution beyond the depth limit.
    1. Limitation
      1. Not optimal
      1. Incompleteness
      1. Effectiveness depend on the depth limit
  1. Iterative Deepening Depth First Search
    1. combines BFS and DFS altogether
    1. Adv
      1. Ensure completeness and optimally → BFS
      1. Memory Efficient → DFS
    1. Dis
      1. repeats all the work of the previous phase
  1. Uniform Cost Search
    1. Extends BFS by considering path costs always expanding the last cost node first.
    1. Find the least cost, Slower than BFS
  1. Bidirectional Search
    1. Runs two simultaneous searching one from initial state (forward search) and another from goal (backward search)
    1. It replaces one search graph with two small search subgraphs
    1. The search stops when there two graphs intersect each other
    1. When it doesn’t work
      1. Implementation of tree is difficult
      1. The goal state is unknown/unclear in advance
      1. Finding an efficient way to check if a match exists is tricky which can increase the time.

AlgorithmTCSCOptimalComplete
BFSb^db^dYesYes
DFSb^dbdNoNo
DLSb^lblNoif l ≥ d
IDDFSb^dbdYesYes
Uniformb^db^dYesYes
Bidirectionalb^(d/2)b^(d/2)YesYes

Informed Search, Heuristics*, How heuristics help? A* search, Proof of optimality of A*


https://iipseries.org/assets/docupload/rsl2024AF233C7BF02A178.pdf

  1. Best-First Search
    1. Greedy search, It picks such path that appears best at the moment.
    1. A blend of both DFS and BFS
    1. Choose the most promising node at each step
    1. It is applied by priority queue
    1. Advantage :
      1. It switch between BFS and DFS by gaining the advantage of both algorithm
      1. More effective and capable than BFS, DFS
    1. Disadvantage
      1. Worst scenario : operate as an unguided DFS
      1. Get stuck in a loop
      1. Not optimal
      1. Not complete
    1. TC/SC : bmb^m

    h(n)=g(n)h(n) = g(n)

    h(n)=estimated cost from node n to the goal. h(n) = \text{estimated cost from node n to the goal. }

    g(n)= cost from the start node to node ng(n) \text{= cost from the start node to node n}

  1. A* Algorithm
    1. Heuristic : A heuristic is a technique designed to solve a problem faster than classic method or to find an approximate solution when the classic methods fail to find any exact solution.
      1. How it helps in finding solution
        1. It useful in reducing the time and resources required to find the solutions by focusing the search on the most promising path.
        1. It prioritizes which node to explore based on their estimated cost to the goal.
        1. It helps in state space search by guiding the exploration, reducing the search space and improving efficiency.
        1. It allows algorithm to focus on exploring path that are more likely to be solution.
      1. Admissible Heuristic: A heuristic h(n)h(n) is admissible if for every node nn, h(n)h(n)h(n) \leq h^*(n), where h(n)h^*(n)  is the true cost to reach the goal statement from nn. An admissible heuristic never overestimates the cost to reach the goal, thus it is optimistic.
        1. h2(n)h1(n)h_2(n) \geq h_1(n) for all node nn and both are admissible then h2h_2 dominates h1h_1. h2h_2 is better for search.
    1. Evaluation Function, f(n)=g(n)+h(n)f(n) = g(n) + h(n)
    1. Optimal, Complete
    1. TC : Exponential.
    1. AA^* is always optimal

      Suppose some suboptimal goal G2G_2 has been generated and is in the fringe. Let n be an unexpanded node in the fringe such that n is on a shortest path to an optimal goal G2G_2.

      f(G2)=g(G2) since h(G2)=0f(G)=g(G) since h(G)=0g(G2)>g(G),since G2 is suboptimalf(G2)>f(G),h(n)h(n)g(n)+h(n)g(n)+h(n)f(n)f(G)<f(G2)f(G_2) = g(G_2) \text{ since } h(G_2) = 0 \\ f(G) = g(G) \text{ since } h(G) = 0\\ g(G_2) \gt g(G), \text{since } G_2 \text{ is suboptimal} \\ f(G_2) \gt f(G), \\ h(n) \leq h^*(n) \\ g(n) + h(n) \leq g(n) + h^*(n) \\ f(n) \leq f(G) \lt f(G_2)

      A* will never select G2G_2 for expansion.

      A heuristic is consistent if for every node n, every successor n’ of n generated by any action a,

      h(n)c(n,a,n)+h(n)h(n) \leq c(n, a, n') +h(n')

      If h is consistent,

      f(n)=g(n)+h(n)=g(n)+c(n,a,n)+h(n)=g(n)+h(n)f(n)f(n') = g(n')+h(n') = g(n)+c(n,a,n')+h(n') = g(n)+h(n) \geq f(n)

      f(n)f(n) is non-decreasing along any path. So, A* using graph search is optimal.

    1. Prove that the uniform-cost search is a special case of A* search.

    A* search uses the evaluation function:

    f(n)=g(n)+h(n)f(n) = g(n) + h(n)

    where:

    • g(n)= cost from the start node to node ng(n) = \text{ cost from the start node to node n}
    • h(n)=heuristic estimate of the cost from n to the goal.h(n) = \text{heuristic estimate of the cost from n to the goal.}

    Uniform-Cost Search (UCS) does not use a heuristic, so:

    h(n)=0h(n) = 0

    Therefore, for UCS, the evaluation function becomes:

    f(n)=g(n)+0=g(n)f(n) = g(n)+0 = g(n)

    This means UCS always expands the node with the lowest path cost g(n), exactly like A* with a zero heuristic.

    Hence, Uniform-Cost Search is a special case of A Search where the heuristic function h(n)is zero for all nodes.

    g. Prove that, if heuristic function h never overestimates by more than cost c, A* using h returns a solution whose cost exceeds that of the optimal solution by no more that c.

    Now, suppose h(n) <= h*(n)+c as given and let G2 be a goal that is sub-optimal by more than c, i.e. f(G2)=g(G2) > C* +c. Now consider any node n on a path to an optimal goal. We have

    f(n)=g(n)+h(n) <= g(n)+h*(n)+c <= C*+c <= f(G2)

    so G2 will never be expanded before an optimal node is expanded

    because f(n)<f(G2)

Local Search Algorithm: Local search algorithms are essential tools in artificial intelligence and optimization, employed to find high-quality solutions in large and complex problem spaces.

  1. Hill-Climbing Search:
    1. It is a straightforward local search algorithm that iteratively moves towards better solution.
    1. Process : Start → Evaluate → Move → Repeat
    1. Pros: Easy to implement, works well in small search space
    1. Cons
      1. Local Maxima: Hill-climbing can get stuck at a local maximum.
      • Plateaus: On a flat region where neighboring states have the same heuristic value, hill-climbing may wander aimlessly, slowing progress or failing to find the goal.
      • Ridges: In state spaces with ridges (narrow paths of improvement), hill-climbing may oscillate between states without advancing toward the goal.
      • No Backtracking: Hill-climbing only considers the current state and its neighbors, without backtracking to explore alternative paths, missing better solutions elsewhere.
    • Solution
      • Introduce Gradient descent search which is a variation of all hill climbing that moves downhill
      • Introduce a small random jump to escape the plateau.
      • Use stochastic hill climbing where steps are probabilistic can help navigation ridges.
  1. Simulated Annealing
  1. Local Beam Search
  1. Genetic Algorithm
  1. Tabu Search

2 player zero sum games, mini-max algorithm, alpha-beta pruning


https://ocw.mit.edu/courses/15-053-optimization-methods-in-management-science-spring-2013/2e66a9d9a74dc5c11b620f70663400da_MIT15_053S13_tut08.pdf

https://www.arsdcollege.ac.in/wp-content/uploads/2020/03/Artificial_Intelligence-week3.pdf

The 2-person 0-sum game is a basic model in game theory. There are two players, each with associated set of strategies. While one player aims to maximize his payoff, the other player player attempts to take an action to minimize this payoff. The gain of a player is the loss of another.

Mini-Max Algorithm

Mini-max algorithm is a recursive or backtracking algorithm which is used in decision making and game theory.

The game is modeled as a game tree

  1. Node → Game State
  1. Edges → Legal Moves
  1. Root → Current Position
  1. Leaves → Game Outcomes with utility value (+1 → max, -1 → min)

Process

  1. Start at the root (MAX turn)
  1. Recursively, explore all possible moves down to terminal nodes or fixed depth
  1. At leaf node assign values using a utility function or heuristic evaluation function for non-terminals states
  1. Backpropagate values
    1. At max nodes → Select he child with highest value
    1. At min Nodes → Select the child with lowest value
  1. At the root, choose the move that yields the highest value ensuring the best outcome against min optimal play.

Analytics: TC:O(bd),SC : O(bd)\text{TC} : O(b^d), \text{SC : } O(bd)

Alpha-beta Pruning

Genetic algorithm, steps of genetic algorithm. (using MAXONE problem, see slides), Different Crossover and Selection techniques, GA for solving optimization problems


https://egyankosh.ac.in/bitstream/123456789/12697/1/Unit-11.pdf

http://www.cs.cmu.edu/~02317/slides/lec_8.pdf

Genetic Algorithm: A genetic Algorithm is a search heuristic inspired by Charles Darwin’s “Theory of Natural Evolution”. IT is used to solve optimization problem by mimicking the process of natural selection where the fittest individual are selected for reproduction to produce offspring for the next generation.

Application of GA: Optimization, Automatic Programming, Machine and robot learning, Economic models, Ecological models, Population genetic models and Models of social systems.

Steps

  1. Initialization: Start with randomly generated population
  1. Evaluation: Evaluate each individual using a fittest function
  1. Selection: Select the individuals for reproduction using different techniques
    1. Types of selection techniques
      1. Roulette Wheel Selection: Conceptually, this can be represented as a game of roulette - each individual gets a slice of the wheel, but more fit ones get larger slices than less fit ones.

        Pi=FiFiP_i = \frac{F_i}{\sum F_i}

      1. Rank based selection: where the probability of an individual being selected for reproduction or survival is determined by its rank within the population, not its raw fitness score.
      1. Elitist Selection: Chose only the most fit members of each generation.
      1. Cutoff Selection: Select only those that are above a certain cutoff for the target function.
      1. Scaling Selection :
  1. Crossover: Combine pair to produce offspring
    1. Types
      1. Single Point: Randomly select a single point for a crossover
      1. Two point crossover: Avoids cases where genes at the beginning and end of a chromosome are always split
      1. Uniform
        1. A random subset is chosen
        1. The subset is taken from parent 1 and the other bits from parent 2

  1. Mutation : Random changes to some individual to maintain diversity
    1. Mutation prevents the algorithm to be trapped in a local minimum
  1. Termination: Repeat the process until the criterion met

The basic algorithm

  1. [Start] Genetic random population of n chromosomes
  1. [Fitness] Evaluate the fitness f(x)f(x) of each chromosome x in the population
  1. [New Population] Create a new population by repeating following steps until the New Population is complete
    1. [Selection] Select two parent chromosomes from a population according to their fitness value
    1. [Crossover] With a crossover probability, cross over the parents to form a new offspring.
    1. [Mutation] With a mutation probability, mutate new offspring at each locus.
    1. [Accepting] Place new offspring in the new population
  1. [Replace] Use new generated population for a further sum of the algorithm
  1. [Test] If the condition is satisfied, stop and return the best solution in current population.
  1. [Loop] Go to Step2 for fittest evaluation.

MaxOne problem: The Maxone problem is to find a binary string of length ‘l’ that contains the maximum number of one. The optimal solution is a string of all 1s.

Bayes' rule, Belief update*, Naive Bayes Classifier, Formulation, Dealing with sparse data, Usage in document classification*, Gaussian Naive Bayes


Baye’s Theorem: Baye’s theorem is a fundamental principle in probability theory that allows for the computation of the conditional probability of a hypothesis H gives observed evidence E.

Derivation: P(AB)=P(AB)P(B),P(BA)=P(AB)P(A)P (A|B) = \frac{P(A \cap B)}{P(B)}, P (B|A) = \frac{P(A \cap B)}{P(A)} 

P(AB)P(B)=P(BA)P(A)P(A | B) P(B) = P(B|A) P(A)

P(AB)=P(BA)P(A)P(AB)P(A|B) = \frac{P(B|A) P(A)}{P(A|B)}

P(HE)=P(EH)P(H)P(E)=P(EH)P(H)P(EH)P(H)+P(E¬H)P(¬H)P(H|E) = \frac{P(E|H) P(H)}{P(E)} = \frac{P(E|H)P(H)}{P(E|H)P(H) + P(E|\neg H) P(\neg H)}

Bayesian Network: It is a decision making tool . A BN is a powerful probabilistic graphical model used for decision making under uncertainty,

Principle:

  1. Structure: It consists of DAG and CPT (Conditional Probability Tree)
    1. Node : Random Variable
    1. Edge : Conditional Dependency
    1. No Edge : Conditional Independence.

    Each node associated with a CPT that quantifies the probability of the node given its parent node. P(XiParent)P(X_i| Parent)

    The network encodes the joint probability distribution of all variables as P(X1...Xn)=P(XiPa(Xi))P(X_1 ... X_n) = \prod P(X_i | Pa(X_i))

  1. Probabilistic Inference : BN update the probabilities of known variable using Baye’s theorem, facilitating reasoning under uncertainity. For example, the network can be used to update knowledge of the state of a subset of variables when other variables (the evidence variables) are observed. This process of computing the posterior distribution of variables given evidence is called probabilistic inference.
  1. Belief Update: A Bayesian update or belief update is a change in probabilistic beliefs after gaining new knowledge. For example, after observing a patient's test result, we might revise our probability that a patient has a certain disease. If this belief revision obeys Bayes's Rule, then it is called Bayesian. When the evidence is observed, the prior probabilities are update to posterior probabilities.

Application: Medical diagnosis, decision support, prediction and forecasting, anomaly detection.

Naive Bayes Classifier: Naive Bayes is a classification algorithm that uses probability to predict which category a data point belongs to, assuming that all features are unrelated.

Formulation

Bayes theorem, P(CX)=P(XC)P(C)P(X)P(C|X) = \frac{P(X|C) \cdot P(C)}{P(X)}

C : Class, X : Feature

Naive assumption: All features are conditionally independent given the class.

P(XC)=i=1nP(xiC)P(X|C) = \prod_{i=1}^n P(x_i|C)

Hence, final classification becomes,

C^=argmaxCP(C)i=1nP(xiC)\hat{C} = \arg\max_C P(C) \prod_{i=1}^n P(x_i|C)

Dealing with Sparse Data (Zero Probabilistic)
In real datasets, some feature-class combinations may not appear in training data, resulting in zero probability, which can nullify the whole product.

To handle this, we use Laplace Smoothing (Additive Smoothing):

P(xiC)=count(xi,C)+αjcount(xj,C)+αVP(x_i|C) = \frac{count(x_i, C) + \alpha}{\sum_{j} count(x_j, C) + \alpha \cdot |V|}

Usage in document classification

Spam filtering, sentiment analysis, topic classification

Definitions

  1. Mass function: The mass function, denoted mm, assign a value range [0,1][0,1] to every subset frame of discernment.
  1. Belief (Bel): The belief function, Bel(A)Bel(A), for a subset A is the sum of the mass probabilities of all proper subset of A.

    Bel(A)=B Am(B)Bel(A) = \sum_{B \subseteq A} m (B)

  1. Plausibility (Pls): The plausibility function, Pls(A)Pls(A), represents the maximum possible belief in A.

    Pls(A)=1Bel(¬A)Pls(A) = 1 - Bel(\neg A)

  1. Belief Interval : The belief interval for a subset A is the range [Bel(A),Pls(A)][Bel(A), Pls(A)], which express the range of belief in A.

Gaussian Naive Bayes Algorithm

Gaussian Naive Bayes is a type of Naive Bayes method working on continuous attributes and the data features that follows Gaussian distribution throughout the dataset.

  1. Calculate Mean and Variance

    μ(X)=n\mu (X) = \frac{\sum}{n}

    σ2=(xiμ)2n1[sample variance]\sigma^2 = \frac{\sum (xi - \mu)^2}{n-1} \text{[sample variance]}

    σ2=(xiμ)2n[population variance]\sigma^2 = \frac{\sum (xi - \mu)^2}{n} \text{[population variance]}

  1. P(XY)=12πσ2exp(xμx)22σ2P(X|Y) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp^{-\frac{(x - \mu_x)^2}{2\sigma^2}}
  1. Posterior,P(Y)=P(Y)P(XiY)evidence\text{Posterior}, P(Y) = \frac{P(Y) \prod P(X_i | Y)}{\text{evidence}}

Decision Tree, Information Gain & Gini Index.


https://downloads.ctfassets.net/kdr3qnns3kvk/6nDiFgv0LRFz3ocCMvZGMR/c8b9acb313cae4f7ccc20a61058dbb80/Week5-DecisionTrees.pdf

A decision tree is a machine learning model that uses a tree like structure to make decision based on a sequence of questions and conditions.

Entropy is a measure of impurity or disorder in a dataset
Ent(S)=i=1npilog2piEnt(S) = -\sum_{i=1}^{n} p_i \log_2 p_i

Information Gain measures the reduction in entropy after a dataset is split on an attribute:

IG(S,A)=Ent(S)vValues(A)SvSEnt(Sv)IG(S, A) = Ent(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} Ent(S_v)


Gini Index :
Gini(n)=1[pi]2Gini(n) = 1 - \sum [p_i]^2

Machine Learning, Steps of ML, Hyperparameter tuning, why needed?


https://www.cs.cmu.edu/~hn1/documents/machine-learning/notes.pdf

Machine learning is programming computers to optimize a performance criterion using example data or past experience.

AspectTraditional ProgrammingMachine Learning
ProgrammingRules and logic are manually coded by the programmer.Model learns patterns automatically from data.
InputData and explicit rules.Data (often labeled for supervised learning).
OutputDeterministic output based on coded rules.Predictions or decisions based on learned patterns.
FlexibilityLimited to predefined rules; changes require recoding.Adapts to new data; retraining updates the model.
Use CasesWell-defined, rule-based tasks (e.g., payroll, sorting).Complex, pattern-based tasks (e.g., image recognition, predictions).
Examples (Lecture 1)Calculating payrollSpeech recognition, personalized

Steps of ML

  1. Project Setup : This is the first step to plan and set up the environment for machine learning project.
    1. Understand Business Goal
      1. Having conversation with stakeholders
    1. Choose the solution to the problem
      1. Which category of the models derive the highest impact
  1. Data Preparation:
    1. Collection of data
      1. Clear about the goal with clear objective → Identify which data are vital for the model tuning
      1. Collecting related data from various sources according to project requirement
    1. Cleaning of Data
      1. Identify and handle missing values, inconsistencies, removing duplicates
    1. Data transformation
      1. Convert cleaned data into a format suitable for machine learning
      1. modifying or converting data, feature scaling, feature encoding
    1. Data Reduction
      1. Simplifying data without losing the essence
    1. Data Splitting
      1. Split data into different sets to ensure more reliable and actionable
      1. Splitting of data
        1. Training set:
          1. actual dataset from which a model trains
          1. model sees and learns from this data
          1. 60% of total data
        1. Validation set
          1. Used to training hyperparameters
          1. Model sees this data for evaluation but doesn’t learn from this
          1. 15% of total data
        1. Testing set
          1. Evaluate the model after training to complete
          1. Provides unbiased evaluation of the models
          1. 20-25% of data
    1. Importance of data preparation
      1. Provide reliable prediction outcomes
      1. Identify data issues or error
      1. Increase decision making capability
      1. Reduce cost
      1. Increase model performance
  1. Deployment:
    1. Deploy the model:
      1. integrating it into a production environment where it can process real world data
      1. MLOPS
    1. Monitor Model performance
      1. Regularly test the performance of model as data
    1. Improve Model
      1. Continuously iterate and improve model
      1. Replace model with an updated version

FeatureRandom SplitStratified SplitK-Fold Cross-ValidationTime-Based Split
How it worksRandomly divides dataKeeps class proportions sameSplits data into k parts, trains k times.
in each iteration, k-1 folds are used for training, and the remaining 1 fold is used for validation
Splits data by time order. Earlier data points form the training set, and later data points form the test or validation set
When to useData has no order, balanced classesClassification with imbalanced classesWhen data is small, want reliable resultsFor time-series or sequential data
ProsSimple and fastKeeps classes balancedUses all data for training and testingKeeps time order, avoids future data leaking
ConsMay not keep class balanceOnly for classification problemsTakes more time to runNot for random data, only ordered data
ExampleSplitting customer data randomlyHeart disease data with rare casesSmall dataset with 5 foldsStock prices split by date

Hyperparameter

Hyperparameter tuning is a fundamental process in machine learning that involves selecting the best set of hyperparameters to maximize a model's performance and generalization ability. Hyperparameters are predefined configuration settings, such as learning rate, batch size, or the number of layers, which are not learned from the data but significantly influence the training process and final model quality.

Some techniques

  1. Grid Search : exhaustively tests all combinations of specified hyperparameter values
  1. Random Search: which samples parameter combinations randomly to reduce computation cost

Importance of it

  1. Improve model performance: capturing patterns without overfitting or underfitting
  1. Balance bias and variance trade off: controls complexity that affects the bias-variance trade off.
  1. Enhancing learning efficiency
  1. Optimizes resource usage:

Supervised, Unsupervised and Reinforcement learning


AspectSupervised LearningUnsupervised LearningReinforcement Learning
DefinitionA model is trained on a labeled dataset with input-output pairs to predict outputs for new data A model identifies patterns or structures in unlabeled data without predefined outputs An agent learns by interacting with an environment, making decisions to maximize cumulative rewards .
Data TypeLabeled (input-output pairs)Unlabeled (no predefined outputs)No predefined input-output; uses rewards
GoalPredict accurate outputs for new inputsFind patterns or structures in dataMaximize cumulative reward
FeedbackDirect (correct labels)None (inferred from data structure)Delayed (reward signals from environment)
ExamplesHouse price prediction, disease classificationCustomer segmentation, data compressionRobot navigation, game playing
AlgorithmsLinear regression, SVM, logistic regressionK-means, PCAQ-learning, policy gradients
Human InvolvementYesNoLow

Gradient Descent Algorithm using a linear regression example


Gradient Descent is an optimization algorithm used to find the optimal parameters of a machine learning model by iteratively adjusting parameters to minimize.

In linear regression, the goal is to find the line that best fits a set of data points. The model point can be,

y=wx+by =wx + b

y:predicted output,w: weight,x:input,b:biasy : \text{predicted output} , w : \text{ weight} , x : \text{input} , b :\text{bias}

We measure how well the model fits the data using a cost function. A common cost function for linear regression is the mean squared error (MSE)

J(w,b)=12m(y^iyi)2J (w, b) = \frac{1}{2m} \sum (\hat y_i - y_i)^2

y^i=wxi+b is predicted values,yi=actual value,m=number of data point\hat y_i = w_{xi} + b \text{ is predicted values} , y_i = \text{actual value}, m = \text{number of data point}

Algorithm

  1. Lets, w=b=0w=b=0
  1. Calculate the gradient descent
    1. dJdw=12m2(y^iyi)(xi+00)=1m(y^iyi)xi\frac{dJ}{dw} = \frac{1}{2m} \sum 2 \cdot (\hat y_i - y_i) \cdot (x_i +0-0) = \frac{1}{m} \sum (\hat y_i - y_i) x_i
    1. dJdb=1m(y^iyi)\frac{dJ}{db} = \frac{1}{m} \sum (\hat y_i - y_i)
  1. Adjust w and b\text{w and b} using gradient,
    1. w=wαdJdww = w - \alpha \frac{dJ}{dw}
    1. b=bαdJdbb = b - \alpha \frac{dJ}{db}
  1. Iterate the process until converges

Example: w(x,y)={(1,2),(2,3),(3,5)}w(x,y) = \{ (1,2), (2,3), (3,5) \}

  1. w=b=0w=b=0
  1. Calculate gradients : For each iteration compute dJdw and dJdb\frac{dJ}{dw} \text{ and } \frac{dJ}{db}
  1. Learning rate, α=0.001\alpha = 0.001
  1. After multiple iteration, it finds values of  w and b \text{ w and b } that minimizes J(w,b)J(w, b)

Regression and classification problems


AspectRegressionClassification
DefinitionRegression is a type of supervised learning where the algorithm learns to predict continuous values based on input feature.Classification is a type of supervised learning where the algorithm learns to assign input to a specific category or class based on input features.
Used for Predicting the valuesPredicting the class
Output LabelsContinuous valuesDiscrete
AlgorithmLinear, Polynomial, Decision treeNaive Bayes, Decision tree, Logistic Regression
Evaluation MetricsMSE, MAE. R^2 ScoreAccuracy, Precision, ROC-AUC
ExamplePredicting house price, forecasting sales, predicting temperature, stock priceEmail Classification, Disease diagnosis, image recognition, fraud detection

Performance metrics*, AUC - ROC


https://www.tutorialspoint.com/machine_learning/machine_learning_performance_metrics.htm

Performance metrics in machine learning are used to evaluate the performance of a machine learning model.

Confusion Matrix

Performance Metrics for Regression Problems

  1. Mean Absolute Error (MAE):
    1. It is basically the sum of average of the absolute difference between the predicted and actual values.
    1. MAE = 1nYY^\text{MAE = } \frac{1}{n} \sum |Y - \hat Y|
  1. Mean Square Error (MSE):
    1. average of the squared difference between target value and predicted value
    1. MAE = 1n(YY^)2\text{MAE = } \frac{1}{n} \sum (Y - \hat Y) ^2
    1. Differentiable more optimized
  1. Root Mean Square Error (RMSE)
    1. the square root of MSE
    1. MAE = 1n(YY^)2\text{MAE = } \sqrt{\frac{1}{n} \sum (Y - \hat Y) ^2}
    1. Differentiable
    1. Handles small error done by MSE
  1. R2R^2 Score:
    1. R Squared metric is generally used for explanatory purpose and provides an indication of the goodness or fit a set of predicted output values to the actual output values.
    1. R2=1Sum of squared errorTotal Sum of squaresR^2 = 1 - \frac{\text{Sum of squared error}}{\text{Total Sum of squares}}

Performance Metrics for Classification Metrics

  1. Accuracy
    1. It is the ratio correct prediction to total predictions
    1. TP + TNTP + TN + FP + FN=Correct PredictionTotal Observation\frac{\text{TP + TN}}{\text{TP + TN + FP + FN}}=\frac{\text{Correct Prediction}}{\text{Total Observation}}
  1. Precision
    1. Proportion of true positive instances out of all predicted positive instances
    1. Precision = TPTP+FP\text{Precision = } \frac{TP}{TP+FP}
    1. Defined as number of correct documents returned by our ML Model
    1. A precision score 1 → The model didn’t miss any true positive and is able to classify well between correct and incorrect labelling
    1. A low precision (<0.5) → a high number of false positive
  1. Recall/Sensitivity
    1. actual positive correctly identified
    1. Recall=TPTP+FN\text{Recall} = \frac{TP}{TP+FN}
    1. Recall 1 → didn’t miss any true positive and able to classify well
    1. Low recall (<0.5) → high number of false negative
  1. Specificity
    1. proportion of actual negative correctly identified
    1. Specificity = TNTN+FP\text{Specificity = } \frac{TN}{TN+FP}
  1. F1F1 Score:
    1. Harmonic mean of precision and recall
    1. F1=2Precision×RecallPrecision+RecallF1 = 2 \cdot \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

Precision-Recall Tradeoff

Precision1Recall\text{Precision} \propto \frac{1}{\text{Recall}}, can improve one a a time, but not both. So we need precision-recall combination graph to observe both.

AUC

ROC AUC : Receiver Operating Characteristic Area Under Curve

https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc

AUC-ROC Curve

Random
Great AUC
Great AUC

Loss functions, L1, L2, Huber loss, Binary Cross Entropy loss


A loss function used in ML to measure the difference between the predicted output of a model and the actual target. Also known as cost function or error function.

Loss function for regression

  1. MAE/L1MAE/L_1 Loss
    1. average of absolute difference between the actual and the predicted value
    1. MAE=1ny^iyiMAE = \frac{1}{n} \sum | \hat y_i - y_i |
  1. MSE/L2MSE/L_2 Loss
    1. average of squared difference between actual and predicted value
    1. MSE=1n(y^iyi)2MSE = \frac{1}{n} \sum {(\hat y_i - y_i)^2}
  1. Huber Loss
    1. Defined as combination of MSE and MAE loss function
      1. MSE → when error is small (approx. 0)
      1. MAE → when error is large (approx. α\alpha)
    1. Hyperparameter δ\delta controls this error to make quadratic error
    1. Lδ(y,f(x))={12(yf(x))2foryf(x)δδyf(x)12δ2OtherwiseL_\delta (y, f(x)) = \begin{cases} \frac{1}{2}(y-f(x))^2 & \text{for} |y-f(x)| \le \delta \\ \delta |y-f(x)| - \frac{1}{2} \delta^2 & \text{Otherwise} \end{cases}

Loss functions for Classification

  1. Binary cross-entropy loss
    1. Binary cross-entropy (log loss) is a loss function used in binary classification problems
    1. It measures the performance of a classified model whose predicted output is a probability value between 0 to 1
    1. When the number of classes is 2, its binary classification.
      1. L=1myilog(y^i)+(1yi)log(1y^i)L = - \frac{1}{m} \sum { y_i \log (\hat y_i) + (1 - y_i) \cdot \log{(1-\hat y_i)} }
    1. Binary cross-entropy for multiple classes ( > 2)

      L=1myilogy^iL = \frac{1}{m} \sum y_i \log \hat y_i

  1. Hinge Loss
    1. It is developed for support vector machine model evaluation

      L=max(0,1y×f(x))L = max(0, 1 - y \times f(x))

Overfitting and underfitting in terms of bias and variance


Bias : Gap between actual data of model and predicted value of data

  • High Bias : Predicted value is more away than actual value (underfitting)
  • Low Bias : Predicted value is near to actual value

Variance: Prediction value how much scatter with relation between each other

  • Low Variance : Group of predicted does not scatter
  • High Variance : Scatter with each other (Overfitting)

Overfitting: Overfitting is an undesirable condition in ML that occurs when the ML gives accurate prediction for training data but not for new data.

Reasons

  1. High Variance and Low Bias
  1. Model → too complex
  1. Training data size → Less
  1. Training data → irrelevant information, noise
  1. Trains for too long on a single sample

Reductions

  1. Removing Feature
  1. Reduce Complexity
  1. Increase the size of training data
  1. Improve quality of training data
  1. Early stopping during training phase

Underfitting: It represents inability of the model to learn the training data effectively and result in poor performance both on training and testing data.

Reasons

  1. Too simple model
  1. Model has no capability to represent complexity in data
  1. Size of training data → less
  1. High Bias
  1. Training dataset → noise

Reductions

  1. Increase Complexity of model
  1. Increase the size of training data
  1. Increase number of features
  1. Increase duration of dataset

Activation functions, linearity and differentiability of activation functions, ReLU activation functions, use of ReLU in deep learning


Activation function is a mathematical function used in artificial neural networks to determine the output of a neuron.

Common Activation functions

  1. Linear: f(x)=xf(x)=x, (inf,+inf)(-\inf, +\inf)
  1. Sigmoid: g(x)=11+ex,(0,1)g(x) = \frac{1}{1+e^{-x}}, (0,1)
  1. Tanh: g(x)=exexex+ex,(1,1)g(x) = \frac{e^x - e^{-x}}{e^x+e^{-x}}, (-1,1)
  1. ReLU : Reflected Linear Unit g(x)=max(0,x),g(x)=max(0,x),  [0,inf)[0,\inf)
  1. Leaky ReLU: g(x)={ax,forx<0xx0g(x) = \begin{cases} ax, & \text{for} x \lt 0 \\ x & x \ge 0 \end{cases}

    (inf,inf)(-\inf, \inf)

  1. Swish : g(x)=x1+exg(x) = \frac{x}{1+e^{-x}}

Activation function should be non-linear

Activation function should be differentiable

  1. WHY? It can be used in process of backpropagation neural network are trained using the backpropagation algorithm which involves forward and backward passing
    1. backward passing requires chain rule of calculus to compute derivatives layer by layer
      1. So, differentiability is must
  1. Example: For a network output yy with activation function ff, the gradient with respect to ww
    dLdw=dLdfdfdzdzdw\frac{dL}{dw} = \frac{dL}{df} \cdot \frac{df}{dz} \cdot \frac{dz}{dw}

    where z=wx+bz=wx+b and LL is the loss function. If f is not differentiable, the chain rule can’t propagate gradient through ff.

ReLU activation function

ReLU is one of the most popular activation function used in NN especially in deep learning.

f(x)=max(0,x),[0,inf)f(x) = max(0, x), [0, \inf)

It has become default choice

Use of ReLU

  1. Non-linearity
  1. It has constant gradient = 1 for x>0x\gt0 which to mitigate vanishing gradient problem.
  1. Make training deep networks computationally efficient and effective
  1. Gradient is constant and doesn’t shrink, allowing effective backpropagate in deep learning
  1. Non linearity and computational simplicity
  1. Simple operation reduces computation

Basic structure of Artificial Neurons*
[Basic terminologies, Learning rate, momentum, threshold]


Artificial Neurons is a mathematical model to simulate how neurons process information. It is the fundamental building block of artificial neural network.

Artificial Neuron

Artificial Neural Network:

Basic components of Perceptron:

Perceptron is type of ANN (Artificial Neural Network), which is a fundamental concept in ML.

  1. Input Layer
    1. Consists of one/more input neuros
    1. receive signal from external world
  1. Weight:
    1. strength of the connection between input and output neuron
  1. Bias
    1. added in input layer to provide the perceptron with additional flexibility
  1. Activation function:
    1. Activation function is a mathematical function used in artificial neural networks to determine the output of a neuron.
  1. Output:
    1. a single binary value either 0/1

Basic Terminologies

  1. Learning Rate:
    1. a critical hyperparameter in ML and NN that determines the size of the steps taken during optimization process to minimize loss function.
    1. Wnew=WoldLR×GW_{new} = W_{old} - LR\times G
    1. Controls how much the model updates its weight wrt the gradient
    1. Small LR → 0.0001 → Slow convergence
      Large LR → 1.0 → Speed up Loss oscillates
      LR → 0.01 → weights are update at right space.
  1. Momentum
    1. a parameter optimization technique that accelerates the gradient descent by adding a fraction of the previous update to the current update
    1. Reduce oscillations and improves convergence
    1. Vt=βVt1ηΔL(wt)V_t = \beta V_{t-1} - \eta \Delta L (w_t)
  1. Threshold
    1. refers to a boundary value used to determine how a neuron behaves or how output are processed particularly in binary classification or activation function.
    1. Decision making
    1. y={1 if wixi+b>threshold0otherwisey=\begin{cases} 1 & \text{ if } \sum w_ix_i+b \gt threshold \\ 0 & \text{otherwise} \end{cases}
    1. A NN predicts the probability that an email is spam and the threshold is 0.5. P(0.7)=SPAMP(0.7) = \text{SPAM}

Structure of Multi-layer backpropagation network, derivation of backpropagation algorithm


Backpropagation is a fundamental algorithm for training artificial neural network. The goal of backpropagation is to reduce the difference between the models predicted output and the actual output by adjusting weights and biases.

Structure of Multi-layer backpropagation network

It typically consists of three main components.

  1. Input layer
  1. Hidden layer :
    1. intermediate layer that learn pattern from input data
    1. each neuron in hidden layer is connected to every in the previous and subsequent layers
    1. Hidden layers \propto Network Capacity
  1. Output layer
    1. apply a suitable activation function

Derivation of backpropagation

Linearly separable and linearly non-separable problems


Convolution*, CNN*, usage of CNN*, different layers in CNN


Convolution

Convolution is a mathematical operation used in signal and image processing to extract features by applying a filter (or kernel) over an input. In the context of neural networks, this operation forms the foundation of Convolutional Neural Networks (CNNs), which are specialized for processing structured grid-like data such as images.

CNN

A Convolutional Neural Network (CNN) is a type of Deep Learning neural network architecture commonly used in Computer Vision. CNNs are particularly effective for tasks like image classification, object detection, and facial recognition due to their ability to capture spatial hierarchies and patterns. Convolutional Neural Network consists of multiple layers.

Uses of Convolutional Neural Networks

Different Layers in CNN

  1. Input : Receives the raw input data, such as pixel values of an image, and passes it to subsequent layers for processing.
  1. Convolution: Applies convolution operations using filters to extract features like edges, textures, or patterns from the input data.
  1. Activation Layer: Introduces non-linearity (e.g., ReLU) to enable the network to learn complex patterns by transforming the convolved output.
  1. Pooling Layer: Reduces the spatial dimensions (e.g., max pooling) of the feature maps, retaining important information while decreasing computational load.
  1. Fully Connected Layer: Connects all neurons from the previous layer to produce high-level reasoning, consolidating features for final decision-making.
  1. Output Layer: Generates the final output, such as class probabilities or a classification label, based on the processed features

RNN, uses of RNN, vanishing and exploding gradients problems, how solved?* Architectural difference between LSTM and GRU*


Recurrent Neural Networks (RNNs) are a class of neural networks designed to handle sequential data by maintaining a memory of previous inputs through recurrent connections.

Uses of Recurrent Neural Networks

Vanishing and Exploding Gradients Problems

During backpropagation through time (BPTT) in RNNs, the following issues arise:

Solutions to Vanishing and Exploding Gradients

Architectural Differences Between LSTM and GRU

Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) are advanced recurrent neural network architectures designed to mitigate the vanishing gradient problem. This document compares their architectural differences in a tabular format.

Transfer learning*, usage*


Transfer learning is a machine learning technique where a model trained on a large, general dataset is fine-tuned for a specific task with a smaller dataset.


Transfer learning is a machine learning technique where knowledge gained from solving one task is leveraged to improve a model’s performance on a different but related task. Instead of training a model from scratch for every new problem, transfer learning uses a pre-trained model—typically trained on a large dataset for a source task—and adapts it (often by fine-tuning) for a target task, which may have less data or slightly different requirements

Usage

Transformer model*, Attention mechanism in transformer architecture*, Transformer vs RNN*.


The Transformer model is a deep learning architecture introduced for sequence-to-sequence tasks, relying entirely on attention mechanisms and eliminating recurrent structures.


A transformer model is a type of neural network architecture that has revolutionized the field of artificial intelligence, especially in natural language processing (NLP) and other sequential data tasks. Introduced in the 2017 paper ”Attention is All You Need,” transformers have become the foundation for many modern AI systems, including large language models and generative AI.

Attention mechanism

Transformer vs RNN