CE 7393: AI in CE

Lecture 1a

Welcome!

Welcome to CE 7393

Syllabus

Course Website

Course Site Link: https://subasish.quarto.pub/ce7393-fall24/

Course Canvas

Class Etiquette

  1. Please be respectful, especially to other students.

  2. Please be present. Attendance will not be taken, but you are encouraged to come and learn together.

  3. Please restrict the use of electronic devices to course-related material; other content could be distracting.

  4. Please be forgiving; instructors are people too, we will make mistakes.

  5. Be Considerate; If you feel extremely drowsy or unwell, it’s better to step out for a moment to refresh yourself rather than risk distracting your peers.

Intelligence et al. 

Source:https://www.simplilearn.com/ice9/free_resources_article_thumb/AIvsML.png

History of AI

Another Look

GPT and after

Statistics and AI (Two Cultures)

Statistics and AI (Two Cultures)

Complex Problems

  • It is very hard to write programs that solve problems like recognizing a three-dimensional object in complex situations.

    • Can’t write the code as we don’t know how its done in our brain.

    • Even we figure out, it might be very complex code.

  • It is hard to write a program to compute the probability that a credit card frauds.

    • There is no simple and reliable rule . We need to combine a very large number of weak rules.

    • Fraud is a moving target. The program needs to keep changing.

Machine Learning Approach

  • Instead of writing a program for each specific task, we collect lots of examples that specify the correct output for a given input.

  • A machine learning algorithm then takes these examples and produces a general program.

    • If we do it right, the program works for new cases as well as the ones we trained it on.

    • If the data changes the program can change too by training on the new data.

  • Massive amounts of computation are now cheaper than paying someone to write a task-specific program.

Some examples of tasks best solved by learning

  • Recognizing patterns

    • Identify vulnerable roadway users (VRUs)
    • Facial identities or facial expressions
    • Pedestrain crash typing
  • Recognizing anomalies

    • U-turn movements at certain intersections
    • Unusual patterns of sensor readings in a nuclear power plant
  • Prediction

    • Crash severity types
    • How many crashes will occur on that road in year 2026?

A standard example of machine learning

  • The Modified National Institute of Standards and Technology (MNIST) database of hand-written digits is the the machine learning equivalent of fruit flies.

    • They are publicly available and we can learn them quite fast in a moderate-sized neural net.
    • We know a huge amount about how well various machine learning methods do on MNIST.

MNIST Data

A typical neuron

  • Gross physical structure
    • There is one axon that branches
    • There is a dendritic tree that collects input from other neurons.
  • Axons typically contact dendritic trees at synapses
    • A spike of activity in the axon causes charge to be injected into the post-synaptic neuron.
  • Spike generation
    • There is an axon hillock that generates outgoing spikes whenever enough charge has flowed in at synapses to depolarize the cell membrane.

Linear neurons

  • These are simple but computationally limited
    • If we can make them learn we may get insight into more complicated neurons.

Linear neurons

  • These are simple but computationally limited
    • If we can make them learn we may get insight into more complicated neurons.


Binary threshold neurons

  • McCulloch-Pitts (1943)
    • First compute a weighted sum of the inputs.
    • Then send out a fixed size spike of activity if the weighted sum exceeds a threshold.
    • McCulloch and Pitts thought that each spike is like the truth value of a proposition and each neuron combines truth values to compute the truth value of another proposition!

Binary threshold neurons

  • There are two equivalent ways to write the equations for a binary threshold neuron.

Rectified Linear Neurons

You have heard of RELU

  • They compute a linear weighted sum of their inputs.
  • The output is a non-linear function of the total input.

Sigmoid neurons


  • These give a real-valued output that is a smooth and bounded function of their total input.
    • Typically they use the logistic function
    • They have nice derivatives which make learning easy.


Stochastic binary neurons


  • These use the same equations as logistic units.
    • But they treat the output of the logistic as the probability of producing a spike in a short time window.
  • We can do a similar trick for rectified linear units:
    • The output is treated as the Poisson rate for spikes.


A very simple way to recognize handwritten shapes

  • Consider a neural network with two layers of neurons.
    • neurons in the top layer represent known shapes.
    • neurons in the bottom layer represent pixel intensities.
  • A pixel gets to vote if it has ink on it.
    • Each inked pixel can vote for several different shapes.
  • The shape that gets the most votes wins.

Display the weights

  • Give each output unit its own “map” of the input image and display the weight coming from each pixel in the location of that pixel in the map.

  • Use a black or white blob with the area representing the magnitude of the weight and the color representing the sign.

Types of learning task

  • Supervised learning
    • Learn to predict an output when given an input vector.
  • Reinforcement learning
    • Learn to select an action to maximize payoff.
  • Unsupervised learning
    • Discover a good internal representation of the input.

Two types of supervised learning


  • Each training case consists of an input vector x and a target output t.

  • Regression: The target output is a real number or a whole vector of real numbers.

    • Crash counts on Hunter Road in 2025.
    • The temperature at noon tomorrow.
  • Classification: The target output is a class label.

    • Pedestrian crash typing
    • Crash severity type

How supervised learning typically works


  • We start by choosing a model-class: \(y=f(\mathbf{x}; \mathbf{W})\)

    • A model-class, \(f\), is a way of using some numerical parameters, \(\mathbf{W}\), to map each input vector, \(\mathbf{x}\), into a predicted output \(y\).
  • Learning usually means adjusting the parameters to reduce the discrepancy between the target output, t, on each training case and the actual output, y, produced by the model.

    • For regression, \(\dfrac{1}{2}(y-t)^2\) is often a sensible measure of the discrepancy.
    • For classification there are other measures that are generally more sensible.

Reinforcement learning

  • In reinforcement learning, the output is an action or sequence of actions and the only supervisory signal is an occasional scalar reward.
    • The goal in selecting each action is to maximize the expected sum of the future rewards.
    • We usually use a discount factor for delayed rewards .
    • The rewards are typically delayed so its hard to know where we went wrong.

Unsupervised learning


  • For about 40 years, unsupervised learning was largely ignored by the machine learning community
    • Some widely used definitions of machine learning actually excluded it.
    • Many researchers thought that clustering was the only form of unsupervised learning.
  • It is hard to say what the aim of unsupervised learning is.
    • One major aim is to create an internal representation of the input that is useful for subsequent supervised or reinforcement learning.
    • You can compute the distance to a surface by using the disparity between two images. But you don’t want to learn to compute disparities by stubbing your toe thousands of times.

Other goals for unsupervised learning

  • It provides a compact, low-dimensional representation of the input.
    • High-dimensional inputs typically live on or near a low-dimensional manifold.
    • Some methods: Principal Component Analysis (non-categorical), Multiple Correspondence Analysis (categorical), association rules.
  • It provides an economical high-dimensional representation of the input in terms of learned features.
    • Binary features are economical.
    • So are real-valued features that are nearly all zero.
  • It finds sensible clusters in the input.
    • This is an example of a very sparse code in which only one of the features is non-zero.

Why the learning procedure works (first attempt)

  • Consider the squared distance between any feasible weight vector and the current weight vector.
    • Example: Every time the perceptron makes a mistake, the learning algorithm moves the current weight vector closer to all feasible weight vectors.

Understanding Residuals

Understanding Residuals

Understanding Loss

Residuals, Loss (Code)

################## MSE
import numpy as np

actual = np.random.randint(0, 10, 10)
predicted = np.random.randint(0, 10, 10)

print('Actual :', actual)
print('Predicted :', predicted)

ans = []

# The Computation through applying Equation 
for i in range(len(actual)):
    ans.append((actual[i]-predicted[i])**2)

MSE = 1/len(ans) * sum(ans)
print("Mean Squared error is :", MSE)

################### MAE

import numpy as np

actual = np.random.randint(0, 10, 10)
predicted = np.random.randint(0, 10, 10)

print('Actual :', actual)
print('Predicted :', predicted)

ans = []

# The Computation through applying Equation 
for i in range(len(actual)):
    ans.append((actual[i]-predicted[i])**2)

MAE = 1/len(ans) * sum(ans)
print("Mean Absolute error is :", MAE)


###################### Huber Loss

import numpy as np

def huber_loss(y_pred, y, delta=1):
    huber_mse = 0.5*np.square(np.subtract(y,y_pred))
    huber_mae = delta * (np.abs(np.subtract(y,y_pred)) - 0.5 * delta)
    return np.where(np.abs(np.subtract(y,y_pred)) <= delta, huber_mse, huber_mae).mean()

actual = np.random.randint(0, 10, (2,10))
predicted = np.random.randint(0, 10, (2,10))

print('actual :', actual)
print('predicted :', predicted)
print("Mean Absolute error is :", huber_loss(actual, predicted))

Regression Loss Functions

Classification Loss Functions

Gradient Descent

Gradient Descent

Gradient Descent

Source: https://datascience.stackexchange.com/questions/44703/how-does-gradient-descent-and-backpropagation-work-together

Optimization issues in using the weight derivatives

  • How often to update the weights
    • Online: after each training case.
    • Full batch: after a full sweep through the training data.
    • Mini-batch: after a small sample of training cases.
  • How much to update
    • Use a fixed learning rate?
    • Adapt the global learning rate?
    • Adapt the learning rate on each connection separately?
    • Don’t use steepest descent?

A simple example of overfitting



  • Which model do you trust?
    • The complicated model fits the data better.
    • But it is not economical.
  • A model is convincing when it fits a lot of data surprisingly well.
    • It is not surprising that a complicated model can fit a small amount of data well.

Ways to reduce overfitting

  • A large number of different methods have been developed.
    • Weight-decay
    • Weight-sharing
    • Early stopping
    • Model averaging
    • Bayesian fitting of neural nets
    • Dropout
    • Generative pre-training
  • Many of these methods will be described later.

A few Algorithms

Feed-forward neural networks

  • These are the common type of neural network in practical applications.
    • The first layer is the input and the last layer is the output.
    • If there is more than one hidden layer, we call them “deep” neural networks.
  • They compute a series of transformations that change the similarities between cases.


Recurrent networks


  • These have directed cycles in their connection graph.
    • That means you can sometimes get back to where you started by following the arrows.
  • They can have complicated dynamics and this can make them very difficult to train.
    • There is a lot of interest at present in finding efficient ways of training recurrent nets.

Recurrent nets with multiple hidden layers are just a special case that has some of the hidden\(\rightarrow\)hidden connections missing.

The standard paradigm for statistical pattern recognition


  1. Convert the raw input vector into a vector of feature activations. Use hand-written programs based on common-sense to define the features.
  2. Learn how to weight each of the feature activations to get a single scalar quantity.
  3. If this quantity is above some threshold, decide that the input vector is a positive example of the target class.

The standard Perceptron architecture


Softmax





Softmax (Code)

softmax <- function(par){
  n.par <- length(par)
  par1 <- sort(par, decreasing = TRUE)
  Lk <- par1[1]
  for (k in 1:(n.par-1)) {
    Lk <- max(par1[k+1], Lk) + log1p(exp(-abs(par1[k+1] - Lk))) 
  }
  val <- exp(par - Lk)
  return(val)
}

# Example 1
vec <- c(-1,2,1,-3)
sm <- softmax(vec)
print(sm)


set.seed(123)
vec <- rnorm(30)
sm <- softmax(vec)
print(sm)

ANI, AGI, ASI

Towards ASI

Towards ASI from CE Lens

System Design for Road Users

System Design for Road Users

System Design for Road Users

System Design for Road Users

System Design for Road Users

System Design for Road Users

AI World Models

Deep Learning

  • Transportation
    • Driving assistance / autonomous driving
  • On-line Safety / Security
    • Filtering harmful/hateful content
    • Filtering dangerous misinformation
  • Environmental monitoring
  • Medicine
    • Medical imaging
    • Diagnostic aid
    • Patient care
    • Drug discovery

Deep Learning Connects People and Knowledge

  • Meta (FB, Instagram), Google, YouTube, Amazon, are built around Deep learning
    • Take Deep Learning out of them, and they crumble.
  • DL helps us deal with the information deluge
    • Search, retrieval, ranking, question-answering
    • Requires machines to understand content
  • Translation / transcription / accessibility
    • language ↔︎ language; text ↔︎ speech; image → text
    • People speak thousands of different languages
    • 3 billion people can’t use technology today.
    • 800 million are illiterate, 300 million are visually impaired

Source: Majoity of the following slides are taken from Dr. Yann LeCun’s lectures and talks.

Deep Learning for On-Line Content Moderation

  • Filtering out objectionable content
    • What constitutes acceptable or objectionable content?
    • Meta doesn’t see itself as having the legitimacy to decide
    • But in the absence of regulations, it has to do it.
  • Types of objectionable content on Facebook
    • (with % taken down preemptively & prevalence, Q1 2022)
    • Hate Speech (95.6%, 0.02%), up from 30-40% in 2018
    • Violence incitement (98.1%, 0.03%), Violence (99.5%, 0.04%), Bullying/Harassment (67%, 0.09%), Child endangerment (96.4%), Suicide/Self-Injury (98.8%), Nudity (96.7%, 0.04%),
    • Taken down (Q1’22): Terrorism (16M), Fake accounts (1.5B), Spam (1.8B)
    • https://transparency.fb.com/data/community-standards-enforcement

Future of AI

  • Understand the world, understand humans, have common sense
  • Level-5 autonomous cars
    • That learn to drive like humans, in about 20h of practice
  • Virtual assistants that can help us in our daily lives
    • Manage the information deluge (content filtering/selection)
    • Understands our intents, takes care of simple things
    • Real-time speech understanding & translation
    • Overlays information in our AR glasses.
  • Domestic Robots
    • Takes care of all the chores
  • For this, we need machines near-human-level AI
    • Machines that understand how the world works


Machine Learning (compared to humans and animals)

  • Supervised learning (SL) requires large numbers of labeled samples.
  • Reinforcement learning (RL) requires insane amounts of trials.
  • SL/RL-trained ML systems
    • are specialized and brittle
    • make “stupid” mistakes
  • Machines don’t have common sense
  • Animals and humans
    • Can learn new tasks very quickly.
    • Understand how the world works
  • Humans and animals have common sense

Machine Learning (plain ML/DL, at least)

  • Machine Learning systems (most of them anyway)
    • Have a constant number of computational steps between input and output.
    • Do not reason.
    • Cannot plan.


  • Humans and some animals
    • Understand how the world works.
    • Can predict the consequences of their actions.
    • Can perform chains of reasoning with an unlimited number of steps.
    • Can plan complex tasks by decomposing it into sequences of subtasks

Three challenges for AI & Machine Learning

  1. Learning representations and predictive models of the world
    • Supervised and reinforcement learning require too many samples/trials
    • Self-supervised learning / learning dependencies / to fill in the blanks
      • learning to represent the world in a non task-specific way
      • Learning predictive models for planning and control
  2. Learning to reason, like Daniel Kahneman’s “System 2”
    • Beyond feed-forward, System 1 subconscious computation.
    • Making reasoning compatible with learning.
      • Reasoning and planning as energy minimization.
  3. Learning to plan complex action sequences
    • Learning hierarchical representations of action plans

How could machines learn like animals and humans?

  • How do babies learn how the world works?
  • How can teenagers learn to drive with 20h of practice?

How do Human and Animal Babies Learn?

  • How do they learn how the world works?
  • Largely by observation, with remarkably little interaction (initially).
  • They accumulate enormous amounts of background knowledge
    • About the structure of the world, like intuitive physics.
  • Perhaps common sense emerges from this knowledge?

Modular Architecture for Autonomous AI

  • Configurator
    • Configures other modules for task
  • Perception
    • Estimates state of the world
  • World Model
    • Predicts future world states
  • Cost
    • Compute “discomfort”
  • Actor
    • Find optimal action sequences
  • Short-Term Memory
    • Stores state-cost episodes


Mode-2 Perception-Planning-Action Cycle

  • Akin to Model-Predictive Control (MPC) in optimal control.
  • Actor proposes an action sequence
  • World Model imagines predicted outcomes
  • Actor optimizes action sequence to minimize cost
    • e.g. using gradient descent, dynamic programming, MC tree search…
  • Actor sends first action(s) to effectors

Self-Supervised Learning = Learning to Fill in the Blanks

  • Reconstruct the input or Predict missing parts of the input.

This is a […] of text extracted […] a large set of […] articles

Self-Supervised Learning = Learning to Fill in the Blanks

  • Reconstruct the input or Predict missing parts of the input.

This is a piece of text extracted from a large set of news articles

Two Uses for Self-Supervised Learning


  1. Learning hierarchical representations of the world
    • SSL pre-training precedes a supervised or RL phase
  2. Learning predictive (forward) models of the world
    • Learning models for Model-Predictive Control, policy learning for control, or model-based RL.


  • Question: how to represent uncertainty & multimodality in the prediction?

Learning Paradigms: information content per sample


  • “Pure” Reinforcement Learning (cherry)
    • The machine predicts a scalar reward given once in a while.
    • A few bits for some samples
  • Supervised Learning (icing)
    • The machine predicts a category or a few numbers for each input
    • Predicting human-supplied data
    • 10→10,000 bits per sample
  • Self-Supervised Learning (cake génoise)
    • The machine predicts any part of its input for any observed part.
    • Predicts future frames in videos
    • Millions of bits per sample



The world is stochastic

  • Training a system to make a single prediction makes it predict the average of all plausible predictions
  • Blurry predictions!


The world is unpredictable. Output must be multimodal.

  • Training a system to make a single prediction makes it predict the average of all plausible predictions
  • Blurry predictions!


How do we represent uncertainty in the predictions?


  • The world is only partially predictable
  • How can a predictive model represent multiple predictions?
  • Probabilistic models are intractable in continuous domains.
  • Generative Models must predict every detail of the world
  • My solution: JointEmbedding Predictive Architecture


Energy-Based Models: Implicit function

  • Gives low energy for compatible pairs of x and y
  • Gives higher energy for incompatible pairs

Energy-Based Models

  • Feed-forward nets use a finite number of steps to produce a single output.
  • What if…
    • The problem requires a complex computation to produce its output? (complex inference)
    • There are multiple possible outputs for a single input? (e.g. predicting future video frames)
  • Inference through constraint satisfaction
    • Finding an output that satisfies constraints: e.g a linguistically correct translation or speech transcription.
    • Maximum likelihood inference in graphical models

Energy-Based Model: implicit function

  • Energy function that captures the x,y dependencies:
    • Low energy near the data points. Higher energy everywhere else.
    • If y is continuous, F should be smooth and differentiable, so we can use gradient-based inference algorithms.

Energy-Based Model: unconditional version

  • Conditional EBM: F(x,y)
  • Unconditional EBM: F(y)
    • measures the compatibility between the components of y
    • If we don’t know in advance which part of y is known and which part is unknown

Energy-Based Models vs Probabilistic Models

  • Probabilistic models are a special case of EBM
    • Energies are like un-normalized negative log probabilities
  • Why use EBM instead of probabilistic models?
    • EBM gives more flexibility in the choice of the scoring function.
    • More flexibility in the choice of objective function for learning
  • From energy to probability: Gibbs-Boltzmann distribution
    • Beta is a positive constant



Latent-Variable Generative EBM Architecture


  • Latent variables:
    • parameterize the set of predictions
  • Ideally, the latent variable represents independent explanatory factors of variation of the prediction.
  • The information capacity of the latent variable must be minimized.
    • Otherwise all the information for the prediction will go into it.


Tools and Websites

Programming Language

  • Any necessary coding will be done in R
  • You can use ‘Python’ too. But this class will be mostly R based.
  • It’s a Ph.D./Graduate level course. Lecture focus is on concepts and applications, not code debuggig.
  • Relevant code will be posted to Canvas and embedded in the slides when necessary

Pros

  • Exposure to R
  • Rich Ecosystem
  • Reproducibility
  • Textbook

Cons

  • Steep learning curve
  • Performance
  • Package Quality
  • Limited Industry Adoption

Quarto

Quarto is an open-source scientific and technical publishing system that allows you to combine text, images, code, plots, and tables in a fully-reproducible document. Quarto has support for multiple languages including R, Python, Julia, and Observable. It works for a range of output formats such as PDFs, HTML documents, websites, presentations,…

quarto hex sticker logo

GitHub

  • GitHub is a good source for open source codes.
  • Getting familiarity with a GitHub is a must.

GitHub: Make a fork

Screenshot of github repository with fork button highlighted

GitHub: Clone the repository

Screenshot of github repository with clone button highlighted

Papers with Code

The End