Neural network

name: inverse
layout: true
class: center, middle, inverse
---
class: neuralnet3, full-width, full-height
# Neural Nets: from Perceptron to Deep Learning
[or mimicking brain]
.footnote[or skip it to [Tensorflow Playground](http://playground.tensorflow.org/)]
---
## What are deep artificial neural nets?
---
layout: false
.left-column[
  ## What is it?
]
.right-column[
  Another bio-inspired, crazy-AI-maths unindentified object a.k.a:

- the *most popular ML algorithm* after linear models

- the reason for bi-weekly Breaking News from Google

- the stuff that *already* tags maps, photos, unlocks your IPhone, translate your texts, recognize your speech,...

- the stuff that *has been* reading bankchecks, detecting frauds, identifying threats...for more than 20 years

- the stuff that *will* help drive your car, diagnose a cancer, transform a panda, generate people...

]
---
.left-column[
  ## For real ?
]
.right-column[
  Artificial Neural Networks are as old as Artificial Intelligence, initial goal was to create a **machine**
  that would reason the same way as the humain brain, except neurons would be central computing units connected
  by wires

- historical **hardware** implementation of ANN before transistors became code with transistors and high level languages

- with DL it has essentially become a piece of specific code running on GPU or TPU

- but AI chip dream still around

]
---
.left-column[
  ## Historically
]
.right-column[
  ANN started in 1943, complex timeline

- Cybernetics era, followed by a 'crazy-expectations' era

- Massively funded era (defence, IT, finance) sounded by an SVM era

- **Deep Learning era** (since 2006) is actually the third wave, most succesfull period

]
---
.left-column[
  ## Second wave
]
.right-column[
  Starting in the 80's, essentially at AT&T, MIT, Microsoft and IBM

- backpropagation, Convolutionnal Neural Networks, Long Short Term Memory

- Hinton (Univ. Toronto), LeCun (Bell Labs), Bengio (Univ. Montreal)

- training was hard but already in production for inference

> " *At some point in the late 1990s, one of these systems was reading 10 to 20% of all the checks in the US.* " Y. LeCun (2014)
]
---
.left-column[
  ## Second wave
]
.right-column[
  (Shallow) Neural networks already in common use with crafted features engineering

- *much workload* was concerned by this *feature engineering* and *data preprocessing*

- once good features have been found, Neural Nets was just another ML technique (vs. SVM)

- more dedicated solutions (CNN, LSTM) were *hard to train*

]
---
.left-column[
  ## Third wave
]
.right-column[
  Deep Neural networks learn the features and make inference easier

- no *more complex preprocessing* and automatically learns the right features

- much workload consists in training made possible with GPU and very large learning set

- inference pipeline simpler but requires *computing power*

]
---
# How do you train deep neural nets?

<img src="mlp2.jpg" width="600">
---
.left-column[
  ## Jargon
]
.right-column[
  As any domain, Neural Nets has its own Domain Specific Language, in addition to traditional ML issues

Comics view .red[*]

.footnote[.red[*] source : https://codelabs.developers.google.com/]
]
---
.left-column[
  ## Maths
]
.right-column[
  An ANN output is a linear combination of nonlinear outputs taken over linear combination of the inputs

- let's call the inputs `$x=(x_1,\ldots,x_n)$` and the weights `$w$` and the bias `$b$` of one neuron

- let's call `$f$` the **activation function** of one given neuron

- the output of one neuron is

$$ f(w_1 x_1 + \ldots + w_n x_n + b) $$

- now, with m neurons, the final output is

$$ g(x) = \sum_{i=1}^m \alpha_i f^i(w_1^i x_1 + \ldots + w_n^i x_n + b^i) + b_0$$
]
---
.left-column[
  ## MLP
]
.right-column[
  What we have just described is the celebrated Multi-Layer-Perceptron architecture

- ancestors of all modern architectures (fully connected part)

- created by Rosenblatt in 1963

- part of the feedforward class of neural nets (no loop back)

]
---
.left-column[
  ## Activation Functions
]
.right-column[
  Activation function is the unit process function that every input transformation goes through. Initially, this function purpose was to mimick ON/OFF process of axons. Before deep learning, ANN practitioners favored differentiable activation function.
  But since deep learning upraisal, this situation has changed and now it can be

- a smooth saturated function e.g. sigmoid, hyperbolic tangent

- piecewise linear (Rectified Linear Unit)

- anything (cosinus, identity)

]
---
.left-column[
  ## Does it work?
]
.right-column[
  Hornik's universal approximation theorem (a.k.a Cybenko's theorem)

]
---
.left-column[
  ## Training
]
.right-column[
  Once architecture (number of layers, number of neurons, activation functions) is set, need to get the weights:

- Find the optimal values of weights s.t the loss function `$L(w)$` is minimum over the training set of `$N$` examples where

$$ \min L(w)=\frac{1}{N}\sum_{i=1}^N(g(x_i)-y_i)^2 $$

- Unconstrained non convex minimization problem over `$w$` variables

- With many optimization variables

]
---
.left-column[
  ## Training
]
.right-column[
  The only viable option for such a training is gradient descent

- iterative algorithm where you start from an initial guess of the values of the `$w^0$` and update them

$$ w^{i+1}=w^{i}-\alpha \nabla_w L(w^{i}) $$

- `$\alpha$` is known as the gradient step (in optimization) and the learning rate (in NN)

- quite simple...except you need `$\nabla_w L(w^{i})$`  **backpropagation algorithm**

]
---
.left-column[
  ## Training
]
.right-column[
  Training an MLP (and it is even more true for Deep Neural Nets) then requires

- scientific computing skills, computing power and **craft art**

- dedicated libraries essentially for distributed backpropagation and optimization (all are in C++ but sklearn and DL4J)

- solving an **optimization problem**

]
---
.left-column[
  ## Training
]
.right-column[
  In addition to traditional ML parameters (regularization), training an DNN requires setting the right hyperparameters for the
  optimization

- optimization solvers: Stochastic Gradient Descent, Momentum, Adam, RMSProp, Nesterov...

- SGD is the most used solver, it is a simple Gradient Descent where you compute  `$\nabla_w L(w^{i})$` over a very small subset of the training basis (called *Minibatch*) and hence the Stochastic gradient

]
---
.left-column[
  ## Training
]
.right-column[
  Regularization is ensured through a ridge penalty term a.k.a *weight decay*, but
  - *learning rate*: most important term, increases model capacity indirectly through its impact on optimization. More advanced strategies update
  the learning rate (typically with TensorBoard)
  - *number of neurons*: too many neurons leads to overfitting

]
---
## What do they work for?

<img src="googlernncnn.png" width="500">
---
.left-column[
  ## Automatic Processing Tasks
]
.right-column[
   Deep Learning upraisal was motivated by reaching human-like errors rate for traditional challenges in Robotics, IA... by taking advantage of large amount of data collected
   by american and chinese tech giants almost for free. This include

- **Computer Vision** (pictures from Mobile Phones)

- **Natural Language Processing** (Text and Comments from the World Wide Web)

- **Speech** (excerpts of talks from Mobile Phones and Collaborative Platforms)

]
---
class: neuralnet4, full-width, full-height
# Computer Vision

---
.left-column[
  ## Computer Vision - Tasks
]
.right-column[
   In Image Processing (or Computer Vision typically for Robotics, Detection...), traditional challenges and tasks are

- image classification (threat or not ?)

- object detection (where is the threat in the picture ?)

- segmentation (background removal), semantic (tumor detection) or instance segmentation (count people)

- scene understanding, tracking, location, mapping (SLAM), 3d reconstruction... (defence, autonomous transportation)

]
---
.left-column[
  ## Computer Vision - Traditional Methods
]
.right-column[
   In Image Processing, decades of research created many different methods, like

- template matching (similarity with fixed pattern)

- low-level dedicated methods (Canny edge detection, watershed or CRF for image segmentation...)

- Descriptors extractions and Machine Learning (face detection on Mobile phone, Viola-Jones)

]
---
.left-column[
  ## Computer Vision - New architectures for DNN
]
.right-column[
   Convolutionnal Neural Networks (created by LeCun in 1989) first apply Convolution Layers on top of inputs (pixel values) to reduce the number
   of weights

- achieve state-of-the-art performance for most CV challenges since 2012

- add to the MLP architecture layers of convolution adapted from traditional CV

- add newcomers such as *pooling* (a.k.a subsampling) and *dropout* to regularize and lower the number of parameters

]
---
.left-column[
  ## Computer Vision - Convolution
]
.right-column[
   Convolutionnal Layers specific to pictures

- Fully Connected layers would involve too many parameters for one picture (think
  of a `$1000 \mbox{px}  \times  1000 \mbox{px}$` picture, first layer for one neuron would involve `$1,000,000$` weights)

- Convolutions help reduce the number of parameters

- Weights of the convolution become trainable parameters

]
---
.left-column[
  ## Computer Vision - Dropout and Pooling
]
.right-column[
   Dropout

- Some connections between neurons and layers are switched off during training. Typical amount is 20 %
  of neurons randomly selected during training. Dropout helps regularization

Pooling

- Subsampling helps reduce the number of parameters

]
---
.left-column[
  ## Computer Vision Architectures
]
.right-column[
   Many sophisticated architectures

- Combines many layers ('deep') and in general several fully connected layers at the end

- Usually ReLu-like activation functions and TanH at the end, but the devil is in the sequel of pooling, batch normalization weights sharing...

- Dropout only for training ! Not inference

]
---
.left-column[
  ## Computer Vision Architectures
]
.right-column[
   Main challenges and architectures focus on image classification, but other CV tasks require more complex architectures

- Typically, object detection outputs one or several bounding boxes containing most probable objects

- Traditional CV uses pyramid of images and uses a classifier many times over many different regions and outputs region with gighest scores

- Object recognition recent DL architectures rely on efficient subdivision and probability map to speed up object detection

]
---
.left-column[
  ## Computer Vision Architectures
]
.right-column[
   A difficult CV task is semantic segmentation (determines which class each pixel belongs to)

- Outputs a map (same size) of the original images

- Apply convolution and then deconvolution

- Labelling is tedious

]
---
.left-column[
  ## Computer Vision: best practices
]
.right-column[
   Use a classical architecture (unless you're Google or Facebook) or even better transfer learning

- Dataset: same aspect ratio, same resolution, same colormap

- Size: as many pictures as possibles, don't overlook Labelling

- Scaling: usually mean substraction (per pixel, per channel)

- Augment: train and test (crops, shift, rotation...)

- [DL's book](https://www.deeplearningbook.org/) Rule-Of-Thumb : need at list `\5,000\` examples per category and `\10,000,000\` examples to reach human error rate

- Many practical advices [here](https://jeffmacaluso.github.io/post/DeepLearningRulesOfThumb/)

]
---
class: neuralnet5, h1, full-width, full-height
# NLP

---
.left-column[
  ## Natural Language Processing Tasks
]
.right-column[
   AI initial goal was to make machine able to understand and generate texts and speech to interact with humans (Turing test). Such a goal relies on easy to very hard tasks

- **Easy** Spelling correction, Part-of-Speech tagging...

- **Medium** Sentiment Analysis, Text Classification...

- **Hard** Summarization, Translation, Question Answering...

]
---
.left-column[
  ## Natural Language Processing Issues
]
.right-column[
   NLP is hard because

- Hard pre-processing (segmentation) and sequential by nature

- Sparsity of examples (many words not used very often, different sizes...)

- Idioms, Languages diversity (colloquial, formal...), Domain Specific Languages

- Word and sense human ambiguity, grammar parsing not enough... **the astronomer saw the star**

- Linguistics issues (rationalist vs. empiricist)

]
---
.left-column[
  ## Natural Language Processing Traditional Methods
]
.right-column[
   NLP has a longer history than AI, many different methods

- Rule-based (think of `$regexp$`)

- Language Models based on n-grams (Markov models) and dictionary

- Graph-based methods (Viterbi algorithm)

- Manual feature extraction (dictionary, bag of words, tf-idf) and ML classifier

]
---
.left-column[
  ## Natural Language Processing Neural Networks Approach
]
.right-column[
   Similar to CV, NLP has experienced some real advances with the help of Artificial Neural Networks

- Long Short Term Memory unit in Recurrent Neural Network for Language Models, allow to retain information

- Word Embeddings : dense representation of words, sentences... **word2vec** approach

- Sequence-to-Sequence learning and attention mechanisms (Neural Machine Translation, Question Answering, chatbot...)

- Transformers (BERT, T-NLG..) and deeper architectures (17 billions parameters)

]
---
.left-column[
  ## Natural Language Processing LSTM
]
.right-column[
   Long Short Term Memory networks are part of Recurrent Neural Networks (created in 1982 by Hopfield)

- unlike standard Feedforward they process inputs sequentially

- they retain information (*contextual*) in a memory

- works well for Language Modelling, Sequence-to-Sequence models, translation...

]
---
class: neuralnet6, h1, full-width, full-height

---
.left-column[
  ## Generative Adversarial Networks (GANs)
]
.right-column[
   Best idea in Machine Learning over 10 years (Yann LeCun)

- Generative VS Discriminative (Classification, Regression)

- Two neural networks : the Generator and the Discriminator that fights in a minimax game

- Quite hard training... recent methods take inspiration from [Optimal Transport](https://weave.eu/le-transport-optimal-un-couteau-suisse-pour-la-data-science/) theory (Wasserstein distance)

]
---
.left-column[
  ## Generative Adversarial Networks (GANs)
]
.right-column[
   Applications

- Generate realistic images

- Augment resolution...

- Text-to-image, image-to-image ...

]
---
.left-column[
  ## Autoencoders
]
.right-column[
   Autoencoders are a deep extension of MLP to automatically find the best reduction and coding/decoding at once

- part of unsupervised learning

- conceptually, it is a nonlinear dimension reduction technique decodable (unlike t-SNE)

- often used to denoise signals, also for data synthesis

]
---
.left-column[
  ## Auto-encoders
]
.right-column[
   Applications :

- find the concept of cat spawning Web images and videos

- help unlock your IPhone 10

- used as a compression for speech, image...

]
---
.left-column[
  ## Adversarial Attacks
]
.right-column[
   Basic idea  :

- Find with optimzation amount of pixels to add to an image to change prediction of CNN

- DNN are very sensititive to small changes of pixel values

- Funny but very worrying for security and industrial applications of CNN

]
---
## Small focus on hardware

<img src="gpu.jpg" width="500">
---
.left-column[
  ## Training
]
.right-column[
   In addition to learning rate and algorithmic improvements, modern Neural Nets work because

- more data (millions time more), better backprop algorithm (automatic differentiation)

- progress in the secret sauce (initialization of weights, dropout, activation function...)

- **clever use of GPU** (in lieue of Moore's law): allows to distribute many easy-computations (kind of vectorization) with NVIDIA setting the standards (with CUDA language)

]
---
.left-column[
  ## Training
]
.right-column[
  In addition, chip  war has already started with old and new players :

- NVIDIA (with CUDA)  vs. AMD (with OpenCL)

- most DL libraries support only CUDA except Caffe (Apple) but would like to get rid of NVIDIA-dependency

- Microsoft/Intel (with FPGA) vs. Amazon/Xilinx/Baidu (with XPU)

- Google (Tensor Processing Unit)

- Cloud-based services

]
---
.left-column[
  ## Training
]
.right-column[
   Big AI/fonders currently deploy strategic plans where training NN is different from infering NN :

- Microsoft massively deploys Field Programmable Gate Array with Intel Nervana for Cloud Deep Learning

- TPU/XPU (mix of CPU/GPU/FPGA) would go for inference !

]

---
.left-column[
  ## On the future of deep learning
]
.right-column[
   Future, trends and challenges :

- Much ado about **Generative Adversarial Networks** and **Autoencoders**

- Since several years ago, DNN was successfully integrated in Reinforcement Learning (Deep Q-learning)

- Safety-critical ML issues (adversarial, uncertainty...), power-efficient version...

- Prove something about DNN (data coverage with TDA, formal methods for robustness evaluation...)

- Conciliate physics-based, expert knowledge reasoning and predictive power of Deep Learning **Bayesian Deep Learning**

]

---
.left-column[
  ## Some applications for aerospace engineering
]
.right-column[
   Every field is experiencing a Deep Learning hype :

- Aerodynamics: automatic tuning of turbulence  parameters, calibration, predict flow for new geometry...

- Material design: constitution-to-properties prediction, material discovery

- Inverse design: calibration, uncertainty propagation (probabilistic programming)

<img src="inverse.png" width="180">
]

---
## Frameworks, Tools and resources

---
.left-column[
  ## Deep Learning at home
]
.right-column[

### Technical solutions :

- **Python** : Theano (Montreal), Tensorflow (Google), Keras (Google), Caffe and Pytorch (Facebook), CNTK (Microsoft), MXNET (Amazon), Paddle (Baidu)

- **R** : Keras (Google)

- **Javascript** : deeplearning.js, ex : [Teachable machines](https://teachablemachine.withgoogle.com/)

]

---
.left-column[
  ## DL Frameworks
]
.right-column[

### Deep Learning frameworks :

- Allows to build complicated architecture neural networks

- They are all optimized for performance (low-level libraries written in C/C++ wrapped to be called by a high-level API)

- Community support (continuous development)

- GPU-acceleration and massive parallelizations

- **Automatic differentation**

]

---
.left-column[
  ## DL Frameworks
]
.right-column[

### Computational Graphs and AD :

- User define a computational graph as a composition of transformations (the flow) of the data

- This graph will be optimized for execution and for gradient computation (automatic differentiation), usually many more parameters
   than output hence reverse mode (backward) is implemented

- Define-and-run (static graph) vs. more general define-by-run (dynamic graph)

]

---
.left-column[
  ## DL Frameworks
]
.right-column[

### Tensorflow :

- Static computational graph "build graph once and run it many times over distributed machines"

- Developped by Google (for dev and production)

- Massive community and side projects (mobile, tensorflow-probability)

]

---
.left-column[
  ## DL Frameworks
]
.right-column[

### PyTorch :

- Dynamic computational graph "build graph once at each forward pass"

- Developped by Facebook (PyTorch+Caffe2 for production)

- Massive community and side projects (pyro)

- More flexible, research oriented

]

---
.left-column[
  ## Deep Learning at home
]
.right-column[

### Resources :

- Introduction videos [Part 1](https://www.youtube.com/watch?v=aircAruvnKk) and [Part 2](https://www.youtube.com/watch?v=IHZwWFHWa-w)

- Benchmark different libraries to recognize MNIST [github link](https://github.com/TheoLvs/machine-learning/blob/master/1.%20Computer%20Vision/2.%20MNIST%20Recognition.ipynb)

- Stanford course CS231n : [Convolutional Neural Networks for Visual Recognition](http://cs231n.github.io/)

- [Fast.ai course](http://course.fast.ai/part2.html)

- [Deeplearning.ai course on Coursera](https://www.coursera.org/specializations/deep-learning)

- [Google course](https://classroom.udacity.com/courses/ud730)

]

---
## That's all folks !

<img src="giphy.gif" width="500">
---