Supervised learning: tagging. http://stanford.io/2nRlxxp
- Training with all data, tagging it so it can predict future events. Example: train raspberry pi so it can recognise bird images captured with the camera.
Semi-supervised learning: reinforcement learning.
- it does not require training data. But a lot of Try and Error instead.
Unsupervised learning: Discovering patterns in unlabelled data
- Is all about clustering data and inferring relationships.
- k-Means clustering

Deep Learning (ie: neuronal networks) http://stanford.io/2BsQ91Q
- Layers: Input, Hidden, Output. But also Bias input (poking the hidden layers)

Reinforcement Learning: BEYOND SELF SUPERVISION TODO

Train the model but also transfer learning: reuse existing models.

Tips and tricks: http://stanford.io/2MEHwFM

For model complexity
- low: bias (flat line(
- high: a lot of variance (adjust data a lot, not good either

BOOK Oreilly: 'Applied Machine Learning and AI for Engineers' Jeff Proise github « book
- /Documents/PLURALSIGHT/datascience/Applied-Machine-Learning-main
- source /Users/santosj/Documents/PLURALSIGHT/datascience/bin/activate
- jupyter notebook /Users/santosj/Documents/PLURALSIGHT/datascience/Applied-Machine-Learning-main
https://github.com/javier-antich/ml4nce/blob/main/UC2/UC2-multivariate-outlier-detection.ipynb

Future reading: Machine Learning for Network and Cloud Engineers External Link
Oreilly: Machine Learning with scikit-learn David Mertz github
- http://localhost:8888/notebooks/WhatIsML.ipynb
Oreilly: Wide overview by Rob Barton, Jerome Henry : ml_fundamentals.pfd.pdf Rob Barton, Jerome « DONE
Oreilly: models and misfits by Dr Mark Fenner : ml_models_misfits_drmarkfenner.pdf ; https://github.com/mfenner1/mlwpy_live « DONE

Managed datasets with panda's and scikit-learn
- Link

convolution studies how a shape is modified by another)
cnn relu cnn relu cnn …

Classical Training Steps for Neural Networks

Training a neural network is a structured, iterative process that allows the model to learn patterns and relationships in data. The goal is to optimise the model’s internal parameters (weights and biases) so it can make accurate predictions. This process consists of several key steps that repeat over many cycles (epochs), each refining the model’s understanding. The major steps are forward propagation, loss calculation, backpropagation with gradient calculation, and weight updates via gradient descent. Below is a detailed breakdown of each step:

1. Forward Propagation

In forward propagation, input data is passed through the layers of the neural network to produce an output or prediction.

The process begins at the input layer, where raw data (such as numerical features, pixel values, or word embeddings) enters the network. This data is then processed through one or more hidden layers, each consisting of multiple neurons.

Each neuron in a layer applies a weighted sum to the inputs it receives from the previous layer, adds a bias term, and then applies an activation function (such as ReLU, sigmoid, or tanh) to introduce non-linearity. This non-linear transformation allows the network to model complex patterns and relationships in the data, which would not be possible with simple linear transformations.

The transformed data continues to propagate through the layers until it reaches the output layer, which produces the final prediction. In a classification task, the output might be a set of probabilities indicating the likelihood of each class, while in a regression task, it could be a continuous value.

2. Loss Calculation

After forward propagation, the network has produced an output, but it still needs to know how accurate that output is. This is done through loss calculation, where the prediction is compared to the actual target value (also known as the ground truth or label).

A loss function is used to quantify the difference between the predicted output and the true value. The choice of loss function depends on the type of problem:

For regression tasks, common loss functions include Mean Squared Error (MSE), which penalises larger errors more heavily.
For classification tasks, Cross-Entropy Loss is commonly used, which measures the difference between two probability distributions (the predicted probabilities and the true labels).

The loss function outputs a scalar value that represents how far off the model’s prediction was. A lower loss indicates better performance for that sample, while a higher loss signals a poor prediction.

3. Backpropagation and Gradient Calculation

Once the loss is calculated, the network must determine how to adjust its parameters (weights and biases) to reduce this error in future predictions. This adjustment process requires knowing how sensitive the loss is to each parameter. This is achieved through backpropagation.

Backpropagation is an algorithm that computes the gradients of the loss function with respect to each parameter in the network. It applies the chain rule of calculus to systematically calculate these gradients by moving backwards through the network, from the output layer to the input layer.

For each neuron and its parameters:

It calculates how much a small change in that parameter would affect the loss.
This gradient tells the network whether increasing or decreasing that parameter would reduce the loss.

The result of backpropagation is a collection of gradients for all weights and biases, which provide the direction and magnitude of change needed to improve the model’s performance.

4. Weight Update (Gradient Descent)

With the gradients calculated, the network updates its parameters to reduce the loss. This is done using an optimisation algorithm, most commonly gradient descent.

In gradient descent, each parameter is updated by moving it slightly in the opposite direction of its gradient (because gradients point towards the direction of increasing loss). The learning rate is a hyperparameter that determines how large these updates are; if it's too large, the model might overshoot the optimal values, while if it's too small, training might be very slow.

There are several variations of gradient descent:

Stochastic Gradient Descent (SGD): updates parameters using one data point at a time.
Mini-Batch Gradient Descent: updates parameters using small batches of data.
Adam Optimiser: an adaptive learning rate method that adjusts the learning rate for each parameter individually based on past gradients.

This weight update step allows the network to improve its predictions over time. After the update, the training loop returns to forward propagation with the next data sample or batch, repeating the process over many iterations.

Summary

These steps—forward propagation, loss calculation, backpropagation with gradient calculation, and weight updates—are repeated across thousands or millions of data samples over multiple epochs. This iterative process gradually fine-tunes the network’s parameters, allowing it to learn complex patterns and make accurate predictions.

As training progresses, the loss typically decreases, indicating that the model’s predictions are improving. Once the model reaches an acceptable level of performance, training can be stopped, and the network is ready for inference on unseen data.

AI HARDWARE - GPUs

AMD Instinct MI series
Amazon's Inferentia (for machine learning inference on AWS)
Google's TPUs (Tensor Processing Units, custom hardware for Google’s machine learning tasks)
Intel Gaudi (designed for deep learning training)
NVIDIA GPUs (e.g., A100, H100, used for training and inference in deep learning applications)
NVIDIA Tensor Cores (hardware feature within NVIDIA GPUs, optimized for mixed-precision AI workloads)

Current practical models (is important to check they support Ollama)
https://github.com/ollama/ollama/blob/main/docs/gpu.md

Nvidia H100
48 GB Nvidia RTX 6000 Ada graphics card

Attention mechanism (just a formula that makes easier for training models)
Transformer architecture (hugging face created it)
- transformers are created in the attention mechanism.
  - precursor was tensorflow-hug

PRACTICAL NOTES ON MODELS:

Models multiply matrices.
Those matrices are multi-dimensionals : tensors
- They are made of weight and bias « When defining a model weight and bias are called, generically, parameters.
- Eg: 100B (all tensor's bias and weights, added together)
HF transformers library is ~different from transformers architecture. HF's is framework for loading, training, fine-tuning, and deploying transformer models across NLP and vision tasks. It provides access to thousands of pretrained models, simplifies workflows with task-specific pipelines, and supports custom training on new datasets. Beyond downloading models, Transformers enables production-ready deployment with optimizations for diverse hardware

HUGGINGFACE

models, datasets and prototypes
open-source and open-weight
we can download pre-trained Llama, via ollama and then fine-tune it.
- One of the reason is so it identifies patterns better (tex, images…). This process is called embedding (Embeddings capture the inherent properties and relationships of the original data in a condensed format and are often used in Machine Learning use cases. See Link « Better classification
  - embedding: phrases in » vectors out

Mixture of Experts (MoE)

Is a neural network design that activates only a few specialised sub-models (experts) per input, based on a gating mechanism. This allows models to scale to massive sizes efficiently, improving performance while reducing compute costs by avoiding the need to use the entire model every time.

dokucama

Table of Contents

Classical Training Steps for Neural Networks

1. Forward Propagation

2. Loss Calculation

3. Backpropagation and Gradient Calculation

4. Weight Update (Gradient Descent)

Summary

AI HARDWARE - GPUs

HUGGINGFACE

Mixture of Experts (MoE)

dokucama

User Tools

Site Tools

Table of Contents

Classical Training Steps for Neural Networks

1. Forward Propagation

2. Loss Calculation

3. Backpropagation and Gradient Calculation

4. Weight Update (Gradient Descent)

Summary

AI HARDWARE - GPUs

HUGGINGFACE

Mixture of Experts (MoE)

Page Tools