Differences

This shows you the differences between two versions of the page.

--- network_stuff:machine_learning [2024/10/26 18:45] – jotasandoku
+++ network_stuff:machine_learning [2025/05/12 18:52] (current) – jotasandoku
@@ Line 1: / Line 1: @@
+[[https://camarreal.duckdns.org/doku.php?id=network_stuff:machine_learning|ML]]  ;  [[https://camarreal.duckdns.org/doku.php?id=network_stuff:machine_learning:networking|network-for-ML-workload]]
 __NOTES ABOUT MACHINE LEARNING AND ARTIFICIAL INTELLIGENCE AI__
 \\
@@ Line 10: / Line 12: @@
 \\
   * **Supervised learning**: tagging. [[http://stanford.io/2nRlxxp]]
-    * Traing with all data, tagging it so it can predict future events. Example: train raspberry pi so it can recognise bird images captured with the camera.
+    * Training with all data, tagging it so it can predict future events. Example: train raspberry pi so it can recognise bird images captured with the camera.
   * **Semi-supervised learning**: reinforcement learning.
     * it does not require training data. But a lot of Try and Error instead.
@@ Line 43: / Line 45: @@
     * ''source /Users/santosj/Documents/PLURALSIGHT/datascience/bin/activate''
     * ''jupyter notebook /Users/santosj/Documents/PLURALSIGHT/datascience/Applied-Machine-Learning-main''
-  * (buy the book later on): [[https://github.com/javier-antich/ml4nce/blob/main/UC2/UC2-multivariate-outlier-detection.ipynb]]
+  * [[https://github.com/javier-antich/ml4nce/blob/main/UC2/UC2-multivariate-outlier-detection.ipynb]]
@@ Line 66: / Line 68: @@
   * convolution studies how a shape is modified by another)
   * cnn relu cnn relu cnn ...
+----
+====== Classical Training Steps for Neural Networks ======
+Training a neural network is a structured, iterative process that allows the model to learn patterns and relationships in data. The goal is to optimise the model’s internal parameters (weights and biases) so it can make accurate predictions. This process consists of several key steps that repeat over many cycles (epochs), each refining the model’s understanding. The major steps are **forward propagation**, **loss calculation**, **backpropagation with gradient calculation**, and **weight updates via gradient descent**. Below is a detailed breakdown of each step:
+===== 1. Forward Propagation =====
+In forward propagation, input data is passed through the layers of the neural network to produce an output or prediction.
+The process begins at the **input layer**, where raw data (such as numerical features, pixel values, or word embeddings) enters the network. This data is then processed through one or more **hidden layers**, each consisting of multiple **neurons**.
+Each neuron in a layer applies a **weighted sum** to the inputs it receives from the previous layer, adds a **bias term**, and then applies an **activation function** (such as ReLU, sigmoid, or tanh) to introduce non-linearity. This non-linear transformation allows the network to model complex patterns and relationships in the data, which would not be possible with simple linear transformations.
+The transformed data continues to propagate through the layers until it reaches the **output layer**, which produces the final prediction. In a classification task, the output might be a set of probabilities indicating the likelihood of each class, while in a regression task, it could be a continuous value.
+===== 2. Loss Calculation =====
+After forward propagation, the network has produced an output, but it still needs to know how accurate that output is. This is done through **loss calculation**, where the prediction is compared to the **actual target value** (also known as the ground truth or label).
+A **loss function** is used to quantify the difference between the predicted output and the true value. The choice of loss function depends on the type of problem:
+  * For **regression tasks**, common loss functions include **Mean Squared Error (MSE)**, which penalises larger errors more heavily.
+  * For **classification tasks**, **Cross-Entropy Loss** is commonly used, which measures the difference between two probability distributions (the predicted probabilities and the true labels).
+The loss function outputs a **scalar value** that represents how far off the model’s prediction was. A lower loss indicates better performance for that sample, while a higher loss signals a poor prediction.
+===== 3. Backpropagation and Gradient Calculation =====
+Once the loss is calculated, the network must determine how to adjust its **parameters** (weights and biases) to reduce this error in future predictions. This adjustment process requires knowing how sensitive the loss is to each parameter. This is achieved through **backpropagation**.
+**Backpropagation** is an algorithm that computes the **gradients** of the loss function with respect to each parameter in the network. It applies the **chain rule of calculus** to systematically calculate these gradients by moving backwards through the network, from the **output layer** to the **input layer**.
+For each neuron and its parameters:
+  * It calculates how much a small change in that parameter would affect the loss.
+  * This gradient tells the network whether increasing or decreasing that parameter would reduce the loss.
+The result of backpropagation is a collection of gradients for all weights and biases, which provide the direction and magnitude of change needed to improve the model’s performance.
+===== 4. Weight Update (Gradient Descent) =====
+With the gradients calculated, the network updates its parameters to reduce the loss. This is done using an **optimisation algorithm**, most commonly **gradient descent**.
+In **gradient descent**, each parameter is updated by moving it slightly in the **opposite direction of its gradient** (because gradients point towards the direction of increasing loss). The **learning rate** is a hyperparameter that determines how large these updates are; if it's too large, the model might overshoot the optimal values, while if it's too small, training might be very slow.
+There are several variations of gradient descent:
+  * **Stochastic Gradient Descent (SGD)**: updates parameters using one data point at a time.
+  * **Mini-Batch Gradient Descent**: updates parameters using small batches of data.
+  * **Adam Optimiser**: an adaptive learning rate method that adjusts the learning rate for each parameter individually based on past gradients.
+This weight update step allows the network to improve its predictions over time. After the update, the training loop returns to forward propagation with the next data sample or batch, repeating the process over many iterations.
+===== Summary =====
+These steps—**forward propagation**, **loss calculation**, **backpropagation with gradient calculation**, and **weight updates**—are repeated across thousands or millions of data samples over multiple **epochs**. This iterative process gradually fine-tunes the network’s parameters, allowing it to learn complex patterns and make accurate predictions.
+As training progresses, the loss typically decreases, indicating that the model’s predictions are improving. Once the model reaches an acceptable level of performance, training can be stopped, and the network is ready for inference on unseen data.
@@ Line 77: / Line 131: @@
   * NVIDIA GPUs (e.g., A100, H100, used for training and inference in deep learning applications)
   * NVIDIA Tensor Cores (hardware feature within NVIDIA GPUs, optimized for mixed-precision AI workloads)
+Current practical models (is important to check they support Ollama)\\
+[[https://github.com/ollama/ollama/blob/main/docs/gpu.md]]
+  * Nvidia H100
+  * 48 GB Nvidia RTX 6000 Ada graphics card
+----
+  * Attention mechanism (just a formula that makes easier for training models)
+  * Transformer architecture (hugging face created it)
+    * transformers are created in the attention mechanism.
+      * precursor was tensorflow-hug
+----
+PRACTICAL NOTES ON MODELS:
+  * Models multiply matrices.
+  * Those matrices are multi-dimensionals : ''tensors''
+    * They are made of ''weight'' and ''bias'' << When defining a model weight and bias are called, generically, ''parameters''.
+    * Eg: 100B (all tensor's bias and weights, added together)
+  * HF transformers library is ~different from transformers architecture. HF's is framework for loading, training, fine-tuning, and deploying transformer models across NLP and vision tasks. It provides access to thousands of pretrained models, simplifies workflows with task-specific pipelines, and supports custom training on new datasets. Beyond downloading models, Transformers enables production-ready deployment with optimizations for diverse hardware
+=== HUGGINGFACE ===
+  * models, datasets and prototypes
+  * open-source and open-weight
+  * we can download pre-trained Llama, via ollama and then fine-tune it.
+    * One of the reason is so it identifies patterns better (tex, images...). This process is called ''embedding'' (Embeddings capture the inherent properties and relationships of the original data in a condensed format and are often used in Machine Learning use cases. See [[https://medium.com/kx-systems/vector-embedding-101-the-new-building-blocks-for-generative-ai-a5f598a806ba|Link]] << Better **classification**
+      * embedding: phrases in >> vectors out
+----
+=== Mixture of Experts (MoE) ===
+Is a neural network design that activates only a few specialised sub-models (experts) per input, based on a gating mechanism. This allows models to scale to massive sizes efficiently, improving performance while reducing compute costs by avoiding the need to use the entire model every time.

dokucama

User Tools

Site Tools

Differences

Page Tools