Data analysis

 Data analysis service

Change forever the way you relate to your data and obtain accurate predictions and behavior patterns by applying Machine Learning algorithms to your data, always looking for the most suitable technological solution for your needs.

  • Data capture
  • Data cleansing and transformation
  • Tabular, textual, and  computer vision scenarios
  • Supervised and unsupervised algorithms

Technology

software

The analysis will be performed in Python using the Jupyter environment on Anaconda and Visual Studio Code as frameworks. Among the libraries to be used will be found all those required by the analysis processes and the final model (some common to all analysis processes such as pandas, numpy, pyplot, seaborn...; and others specific such as Keras on TensorFlow, LightGBM, XGBoost, etc.).

Hardware

Depending on the client's needs, it will work in the cloud (Azure) or on local computers. In the second case, the project will include the exclusive use of a workstation with specifications that will depend on its complexity:

  • Intel Core i7-8700 CPU with 6 cores (12 virtual) at 3.70 GHz.
  • 32 or 64 GB of RAM
  • Windows 10 64-bit operating system
  • NVIDIA GeForce GTX 1060 6Gb GPU (optional)
  • SSD disk

The trained model can be published as a Web Service for real-time prediction if specified.

Phases of the analysis

The analysis to be carried out can be divided into the following phases:

  • Consulting, to be carried out together with the client's team, aimed at transmitting the "knowledge of the environment" (domain knowledge) to the analysis. This phase includes the definition of the objective of the analysis, the identification of the variables involved, their interdependence and relationship.
  • Exploratory Data Analysis (EDA), statistical treatment of the data from which the analysis is based. In this phase, an analysis of the “shape” of the data or characteristics is carried out, studying the profile of each of the characteristics, the covariance between them, the outliers (values outside the usual range) and their impact on the data set., "invalid" data, and so on.
  • Data pre-treatment, part of the set of processes known as ETL (data extraction, transformation and loading) that includes the treatment of null values, repeated records, records inconsistent with each other, etc., all decisions that will impact the behavior of the predictive model to be developed.
  • Characteristics engineering, generation of the predictive characteristics that will feed the model. Part of them will be extracted directly from the client's historical data, and part will be custom generated from the EDA results and the consulting phase.
  • Generation of a baseline predictive model. Once the exploratory analysis of the data, the consulting phase, the previous treatment of the same and the characteristics engineering phase are completed, the generation of a baseline predictive model is now possible. In tabular (numerical) environments, the candidate algorithms to offer the best performance are usually "Gradient Boosting" algorithms (LightGBM or XGBoost). In text mining environments, both "Gradient Boosting" algorithms and Neural Networks can be good candidates. Finally, in artificial vision or complex environments the only feasible option is Convolutional Neural Networks or more complex architectures.
    This first baseline model will offer reasonable performance, but typically far from the performance you get after tuning.
  • Algorithm optimization and fine tuning. Once the baseline model has been developed, a cycle is entered that include not only the improvement of the baseline predictive model but also tests new algorithms.
  • Assembly of algorithms. During the previous phases, models based on different algorithms will have been generated. One way to improve the result is to apply some type of assembly that considers two or more models and generate a new prediction from them. The improvement produced in the performance of the model with this technique can be between 0.5% and 1%.

Documentation

The documentation to be delivered includes:

  • EDA. The exploratory data analysis is delivered to the customer in Jupyter notebook format and in PDF format once this phase is completed.
  • Baseline predictive model and first prediction. After the development of the baseline predictive model, the corresponding code (including data treatment, feature engineering, model creation and training) and a first prediction are sent to the client (which should only serve as proof of concept as its accuracy will still be too low).
  • Predictive model. In this document -to be delivered at the end of the project also in Jupyter notebook format or Python format (.py) and PDF- the data cleaning and feature engineering code is included, as well as the code that generates the model, trains it and make a prediction. This code will allow the client to retrain the model in the future if they wish.
  • Predictive code. This document will be generated from the previous predictive model, once again in Jupyter notebook or Python and PDF format, which, based on the available data, will allow the customer to make new predictions in the future. The difference with the previous document is that it does not include the retraining code (due to the time it may take to execute it), only the code corresponding to the data processing, model generation and prediction.
  • Analysis conclusions. In this last document -to be completed after the end of the project- the conclusions of the analysis are detailed, as well as -if any- recommendations for the maintenance and improvement of the model.

 Documentation