Granville V. Synthetic Data & Generative AI 2024 | 43.9 MB
Synthetic Data and Generative AI covers the foundations of Machine Learning, with modern approaches to solving complex problems and the systematic generation and use of synthetic data. Emphasis is on scalability, automation, testing, optimizing, and interpretability (explainable AI). For instance, regression techniques – including logistic and Lasso – are presented as a single method, without using advanced linear algebra. Confidence regions and prediction intervals are built using parametric bootstrap, without statistical models or probability distributions. Models (including generative models and mixtures) are mostly used to create rich synthetic data to test and benchmark various methods.
Emphasizes numerical stability and performance of algorithms (computational complexity)
Focuses on explainable AI/interpretable Machine Learning, with heavy use of synthetic data and generative models, a new trend in the field
Includes new, easier construction of confidence regions, without statistics, a simple alternative to the powerful, well-known XGBoost technique
Covers automation of data cleaning, favoring easier solutions when possible
Includes chapters dedicated fully to synthetic data applications: fractal-like terrain generation with the diamond-square algorithm, and synthetic star clusters evolving over time and bound by gravity
Chapter 1: Machine learning cloud regression and optimization: This chapter is not about regression performed in the cloud. It is about considering your data set as a cloud of points or observations, where the concepts of dependent and independent variables (the response and the features) are blurred. It is a very general type of regression, offering backward-compatibility with existing methods. Treating a variable as the response amounts to setting a constraint on the multivariate parameter, and results in an optimization algorithm with Lagrange multipliers. The originality comes from unifying and bringing under the same umbrella a number of disparate methods, each solving a part of the general problem and originating from various fields. I also propose a novel approach to logistic regression, and a generalized R-squared adapted to shape fitting, model fitting, feature selection, and dimensionality reduction. In one example, I show how the technique can perform unsupervised clustering, with confidence regions for the cluster centers obtained via parametric bootstrap. Besides ellipse fitting and its importance in computer vision, an interesting application is a nonperiodic sum of periodic time series. While rarely discussed in machine learning circles, such models explain many phenomena, for instance, ocean tides. It is particular useful in time-continuous situations where the error is not a white noise, but instead smooth and continuous everywhere, for instance, granular temperature forecast. Another curious application is modeling meteorite shapes. Finally, my methodology is model free and data driven, with a focus on numerical stability. Prediction intervals and confidence regions are obtained via bootstrapping. I provide Python code and synthetic data generators for replication purposes