This campaign is over.
Take this short self-assessment quiz!
Finding structure in data and making predictions are the most important steps in Data Science. Here, in particular, statistical methods are essential since they are able to handle many different analytical tasks. Statistics and algebra initially helps you identify trends in data and do basic hypothesis testing. As you continue your journey, you realise that it comprises the ‘science’ in Data Science. It is the core of all machine learning algorithms. It helps you compare and assess different models and score them. Here we give you a basic overview of the fundamental statistical concepts
Probability:
Probability gives the information about how likely an event can occur. Digging into the terminology of the probability:
Conditional Probability [P(A|B)] is the likelihood of an event occurring, based on the occurrence of a previous event.
Independent events are events whose outcome does not influence the probability of the outcome of another event; P(A|B) = P(A).
Mutually Exclusive events are events that cannot occur simultaneously; P(A|B) = 0.
Probability Distribution Functions
Continuous Data Distributions
Discrete Data Distributions
Data Types:
Measures of Variability
a mathematical formula for determining conditional probability. “The probability of A given B is equal to the probability of B given A times the probability of A over the probability of B.
Algebra:
Refresh your knowledge on algebra here
To show the relevance of linear algebra in the field of data science, we are briefly going through two relevant applications.
Singular Value Decomposition (SVD)
The singular value decomposition (SVD) is a very important concept within the field of data science. Some important applications of the SVD are image compression and dimensionality reduction. Let us focus on the latter application here. Dimensionality reduction is the transformation of data from a high-dimensional space into a lower-dimensional space, in such a way that the most important information of the original data is still retained. This is desirable, since the analyzing of the data becomes computationally intractable once the dimension of the data is too high.
The singular values can be used to understand the amount of variance that is explained by each of the singular vectors. The more variance it captures, the more information it accounts for. In this way, we can use this information to limit the number of vectors to the amount of variance we wish to capture.
Principal Component Analysis (PCA)
Just as the singular value decomposition, principal component analysis (PCA) is an alternative technique to reduce dimensionality. The objective of PCA is to create new uncorrelated variables, called the principal components, that maximize the captured variance. So, the idea is to reduce the dimensionality of the data set, while preserving as much ‘variability’ (that is, information) as possible. This problem reduces to solving an eigenvalue-eigenvector problem.
Probability based approx question
Hold on there! You may think you don’t need this section, but we ask you to think twice.
Did you know that a data scientist spends 60-80% of their productive development time in cleaning and manipulating data?!
In this section, you'll meet your dear friend pandas and numpy who will help you with basic data manipulation. Instead of going into theory, we'll take a practical approach. We’ll focus on the hacks that are crucial to different kinds of problems - regression, NLP and/or Deep Learning basics.
First, we'll understand the syntax and commonly used functions of the respective libraries. Later, we'll work on a real-life data set.
Click here to zoom through the data handling lens and learn data cleaning & manipulation thoroughly.
Note: This tutorial is best suited for people who know the basics of python. No further knowledge is expected. Make sure you have python installed on your laptop.
Hacks for you:
Categorical vs. Continuous variable:
RegEx: cheat sheet
String column slicer:
Deal with your Emotions:
Read multiple files:
Deal with overfitting and underfitting by bagging and boosting
Level Up:
Challenge
Regression analysis, a favorite amongst statisticians, is used to describe the relationships between a set of independent variables and the dependent variable. Regression analysis produces a regression equation where the coefficients represent the relationship between each independent variable and the dependent variable. You can also use the equation to make predictions.
Regression analysis is a load-bearer for a lot of things in the Data science world. For example, it can be used to
These capabilities are all cool, but there is another almost magical ability. Regression analysis can unscramble very intricate problems where the variables are entangled like spaghetti. For example, imagine you’re a researcher studying any of the following:
Regression analysis is a form of inferential statistics. It can help you show you how to control the independent variables by estimating the effect that changing one independent variable has on the dependent variable while holding all the other independent variables constant. This process allows you to learn the role of each independent variable without worrying about the other variables in the model. Your goal is to minimize the effect of confounding variables.
Decode how to deal with regression here
Level Up:
Classification is a process of categorizing a given set of data into classes, It can be performed on both structured or unstructured data. The process starts with predicting the class of given data points. The classes are often referred to as target, label or categories.
Depending on the data, classification can be as basic as inputting the data into a decision tree or will need a lot of pampering (cleaning, bagging, boosting, sampling, etc.) before you feed it into a classifier. Following are the most widely used classifiers that are used for majority of the Data science problems:
Working with a classification problem:
Classifier Evaluation
The most important part after the completion of any classifier is the evaluation to check its accuracy and efficiency. There are a lot of ways in which we can evaluate a classifier. Let us take a look at these methods listed below.
Holdout Method
This is the most common method to evaluate a classifier. In this method, the given data set is divided into two parts as a test and train set 20% and 80% respectively.
The train set is used to train the data and the unseen test set is used to test its predictive power.
Cross-Validation
Overfitting is the most common problem prevalent in most of the machine learning models. K-fold cross-validation can be conducted to verify if the model is over-fitted at all.
In this method, the data set is randomly partitioned into k mutually exclusive subsets, each of which is of the same size. Out of these, one is kept for testing and others are used to train the model. The same process takes place for all k folds.
Classification Report
A classification report will give the following results, it is a sample classification report of an SVM classifier using a cancer_data dataset.
Receiver operating characteristics or ROC curve is used for visual comparison of classification models, which shows the relationship between the true positive rate and the false positive rate. The area under the ROC curve is the measure of the accuracy of the model.
Algorithm Selection
Level Up:
Deep learning is a machine learning technique that teaches computers to do what comes naturally to humans: learn by example. Deep learning is a key technology behind driverless cars, enabling them to recognize a stop sign, or to distinguish a pedestrian from a lamppost. It is the key to voice control in consumer devices like phones, tablets, TVs, and hands-free speakers. Deep learning is getting lots of attention lately and for good reason. It’s achieving results that were not possible before.
In deep learning, a computer model learns to perform classification tasks directly from images, text, or sound. Deep learning models can achieve state-of-the-art accuracy, sometimes exceeding human-level performance. Models are trained by using a large set of labeled data and neural network architectures that contain many layers.
Understand the concepts of Deep Learning in detail here.
Deep Learning has spread its neurons in almost every use-case today- time series forecasting to text analysis to Image recognition. But is that really required? We have seen a lot of cases where Naive Bayes’ classifier has performed and yielded better results than keras applications. In the real world, how do you choose between Deep Learning and Machine Learning? Here’s a short guide to enable your decision making process:
Machine learning offers a variety of techniques and models you can choose based on your application, the size of data you're processing, and the type of problem you want to solve. A successful deep learning application requires a very large amount of data (thousands of images) to train the model, as well as GPUs, or graphics processing units, to rapidly process your data.
When choosing between machine learning and deep learning, consider whether you have a high-performance GPU and lots of labeled data. If you don’t have either of those things, it may make more sense to use machine learning instead of deep learning. Deep learning is generally more complex, so you’ll need at least a few thousand records of data/images to get reliable results. Having a high-performance GPU means the model will take less time to analyze all those data.
Most common use cases of Deep Learning are centered around:
Level Up:
P.S.: I am skipping the most fascinating GPT model use case out on purpose here. But feel free to research around it.
ML in FinTech:
Financial institutions have a wide range of applications for Machine Learning. The use-cases however work with a marriage of data analytics and data science. What’s more interesting is that fintech use cases help you learn and analyse if a given problem statement can be solved using general heuristics or needs machine learning approach. Click here to learn the most common use case of ML in FinTech
ML in Networking:
ML, a subset of AI, is a prerequisite for any successful deployment of AI technologies. ML uses algorithms to parse data, learn from it, and determines or predicts without requiring explicit instructions. With that said, AI/ML can be leveraged for tasks in the networking domain like self-correction for maximum uptime, predicting user experience to dynamically adjust bandwidth demand, detect malware attack and leveraging data mining techniques to identify/troubleshoot root-causes.
ML for all:
We started with “predict the stock close price” and today we are working with analysis of all news articles to extract and predict the impact each event across the globe can have on the stock you own. We have evolved from “Ok, Google” to voice-controlled home assistants. From analysis of purchase trends to analysis of consumer sentiment and behavior during purchase activities, ML applications are constantly under evolution and development. Where there is data, there is a scope for data science. Few basic applications of machine learning in today’s world include:
Level Up:
End to end deployment of ML model for streaming data
Interview tips:
Want to know more about the recruitment of data scientists? Click here to learn.