DataCulture Technologies

Machine Learning: A Data Science Process Perspectives

"Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed."

The characterization of machine learning coined by Arthur Samuel is often quoted and is genius in its broadness, but it leaves you with the question of how the computer learns. To achieve machine learning, experts develop general-purpose algorithms that can be used on large classes of learning problems. When you want to solve a specific task you only need to feed the algorithm more specific data. In a way, you're programming by example. In most cases a computer will use data as its source of information and compare its output to a desired output and then correct for it. The more data or "Experience" the computer gets, the better it becomes at its designated job, like a human does.

When machine learning is seen as a process, the following definition is insightful:

"Machine learning is the process by which a computer can work more accurately as it collects and learns given."

For example, as a user writes more text messages on a phone, the phone learns more about the messages' common vocabulary and can predict (auto complete) their words faster and more accurately.

Where machine learning is used in the data science process

Although machine learning is mainly linked to the data-modelling step of the data science process, it can be used at almost every step.

The data modelling phase can't start until you have qualitative raw data you can understand. But prior to that, the data preparation phase can benefit from the use of machine learning. An example would be cleansing a list of text strings; machine learning can group similar strings together so it becomes easier to correct spelling errors. Machine learning is also useful when exploring data. Algorithms can root out underlying patterns in the data where they'd be difficult to find with only charts. Given that machine learning is useful throughout the data science process, it shouldn't come as a surprise that a considerable number of Python libraries were developed to make your life a bit easier.

Types of machine learning

Broadly speaking, we can divide the different approaches to machine learning by the amount of human effort that's required to coordinate them and how they use labelled data—data with a category or a real-value number assigned to it that represents the outcome of previous observations.

Supervised learning techniques attempt to discern results and learn by trying to find patterns in a labelled data set. Human interaction is required to label the data.
Unsupervised learning techniques don't rely on labelled data and attempt to find patterns in a data set without human interaction.
Semi-supervised learning techniques need labelled data, and therefore human interaction, to find patterns in the data set, but they can still progress toward a result and learn even if passed unlabeled data as well.

Applications for machine learning in data science

Finding oil fields, gold mines, or archaeological sites based on existing sites (classification and regression)
Finding place names or persons in text (classification)
Identifying people based on pictures or voice recordings (classification)
Recognizing birds based on their whistle (classification)
Identifying profitable customers (regression and classification)
Proactively identifying car parts that are likely to fail (regression)
Identifying tumours and diseases (classification)
Predicting the amount of money a person will spend on product X (regression)
Predicting the number of eruptions of a volcano in a period (regression)
Predicting your company's yearly revenue (regression)
Predicting which team will win the Champions League in soccer (classification)