What is data in Artificial Intelligence?

Index

Introduction
How do we acquire data?
- By manual labelling
- By observing user behavior
- By observing machine behavior
- Download from websites or partnerships
Use and Misuse of data
Data is messy
- Data problems
- Multiple types of data

"Data is really important for building AI systems. But what data is?"

Introduction

A table of data is often called a dataset.

If we aim to determine the pricing of houses for buying or selling, we can gather a dataset in the format mentioned. This dataset may be organized in an MS Excel spreadsheet, where one column represents the size of the house in square feet or square meters, and another column represents the corresponding house prices. In the process of developing an AI or Machine Learning system to assist in setting house prices or assessing appropriate pricing, we can designate the house size as A and the house price as B. The objective is to train the AI system to learn the mapping from input (A) to output (B).

Now, rather than just pricing a house based on its size, we might also collect data on the number of bedrooms.

In that case, A can be both of these first two columns, and B can be just the price of the house. So, given the dataset, it is up to the business use case to decide what is A and what is B. If we have a certain budget and if we want to decide what is the size of the house in square feet then that would be a different choice of A and B.

A: How much does someone spend?

B: Size of the house in square feet

Let's see another dataset:

The above image is taken from the slides created by Deeplearning.ai

If we want to build an AI system that detects cats then we collect a dataset as above where input A is a set of different images and output B is are labels. 1st row and 3rd row are cats whereas 2nd row and 4th row are not cats.

How do we acquire data?

By manual labeling:

The above image is taken from the slides created by Deeplearning.ai

We could gather a series of images, like the ones shown here, and proceed to label each of them. For instance, we determine that the first and third images depict cats, while the second and fourth do not. This manual labeling process establishes a dataset for training a cat detector. Creating such datasets typically requires more than just four pictures; it often involves collecting hundreds or even thousands. While manual labeling can be a meticulous task, it remains a reliable and established method for generating datasets.
By observing user behavior:

The above image is taken from the slides created by Deeplearning.ai

If we operate an online retail website, we can simply observe user interactions to determine whether they make a purchase or not. By capturing this buying behavior, we can generate a dataset that includes details such as user ID, the time of the user's website visit, the offered product price, and whether the user completed a purchase.
By observing machine behavior:

The above image is taken from the slides created by Deeplearning.ai

If we run a large machine in a factory and we want to predict if a machine is about to fail or have a fault, then just by observing the behavior of a machine,

we can then record a dataset like this. There's a machine ID, there's a temperature of the machine, there's a pressure within the machine,

and then did the machine fail or not. In this case, we can consider (machine, temperature, pressure) as A and machine fault as B and map A->B. By this, we can do preventative maintenance on the machine.
Download from Websites or Partnerships

Open internet helps in downloading data ranging from computer vision or image datasets to self-driving car datasets, speech recognition datasets,

medical imaging data sets to many many more keeping in mind the licensing and copyright. We can also collect the data from a partner or a factory we work with.

Use and Mis-use of data:

The mishandling of data doesn't necessarily imply malicious intent; it can also result from inefficient utilization of data.

Some companies delay leveraging their data for AI until they collect a substantial amount, deeming it 'perfect.' However, a more effective approach is to provide the AI team with data as soon as a reasonable quantity is accumulated. This enables the AI team to offer insights to the IT team regarding the types of data needed and the required IT infrastructure.

Illustrating the aforementioned concept, the IT team initially supplies the AI team with data from a factory machine at 10-minute intervals. Nevertheless, the AI team provides feedback to the IT team, recommending more frequent data feeds, such as every minute, for improved preventive maintenance systems.

Hence seek feedback from AI at an early stage, as it can assist in steering the development of the IT infrastructure.
Some company CEOs understand the significance of staying abreast of current trends. They believe that accumulating substantial data makes it valuable for the AI team. However, this assumption doesn't consistently hold true. While having more data is often advantageous, merely collecting terabytes or gigabytes isn't sufficient to render it valuable for an AI team. The strategy of overinvesting in data acquisition without active involvement from the AI team is not advisable. Instead, it is recommended to consult the AI team to determine which data holds value for them.

Data is Messy:

You might be familiar with the saying 'garbage in, garbage out.' In the context of AI, having poor-quality data can lead to the system learning inaccuracies.

Data Problems:

The above image is taken from the slides created by Deeplearning.ai

Consider the above data set of sizes of houses, number of bedrooms, and the price.

We can have incorrect labels or incorrect data. In the second row, the house is priced at $0.001 multiplied by 1000, resulting in $1. However, it is implausible for any house to be sold for just $1, indicating that this data is inaccurate.
In the 3rd, 4th, 5th, and 6th rows, there are values marked as unknown, commonly referred to as missing values. Various techniques exist to address and manage such instances of missing values.

So, the AI team will need to figure out how to clean up the data or how to deal with these incorrect labels and all missing values.

Multiple types of data.

Different types of data exist, such as images, audio, and text, which humans can interpret effortlessly. This falls under the category known as unstructured data. Specialized AI techniques can process unstructured data, enabling tasks like image recognition, speech understanding, or spam detection in emails. On the other hand, there is structured data, exemplified by datasets formatted like a spreadsheet. Dealing with unstructured and structured data involves distinct AI techniques, but both types can be effectively addressed by AI methodologies.

In this blog, we discovered what data is and learned to avoid its misuse, such as over-investing in IT infrastructure without ensuring its relevance to future AI applications. We emphasized that data can be messy, but a proficient AI team can assist in navigating these challenges.

Credit for the above blog: I have gained the knowledge for the above blog from a course 'AI for everyone' by Andrew Ng. Thank you Deeplearning.ai for such a wonderful course.

What is data in Artificial Intelligence?

'AI for Everyone' by DeepLearning.ai