Performing Data Analysis & Visualization (Part 3 of 3)

 




To read part 1, please click here
To read part 2, please click here








Exploring & Analyzing File Datasets

We need to look out for the following aspects while exploring and analyzing datasets:
  • Uniformity- If all the images in a dataset are not of same size, then, they should be rescaled which might require centering pixel values per channel, may be followed by some form of normalization.

  • Augmentation- Here, we diversify the dataset without taking on new data (new images), which can be very useful while dealing with small dataset and typically involves horizontal as well as vertical flipping, cropping, and rotating among other transformation.

However, if you want to take pictures as uniformly as possible to cover a lot of various scenarios, you have to consider following aspects:

  • Camera Type- We might require same type of camera to take pictures in the same format across the globe. 

  • Environmental Conditions- Lighting, temperature, humidity, etc. can also influence the electronics in the camera.

  • Positioning- Position of the angle while taking picture can also influence the image quality.

These are only some of the points to consider while taking a picture.

There are also other forms of files like Sound Files, that can be used to build a speech-to-text model to convert whatever we say into text. We can also use Fourier Transformations to decompose our sound files, while considering the following aspects:

  • Recording Hardware- A voice assistant at home, contains probably the same microphone for everyone, while the one on mobile phones might have different microphones. 

  • Environment- We might need to record voices in different environments which may have different sound spectrum (like, the spectrum would be different in a tram as compared to a recording booth).

  • Pronunciation- The ML algorithm in your brain may have a hard time deciphering different pronunciations (especially dialects) and so does the actual ML model.

Hence, summarize that for file datasets we don't have as many options to statistically eliminate the problems, so, we must concentrate on taking good and clean samples that simulate the kind of realistic environment we would get when the model is running in production. 








To read part 1, please click here
To read part 2, please click here








Comments

Popular posts from this blog

Deployment (Part 3)

Deployment (Part 1)

Project Resourcing (Part 2)