Getting started with AI? First things first

I recently met with a group of folks who were ready to get into Artificial Intelligence (AI). They had a lot of data, both historical and recent, and wanted to predict the outcome of situations based on categorical and continuous (numeric) data. They were well educated in their business and had strong mathematical and analytical skills, which in my opinion are crucial for this type of endeavor.

I was intrigued and asked a barrage of questions and learn that their biggest pain point was the data.  There were two main challenges that made this situation complicated. First, their data was stored as tabular data in spreadsheets with variable formats and no standardization. Second, the data was coming from multiple sources such as web scraping, manual data entry, and some was provided by 3rd party vendors using relational databases. This led to questions around the quality of the data, which was a big red flag.

In areas of AI, such as machine learning that involves training algorithms, it is imperative to have clean and accurate data. Otherwise your predictions and outcomes will have a great deal of variance and be as inaccurate as the data itself.

There are several ways to address these challenges. As an example, you can use an unstructured data framework such as Hadoop. The solution I will be discussing in this article involves building a standardized data platform with structured data to fit this platform. In my experience, a data platform provides the needed standardization that makes your AI, machine learning, and data visualization efforts much easier.

Furthermore, with a data platform you will have the ability to take advantage of the following and more:

  1. Known schemas which are useful to query the data
  2. Facilitation of data governance to ensure data consistency
  3. More visibility into data quality and potential issues
  4. Automation of data ingestion
  5. Easier monitoring of data related processes
  6. Centralized data storage
  7. Centralized repository of result snapshots
  8. Complete segregation of data ingestion, data analysis and analytical learning
  9. Options for utilization data virtualization solutions such as Delphix
  10. Creation of a standardized data visualization platform
  11. Creation of a standardized operational data reporting platform
  12. Reduced support and maintainability
  13. Reduced learning curve

Once you have a data platform in place, machine learning models can take advantage of clean and normalized datasets. Data lakes can be used to run deep analytics concurrently without disrupting the ingestion process, and can create real-time streaming analytics workflows.

How it all fits together

Looking at the diagram below, you can see a suggestion for a data platform with the end goal of providing analytics real-time and on demand. We will go through each of the steps next.


As your data is ingested and stored, you can run deeper analytics on the data lake.  At the same time you can have a machine learning model run predictions and classifications on the data. In parallel,  you can have operational reporting and analytic dashboards produce visual outputs.  This data platform can also leverage IoT data – I will cover IoT, Edge Analytics, and Cloud Analytics in a future article.

In conclusion, a data platform can be used to help you create a repository of all your data. This repository can easily evolve to support your advanced analytics needs. The more structured your approach to manage your data, the better results you will get from your advanced analytics efforts. It is key that data is clean, accurate, and reliable to get the best possible results from machine learning and deep analytics.

Keep in mind  that a data platform does not need to include all the components mentioned above. You can start with just the data sources and ingestion. Once you have stabilized these two components, you can then add either the machine learning models or deep analytics. After, you can add a historical repository of results. Once you are confident your data layer is complete and accurate, you can then move on to add the visualization layer, which would include an analytics dashboard and operational reporting if needed.

More to Consider 

I highly recommend implementing your data platform on the cloud through a service like Microsoft Azure. One of the many benefits of doing so will be taking advantage of the scalability and elasticity of the cloud. This will allow you to grow your capacity as your needs grow. Also, many cloud providers, including Azure, have frameworks that facilitate building a data platform. More on this topic will follow in a later article.

As with every platform, I recommend capturing and storing metrics to help you determine performance bottlenecks.  These metrics will also help you identify opportunities for process improvements. An example of data platform metrics is capturing your machine learning prediction accuracy, or the confusion matrix output to determine the efficiency of your predictive models.

In closing, when spearheading an AI or an advanced analytics effort, it is worthwhile to take the time to plan and build a robust data platform that will help you achieve your analytics goals. Consult with an expert if you don’t know where to start.

Emilio Chemali, Director of Business Intelligence & Analytics, MRE Consulting, Ltd.

Emilio is a technology subject matter expert, respected thought leader and CIO100 Award Winner.  With over 18 years of experience, Emilio has helped clients in multiple industries create business value through Business Intelligence, Data Analytics, DataOps, DevOps, IoT, Application Integration, Enterprise Mobility, Enterprise Architecture, Software Development, Infrastructure Management, Cloud Strategies, Server Virtualization, and Application Performance Tuning initiatives.



Click the link below to download the PDF version.  | Click here to read other MRE Insights.