This tutorial is adapted from the Web Age course https://www.webagesolutions.com/courses/WA3169-data-science-and-data-engineering-in-2022.
1.1 Big Trends in 2021
- 2021 was a very incremental year in terms of breakthroughs, with an exponential rise in the demand for data professionals, the rise of data engineering, and further developments in:
- MLOps,
- DataOps,
- Computer Vision
- Natural Language Processing
- According to Research and Markets, the Data Engineering market is expected to grow to Usd 77.37 Billion by 2023
- With the pandemic, people have adopted a “living on the internet” lifestyle
- This has only led to more streaming or live data to be analyzed, which in turn gave rise to the demand for more data engineering jobs in 2021
Source: https://www.researchandmarkets.com/reports/4618070/big-data-and-data-engineering-services-market-by
1.2 MLOps
- The management of the lifecycle of machine learning (ML) projects is in its infancy with most organizations working on proofs-of-concept (PoCs)
- MLOps means developing systematic approach to the monitoring, scalability and evaluation of data pipelines and ML models
- Model training should be reproducible and deployment should be as automated as possible
- The ability to rapidly build end-to-end solutions allows for an improved focus on providing genuine business value
- Taking already successful models and repositioning / re-purposing them add value and saves time
1.3 DataOps
- The data engineering technology market is dynamic, driven by the rapid shift from on-premise databases and BI tools to modern, cloud-based data platforms built on lakehouse architectures
- Snowflake, Databricks, Amazon S3, Google BigTable, etc
- To remain competitive and maximize the value of their data including sensitive data, organizations are developing DataOps functions and frameworks to varying degrees
- DataOps tools and processes enable continuous and automated delivery of data to power BI, analytics, data science, and data-powered products
1.4 Computer Vision(CV)
- This year saw an adaptation of computer vision applications in various industries focusing this technology to better understand images and videos
- With the workforce being reduced in the traditional sector, a lot of companies started to apply already developed CV applications to continue their work in images classification, object detection, semantic segmentation and human pose estimation
- OAK is a modular, open-source ecosystem composed of MIT-licensed hardware, software, and AI training
- Allows embeding the super-power of spatial AI plus accelerated computer vision functions into your product
- IRISXtract which uses the IRIS Hybrid Forms Processing Solution extracts indexes from structured forms, unstructured forms and hybrid forms, combining all IRIS-made data capture techniques in a single application
- The solution captures, analyses and routes the targeted data to the system of your choice (ECM, DMS, ERP, Cloud …) for intelligent business process automation
Source: https://neptune.ai/blog/top-tools-to-run-a-computer-vision-project
Source: https://irisdatacapture.com/software/irisxtract/
1.5 Natural Language Processing (NLP)
- NLP is one of the hottest fields in artificial intelligence (AI) and machine learning (ML) right now
- In 2021, NLP Budgets remain robust
- 60% of Tech Leaders indicated that their NLP budgets grew by at least 10% compared to 2020
- A third (33%) of Tech Leaders indicated that their NLP budgets grew by at least 30%
- Accuracy, Production Readiness, and Scalability are key features of NLP solutions
- The most popular applications of NLP are Named Entity Recognition (NER) and Document Classification
Source: https://gradientflow.com/2021nlpsurvey/
- Hugging Face is a leader in NLP tools like:
- Optimum – Transformers have been a game-changer when it comes to improving the accuracy of Machine Learning and NLP models.
- The Transformers library made using state-of-the-art models easy, alleviating the complexity of frameworks, architectures, and pipelines
- With the Optimum library, engineers can use all the available hardware features at their disposal, reducing the complexity of model acceleration on hardware platforms
- Infinity: The on-prem containerized solution delivers Transformers accuracy at 1ms latency
- It helps speed up and fasten your inference in your infrastructure. It can be deployed in any production environment and can be easily be scaled to thousands of requests every second
- Optimum – Transformers have been a game-changer when it comes to improving the accuracy of Machine Learning and NLP models.
Source: https://huggingface.co
1.6 Trends in 2022
- The number of jobs in the Data Science Domain will continue to rise in 2022 with Data Engineering and MLOps taking precedence
Source: https://www.analyticsvidhya.com/blog/2021/12/a-review-of-2021-and-trends-in-2022-a-technical-overview-of-the-data-industry/
- While analyzing the employment trends for data engineers, it’s evident that there is a high push in the healthcare industry
- The analytics professionals employed in healthcare have nearly tripled, with uptake of 18% in a survey done on a small sample
- Popularity of Notebooks continue to Rise
- Notebooks will continue to gain traction among data engineers in 2022
- Notebooks allow data engineers to mix and match language as per the task requirement whether Jupyter or other online derivatives
- Notebooks will continue to gain traction among data engineers in 2022
Source: https://www.silect.is/blog/notebook-interface-data-engineers-data-science/
- DataOps moves further into the mainstream of Data Engineering in 2022
1.7 DATAOps 2.0
Source: https://jdp491bprdv1ar3uk2puw37i-wpengine.netdna-ssl.com/wp-content/uploads/2019/11/102519_Ultimate_Guide_To_Data_Ops_Tamr.pdf
1.8 Data Fabric
- Data fabric is being hailed by analyst firm Gartner as next-generation data management architecture for the enterprise
- It is designed as a coherent environment for reconciling all types of data, from all types of sources, and can reduce data management efforts by nearly 70%
- Promising flexibility and agility, a Data Fabric avoids data siloing and simplifies its integration into the decision-making and strategic processes of companies
- Data Fabric facilitates the use of data, even for non-technical users, and contributes to the development of your organization’s data culture.
1.9 Cloud-Native platforms
- Cloud-native platforms in the data ecosystem
- These platforms, which promise scalability and adaptability, are a response to both performance and cost control
- The transition to the cloud has multiple advantages, including cost and time savings, reliability, and mobility
- Gartner points out that Cloud-native platforms, which exploit the basic capabilities of cloud computing to provide scalable and elastic IT capabilities “as a service”, are expected to form the basis of 95% of companies’ digital transformation projects by 2025, compared with 40% in 2021
- The work from home or the hybrid model of working has encouraged various companies to make a shift and transfer their data to the cloud
- The millennial’s and other employers are drawn to the organizations using the latest tools and technologies, which further will help to advance the business
Source: https://www.gartner.com/en/information-technology/insights/top-technology-trends
1.10 Hybrid Forms of Automation
- Faced with the acceleration of time-to-market and the need to return to strong and rapid economic growth, to win the race against time and refocus human intelligence on value-added tasks
- According to Gartner, hyper automation will be one of the key trends in 2022
- This hyper-automation translates into a massive use of advanced technologies, including artificial intelligence and machine learning to automate processes and augment human capabilities
- In its report, Gartner states that “the most successful hyper-automation teams focus on three key priorities: improving the quality of work, accelerating business processes and increasing decision-making agility.“
- Sales agents can work with Robotic Process Automation (RPA) solutions that perform routine tasks, reducing the workload on salespeople and allowing them to be more effective
- Solutions can complete the order process after the sale has been made and analyze customer information and sales interactions to identify upsell and crosssell opportunities
Source: https://www.gartner.com/en/information-technology/insights/top-technology-trends
Source: https://irpaai.com/wp-content/uploads/2018/08/IRPAAI-Kryon-Hybrid-72dpi-NEW-LOGO.pdf
1.11 AI- As-A-Service-Platforms
- Artificial intelligence as a Service (AiaaS) refers to off-the-shelf AI tools that enable companies to implement and scale AI techniques at a fraction of the cost of a full, in-house AI
- Top 10 AIaaS platforms:
- Google. Platform: Google Cloud AI
- Amazon. Platform: Amazon AI services
- Microsoft. Platform: Microsoft Azure AI
- H2O.ai. Platform: H2O.ai
- IBM. Platform: IBM Watson Studio
- Google Brain team. Platform: TensorFlow
- DataRobot. Platform: DataRobot
- Wipro Holmes. Platform: Wipro Holmes AI and automation platform
- Salesforce. Salesforce Einstein
- Infosys. Infosys Nia
Source: https://aimagazine.com/ai-strategy/top-10-ai-platforms
1.12 Augmented Data Management
- Augmented data management preparation employs machine learning algorithms that automatically detect and analyze data usage to blend, find data relationships and recommend best actions to take for cleaning, enriching and manipulating data
- This allows business users to spend more time analyzing data and less time getting it ready
- Augmented data discovery is the process of automating your search for external data using a platform like Explorium, which connects you to thousands of pre-vetted sources
- Augmented data management to support and accelerate the following capabilities and tasks:
- Data Quality: Identifying and resolving data quality issues
- Metadata Management: Labelling, classifying and searching data
- Master Data Management: Identifying and evaluating potential master data
1.13 Business Intelligence for Performance
- According to Gartner’s Data 2022 trends report, “over the next two years, one-third of large organizations will use BI for structured decision making to improve their competitive advantage”
- Business Intelligence is based on a set of technologies that enable real-time and granular analysis of data to clarify and facilitate decision-making
- It relies on a wide range of applications, solutions and methodologies combined to collect data from internal systems and external sources, in order to integrate it into decision-making processes
Source: https://www.gartner.com/en/information-technology/insights/top-technology-trends
1.14 Data Mesh
- Data mesh is an emerging paradigm shift in the data architectures and platforms are viewed in the context of business
- Forerunners are already in the process of data mesh experimentation and implementation which is first and foremost a cultural shift on how data will be governed in the future
- Snowflake has become a popular technology that has taken an upward and onward growth in 2021, this will continue in 2022
- This tool is flexible, user-friendly and can support cloud platforms
- Kubernetes has been increasing as the demand for data engineering roles has increased, requiring more DevOps responsibilities
1.15 AI Engineering
- AI Engineering addresses the critical mission of automating data updates, models and applications to streamline the use of AI in data analysis
- AI Engineering services create the data platforms to deliver operational AI solutions
- AI Engineering is the key lever for value generation by 2025, according to Gartner
- The institute predicts that “the 10% of companies that will establish AI engineering best practices will generate at least three times more value than the 90% of companies that will not.”
- At a global level in 2021, we saw unique diagnostics projects, such as DarwinAI (Canada)
- With this computer vision tool in place, it’s possible to diagnose COVID-19 by chest radiography scans only
- Before, the only medical-imaging COVID-19 diagnostic method was computer tomography (CT).
Source: https://www.gartner.com/en/information-technology/insights/top-technology-trends
1.16 Machine Learning Services
- Image and speech recognition and other solutions to generic problems are already available as easily integrated APIs, and their significance will keep growing
- The number of ready machine learning (ML) services will also continue to grow
- As a result, developers will be able to implement advanced features without a deep knowledge of data science
- Even in projects that require custom ML solutions, AutoML will partly automate the model optimization and therefore enhance the impact of the data scientists’ work
- Google’s latest research breakthrough Language Model for Dialogue Applications (LaMDA) can engage in a free-flowing conversation irrespective of the topic, in text right now
- This is a breakthrough in unlocking more natural ways of interacting with technology and new categories of supportive applications
- Google has been trying to fine-tune the model and improve the sensibility and specificity of its response
1.17 NLP for Smaller Languages
- Companies want to capitalize on their textual and speech data, and this applies in languages other than English as well
- Cloud platforms already provide NLP tools for a number of languages and plan to release more in the future to cater to additional markets
- The need for both simpler rule-based models and more advanced pre-trained deep learning models remains to be solved by smaller boutique players
- Areas which are of particular interest include:
- Transfer Learning
- Fake News and Cyberbullying Detection
- Monitoring Social Media Using NLP
- Reinforcement Learning Training Models
- The Use of Multilingual NLP
- Using a Mix of Supervised and Unsupervised Machine Learning Techniques
1.18 Fairness and Privacy as a Mega-trend
- Data scientists need to be acutely aware of potential sources of bias and have the necessary tools to evaluate the impact of the systems they are building
- Data privacy is important to ensure the end-users’ trust towards the application
- Technical solutions, such as federated learning, can ensure that private data is not shared more widely than necessary
- Data scientists will also need to be aware of privacy best practices
- India’s 2nd wave of Covid-19 provided an ideal environment for data science to created machines that integrated FTIR microscopy and artificial intelligence to analyze infected patients’ samples
- Fourier transforms infrared (FTIR) microscopy is an efficient, rapid, and repeatable process for obtaining spectral fingerprints of biomolecules
- The researchers took a computer-based model which is unique and is trained to recognize different signals from black fungus molecules, and it will match each data from the patients with a determined spectrum
- The South Korean government took major preventative measures using real-time analytics for strategy design and patient surveillance
- It uses the data from IoT and AI solutions underlying the live smart cities networks and personal information provided by confirmed patients
- Researchers ttracked the patients’ movements, identify their contacts, and predict the potential outbreak scale in a given region with the help of big data analytics
- The data is also used for drafting preventive measures and instructions
1.19 Computer Vision – 3D
- 3D Object/ Vehicle Detection
- Humans know that there are different objects everywhere, in our physical space; we can pick up a glass of water or catch a thrown ball
- While driving, the system needs to know the full speed, trajectory, bicycles on the road, people crossing the road, etc.
- 3D Semantic Segmentation
- 3D depth information is semantically labeled per pixel
- It allows a robot to stay on the sidewalk or know when there’s some object in its path
- Human Pose Estimation
- The product can track your hand’s position and even full-body pose in full 3D coordinates
- Motion capture on the move it can also try to predict the 3D hand pose estimation
- 3D Manipulation and Control
- The robot can sense how someone is looking at you or how are you looking at someone or something
- The best part is that your robot won’t get as tempted as you while you gaze at your favorite ice cream
1.20 MLOps in 2022
- In 2021, most of the MLOps start-ups focused on the Tabular Data and then expanding into other types
- MLOps show a standard progression where they master Tabular Data with their unique Data Governance, Data Monitoring, ML Monitoring, ML Platforms, and serving platforms
- In 2022, MLOps will change the trend and help manage, accelerate the lifecycle for analytics and ML models from development into production
- Start-ups need insights, innovation, and urgency to solve these problems
- The solutions can deliver energy to their enterprise customers who need to get more value from their ML models
- MLOps is a market ripe for private equity investors looking for M&A opportunities and investors looking to get into AI
- It is believed that the mid-sized companies will now start investing and buying this technology and climb a level up concerning innovation
1.21 Summary
- In this tutorial, we discussed the following topics:
- Current – MLOps, DataOps, CV, NLP
- Future – Data Mesh, Data Fabric, AIaaS