Home  > Resources  > Blog

Blog articles from Mikhail Vladimirov

 

SQL Notebooks in Databricks

August 11, 2022

This tutorial is adapted from the Web Age course https://www.webagesolutions.com/courses/WA3208-programming-on-azure-databricks-with-pyspark-sql-and-scala. In this tutorial, you will learn how to create and use SQL Notebooks in Databricks that enable developers and business users to query data cataloged as tables using standard SQL commands. This tutorial depends on the resour

Implement an AWS Lambda using .NET

March 29, 2022

In this tutorial, you will create an AWS Lambda that exposes an ASP.NET service. The tutorial will follow these steps:Install and configure required toolsCreate a New AWS Lambda .NET projectCode the Lambda Function in C#Upload the Lambda Function to AWSInvoke the Function using AWS CLITest the Function in the AWS ConsolePart 1 – Install and configure req

Robust Python Programming Techniques

March 29, 2022

This tutorial is adapted from the Web Age course https://www.webagesolutions.com/courses/WA3174-pragmatic-python-programming.1.1 Defining Robust ProgrammingWe will define Robust Programming as a collection of assorted programming techniques, methods, practices, and libraries that can help yo

Learning the CoLab Jupyter Notebook Environment

March 29, 2022

This tutorial is adapted from the Web Age course https://www.webagesolutions.com/courses/WA3174-pragmatic-python-programming. Google Colaboratory (CoLab) is a free Jupyter notebook interactive development environment (REPL) hosted in Google’s cloud that we are going to use in this course. In this tutorial, you will learn about the main features of the Goo

Future Trends in Data Science and Data Engineering

February 7, 2022

This tutorial is adapted from the Web Age course https://www.webagesolutions.com/courses/WA3169-data-science-and-data-engineering-in-2022.1.1 Big Trends in 20212021 was a very incremental year in terms of breakthroughs, with an exponential rise in the demand for data professionals, the r

Defining Data Science for Architects

December 30, 2021

This tutorial is adapted from the Web Age course https://www.webagesolutions.com/courses/WA3057-data-science-and-data-engineering-for-architects.1.1 What is Data Science?Data science focuses on the extraction of knowledge and business insights from dataIt does so by leveraging techniques and theorie

Data Visualization in Python for Architects

December 29, 2021

This tutorial is adapted from the Web Age course https://www.webagesolutions.com/courses/WA3057-data-science-and-data-engineering-for-architects.1.1 Why Do I Need Data Visualization?The common wisdom states that:Seeing is believing and a picture is worth a thousand wordsData visual

Introduction to Pandas for Architects

December 29, 2021

This tutorial is adapted from the Web Age course https://www.webagesolutions.com/courses/WA3057-data-science-and-data-engineering-for-architects.1.1 What is pandas?pandas (https://pandas.pydata.org/) is an open-source library that provides high-performance, memory-efficient, easy-to-use data

An AWS CLI / Node.js Script for Terminating EC2 Instances

June 6, 2017

The AWS Command Line Interface (CLI) is a powerful scripting platform written in Python that uses the AWS Cloud’s RESTful management API for performing various operational tasks, like creating S3 buckets, deleting EBS volumes, etc.In this blog, I will show you how you can terminate EC2 instances from your local computer using AWS CLI wrapped up as a Node.js app. What you need is these four things:

Using k-means Machine Learning Algorithm with Apache Spark and R

January 31, 2017

In this post, I will demonstrate the usage of the k-means clustering algorithm in R and in Apache Spark.Apache Spark (hereinafter Spark) offers two implementations of k-means algorithm: one is packaged with its MLlib library; the other one exists in Spark’s spark.ml package. While both implementations are currently more or less functionally equivalent, the Spark ML team recommends using the

Spark RDD Performance Improvement Techniques (Post 2 of 2)

October 4, 2016

In this post we will review the more important aspects related to RDD checkpointing. We will continue working on the over500 RDD we created in the previous post on caching. You will remember that checkpointing is a process of truncating an RDD’s lineage graph and saving its materi

Spark RDD Performance Improvement Techniques (Post 1 of 2)

September 13, 2016

Spark offers developers two simple and quite efficient techniques to improve RDD performance and operations against them: caching and checkpointing. Caching allows you to save a materialized RDD in memory, which greatly improves iterative or multi-pass operations that need to traverse the same data set over and over again (e.g. in machine learning algorithms.)

SparkR on CDH and HDP

September 13, 2016

Spark added support for R back in version 1.4.1. and you can use it in Spark Standalone mode. Big Hadoop distros, like Cloudera’s CDH and Hortonworks’ HDP that bundle Spark, have varying degree of support for R. For the time being, CDH decided to opt out of supporting R (their latest CDH 5.8.x version does not even have sparkR binaries), while HDP (versions 2.3.2, 2.4, … ) includes SparkR as a technical preview technology and bundles some R-related components, like the sparkR script. Making it all

Simple Algorithms for Effective Data Processing in Java

January 30, 2016

The needs of Big Data processing require specific tools which nowadays are, in many cases, represented by the Hadoop product ecosystem. When I speak to people who work with Hadoop, they say that their deployments are usually pretty modest: about 20 machines, give or take. It may account for the fact that most companies are still in the technology adoption phase evaluating this Big Data platform and with time the number of machines in their Hadoop clusters would probably grow into 3- or even 4-di

Spark SQL

December 24, 2015












Follow Us

Webinar Categories