1.1 Understanding PolyBase
PolyBase enables a SQL Server instance to query data with T-SQL directly from SQL Server, Oracle, Teradata, MongoDB, Hadoop clusters, and Cosmos DB. Separate installation of client connection software is not needed. You can also use the generic ODBC connector to connect to additional providers using third-party ODBC drivers. PolyBase allows T-SQL queries to join the data from external sources to relational tables in an instance of SQL Server.
1.2 PolyBase Example
1.3 Data Virtualization with PolyBase
With PolyBase, data stays in its original location and format. You can virtualize the external data through the SQL Server instance so that it can be queried in a place like any other table in SQL Server. This process minimizes the need for ETL processes for data movement. The query author does not need any knowledge about the external source.
1.4 PolyBase Use-cases
Query data stored in Hadoop from a SQL Server instance. Query data stored in Azure blob storage. Import data from Hadoop, Azure blob storage, and Azure Data Lake Store. There is no need for a separate ETL or import tool. Export data to Hadoop and Azure blob storage, or Azure Data Lake Store. Integrate with BI tools. Use PolyBase with Microsoft’s business intelligence and analysis stack, or use any third-party tools that are compatible with SQL Server.
1.5 PolyBase and Massive Parallel Processing
You may configure a PolyBase cluster that lets you set up scaled-out groups. Scaled-out groups allow PolyBase to provide enormous parallel processing to handle huge quantities of data at a quick speed.
1.6 Supported SQL Products and Services
- SQL Server 2016 (13.x) and later versions (Windows only)
- SQL Server 2019 (15.x) and later versions (Linux)
- Azure Synapse Analytics
1.7 PolyBase Connectors
SQL Server 2016 (13.x) introduced PolyBase with support for connections to Hadoop and Azure blob storage. SQL Server 2019 (15.x) introduced additional connectors, including SQLServer, Oracle, Teradata, and MongoDB.
1.8 PolyBase vs. Linked Server
1.9 PolyBase Limitations
If the sum of the column schema is greater than 32 KB, PolyBase can’t query the data.
The following data types cannot be used in PolyBase external tables:
geography, geometry, hierarchy id, image, text, nText, xml, Any user-defined type. Oracle synonyms are not supported for usage with PolyBase.1
1.10 Using PolyBase in SQL Server Instance
- Install PolyBase on Windows or Install PolyBase on Linux.
- Starting with SQL Server 2019 (15.x), enable PolyBase in sp_configure, if necessary.
- Create an external data source.
- Create an external table.
1.11 PolyBase Installation
In the installation wizard, select PolyBase Query Service for External Data.
1.12 Enable PolyBase
After the installation, use the following T-SQL command.
exec sp_configure @configname = ‘polybase enabled’, @configvalue = 1;
RECONFIGURE;
Run the following command and verify the return value is 1 to ensure PolyBase is installed.
SELECT SERVERPROPERTY (‘IsPolyBaseInstalled’) AS IsPolyBaseInstalled;
1.13 Create an external data source to reference a SQL Server Instance
Create a database scoped credential.
— Create a Master Key
CREATE MASTER KEY ENCRYPTION BY PASSWORD = ‘S0me!nfo’;
GO
CREATE DATABASE SCOPED CREDENTIAL SqlServerCredentials
WITH IDENTITY = ‘username’, Secret = ‘password’;
GO
Use a named instance to connect to a SQL Server instance
CREATE EXTERNAL DATA SOURCE SQLServerInstance2
WITH (
LOCATION = ‘sqlserver://YOUR_HOST_NAME’ ,
CONNECTION_OPTIONS = ‘Server=%s\YOUR_INSTANCE_NAME’ ,
CREDENTIAL = SQLServerCredentials
) ;
Use a port to connect to a SQL Server instance.
CREATE EXTERNAL DATA SOURCE SQLServerInstance2
WITH (
LOCATION = ‘sqlserver://YOUR_HOST_NAME:YOUR_PORT_NUMBER’ ,
CREDENTIAL = SQLServerCredentials
) ;
1.14 Create an external data source in SQL Server 2019 to reference Oracle
To create an external data source that references Oracle, ensure you have a database scoped credential.
— Create a database master key
CREATE MASTER KEY ENCRYPTION BY PASSWORD = ” ;
— Create a database scoped credential with Azure storage account key as the
secret.
CREATE DATABASE SCOPED CREDENTIAL OracleProxyAccount
WITH
IDENTITY = ‘oracle_username’,
SECRET = ‘oracle_password’ ;
Create external data source to reference Oracle.
CREATE EXTERNAL DATA SOURCE MyOracleServer
WITH
( LOCATION = ‘oracle://145.145.145.145:1521’,
CREDENTIAL = OracleProxyAccount
) ;
1.15 Create an external data source to access data in Azure Storage
— Create a database master key if one does not already exist, using your own
password. This key is used to encrypt the credential secret in the next step.
CREATE MASTER KEY ENCRYPTION BY PASSWORD = ” ;
— Create a database scoped credential with Azure storage account key as the
secret.
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH
IDENTITY = ” ,
SECRET = ” ;
— Create an external data source with CREDENTIAL option.
CREATE EXTERNAL DATA SOURCE MyAzureStorage
WITH
( LOCATION = ‘wasbs://daily@logs.blob.core.windows.net/’ ,
CREDENTIAL = AzureStorageCredential ,
TYPE = HADOOP
) ;
Windows Azure Storage Blob (WASB) is a file system implemented as an extension built on top of the HDFS APIs
1.16 Create an external data source to reference Hadoop
CREATE EXTERNAL DATA SOURCE MyHadoopCluster
WITH
( LOCATION = ‘hdfs://10.10.10.10:8020’ ,
TYPE = HADOOP ,
RESOURCE_MANAGER_LOCATION = ‘10.10.10.10:8050’
) ;
1.17 Create an external table for SQL Server
CREATE EXTERNAL TABLE dbo.customer(
C_CUSTKEY INT NOT NULL,
C_NAME VARCHAR(25) NOT NULL,
C_ADDRESS VARCHAR(40) NOT NULL,
C_PHONE CHAR(15) NOT NULL,
)
WITH (
LOCATION=’your_remote_database.dbo.customer’,
DATA_SOURCE=SqlServerInstance
);
1.18 Create an External Table for Azure Blob Storage
Create an external file format.
— In this example, the files are comma (,) delimited
CREATE EXTERNAL FILE FORMAT TextFileFormat WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (FIELD_TERMINATOR =’,’,
USE_TYPE_DEFAULT = TRUE)
Create an external table.
— LOCATION: path to file or directory that contains the data
CREATE EXTERNAL TABLE [dbo].[customer] (
[customerId] int NOT NULL,
[customerName] varchar(20) NOT NULL
)
WITH (LOCATION=’/data/’,
DATA_SOURCE = AzureStorage,
FILE_FORMAT = TextFileFormat
);
1.19 Importing Data with CTAS
CTAS is the CREATE TABLE AS SELECT syntax.
For example:
CREATE TABLE Fast_Customers
WITH
(CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = HASH (CustomerKey))
AS
SELECT DISTINCT
Insured_Customers.CustomerKey, Insured_Customers.FirstName,
Insured_Customers.LastName,
Insured_Customers.YearlyIncome, Insured_Customers.MaritalStatus
from Insured_Customers INNER JOIN
(
SELECT * FROM CarSensor_Data where Speed > 35
) AS SensorD
ON Insured_Customers.CustomerKey = SensorD.CustomerKey
1.20 Exporting Data with CETAS
CETAS is the CREATE EXTERNAL TABLE AS SELECT syntax.
For example, archiving/exporting relational data to Azure Blob Storage.
CREATE EXTERNAL TABLE [dbo].[FastCustomers2009]
WITH (
LOCATION=’/archive/customer/2009′,
DATA_SOURCE = AzureStorage,
FILE_FORMAT = TextFileFormat
)
AS
SELECT T.* FROM Insured_Customers T1 JOIN CarSensor_Data T2
ON (T1.CustomerKey = T2.CustomerKey)
WHERE T2.YearMeasured = 2009 and T2.Speed > 40;
1.21 View PolyBase Objects in SSMS
External tables are displayed in External Tables.
External data sources and external file formats are in the External Resources.
1.22 Summary
In this tutorial, you learned the following:
◊ PolyBase
◊ PolyBase vs. Linked Server