I previously talked about PolyBase and its enhancements (see PASS Summit Announcements: PolyBase enhancements).  There is some confusion on PolyBase use cases as they are different depending on whether you are using PolyBase with Azure SQL Data Warehouse (SQL DW) or SQL Server 2016, as well as the sources you are using it against.  The three main use cases for using PolyBase are: Loading data, federating querying, and aging out data.  Here is the support for those three uses cases in SQL DW and SQL Server 2016:

PolyBase in: Parallelize Data Load (Blob and ADLS) Federated Query (push down) HDInsights Federated Query (push down) HDP/Cloudera (local or blob) Federated Query (push down) five new sources* Age Out Data
SQL DW Yes N/A N/A No support for on-prem sources Maybe
SQL Server 2016 Yes via scale-out groups.  Blob, not ADLS N Y (Creates MapReduce job) Y Maybe

* = Teradata, Oracle, SQL Server, MongoDB, generic ODBC (Spark, Hive, Impala, DB2)

For federated queries: “N” requires all data from the source to be copied into SQL Server 2016 and then filtered.  For “Y”, the query is pushed down into the data source and only the results are returned back, which can be much faster for large amounts of data.

I mention “Maybe” for age out data in SQL DW as you can use PolyBase to access the aged-out data in blob or Azure Data Lake Storage (ADLS), but it will have to import all the data so may have slower performance (which is usually ok for accessing data that is aged-out).  For SQL Server 2016, it will have to import the data unless you use HDP/Cloudera, in which case the creation of the MapReduce job will add overhead.

Here are details on what PolyBase supports for each product:

PolyBase (works with)
Azure Blob Store (WASB)
Push Down
Azure Data Lake Store (ADLS)
Push Down
HDI
Push Down
Cloudera (CDH)
Push Down
Horton Works (HDP)
Push Down
SQL 2016
Yes
N/A
No
N/A
No
No
Yes
Yes
Yes
Yes
Azure SQL DW
Yes
N/A
Yes
N/A
No
No
No
No
No
No
APS
Yes
N/A
No
N/A
Yes
Yes (internal region)

No (external)

Yes
Yes
Yes
Yes

Here are some important notes:

  • The file types that PolyBase supports: UTF-8 and UTF-16 encoded delimited text, RC File, ORC, Parquet, gzip, zlib, Snappy.  Not supported: extended ASCII, fixed-file format, WinZip, JSON, and XML
  • Azure SQL Database does not support PolyBase
  • SQL DW recently added PolyBase support for ADLS but does not support compute pushdown
  • ADLS in only in two regions (East US 2, Central US)
  • PolyBase supports row sizes up to 1MB
  • PolyBase can do writes to blob/ADLS and HDFS (using CETAS)
  • PolyBase requires the CREATE EXTERNAL TABLE command
  • PolyBase offers ability to create statistics on tables (but they are not auto-created or auto-updated)

PolyBase parallelized reads for data loading:

  • Supported: in SQL using CTAS or INSERT INTO
  • Not supported: BCP, Bulk Insert, SQLBulkCopy
  • Not supported: SSIS (unless used to call stored procedure containing CTAS)
  • Supported: ADF
    • If source compatible with PolyBase, will directly copy
    • If source not compatible with PolyBase, will stage to Blob
    • If source is ADLS, will still stage to Blob (having to stage to blob will be removed week of 8/20, so PolyBase will copy from ADLS directly to target)

The bottom line is, for SQL DW, think of PolyBase as a mechanism for data loading.  For SQL Server 2016, think of PolyBase for federated querying.

More info:

PolyBase scale-out groups

PolyBase Queries

 

About the Author

James Serra

James is currently a Senior Business Intelligence Architect/Developer and has over 20 years of IT experience. James started his career as a software developer, then became a DBA 12 years ago, and for the last five years he has been working extensively with Business Intelligence using the SQL Server BI stack (SSAS, SSRS, and SSIS). James has been at times a permanent employee, consultant, contractor, and owner of his own business. All these experiences along with continuous learning has helped James to develop many successful data warehouse and BI projects. James has earned the MCITP Business Developer 2008, MCITP Database Administrator 2008, and MCITP Database Developer 2008, and has a Bachelor of Science degree in Computer Engineering.

Start the discussion at forums.toadworld.com