This article first appeared in the SQLServerGeeks Magazine.
Author: Niels Berglund
Subscribe to get your copy.
Data
We are getting more and more data. Everything we do generates data. When looking at the volume of data, International Data Corp (IDC) research estimates that in 2016 we had 16 Zettabytes of data, where one Zettabyte is approximately a billion Terabytes! so 16 Zb is a lot! However, the research mentioned above also says that by 2025 we will have 163 Zb worth of data!
Obviously, you or I or our businesses will most likely not have that large amount of data, but we will definitely look at Petabytes (PB) volumes (1 PB = 1024 TB). We may even have data in the Exabyte (EB) range (1 EB = 1024 PB). We are not collecting this data just for fun. The reason for having all this data is that we want to get a competitive edge by integrating the data into our systems, managing the data and, most importantly, analyzing the data.
How do we do this (integrate, manage, analyze) with that amount of data and the fact that the data comes in all shapes and types; relational data, NoSQL, files, etc.
SQL Server
I come back to handling all this data shortly, but before that, let us think about the evolution of SQL Server throughout the years.
From its “humble beginnings” as another flavor of Ashton Tate/Sybase, Microsoft evolved SQL Server throughout the years:
- SQL Server 7.0 – re-write of the codebase in C++, introducing: User Mode Scheduling, (UMS), and a multi-dimensional database called SQL OLAP Services.
- SQL Server 2000 – OLAP Services became Analysis Services.
- SQL Server 2005 – this was a massive release! Microsoft introduced, among other things, SQL Server Integration Services, the XML data type, SQL Server Service Broker, and my favorite: SQLCLR.
- SQL Server 2016 – in this release Microsoft introduced SQL Server Machine Learning Services, (the ability to call R code from T-SQL). Microsoft also added PolyBase, which in this release allowed the creation of External Table(s) against Hadoop data.
- SQL Server 2017 – Microsoft added Python in addition to R in SQL Server Machine Learning Services. Also – another biggie – for the first time: SQL Server on Linux, as well as containers!
- SQL Server 2019 – Java is added in addition to R and Python, and it is introduced as SQL Server Extensibility Framework. PolyBase is greatly enhanced, and we can now create external tables against relational databases, as well as MongoDB and CosmosDB, plus generic ODBC data sources.
From the above, which is by no means complete, we see how Microsoft has evolved SQL Server to be an exceptional relational database and an engine that can do many things related to data.
Coming back to what I asked above about integrating, managing, and analyzing all types of data, Microsoft sees SQL Server as the hub in all this, and the solution is SQL Server 2019 Big Data Cluster.
The idea is that with SQL Server 2019 Big Data Cluster (BDC), we should be able to handle (manage, integrate, analyze) not only relational data but also other types of data (Big Data) and extend SQL Server to handle data in the petabyte range.
SQL Server 2019 Big Data Cluster
With the above in mind, we suspect that a BDC is not your father’s SQL Server; it is an entirely different “beast”.
As the name implies, a BDC is a cluster of “things”. When you deploy the BDC, you deploy:
- Multiple instances of SQL Server.
- Hadoop File System (HDFS). For storage of “big data”.
- Apache Spark. For analysis of your data.
- Other open source components, such as Grafana, Elastic Search, InfluxDB, etc.
By having SQL Server, HDFS, and Apache Spark in the cluster, you can process and analyze massive amounts of data, not only relational data but also big data.
A difference from a “normal” SQL Server installation is that the SQL Server instances in a BDC are SQL Server on Linux running in containers. All components in a BDC run in containers, and since we now have multiple containers, we need something that manages and orchestrates the containers and their workload. For this, we use Kubernetes. So, when we deploy a BDC, we deploy it into a Kubernetes cluster. The type of Kubernetes cluster does not matter that much. It can be in the cloud or on-prem. I always deploy to Azure and the Azure Kubernetes Service, but as I said – it can essentially be any cluster.
Installing / Deployment
Above I mentioned how the BDC is not your father’s SQL Server, and what you do to install/deploy a BDC drives that home even more.
If you, like me, have worked with SQL Server for a while, you are probably quite familiar with installing SQL Server instances by mounting an ISO file and running setup; well, you can forget all that when you deploy a BDC. Instead, the setup is done via Python scripts or a wizard in Azure Data Studio (ADS)!
I will not go into detail about how to deploy the BDC but instead point you to some resources about BDC deployment:
- How to deploy SQL Server Big Data Clusters on Kubernetes]: This is the official Microsoft documentation for deploying a BDC.
- How to Deploy a Big Data Cluster to a Single Node Kubeadm Cluster: A post by my friend and BDC guru Mohammad Darab about deploying to an on-prem single node cluster.
- How to Deploy a Big Data Cluster to a Multi Node Kubeadm Cluster: A second post by Mohammad, this time he deploys to an on-prem multi node cluster.
- How to Deploy SQL Server 2019 Big Data Cluster Using Azure Data Studio: The link here is to a blog post by “yours truly” about how to deploy to Azure using ADS.
Let us now look at the architecture of a BDC.
Architecture
I mentioned above how the various BDC components are deployed as containers into a Kubernetes cluster. When reading documentation around the BDC, you may see a diagram depicting the architecture looking something like so:
Figure 1: BDC Architecture
When you look at Figure 1, what you see to the left of the dotted line is Kubernetes specific components. In addition to the Kubernetes specific parts, it also contains shared services for the cluster, such as Kibana, Grafana, and Elastic Search.
To the right of Figure 1’s dotted line are the various components of the BDC, and when discussing the components, we often group them into groups (pools) based on the function they perform.
Master Pool
The Master pool contains the SQL Server master instance, the instance we communicate with via the TDS protocol. The SQL Server instance in the pool handles OLTP workloads as well as persistence of relational data. The SQL Server instance in this pool is what we communicate through when we query Hadoop, etc.
Compute Pool
The Compute pool pod looks very similar to the Master pool pod; it contains one or more SQL Server instances. The Compute pool provides scale-out computational resources for a SQL Server BDC. They SQL Server instance(s) are used to offload computational work, or intermediate result sets, from the SQL Server master instance. For you who have worked with PolyBase before, it is a fully configured PolyBase Scale-Out Group
Data Pool
The Data pool has, by default, two SQL Server instances installed, and the reason is that the Data pool acts as a persisting and caching layer of external data. Thus, the Data pool allows for performance querying of cached data against external data sources and offloading of work.
Storage Pool
In the Storage pool, we have HDFS and Apache Spark. The Storage pool is where we store “Big Data”. In Figure 1, you see a green “blob” underneath the Storage pool block, looking like files. That represents the ability to mount external HDFS data sources into the Storage pool cluster. You access the data stored in the Storage pool via the SQL Server master instance (in the Master pool) and PolyBase external tables.
Above, I mentioned how we have HDFS and Apache Spark in the Storage pool, but in Figure 1, there is also a SQL Server instance. The reason for having a SQL Server instance in the Storage pool is that the BDC uses the SQL Server instance(s) to optimize the access of the data stored in the HDFS data nodes.
App Pool
The BDC allows for deployment of applications into the BDC, and any applications deployed end up in the App pool. The advantage of deploying into a BDC is that the application(s) benefit from the computational power of the cluster and can access the data that is available on the cluster.
Integrating & Managing Data
With the capabilities the BDC provides, you have a great degree of flexibility around how you deal with your data:
- PolyBase gives you the ability to query external data sources without moving or copying the data.
- The Storage pool and HDFS allows you to store big data. Then, when the data is stored, you can query the data via PolyBase.
- You can create data marts across cached data in the Data pool.
Analyzing Data
The BDC allows you to do AI and Machine Learning tasks on the data stored in both SQL Server and HDFS. For analysis and ML, you can use Apache Spark as well the AI and ML tools in SQL Server Machine Learning Services:
Figure 2: AI/ML Flow in SQL Server 2019 Big Data Cluster
In Figure 2, which is a picture from the official Microsoft BDC documentation, we see:
- Data ingested into the Storage pool (HDFS) and the Data pool via Spark streaming and/or Apache Kafka.
- Creating machine learning/AI models using Spark, Spark ML, and SQL Server Machine Learning Services.
- Persisting those models to the SQL server instance in the Master pool.
- Serving the models from the Master pool.
- Using applications and BI tools to analyze the data.
And all of the above comes out of the box with SQL Server 2019 Big Data Cluster!
Summary
SQL Server 2019 Big Data Cluster allows us to integrate, manage and analyze data in the Petabyte range. It enables us to integrate relational data, semi-structured and unstructured data. When deploying a BDC, you get out of the box:
- SQL Server.
- Hadoop File System (HDFS).
- Apache Spark.
The BDC allows you to read, write, and process big data from Transact-SQL or Spark, allowing you to easily combine and analyze your high-value relational data with high-volume big data.
This article first appeared in the SQLServerGeeks Magazine.
Author: Niels Berglund
Subscribe to get your copy.