Latest Internet Order News
Mar 22, 2021
| Amazon Web Services 14 minutes read AWS Glue is fully managed service that makes it easier for you to extract, transform, and load (ETL) data for analytics. You can easily create ETL jobs to connect to backend data sources. There are several natively supported data sources , but what if you need to extract data from an unsupported data source? What if the data resides in an online analytical processing (OLAP) cube database such as Microsoft SQL Server with SQL Server Analysis Services (SSAS)? When you have data stored in an SSAS OLAP cube database backed by SQL Server and you want to analyze this data on the AWS Cloud, you often have to directly interface with the SQL data sources. Invoking analytical queries directly against SQL Server can get expensive and can often result in performance degradation. AWS Cloud provides tools like Amazon SageMaker notebooks or Amazon EMR notebooks for data analysts and data scientists to analyze data at scale. These tools can directly interface with SQL Server, but performance and cost can become a bottleneck when analyzing and pulling large amounts of data across wire. Additionally, it’s challenging to work with SSAS OLAP cube databases directly because you have to use Multidimensional Expressions (MDX) to query the necessary data. Third-party connectors are available for interfacing with an SSAS OLAP cube, but they add to the overall cost. In this post, we demonstrate how to overcome these challenges. We go over a simple solution that allows us to connect to the SSAS OLAP cube database and use standard SQL statements to query the multidimensional data. Let’s first review the architecture. Architecture The following diagram shows the overall architecture to address our two challenges. We use SQL Server views to access data from the SSAS OLAP cube. Views allow us to flatten the three-dimensional OLAP cube data into a two-dimensional table structure form, which can then be easily accessed by traditional JDBC clients such as AWS Glue. You can connect to SSAS over HTTP and run MDX queries directly. However, it’s challenging to set up SSAS with HTTP connectivity. Performance of MDX queries over HTTP can often become a challenge. Securely exposing OLAP cubes over HTTP is another pain point. Moreover, the client side must either use SSAS libraries or speak the SSAS wire protocol, SOAP or XML/A. Open-source libraries can address these shortcomings, but some of these libraries aren’t well maintained. The latest SSAS native Windows client libraries are still in beta preview. Instead of connecting directly to the SSAS cube, it’s much simpler to interface with SQL Server and use it as a proxy to run MDX queries on the SSAS cube database, as shown in our architecture. SQL Server allows us to do just that using the OPENQUERY feature. With OPENQUERY, we can run MDX statements on a linked server; in our case we link SSAS service to SQL Server. To make our lives easier and to simplify our SQL statements, we can create SQL views. These SQL views are specifically constructed to run MDX queries against the OLAP cube database. The OPENQUERY construct in SQL Server takes in the name of a linked SQL Server and runs an arbitrary query. In our case, the linked server is SSAS and the arbitrary query is MDX query. We review how to create linked servers in the following section. But first, let’s look at some limitations of the architecture. We can build as many views as needed by using SQL views as an intermediate step to extract OLAP cube data. Each SQL view can be tied to a specific MDX query. Keep in mind the following limitations: OPENQUERY limits the size of the query to 8,000 characters. Therefore, when building large complex MDX extraction queries, be mindful of the query size limitation. You can’t pass parameters to OPENQUERY when building the view. An alternative is to use multiple views that run the necessary MDX query with different parameters. Setting up SQL Server To demonstrate the multidimensional data extraction process from SQL Server Analysis Services, the OLAP cube, we use the AdventureWorks sample database and accompanying AdventureWorks Analysis Services models . If you don’t have a SQL Server installation, use the SQL Server on AWS Quick Start to stand up SQL Server in your AWS account on Amazon Elastic Compute Cloud (Amazon EC2) instances. As part of this installation, SQL Server Management Studio (SSMS), a graphical database administration tool, is also installed. Be mindful of the running cost that the instances incur. And remember to delete the Quick Start AWS CloudFormation stacks. We also have to perform additional networking and security configuration steps to let AWS Glue connect to SQL Server running on an EC2 instance; we cover the necessary steps in a later section. Set up the AdventureWorks database by downloading and restoring a full backup copy of the database. For instructions, see AdventureWorks Readme . When the AdventureWorks database is restored, set up the AdventureWorks for Analysis Services multidimensional model. For instructions, see AdventureWorks for Analysis Services . Finally, create a linked server in SQL Server that points to the SSAS instance you just set up. Use the SSMS option to create the linked server as documented. As shown in the following screenshot, name your linked server SSAS1. Provide a data source name that points to the server instance where SQL Server Analysis Services is running. We use this linked server in the subsequent sections when creating SQL views that run MDX queries against the SQL Server Analysis Services. Preparing the source data Before we can create a SQL view using OPENQUERY, we need to determine what to query from the OLAP cube. For demonstration purposes, we query the AdventureWorks multidimensional model to give us list of orders placed over fiscal years in Canada. We break down the orders by two types: Internet orders and all other type of orders. To build the MDX query, let’s start by connecting to the AdventureWorks multidimensional cube. Choose the cube and choose Browse. This opens the MDX query builder. On the drop-down menu, choose Select dimension. Add the following properties: Filter Express – Canada Next, let’s add the measures that help us select the correct data from the cube while applying our dimension filter. Under Measures, choose Internet Orders. Choose Internet Order Count and choose Add to Query. Under Measures, choose Sales Orders. Choose Order Count and choose Add to Query. Under Date, choose Fiscal. Choose Fiscal Year and choose Add to Query. Great! We have our sample MDX working, which generates a breakdown of AdventureWorks orders placed over the course of fiscal years and broken down by type of orders. But what does the MDX query look like? We need the actual MDX query to create a SQL view using OPENQUERY. You can easily get the MDX query source code by choosing Design Mode in the query builder window. The following code is the MDX query: SELECT NON EMPTY [Measures]. [Internet Order Count], [Measures]. [Order Count] ON COLUMNS, NON EMPTY ([Date]. [Fiscal Year]. [Fiscal Year].ALLMEMBERS ) DIMENSION PROPERTIES MEMBER_CAPTION, MEMBER_UNIQUE_NAME ON ROWSFROM ( SELECT ( [Sales Territory]. [Sales Territory Region].& ) ON COLUMNS FROM [Adventure Works])WHERE ( [Sales Territory]. [Sales Territory Region].& )CELL PROPERTIES VALUE, BACK_COLOR, FORE_COLOR, FORMATTED_VALUE, FORMAT_STRING, FONT_NAME, FONT_SIZE, FONT_FLAGS Let’s now construct a SQL view using OPENQUERY and the linked server reference we created earlier and call it SSAS1. While still in SSMS, ensure that you’re connected to SQL Server. Open a new query window and create a view v_sales_dim using the MDX query generated in previous step: CREATE view v_sales_dim as SELECT * FROM OPENQUERY( SSAS1, ' SELECT NON EMPTY [Measures]. [Internet Order Count], [Measures]. [Order Count] ON COLUMNS, NON EMPTY ([Date]. [Fiscal Year]. [Fiscal Year].ALLMEMBERS ) DIMENSION PROPERTIES MEMBER_CAPTION, MEMBER_UNIQUE_NAME ON ROWS FROM ( SELECT ( [Sales Territory]. [Sales Territory Region].& ) ON COLUMNS FROM [Adventure Works]) WHERE ( [Sales Territory]. [Sales Territory Region].& ) CELL PROPERTIES VALUE, BACK_COLOR, FORE_COLOR, FORMATTED_VALUE, FORMAT_STRING, FONT_NAME, FONT_SIZE, FONT_FLAGS ') We’re ready to query the two-dimensional data from the view we just created using the MDX query from a multidimensional cube. Run the select * from v_sales_dim SQL statement in another query window. The following output shows that the column names still contain the measures and dimensions we set in the MDX query. We rename the columns in the next step. Although there’s no harm in leaving the column names as is, it can be slightly confusing to extract data using an AWS Glue job or AWS Glue crawler via JDBC connection. Let’s create another view that is based on v_sales_dim but renames the columns to something more meaningful. Open another query window in SSMS, and run the following SQL statement: CREATE view v_sales as SELECT "[Date]. [Fiscal Year]. [Fiscal Year]. [MEMBER_CAPTION]" as fiscal_year, "[Measures]. [Internet Order Count]" as internet_orders, "[Measures]. [Order Count]" as orders FROM v_sales_dim Finally, run the select * from v_sales SQL statement in the same query window. The following screenshot shows that it produces a much better output. Now our SQL view is ready for an AWS Glue job or crawler. This view runs MDX queries to fetch data from the OLAP cube, as we discussed earlier. Setting up AWS Glue In this section, we cover the steps necessary to set up AWS Glue to extract data from the v_sales view we created in the previous section. We complete the following steps: Create an Amazon Simple Storage Service (Amazon S3) bucket to store our extracted data. Configure the networking and security required to run AWS Glue jobs. Create an AWS Glue database. Create and test the AWS Glue JDBC connection to SQL Server. Create an AWS Glue crawler. Run the crawler to hydrate the AWS Glue Data Catalog table, which is used in the AWS Glue job as the source table for extracting data from SQL Server. Use AWS Glue Studio to create and run the AWS Glue job. The job uses the Data Catalog source table. Run the job using AWS Glue Studio. Finally, use Amazon Athena to verify the contents of the Parquet files created as a result of running the AWS Glue job. We go over each step in detail in the following sections. The goal is to extract data from the SQL Server view v_sales and store it in the S3 bucket. We demonstrate how to do this in AWS Glue. Creating your S3 bucket First and foremost, we must create the S3 bucket where we store extracted data from the SQL Server view. S3 bucket names are unique across AWS Regions. Therefore, use a unique name for your S3 bucket where you intend to ingest data. We refer to this bucket as <YOUR_S3_OUTPUT_BUCKET> for the remainder of this post. To create your S3 bucket, complete the following steps: On the Amazon S3 console, choose Buckets. Choose Create bucket. Enter your unique bucket name. Pick a Region for your bucket (the same Region where you intend to perform the remainder of these steps). Select Block all public access. Disable bucket versioning Configuring networking and security For AWS Glue to successfully communicate with our SQL Server instance, running inside a VPC, we must configure network plumbing. Our assumption is that SQL Server is running in the same AWS account where we run the AWS Glue jobs and crawlers. Network and security configuration steps necessary to access SQL Server that is running on premises, for example, are beyond the scope of this post. SQL Server running on EC2 instances has security groups associated with it. If you used SQL Server on AWS Quick Start to stand up SQL Server on EC2 instances, complete the next step. On the Amazon EC2 console, search for the group SQL-WSFC-ADStack-XXX-WSFCSecurityGroup-YYYY and add a rule allowing inbound connection from the security group glue-mdx-blog-sg to port 1433. This rule allows AWS Glue to access SQL Server. Next, we must create an AWS Identity and Access Management (IAM) role that the AWS Glue jobs and crawlers assume when running. These roles allow jobs and crawlers to read and write to the S3 bucket. Create an IAM service policy called glue-mdx-blog-policy. You can use the policy given in the AWS Glue documentation as a template to create glue-mdx-blog-policy; make sure to copy the policy permissions as shown in the documentation. Create an IAM role named glue-mdx-blog-role. You can use sample role in the AWS Glue documentation as a template to create glue-mdx-blog-role. Assign the policy document glue-mdx-blog-policy to this new role, glue-mdx-blog-role. Alternatively, rather than assigning the custom policy document glue-mdx-blog-policy to glue-mdx-blog-role role, you can simply assign the AWSGlueServiceRole policy to glue-mdx-blog-role. Grant permissions to glue-mdx-blog-role to write to the S3 bucket. The following includes the S3 policy permissions. You can create another policy document for this policy or use an inline policy approach to achieve this. "Version": "2012-10-17", "Statement": [ "Action": [ "s3:GetObject", "s3:DeleteObject", "s3:PutObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::<YOUR_S3_OUTPUT_BUCKET>", "arn:aws:s3:::<YOUR_S3_OUTPUT_BUCKET>/*" ], "Effect": "Allow" ] Creating an AWS Glue database Create your AWS Glue database as follows: On the AWS Glue console, choose Databases. Choose Add database. Choose Create. Creating and testing the JDBC connection We use this JDBC connection in both the AWS Glue crawler and AWS Glue job to extract data from the SQL view. To create your AWS Glue connection, complete the following steps: On the AWS Glue console, under Databases, choose Connections. Choose Add connection. For Connection type¸ choose JDBC. Choose Next. For JDBC URL, enter a connection string in the format jdbc:sqlserver://<sql_server>:<sql_server_port>;databaseName=<db_name>. Use the following values: <sql_server> – The IP address or DNS name of SQL Server <sql_server_port> – The port number where SQL Service is running (default port is 1433) <db_name> – The name of database (AdventureWorksDW) For example, the URL might look like jdbc:sqlserver://10.0.0.100:1433;databaseName=AdventureWorksDW. For more information about defining connections, see AWS Glue Connection Properties . For Username and Password, enter your database login credentials. For VPC, choose the VPC to put the AWS Glue ENI. For Subnet, choose the private subnet where SQL Server is deployed. For Security groups, choose glue-mdx-blog-sg. Choose Next. Check your configuration and choose Finish. Now that our JDBC connection is created, let’s test to make sure we can successfully connect to SQL Server. On the AWS Glue console, under Databases, choose Connections. Choose mssql-glue-connection. The provided JDBC URL is correct The username and password for connection to SQL Server are correct The database server is available from the subnet where the AWS Glue ENI is created The security group of the database server allows connection from the security group associated with the AWS Glue ENI Creating an AWS Glue crawler An AWS Glue crawler is an easy and quick way to populate the AWS Glue Data Catalog. You can then use the Data Catalog in AWS Glue jobs to perform ETL operations, and with Athena and Amazon Redshift to query the extracted data. Let’s create a crawler that populates the Data Catalog by creating a new table with the schema extracted from our SQL view. Our SQL view v_sales is relatively simple, so running a crawler to build this table is overkill. However, in real-world scenarios, database tables can have hundreds of fields; building a representation of those tables manually in the Data Catalog is cumbersome and error-prone. This is where an AWS Glue crawler is helpful, because it builds the table structure automatically by introspecting the backend data source, which saves time and prevents errors. Create an AWS Glue crawler as follows: On the AWS Glue console, chose Crawlers. Choose Add crawler. Running the AWS Glue crawler Now that our crawler is created, let’s run it to create the Data Catalog table. The crawler uses the JDBC connection we created and extracts the table structure from the v_sales SQL view. On the AWS Glue console, choose Crawlers. Select sales-table-crawler. Choose Run crawler. The crawler takes some time to run. When it’s complete, the value in Tables created changes from 0 to 1. Under Databases, choose Tables to see the new table, adventureworksdw_dbo_v_sales. Creating and running the AWS Glue job Now we create an AWS Glue job to test our configuration. The job uses the JDBC data source we created to run SQL statements, extract the data, and save it to an S3 bucket in Parquet format. We use AWS Glue Studio to create our AWS Glue job. AWS Glue Studio is a new graphical interface for creating and managing AWS Glue jobs. On the AWS Glue console, choose AWS Glue Studio. Choose Create and manage jobs. Select Source and target added to the graph. For Source, choose RDS. Choose Create. Your new Glue job graph should look like the following in the AWS Glue Studio UI. We need to configure our AWS Glue job next. Let’s start by configuring the data source. Choose the first component of your graph, JDBC connection. For Database, choose glue-mdx-blog-database. Choose the third component of your graph, S3 bucket. For Format, choose Glue Parquet. For Compression Type, choose None. If you want to compress data before storing it to Amazon S3, you can choose a compression type later. For S3 Target Location, choose Browse S3 and choose the target S3 bucket, <YOUR_S3_OUTPUT_BUCKET>, you created. In addition to storing the extracted data to S3 bucket in Parquet format, we will create Glue Data Catalog table, test-s3-parquet-table, that will point to the Parquet file. This will help us validate the output data in next section. For Data Catalog Update Option choose Create a table in the Data Catalog and on subsequent runs, keep existing schema and add new partitions option: Select glue-mdx-blog-database as a Database Enter test-s3-parquet-table as a Table Name Press Save Select mdx-parquet-job. Choose Run job. AWS Glue Studio can monitor running jobs. View the job status on the Monitoring page, in the Job runs section. Continue to monitor the job status until the status changes to Succeeded. The job creates Parquet files in the S3 bucket, <YOUR_S3_OUTPUT_BUCKET>. You can locate the files in your output bucket on the Amazon S3 console. Congratulations, you have successfully extracted data from SQL Server Analysis Services using a SQL view to an S3 bucket in Parquet format. Verifying the Parquet files Finally, let’s verify the contents of the Parquet files we created in the last step. We can do this in two ways: Athena or Amazon S3 Select. Although Amazon S3 Select provides a quick way to view the contents of a Parquet file, it’s possible that the Parquet file we created is split across multiple parts. Each part may or may not contain data. Athena gives us a better view of data stored across multiple Parquet files with same schema. Athena operates on a table inside the AWS Glue Data Catalog. Note that we created a new AWS Glue Data Catalog table, test-s3-parquet-table, to store output from AWS Glue Job. Let us query this table using Athena to view generated Parquet data: On the Athena console, for Database, choose glue-mdx-blog-database. For Table, choose test-s3-parquet-table.