- Home
- Datawarehousing
- Polybase
Category: Polybase
External tables vs T-SQL views on files in a data lake
Feed: James Serra's Blog. Author: James Serra. A question that I have been hearing recently from customers using Azure Synapse Analytics (the public preview version) is what is the difference between using an external table versus a T-SQL view on a file in a data lake? Note that a T-SQL view and an external table pointing to a file in a data lake can be created in both a SQL Provisioned pool as well as a SQL On-demand pool. Here are the differences that I have found: Overall summary: views are generally faster and have more features such as OPENROWSETVirtual ... Read More
Query options in Azure Synapse Analytics
Feed: James Serra's Blog. Author: James Serra. The public preview version of Azure Synapse Analytics has three compute options and four types of storage that it can access (mentioned in my blog at SQL on-demand in Azure Synapse Analytics). This gives twelve possible combinations of querying data. Not all of these combinations currently are supported and some have a few quirks of which I list below. (NOTE: I’ll demo these features at my sessions at European Digital Week on 9/25 (session info), SQL Bits on 10/3 (session info), PASS Summit on 11/10 (session info), and Big Data Conference Europe on ... Read More
Ways to access data in ADLS Gen2
Feed: James Serra's Blog. Author: James Serra. With data lakes becoming popular, and Azure Data Lake Store (ADLS) Gen2 being used for many of them, a common question I am asked about is “How can I access data in ADLS Gen2 instead of a copy of the data in another product (i.e. Azure SQL Data Warehouse)?”. The benefits of accessing ADLS Gen2 directly is less ETL, less cost, to see if the data in the data lake has value before making it part of ETL, for a one-time report, for a data scientist who wants to use the data to ... Read More
Big Data Workshop
Feed: James Serra's Blog. Author: James Serra. A challenge I have with customers who want to get hands-on experience with the Azure products that are found in a modern data warehouse architecture is finding a workshop that covers many of those products. To the rescue is a workshop created by my Microsoft colleagues Fabio Braga and Rod Colledge, explained in their blog post Azure Data Platform End2End with the GitHub located here. This is an on-demand workshop with labs that you can run at any time. The idea of this workshop is to give experienced BI professionals (but new to ... Read More
Where should I clean my data?
Feed: James Serra's Blog. Author: James Serra. As a follow-up to my blogs What product to use to transform my data? and Should I load structured data into my data lake?, I wanted to talk about where you should you clean your data when building a modern data warehouse in Azure. As an example, let’s say I have an on-prem SQL Server database and I want to copy one million rows from a few tables to a data lake (ADLS Gen2) and then to Azure SQL DW, where the data will be used to generate Power BI reports (for background on a ... Read More
What product to use to transform my data?
Feed: James Serra's Blog. Author: James Serra. If you are building a big data solution in the cloud, you will likely be landing most of the source data into a data lake. And much of this data will need to be transformed (i.e. cleaned and joined together – the “T” in ETL). Since the data lake is just storage (i.e. Azure Data Lake Storage Gen2 or Azure Blob Storage), you need to pick a product that will be the compute and will do the transformation of the data. There is good news and bad news when it comes to which ... Read More
Should I load structured data into my data lake?
Feed: James Serra's Blog. Author: James Serra. With data lakes becoming very popular, a common question I have been hearing often from customers is, “Should I load structured/relational data into my data lake?”. I talked about this a while back in my blog post What is a data lake? and will expand on it in this blog. Melissa Coates also talked about this recently, and I used her graphic below to illustrate: I would not say it’s common place to load structured data into the data lake, but I do see it frequently. In most cases it is not necessary to first ... Read More
SQL Server 2019 Big Data Clusters
Feed: James Serra's Blog. Author: James Serra. At the Microsoft Ignite conference, Microsoft announced that SQL Server 2019 is now in preview and that SQL Server 2019 will include Apache Spark and Hadoop Distributed File System (HDFS) for scalable compute and storage. This new architecture that combines together the SQL Server database engine, Spark, and HDFS into a unified data platform is called a “big data cluster”, deployed as containers on Kubernetes. Big data clusters can be deployed in any cloud where there is a managed Kubernetes service, such as Azure Kubernetes Service (AKS), or in on-premises Kubernetes clusters, such as AKS on ... Read More
Data Virtualization vs. Data Movement
Feed: James Serra's Blog. Author: James Serra. I have blogged about Data Virtualization vs Data Warehouse and wanted to blog on a similar topic: Data Virtualization vs. Data Movement. Data virtualization integrates data from disparate sources, locations and formats, without replicating or moving the data, to create a single “virtual” data layer that delivers unified data services to support multiple applications and users. Data movement is the process of extracting data from source systems and bringing it into the data warehouse and is commonly called ETL, which stands for extraction, transformation, and loading. If you are building a data warehouse, should you ... Read More
Is the traditional data warehouse dead?
Feed: James Serra's Blog. Author: James Serra. There have been a number of enhancements to Hadoop recently when it comes to fast interactive querying with such products as Hive LLAP and Spark SQL which are being used over slower interactive querying options such as Tez/Yarn and batch processing options such as MapReduce (see Azure HDInsight Performance Benchmarking: Interactive Query, Spark and Presto). This has led to a question I have started to see from customers: Do I still need a data warehouse or can I just put everything in a data lake and report off of that using Hive LLAP or Spark ... Read More
Recent Comments