In this long read we explore how SAS can be used for various functions in the development and maintenance of data warehouses, thanks to its reputation of being able to handle ‘big data’. This long read is sampled from Monika Wahi’s book Mastering SAS Programming for Data Warehousing.
SAS in a Data Warehouse Environment
SAS – the acronym stands for ‘Statistical Analysis System’ – has a long and illustrious history. Development of the first version began in 1966, in the days of punched card storage and mainframe computers. As computing hardware became more powerful, so datasets grew. SAS in turn evolved to harness greater computational power and meet the data challenge, but the original concepts of data steps and proc steps endured. Then, in the 1990s, SQL (Structured Query Language) databases emerged and SAS found itself with some serious competition.
Here we look at the SAS response to the SQL challenge, and examine how SAS lends itself to use today in data warehouse settings.
SQL was aimed at data maintenance and storage using relational tables rather than flat files. SQL approaches only became possible in the 1990s due to technology developments that allowed faster data processing.
SQL languages accomplish the same data editing tasks that data steps do in SAS, but they use a different approach. Unlike SAS, which is a procedural language, SQL is a declarative language:
- In SAS, the programmer must program a data step to do the procedures in the most efficient way to optimize data handling.
- In SQL, the programmer declares what query output they desire using easy-to-understand, simple quasi-English statements, and an optimization program (or optimizer) running in the background figures out the most efficient way to execute the query.
While using efficient code in SQL can still improve performance, the efficiency of SQL is less dependent on the programmer’s code and more dependent on the database’s query optimizer. Hence, maintaining data in a database became easier using SQL than it was using SAS’s data steps. In SQL, programmers had to learn a few basic commands that could perform a variety of tasks when used together. But with SAS data steps, programmers needed to study a broad set of commands, and they had to learn the most efficient way to assemble those commands to achieve optimal SAS performance.
What SQL cannot do, however, is analyze data the way SAS can. In the later 1990s and early 2000s, with many databases being stored and maintained in SQL, the SAS/Access feature made it possible to use SAS on them for analysis.
Using PROC SQL
As SQL became more established, so more programmers were trained in it. These SQL programmers had some trouble transferring their skills to use in SAS data steps, so SAS responded by developing a SQL language within SAS: PROC SQL.
PROC SQL has the following features:
- It is a language within SAS, in that PROC SQL code starts with a PROC SQL statement and ends with a quit statement.
- It uses SQL commands, such as CREATE TABLE and SELECT with GROUP BY and WHERE.
- It allows the user to control the use of processors during execution through the THREADED option.
- It includes a WHERE clause and other clauses that uses indexes if they are available.
- Unlike other SQLs, it does not have an independent optimizer program, so creating optimized code is important.
Like SQL, PROC SQL is much easier to use than data step language for certain common tasks. One particularly useful task that is much easier in PROC SQL is creating a VIEW of the data, which allows the user to look at a particular section of the dataset.
Imagine we wanted to view the data in our example dataset, Chap_1_1_Infile, but we only wanted to look at the data for women (SEX1 = 2) who live in Massachusetts (_STATE = 25). We could use this PROC SQL code:
PROC SQL; Select * from Chap_1_1_Infile where SEX1 = 2 and _STATE = 25; quit;
This code produces output in the following structure (with just the first three rows provided):
Output from PROC SQL
To get similar output using SAS commands, the following PROC PRINT code could be used. Note that all variables are displayed, in the order stored in the dataset, since the VAR statement is excluded:
PROC PRINT DATA=Chap_1_1_Infile; where SEX1 = 2 and _STATE = 25; RUN;
But imagine we did not want to return all the variables – assume we only wanted to return age (_AGE80) and BMI (_BMI5). We could easily replace the asterisk in our PROC SQL code to specify only those two columns:
PROC SQL; Select _AGE80, _BMI5 from Chap_1_1_Infile where SEX1 = 2 and _STATE = 25; quit;
In PROC PRINT, to achieve the same output, we would add a VAR statement to our previous code:
PROC PRINT DATA=Chap_1_1_Infile; where SEX1 = 2 and _STATE = 25; var _AGE80 _BMI5; RUN;
Even in this short example, it is easy to see how SAS PROCs and data steps are more complicated than SQL commands because SQL has fewer, more modular commands. By contrast, SAS has an extensive toolset of commands and options that, when understood and used wisely, can achieve just about any result with big data.
SAS in today’s warehouse environment
While PROC SQL appears to be a workaround from learning a complicated data step language, this is not the case in data warehousing. Because of the lack of optimization of PROC SQL, in many environments it is very slow and can only feasibly be used with smaller datasets. Even today, when transforming big data in SAS, in most environments it is necessary to use data step language, and this affords the programmer an opportunity to develop optimized code, as efficiency is always important when dealing with data in SAS.
However, when interfacing with another database management system (DBMS) where native data are stored in SQL, SAS PROC SQL might be more useful. In his recent white paper on working with big data in SAS, Mark Jordan describes various modern approaches to improving the processing efficiency of both PROC SQL and SAS data steps in both server SAS environments and environments where SAS is used as the analysis engine, connecting to a non-SAS DBMS through SAS/Access.
Jordan describes two scenarios for big data storage and SAS:
- Using a modern server SAS set up: Server SAS comes with its own OS, and Base SAS version 9.4 includes its own DS2 programming language. These can be used together to create threaded processing that can optimize data retrieval.
- Using SAS for analysis connected to non-SAS data storage: In this setup, SAS/Access is used to connect to a non-SAS DBMS and pull data for analysis into the SAS application. This can create a lag, but if SAS and the DBMS are co-located together and the DBMS can use parallel processing, speed can be achieved.
Ultimately, the main bottleneck in SAS processing has to do with I/O, so the easier it is for the SAS analytic engine to interact with stored data, the faster processing will be. But even in this modern era, limitations surrounding data I/O continue to compel SAS users to develop efficient code.
Jordan provides the following tips for thinking about coding for a SAS data warehouse:
- Use WHERE instead of IF wherever possible (due to its increased processing efficiency).
- Reduce columns retained to just the native and transformed variables needed in the warehouse.
- Using the options SASTRACE and SASTRACELOC will echo all the SQL generated to the SAS log file, which can be useful for performance tuning.
- Use PROC SQL and data steps to do the same tasks, and then compare their processing time using information from the SAS log to choose the most efficient code.
- It is especially helpful to compare PROC SQL code performance on summary tasks, such as developing a report of order summaries, because PROC SQL may perform better than PROCs or data steps.
- If using a server SAS setup with DS2 and data steps, and if the log from your data steps shows a CPU time close to the program runtime, then your data steps are CPU-bound. In those cases, rewriting the data step process in DS2 could be helpful because it could take advantage of threaded processing.
- DS2 has the other advantage that it is able to develop results at a higher precision level than data steps.
- DS2 code uses different commands than data step code but can achieve the same results.
- On massively parallel processing (MPP) DBMS platforms such as Teradata and Hadoop, DS2 can run as an in-database process using the SAS In-Database Code Accelerator. Using this code accelerator can significantly improve the efficiency of data throughput in these environments.
Note: In his white paper, Jordan compares PROC SQL processing using the SCAN command with the LIKE command for retrieving a record with criteria set on a high-cardinality variable; the LIKE command is found to be more efficient.
SAS in the Cloud
Jordan also describes how SAS now has a new Viya architecture that offers cloud analytic services (CAS). A CAS library allows the following capabilities:
- Fast-loading data into memory
- Conducting distributed processing across multiple nodes
- Retaining data in memory for use by other processes until deliberately saved to disk
A CAS library has application programming interfaces (APIs) that allow actions to be executed from a variety of languages, including Java, Python, and R, and of course, the SAS Version 9.4 client application.
Today not all warehouse data is stored in the Cloud, many datasets still being stored on traditional servers. Jordan recommends that if the user has an installation of the SAS 9.4M5 application and has access to SAS Viya CAS, and they want to decide whether or not to move to CAS from a traditional server, they should compare the processing time on a subset of data in both environments. Jordan was able to cut the processing time from over 1 minute to 2.35 seconds by moving his data from a traditional server to SAS Viya CAS.
Applications of SAS in modern warehousing
Today, SAS data warehousing is more complicated than it was in the past because there are so many options. Learning about these options can help the user envision the possibilities and design a SAS data warehousing system appropriate to their organization’s needs. In this section we’ll look at:
- A modern case study that used SAS components for analyzing unstructured text in helpdesk tickets
- A case study of a data SAS warehouse that upgraded an old system to include a new API allowing users more visualization functionality through SAS Visual Analytics
- A case study of a legacy SAS shop that began to incorporate R into their system.
Warehousing unstructured text
In his white paper on warehousing unstructured text in SAS, Nick Evangelopoulos describes how the IT Shared Services (ITSS) division at the University of North Texas (UNT) used SAS to study their service tickets to try to improve services. They took the following steps:
- They chose to study a set of 9,691 tickets (representing approximately 18 months’ worth of tickets) comprising mainly unstructured text from the native application platform ServiceNow.
- Using the open-source statistical application R, they conducted text cleaning. Mostly this consisted of removing back-and-forth conversations by email that were recorded in the unstructured ticket text.
- Using the text-mining component of SAS, SAS Text Miner (used within the SAS platform SAS Enterprise Miner (SAS EM)), they were able to use text extraction to help classify the tickets by topic.
- Next, the team used Base SAS and the analytics component SAS STAT to add indicator variables and other quantitative variables to the topics, thus creating a quantitative dataset that could be analyzed and visualized.
After doing this, the team wondered if SAS EM would classify the tickets under the same topic as the user entering the ticket would. To answer this question, the team analyzed 1,481 new tickets that were classified using SAS EM as well as being classified by the user. They found dramatic differences between how users and SAS EM classified the tickets, suggesting that this classification may need additional development in order to be useful.
Using SAS components for warehousing
A white paper by Li-Hui Chen and Manuel Figallo describes a modern SAS data warehouse using SAS applications. The US Department of Health and Human Services (DHHS) has a data warehouse of health indicators called the Health Indicators Warehouse (HIW). They described how they upgraded their SAS data warehouse system to improve performance and customer service using SAS Visual Analytics (VA) accessed through an API.
The HIW serves many users over the internet. Prior to the upgrade, SAS datasets were accessed from storage using SAS, and extract-transform-load (ETL) processes needed to take place manually on the data before it could be visualized. This made the data in the warehouse difficult to visualize.
With the upgrade, this is the new process:
- Users obtain permission to access the API, which controls access to the underlying data as well as the VA capabilities.
- Using the API, which contains a GUI, users indicate which health indicator they want to extract from the HIW, and how they want to visualize it.
- The API extracts the necessary data from the HIW data store using automated processing.
- The API performs necessary ETL processes to support visualization.
- The API then visualizes the results using VA.
Here’s a schematic diagram comparing the old and new systems:
SAS warehousing system before and after adding an API layer
Focusing on I/O, the authors pointed out that ETL in the API is achieved by running SAS macros, or code routines developed in the SAS macro language that can take user or system inputs and can be run automatically. They can run these macros either through a stored process (where the macro can be run on one dataset at a time) or a batched process (where the macro is run on several datasets at once). The authors found that they needed to use a batch process when transferring large amounts of HIW data through an API call.
Using SAS alongside R
SAS has been around for a long time and has typically been the first choice for warehousing big data. However, since the rise of SQL there has been competition between SAS and SQL for data storage functions. With the rise of R, open-source statistical software known for visualization and an easy web interface, SAS has additionally seen competition with respect to statistical analysis functions.
Over time, SAS has responded to competition by building in extra functionality. SAS/Access, SAS VA, and SAS Viya are all examples of this. However, the reality is that SAS is best at analytics, so other applications tend to be superior at these other functions. This has created challenges for legacy SAS warehouses that are now rethinking how they use SAS in their system. Teams are approaching this challenge with a variety of responses.
Dr. Elizabeth Atkinson shared her team’s story of moving from a 100% SAS shop to incorporating R for some functions. She leads a biostatistics service at the Mayo Clinic, a famous specialty clinic in the US, which has been a SAS shop since 1974, when punch cards were still being used, and now has a staff of 300 at three locations. The service supports data storage and analysis for studies both large and small.
In 2014, Mayo went to negotiate their SAS license and found that the price had increased significantly. SAS has always been a distinctive product with a high price. According to a Computerworld article, in 1995 a full SAS application development package, when bundled for 10 users, cost $1,575 per seat; this is expensive even by today’s standards. However, in 2014, the increase in cost was felt to be unsustainable, and the Mayo team started looking for other options.
They wanted to decrease their dependence on SAS by moving some of their functions to R, while also improving customer service and satisfaction. They faced the following challenges:
- SAS infrastructure was entrenched: All training was based on SAS, SAS was integrated into every workflow, and automation used SAS macros. Many users only trusted SAS and did not trust numbers coming out of R. SAS users relied on their personal code repositories.
- R infrastructure was dynamic: Unlike SAS, R releases new versions often. R innovates quickly, so it is hard to keep up a stable R environment. R packages, which are external components of Base R that can be added, were also upgraded regularly, leading to code that would break without warning, and cause user confusion.
- Time constraints: Reworking some SAS functions to be done by R required a lot of effort in deconstructing SAS and constructing R. Both leaders and users had time constraints.
- Different learning styles and levels of knowledge: SAS users had spent years learning data steps. R data management is completely different. It was hard for SAS users to learn R, and R users to learn SAS.
- R support needed: SAS provides customer support, but that is not available with open-source software like R. The organization needed to build its own R support desk. Compared to SAS, R’s documentation is less standardized and comprehensive.
To integrate R into their shop, they took the following steps:
- Committed funding: Divisional funding was committed to the project.
- Identified R champions: This was a group of R users with expertise in R and SAS.
- Set up an R server: Having an R server available increased enthusiasm and interest in R.
- Rebuilt popular local SAS macros in R: These are the ones that were deconstructed and rebuilt in R. Many of these were for reporting. They took the opportunity to improve reporting when rebuilding these macros.
- Developed integrated SAS and R training: Because they are now a combined shop, their training shows how to do the same tasks in SAS and R. They also hold events demonstrating R and providing online examples.
- Set up an R helpdesk: This provides on-call, in-house R support. They maintain a distribution list and send out R tips.
Even after offering R as an alternative, many users chose to stay with SAS. The reasons the shop could not completely convert from R to SAS include the following:
- Time and cost constraints: It was not possible to move all the small projects already in SAS over to R.
- Data retrieval and ETL: R cannot handle big data like SAS. The SAS data steps provide the ability to control procedural data processing in SAS, and this is not possible in R.
- Analysis limitations: Certain tasks are much clumsier in R than in SAS. At the Mayo Clinic, they found that mixed-effect models were much more challenging in R than in SAS.
One of the overall benefits of this effort was that it opened the larger conversation behind what skills would be needed among analysts in the division in the future. These considerations run parallel to the consideration as to what SAS and non-SAS components will be used in the data system in the near future, what roles they will play, how they will be supported, and how they will work together to improve the user experience.
Connecting to Snowflake
As data gets bigger and bigger, new solutions have been developed to store data in the Cloud. Microsoft Azure and Amazon Web Services (AWS) are cloud services that help move business operations to the Cloud. Snowflake is a relatively new cloud data platform that runs on Microsoft Azure and AWS, and may run on other cloud services in the future.
Snowflake enables a programmer to build a virtual data warehouse with little cost, thus solving a data storage problem. However, data still needs to be accessed to be analyzed. Therefore, SAS upgraded its SAS/Access component to now be able to connect directly to Snowflake.
SAS documentation about connecting to Snowflake indicates that Snowflake uses SQL as its query language. Both PROC SQL and regular SAS functions can be passed to Snowflake, but there are cases where SAS and Snowflake function names conflict. Furthermore, careful setting of options and code tuning are needed to improve I/O from SAS to Snowflake.
Although products like Snowflake can solve the big data storage problem, the issue with SAS will always be I/O. Using the newest and most appropriate technology along with the most efficient coding approaches will always be the best strategy for dealing with the data warehousing of big data in SAS.
As SAS environments evolved from running on mainframes to being accessible by PCs, SAS developed new products and services to complement its core analytics and data management functions.
SAS data steps are procedural, and provide the programmer with opportunities to greatly improve I/O through the use of certain commands, features, and approaches to programming. When SQL became popular, PROC SQL was invented. This allowed SAS users to choose between using data steps or SQL commands when managing data in SAS.
Today, SAS is still used in data warehousing, but there are new challenges with accessing data in the Cloud. SAS data warehouses today can include predominantly SAS components, such as SAS VA and CAS. Or, SAS can be part of a warehouse system that includes other components and applications, such as cloud storage in Snowflake, and supplemental analytic functions provided by R.
Modern SAS data warehousing still seeks to improve I/O and to better serve warehouse users through the development of an efficient system that meets customer needs. Creativity is required in the design of modern SAS data warehouses so that the system can leverage the best SAS has to offer while avoiding its pitfalls.
Build a strong foundation in SAS data warehousing by understanding data transformation code and policy, data stewardship and management, interconnectivity between SAS and other warehousing products, and print and web reporting
○ Understand how to use SAS macros for standardizing extract, transform, and load (ETL) protocols
○ Develop and use data curation files for effective warehouse management
○ Learn how to develop and manage ETL, policies, and print and web reports that meet user needs
Find out how you can leverage virtual machines and load balancers to facilitate secure and efficient networking
○ Discover the latest networking features and additions in Microsoft Azure with this updated guide
○ Upgrade your cloud networking skills by learning how to plan, implement, configure, and secure your infrastructure network
○ Provide a fault-tolerant environment for your apps using Azure networking services
A beginner’s guide to simplifying Extract, Transform, Load (ETL) processes with the help of hands-on tips, tricks, and best practices, in a fun and interactive way
○ Explore data wrangling with the help of real-world examples and business use cases
○ Study various ways to extract the most value from your data in minimal time
○ Boost your knowledge with bonus topics, such as random data generation and data integrity checks