1. The Basics of Efficient SAS Coding
Electronic supplementary material
The online version of this chapter (doi: 10.1007/978-1-4842-0568-6_1 ) contains supplementary material, which is available to authorized users.
Coding efficiency is generally measured in CPU time, disk space, or memory usage. This is perfectly reasonable for SAS code that will be submitted many more times than it will be updated. However, there are coding environments where SAS programs are written for single production runs and then adapted and updated for different production runsfor example, in clinical development. In these cases, a measurement of maintenance time may be more important.
SAS programs like those used in clinical trials are unlikely to be used to process large amounts of data, but they are very likely to be updated and adapted for use in a series of trials requiring similar processing. Saving 50% of the CPU time when the program runs for only 5 minutes will not have a significant impact on coding efficiency. But if a program is difficult to maintain, days or even weeks could be added to the time needed to prepare the program for a new trial.
This chapter discusses the choices you need to make when coding efficiently in different types of SAS programs. The various situations are illustrated with coding examples.
Is the SAS Programming World Back to Front?
I first used PROC SQL in financial projects; many of the SAS programmers had backgrounds in database management, and SQL was routinely used. In retrospect, this was not efficient programming, because joining large SAS data sets using PROC SQL , particularly on mainframes, does not usually improve processing performance.
When I started my first clinical-trials contract, no one used PROC SQL , even though data volumes were small and SQL is much easier to read and maintain than DATA steps and PROC SORT. DATA steps and PROC SORT together are much better for working with large data volumes, whereas PROC SQL is usually better for small data volumes.
The following examples reflect my personal views on coding efficiency. In some cases, the choice of an appropriate coding approach depends on the programming experience within the SAS programming teamparticularly their knowledge of SQL programming.
Speed and Low Maintenance
IF...THEN...ELSE and SELECT...WHEN constructs are examples of code that can be written to improve both speed and maintenance time. In a simple case of an input data set containing three possible values AC for a variable, the assignment of a new variable based on the value can be written a number of ways. All three examples generate exactly the same output data set:
The following code is not efficient, because every IF condition is applied to every record. However, for small input data sets, you may not notice the inherent inefficiency:
DATA new;
SET old;
IF oldvar = 'A' THEN newvar = 1;
IF oldvar = 'B' THEN newvar = 2;
IF oldvar = 'C' THEN newvar = 3;
RUN;
The following code is more efficient, because IF conditions are applied only up to the condition that matches. However, for small input data sets, you may not notice the increased speed. You can achieve further improvements in speed by ordering the IF conditions so that the most commonly used is placed at the top, but this may not be worthwhile unless the data is already well-known and the most common value is very common:
DATA new;
SET old;
IF oldvar = 'A' THEN newvar = 1;
ELSE IF oldvar = 'B' THEN newvar = 2;
ELSE IF oldvar = 'C' THEN newvar = 3;
RUN;
The following code is comparable in efficiency to the code in example 2 in that WHEN conditions are applied only up to the condition that matches. For small input data sets, you may not notice the increased speed. Again, you can achieve further improvements in speed by ordering the WHEN conditions so that the most commonly used is placed at the top. In my opinion, this construct is easier to maintain, because all the lines have the same layout; thus you can insert or delete lines with a reduced risk of introducing syntax errors. The mandatory OTHERWISE clause also provides an obvious place to include a default value if none of the previous conditions have been fulfilled:
DATA new;
SET old;
SELECT (oldvar);
WHEN ('A') newvar = 1;
WHEN ('B') newvar = 2;
WHEN ('C') newvar = 3;
OTHERWISE;
END;
RUN;
Extending conditional clauses to 10 or more conditions requires great care to avoid inefficient processing, especially if the input data set is large. You can also avoid inefficient maintenance, particularly if the conditional code is enclosed in a DO...END construct, if you lay out the code with indents indicating the relative positions of each section of conditional code.
Speed or Low Maintenance: Part 1
Rewriting a data-step merge with a PROC SQL join can help reduce maintenance time but may reduce processing speed. The following sample code merges three data sets using two variables and then reorders the resulting data set by another variable:
This is a combination of PROC SORT and DATA steps. The code is efficient as far as processing is concerned, but it is quite long and involved, because you have to sort the individual data sets prior to merging them:
PROC SORT DATA = a OUT = a1;
BY cat_b;
RUN;
PROC SORT DATA = b OUT = b1;
BY cat_b;
RUN;
DATA a1_b1;
MERGE a1 (IN = a) b1 (IN = b);
BY cat_b;
IF a AND b;
RUN;
PROC SORT DATA = a1_b1 OUT = a1_b11;
BY cat_c;
RUN;
PROC SORT DATA = c OUT = c1;
BY cat_c;
RUN;
DATA a1_b1_c1;
MERGE a1_b11 (IN = ab) c1 (IN = c);
BY cat_c;
IF ab AND c;
RUN;
PROC SORT DATA = a1_b1_c1;
BY cat_a cat_b cat_c;
RUN;
This is a single PROC SQL step that does everything in the one step, including the final sort. When input data sets are small to moderate in size, there is little difference in the CPU time used by this and the previous code, but very large input data sets can result in slower processing when using PROC SQL . Another obvious disadvantage is that, when combining two or more data sets with overlapping variables, you must list all the variables to be included in the output data sets. However, assuming the SAS programming team has some experience with SQL programming, this program should be easier to maintain: