About HPC Systems Software User Guides Education Partners

  / gears / hpc / education / tutorials / sas / sasanalyze

 

Analyzing and Working with Data


Topics


Very basic summary statistics

When we're interested in getting simple summary statistics from a SAS data set, proc means; is the simplest, quickest, and most efficent method. The cost of this simplicity is that it is a fairly limited procedure, but it can be made to be rather effective. We can easily get the number of "valid" (i.e. non-missing) responses to a specified variable using proc means;. This is labelled as "N". In addition, proc means; delivers the minimum and maximum values coded for the variable, along with the mean (naturally) and the standard deviation by default. This procedure presents its results in a compact table; there is just one line generated for each variable in most simple cases;

Here is an example in the simplest form, using the class.dat data:

proc means data=class;
   run;

Variable   N          Mean       Std Dev       Minimum       Maximum
--------------------------------------------------------------------
AGE       76    22.5526316     4.4911609    19.0000000    48.0000000
HEIGHT    73    68.0854795     4.1300932    57.0000000    76.0000000
ACADSTAT  76     4.0394737     0.4454448     3.0000000     5.0000000
PA        75     0.5466667     0.5011698             0     1.0000000
MACIBM    75     1.9733333     0.1621922     1.0000000     2.0000000
DRINK     75    27.6266667    27.1252218             0    99.0000000
--------------------------------------------------------------------

Had we defined variable labels (see below), those would have been included on each line, between the "Variable" and "N" (try it). See Cody & Smith (Chapter 2) for details on overriding the default output by specifying summary statistics on the PROC MEANS line. Note: If no statistics are specified, n mean std min and max are printed for each variable. "Min" and "max" are the sample minimum value and maximum value.

The var option can be used with PROC PRINT to limit the output to a specific list of variables in the data set. If it is not used, the requested statistics will be computed for all variables in the data set.

An optional by or class statement generates separate analyses for each level of the by or class variable(s):

proc means data=class;
   class acadstat;
   var age height;
   run;

The output looks like this:

                       N
         acadstat    Obs    Variable     N            Mean         Std Dev
    Minimum         Maximum
     ---------------------------------------------------------------------------
--------------------------
                3      6    age          6      20.5000000       1.3784049
19.0000000      23.0000000
                            height       6      67.6666667       5.1639778
62.0000000      73.0000000

                4     61    age         61      22.0491803       3.5889935
20.0000000      48.0000000
                            height      59      68.3644068       4.1051172
57.0000000      76.0000000

                5      9    age          9      27.3333333       7.7781746
20.0000000      43.0000000
                            height       8      66.3425000       3.5346681
62.4000000      74.0000000
     ---------------------------------------------------------------------------
--------------------------

Instead of using the class statement, we could use a by statement for output that is arranged in a slightly different manner (i.e. more compact), but if we use by the data must be sorted first:

proc sort data=class;
   by acadstat;
   run;

proc means data=class;
   by acadstat;
   var age height;
   run;

This output would look like:

---------------------------- ACADSTAT=3 -----------------------------


Variable   N          Mean       Std Dev       Minimum       Maximum
--------------------------------------------------------------------
AGE        6    20.5000000     1.3784049    19.0000000    23.0000000
HEIGHT     6    67.6666667     5.1639778    62.0000000    73.0000000
--------------------------------------------------------------------


---------------------------- ACADSTAT=4 -----------------------------


Variable   N          Mean       Std Dev       Minimum       Maximum
--------------------------------------------------------------------
AGE       61    22.0491803     3.5889935    20.0000000    48.0000000
HEIGHT    59    68.3644068     4.1051172    57.0000000    76.0000000
--------------------------------------------------------------------


---------------------------- ACADSTAT=5 -----------------------------


Variable   N          Mean       Std Dev       Minimum       Maximum
--------------------------------------------------------------------
AGE        9    27.3333333     7.7781746    20.0000000    43.0000000
HEIGHT     8    66.3425000     3.5346681    62.4000000    74.0000000
--------------------------------------------------------------------

...virtually a separate proc means; run for each level of the variable acadstat in the by statement.

Univariate statistics with PROC UNIVARIATE;

The univariate procedure performs parametric and nonparametric analysis of a sample from a single population.

Printed output for each variable includes the sample mean, the sample standard deviation, the t statistic and the two-sided p-value for a test of zero population mean, the sign statistic and two-sided p-value for a test of zero population median, the Wilcoxon signed rank test statistic and two-sided p-value for a test of zero population median, the quartiles and interquartile range, and some percentiles (1, 5, 10, 90, 95, 99). Use other options listed below to produce additional output. Here's an example, using some common options:

proc univariate data=class freq normal plot;
   var height;
   run;

The data= option specifies the SAS data set to be analyzed.

The freq option generates a frequency table showing the values, frequencies, percentages, and cumulative frequencies for each variable.

The normal option instructs SAS to compute the Shapiro-Wilk or Kolmogorov test of the hypothesis that the data comes from a normal population.

The plot option produces a stem and leaf plot, box plot, and normal probability plot for each variable.

The optional var statement restricts the analysis to specific variables in the data set. If the var statement is not used, all numeric variables in the data set are analyzed (for most data sets, you definitely do not want to forget the var statement. This procedure can easily produce reams and reams of output.

Other options include those which instruct SAS to output weighted averages with arbitrary weights, using value-frequency pairs to specify the raw data, computing additional percentiles, and specifying the method used to compute percentiles. Using the output out= statement (which we will learn about later), the procedure can create a new data set containing any of the computed quantities.

Here's what the output from the above example looks like, once again, using the class.dat data:

                               Univariate Procedure

Variable=HEIGHT

                                     Moments

                     N                73  Sum Wgts         73
                     Mean       68.08548  Sum         4970.24
                     Std Dev    4.130093  Variance   17.05767
                     Skewness   -0.13506  Kurtosis   -0.57074
                     USS        339629.3  CSS        1228.152
                     CV         6.066041  Std Mean   0.483391
                     T:Mean=0   140.8498  Pr>|T|       0.0001
                     Num ^= 0         73  Num > 0          73
                     M(Sign)        36.5  Pr>=|M|      0.0001
                     Sgn Rank     1350.5  Pr>=|S|      0.0001
                     W:Normal   0.965113  Pr<W         0.1274


                                 Quantiles(Def=5)

                      100% Max        76       99%        76
                       75% Q3         72       95%        75
                       50% Med        68       90%        73
                       25% Q1         65       10%        63
                        0% Min        57        5%        62
                                                1%        57
                      Range           19
                      Q3-Q1            7
                      Mode            72


                                     Extremes

                        Lowest    Obs     Highest    Obs
                            57(      18)       74(      74)
                            60(      55)       75(      36)
                            62(      63)       75(      41)
                            62(      20)       76(      25)
                            62(       5)       76(      56)


                             Missing Value         .
                             Count                 3
                             % Count/Nobs       3.95



                                  The SAS System
5

                               Univariate Procedure

Variable=HEIGHT

                 Stem Leaf                     #             Boxplot
                   76 00                       2                |
                   75 00                       2                |
                   74 00                       2                |
                   73 000                      3                |
                   72 000000000000            12             +-----+
                   71 0000                     4             |     |
                   70 000                      3             |     |
                   69 0000000                  7             |     |
                   68 0000                     4             *--+--*
                   67 000000                   6             |     |
                   66 0000                     4             |     |
                   65 00000003                 8             +-----+
                   64 000005                   6                |
                   63 0000                     4                |
                   62 0004                     4                |
                   61                                           |
                   60 0                        1                |
                   59                                           |
                   58                                           |
                   57 0                        1                |
                      ----+----+----+----+


                                  Normal Probability Plot
               76.5+                                             *+  *
                   |                                         *+*+
                   |                                       **+
                   |                                     **+
                   |                               ******
                   |                             **  ++
                   |                            **+++
                   |                          ***+
                   |                        **++
                   |                      ***
                   |                    +**
                   |                  ***
                   |              ****
                   |            ***+
                   |       * ***+
                   |        +++
                   |     *++
                   |    ++
                   | +++
               57.5++*
                    +----+----+----+----+----+----+----+----+----+----+
                        -2        -1         0        +1        +2



                               Univariate Procedure

Variable=HEIGHT

                                 Frequency Table

                                Percents                      Percents
                  Value Count  Cell   Cum       Value Count  Cell   Cum
                     57     1   1.4   1.4          67     6   8.2  46.6
                     60     1   1.4   2.7          68     4   5.5  52.1
                     62     3   4.1   6.8          69     7   9.6  61.6
                   62.4     1   1.4   8.2          70     3   4.1  65.8
                     63     4   5.5  13.7          71     4   5.5  71.2
                     64     5   6.8  20.5          72    12  16.4  87.7
                   64.5     1   1.4  21.9          73     3   4.1  91.8
                     65     7   9.6  31.5          74     2   2.7  94.5
                  65.34     1   1.4  32.9          75     2   2.7  97.3
                     66     4   5.5  38.4          76     2   2.7 100.0
SAS system files
  • Creating SAS System Files
  • Reading existing SAS system files

Creating SAS System Files

Sometimes it's useful to use SAS system files to organize and store your data compactly. Creating and accessing SAS system files is a relatively easy task. We use a libname statement to specify a libref. A libref specifies a SAS library. There is a lot of history behind SAS libraries, but for most purposes today, on Unix, Windows, and Macintosh systems, a SAS library refers to a directory.

We use the libref to specify the directory to which the system file is to be saved and then merely give the DATA statement a "two-level" name in the format, libname.filename (without the file extension). Take a look at this example:

Download this SAS code (Right-mouse button, select "Save Link As...")

libname save 'C:\Temp';

data save.class; input name $ sex $ age height weight; label name =
"Subject's  name"; label sex = "Subject's sex"; label age = "Age in
years";  label  height  =  "Height  in centimeters"; label weight =
"Weight  in  kilograms"; cards; JOHN M 12 59.0 99.5 JAMES M 12 57.3
83.0  ALFRED  M  14 69.0 112.5 WILLIAM M 15 66.5 112.0 JEFFREY M 13
62.5  84.0 RONALD M 15 67.0 133.0 THOMAS M 11 57.5 85.0 PHILIP M 16
72.0  150.0  ALICE F 13 56.5 84.0 ROBERT M 12 64.8 128.0 HENRY M 14
63.5 102.5 ; proc print; run;

proc means ; var height weight; run;

When this job is submitted to SAS, a SAS System File, class.sas7bdat is created in the C:\Temp directory (specified by the libname statement). This SAS system file contains the data created by the data step data save.class, including all variable names and any variable labels that may have been included.

NOTE: the libname statement associates a physical directory with the first part of the dataset name (SAVE).

The file extension (sas7bdat in this example) is created by SAS, and can be different, based on operating system and SAS version. Below is a table indicating what file extension is used for SAS versions 6 and 7 for Unix and Windows:

SAS Engine Unix (AIX, Solaris) Windows
                Member of Type DATA
V6         .ssd01              .sd2
V7         .sas7bdat           . sas7bdat

   Table modified from SAS OnlineDoc V7-1.

Reading existing SAS system files

You can use SAS system files to create new data files or you can invoke them directly in a PROC statement without need of employing a DATA step. In the first example below, a SAS system file is employed to create a new data file containing data for males only.

Download this SAS code (Right-mouse button, select "Save Link As...")

The following output appears in the Output Window:

If there is no need to alter the data in the physical file, statistical analyses can be performed by calling the data directly from the PROC statement. Below is an example:

Download this SAS code (Right-mouse button, select "Save Link As...")

This code would read the SAS system file specified by save.class. In this example, that would be a file in the C:\Temp directory, C:\Temp\class.sas7bdat. This file would have been created by SAS in an earlier job.

Multiple datasets, multiple lines per observation, and column-formatted data

We can work with multiple data sets in SAS simply by using multiple data statements to set-up the data. SAS can work with multiple data sets simultaneously.

A data file may contain multiple lines of data for each observation. Data may be coded this way for a variety of reasons. Sometimes the data are originally from punched cards; such cards can contain only up to 80 characters per "line". Depending on the data, there may be an advantage to coding it in multiple lines. This can include organizational advantage, storage advantages (occupies less space on disk or tape???), or even packaging and transport advantages.

The usnews.dat file contains data on United States universities in comma-delimited form, one observation per line.

Taking a look at the aaup2.dat data, we can see that it is coded such that there are two lines of data for each observation (college). When inputting data that contains more than one line per observation, it's necessary to indicate which line to be reading variables from. To do this, we use the symbol # followed by the line number within the group of lines for the observation. Here's an example, using an input statement for column-formatted data (for data aaup2):

filename data1 'c:\temp\usnews.dat';
filename data2 'c:\temp\aaup2.dat';

data usnews;
   infile data1 dlm=',';
   input fice college $ mathsat verbsat combsat act q1msat q3msat q1vsat
         q3vsat q1act q3act appsrec appsacc newenrol pct10 pct25
         flugrad ptugrad instate outstate randb room board additfee
         books personal pctphd pctterm st2fac pctalum expstud gradrate
         ;
   run;

data aaup2;
    infile data2;
    input

    #1

      fice        1 -  5
      college  $  7 - 37
      state    $ 38 - 39
      type     $ 40 - 43
      salfull    44 - 48
      salassoc   49 - 52
      salassis   53 - 56
      salall     57 - 60
      cmpfull    61 - 65
      cmpassoc   66 - 69
      cmpassis   70 - 73
      cmp_all    74 - 78

   #2

      nbrfull     1 -  4
      nbrassoc    5 -  8
      nbrassis    9 - 12
      nbrinst    13 - 16
      nbrall     17 - 21
   ;
   run;

It's not necessary to use a #1 before the first group of variables, because SAS begins on the first line by default, but it's good practice to do so whenever inputting data with multiple lines per observation.

Direct Merge

Let's start-out with two SAS datasets sample1 and sample2, which contain the following data:

SAMPLE1                SAMPLE2

IDNum   v1   v2        a1    a2    a3

1001    12   16        8.1   6.3   7.5
1002    10   15        8.6   6.2   7.4
1003    11   14        8.8   6.1   7.3
1004    20   18        8.4   6.6   7.7

These datasets could have been read into SAS using two DATA/INFILE steps, for example, or with two DATA/CARDS steps, or by any other valid means. Assuming this has been done, the following SAS program creates a new SAS data set called sample3, by directly merging the above two SAS data sets together (pasting them side-by-side):

DATA sample3;
  MERGE sample1 sample2;
  RUN;

The new SAS data set sample3 would look like this:

SAMPLE3

IDNum    v1   v2    a1    a2    a3

1001     12   16   8.1   6.3   7.5
1002     10   15   8.6   6.2   7.4
1003     11   14   8.8   6.1   7.3
1004     20   18   8.4   6.6   7.7

Normally, when using this type of merging, the number of observations in the two datasets to be merged will be equal. If the number of observations in the data sets are not equal, then missing values are generated for the "blank" observations and variables. When using direct merging, you are assuming that the order of the observations in both data sets are same. If not, the above merge operation will create a SAS data set sample3 with wrong combinations of the observations.

Match Merging

Let's take a look at the sample1 dataset again and a new dataset, sample4:

SAMPLE1                SAMPLE4

IDNum   v1   v2        IDNum b1     b2

1001    12   16        1004  82.1   61.6
1002    10   15        1005  77.3   92.5
1003    11   14        1001  54.2   66.3
1004    20   18        1003  92.6   81.2

The data sets sample1 and sample4 are to be combined by the variable IDNum. Note that in sample4 the order and number of observations are different from sample1. Here we need to match the observations by IDNum and then merge the correct observations. For this we sort sample4 by the variable IDNum in ascending order and then merge both the data sets.

The data set sample1, need not be sorted because the data is already sorted by the variable IDNum. If not, sample1 requires sorting by the variable IDNum.

The following SAS program merges sample1 and sample4, and creates a dataset called sample5.

/* The Sort Procedure */
PROC SORT DATA = sample4;
  BY IDNum
RUN;

/* The data step to merge two data sets */
DATA sample5;
  MERGE sample1 sample4;
  BY IDNum
RUN;

The SORT procedure sorts the data set sample4 by the variable IDNum. By default it will be sorted in ascending order. It is also possible to sort values in descending order (PROC SORT; BY DESCENDING IDNum). The variable IDNum is specified in the BY statement, and is referred to as the by group variable.

The BY statement accepts variables which are sorted.

The new SAS data set sample5 will look like like this:

SAMPLE5

IDNum  v1   v2     b1     b2

1001   12   16   54.2   66.3
1002   10   15     .      .
1003   11   14   92.6   81.2
1004   20   18   82.1   61.6
1005    .    .   77.3   92.5

The two data sets are merged by matching the observations by the variable IDNum. Note the values for the variables v1 and v2 for the IDNum 1005 are missing. Because in SAS data set sample1 there is no observation for IDNum1005.

Using the SET statement

The SET statement in SAS, can be used to concatenate two or more data sets together (i.e. append in series, one attached to the end of another). SET is used after creating a new SAS dataset with the DATA statement (ending with a semicolon). For example:

DATA sample6;
  SET sample1 sample4;
RUN;

The above SET statement combines the data sets sample1 and sample4 discussed in the previous section. The data in sample4 is added below the data in sample1. More than one data sets can be included in the SET statement. The new combined data set sample6 is given below.

SAMPLE6

IDNum  v1   v2     b1    b2

1001   12   16     .     .
1002   10   15     .     .
1003   11   14     .     .
1004   20   18     .     .
1004    .    .   82.1  61.6
1005    .    .   77.3  92.5
1001    .    .   54.2  66.3
1002    .    .   84.8  42.8
1003    .    .   92.6  81.2

Since both datasets sample1 and sample4 have different variables, missing values are generated for variables which do not belong to the respective datasets. Thus the new data set sample6 now has five variables with nine observations.

For example, consider the following two data sets class1 and class2 which have same three variables name, age and height.

CLASS1                        CLASS2

name    age    height          name    age    height

Andrew   15       67           Andrea   16       60
Philip   14       70           Linda    13       55
Robert   15       78           Sandra   17       65
Stephen  17       72

The following program combines the above two data sets and creates a new SAS data set called allclass.

DATA allclass;
  SET class1 class2;
RUN;

The data in allclass would look like this:

ALLCLASS

name    age    height

Andrew   15       67
Philip   14       70
Robert   15       78
Stephen  17       72
Andrea   16       60
Linda    13       55
Sandra   17       65

The data set allclass has three variables as in class1 and class2.

The procedure PROC APPEND also can be used to add observations from one SAS data set to the end of another SAS data set. This procedure is suitable to combine large data sets. The SET command reads all the SAS data sets into memory and then combines them. The APPEND procedure adds the observations from one SAS data set to the other without reading all the observations into the memory.

The APPEND Procedure

APPEND adds observations from a SAS data set to the end of another SAS data set. This means that one SAS dataset is changed by having observations appended to it. Contrast this to using SET, where you first create a new SAS dataset and then add other SAS datasets to the new one. It is more efficient than using a SET statement to concatenate two SAS data sets, especially when the BASE data set is large.

Let's take a look at the class1 and class2 datasets again:

CLASS1                         CLASS2

name    age    height          name    age    height

Andrew   15       67           Andrea   16       60
Philip   14       70           Linda    13       55
Robert   15       78           Sandra   17       65
Stephen  17       72

Note that these data sets have the same variable names. The following SAS program uses the PROC APPEND procedure to add observations from one data set, class2 to the end of another data set (called the BASE dataset), class1:

PROC APPEND BASE=class1 DATA=class2;
RUN;

The PROC APPEND procedure above adds the observations from SAS data set class2 to the end of SAS data set class1. The SAS data set class1 IS CHANGED, and now will have the following observations.

CLASS1

name    age    height

Andrew   15       67
Philip   14       70
Robert   15       78
Stephen  17       72
Andrea   16       60
Linda    13       55
Sandra   17       65

It is also possible to choose which observations are to be added to the other SAS data set. Here's an example:

PROC APPEND BASE=class2 DATA=class1(WHERE=(age=15));
RUN;

The PROC APPEND procedure above adds observations from the data set class1 to the end of the data set class2. The data set option WHERE= is used to choose the observations with age equals 15. Thus, it will select all those observations which have age equals 15 from the data set class1, and adds them to the end of the data set class2. The data set class2 now will have the following observations.

CLASS2

name    age    height

Andrea   16       60
Linda    13       55
Sandra   17       65
Andrew   15       67
Robert   15       78

Please send questions or suggestions about this web page to beatnic@aset.psu.edu

ASET | ITS | Penn State