Analyzing and Working with Data
Topics
Very basic summary statistics
When we're interested in getting simple summary statistics from a SAS
data set, proc means; is the simplest, quickest, and most efficent
method. The cost of this simplicity is that it is a fairly limited
procedure, but it can be made to be rather effective. We can easily
get the number of "valid" (i.e. non-missing) responses to a specified
variable using proc means;. This is labelled as "N". In addition, proc
means; delivers the minimum and maximum values coded for the variable,
along with the mean (naturally) and the standard deviation by default.
This procedure presents its results in a compact table; there is just
one line generated for each variable in most simple cases;
Here is an example in the simplest form, using the class.dat data:
proc means data=class;
run;
Variable N Mean Std Dev Minimum Maximum
--------------------------------------------------------------------
AGE 76 22.5526316 4.4911609 19.0000000 48.0000000
HEIGHT 73 68.0854795 4.1300932 57.0000000 76.0000000
ACADSTAT 76 4.0394737 0.4454448 3.0000000 5.0000000
PA 75 0.5466667 0.5011698 0 1.0000000
MACIBM 75 1.9733333 0.1621922 1.0000000 2.0000000
DRINK 75 27.6266667 27.1252218 0 99.0000000
--------------------------------------------------------------------
Had we defined variable labels (see below), those would have been
included on each line, between the "Variable" and "N" (try it). See
Cody & Smith (Chapter 2) for details on overriding the default output
by specifying summary statistics on the PROC MEANS line. Note: If no
statistics are specified, n mean std min and max are printed for each
variable. "Min" and "max" are the sample minimum value and maximum
value.
The var option can be used with PROC PRINT to limit the output to a
specific list of variables in the data set. If it is not used, the
requested statistics will be computed for all variables in the data
set.
An optional by or class statement generates separate analyses for each
level of the by or class variable(s):
proc means data=class;
class acadstat;
var age height;
run;
The output looks like this:
N
acadstat Obs Variable N Mean Std Dev
Minimum Maximum
---------------------------------------------------------------------------
--------------------------
3 6 age 6 20.5000000 1.3784049
19.0000000 23.0000000
height 6 67.6666667 5.1639778
62.0000000 73.0000000
4 61 age 61 22.0491803 3.5889935
20.0000000 48.0000000
height 59 68.3644068 4.1051172
57.0000000 76.0000000
5 9 age 9 27.3333333 7.7781746
20.0000000 43.0000000
height 8 66.3425000 3.5346681
62.4000000 74.0000000
---------------------------------------------------------------------------
--------------------------
Instead of using the class statement, we could use a by statement for
output that is arranged in a slightly different manner (i.e. more
compact), but if we use by the data must be sorted first:
proc sort data=class;
by acadstat;
run;
proc means data=class;
by acadstat;
var age height;
run;
This output would look like:
---------------------------- ACADSTAT=3 -----------------------------
Variable N Mean Std Dev Minimum Maximum
--------------------------------------------------------------------
AGE 6 20.5000000 1.3784049 19.0000000 23.0000000
HEIGHT 6 67.6666667 5.1639778 62.0000000 73.0000000
--------------------------------------------------------------------
---------------------------- ACADSTAT=4 -----------------------------
Variable N Mean Std Dev Minimum Maximum
--------------------------------------------------------------------
AGE 61 22.0491803 3.5889935 20.0000000 48.0000000
HEIGHT 59 68.3644068 4.1051172 57.0000000 76.0000000
--------------------------------------------------------------------
---------------------------- ACADSTAT=5 -----------------------------
Variable N Mean Std Dev Minimum Maximum
--------------------------------------------------------------------
AGE 9 27.3333333 7.7781746 20.0000000 43.0000000
HEIGHT 8 66.3425000 3.5346681 62.4000000 74.0000000
--------------------------------------------------------------------
...virtually a separate proc means; run for each level of the variable
acadstat in the by statement.
Univariate statistics with PROC UNIVARIATE;
The univariate procedure performs parametric and nonparametric
analysis of a sample from a single population.
Printed output for each variable includes the sample mean, the sample
standard deviation, the t statistic and the two-sided p-value for a
test of zero population mean, the sign statistic and two-sided p-value
for a test of zero population median, the Wilcoxon signed rank test
statistic and two-sided p-value for a test of zero population median,
the quartiles and interquartile range, and some percentiles (1, 5, 10,
90, 95, 99). Use other options listed below to produce additional
output. Here's an example, using some common options:
proc univariate data=class freq normal plot;
var height;
run;
The data= option specifies the SAS data set to be analyzed.
The freq option generates a frequency table showing the values,
frequencies, percentages, and cumulative frequencies for each
variable.
The normal option instructs SAS to compute the Shapiro-Wilk or
Kolmogorov test of the hypothesis that the data comes from a normal
population.
The plot option produces a stem and leaf plot, box plot, and normal
probability plot for each variable.
The optional var statement restricts the analysis to specific
variables in the data set. If the var statement is not used, all
numeric variables in the data set are analyzed (for most data sets,
you definitely do not want to forget the var statement. This procedure
can easily produce reams and reams of output.
Other options include those which instruct SAS to output weighted
averages with arbitrary weights, using value-frequency pairs to
specify the raw data, computing additional percentiles, and specifying
the method used to compute percentiles. Using the output out=
statement (which we will learn about later), the procedure can create
a new data set containing any of the computed quantities.
Here's what the output from the above example looks like, once again,
using the class.dat data:
Univariate Procedure
Variable=HEIGHT
Moments
N 73 Sum Wgts 73
Mean 68.08548 Sum 4970.24
Std Dev 4.130093 Variance 17.05767
Skewness -0.13506 Kurtosis -0.57074
USS 339629.3 CSS 1228.152
CV 6.066041 Std Mean 0.483391
T:Mean=0 140.8498 Pr>|T| 0.0001
Num ^= 0 73 Num > 0 73
M(Sign) 36.5 Pr>=|M| 0.0001
Sgn Rank 1350.5 Pr>=|S| 0.0001
W:Normal 0.965113 Pr<W 0.1274
Quantiles(Def=5)
100% Max 76 99% 76
75% Q3 72 95% 75
50% Med 68 90% 73
25% Q1 65 10% 63
0% Min 57 5% 62
1% 57
Range 19
Q3-Q1 7
Mode 72
Extremes
Lowest Obs Highest Obs
57( 18) 74( 74)
60( 55) 75( 36)
62( 63) 75( 41)
62( 20) 76( 25)
62( 5) 76( 56)
Missing Value .
Count 3
% Count/Nobs 3.95
The SAS System
5
Univariate Procedure
Variable=HEIGHT
Stem Leaf # Boxplot
76 00 2 |
75 00 2 |
74 00 2 |
73 000 3 |
72 000000000000 12 +-----+
71 0000 4 | |
70 000 3 | |
69 0000000 7 | |
68 0000 4 *--+--*
67 000000 6 | |
66 0000 4 | |
65 00000003 8 +-----+
64 000005 6 |
63 0000 4 |
62 0004 4 |
61 |
60 0 1 |
59 |
58 |
57 0 1 |
----+----+----+----+
Normal Probability Plot
76.5+ *+ *
| *+*+
| **+
| **+
| ******
| ** ++
| **+++
| ***+
| **++
| ***
| +**
| ***
| ****
| ***+
| * ***+
| +++
| *++
| ++
| +++
57.5++*
+----+----+----+----+----+----+----+----+----+----+
-2 -1 0 +1 +2
Univariate Procedure
Variable=HEIGHT
Frequency Table
Percents Percents
Value Count Cell Cum Value Count Cell Cum
57 1 1.4 1.4 67 6 8.2 46.6
60 1 1.4 2.7 68 4 5.5 52.1
62 3 4.1 6.8 69 7 9.6 61.6
62.4 1 1.4 8.2 70 3 4.1 65.8
63 4 5.5 13.7 71 4 5.5 71.2
64 5 6.8 20.5 72 12 16.4 87.7
64.5 1 1.4 21.9 73 3 4.1 91.8
65 7 9.6 31.5 74 2 2.7 94.5
65.34 1 1.4 32.9 75 2 2.7 97.3
66 4 5.5 38.4 76 2 2.7 100.0
SAS system files
- Creating SAS System Files
- Reading existing SAS system files
Creating SAS System Files
Sometimes it's useful to use SAS system files to
organize and store your data compactly. Creating and accessing SAS system
files is a relatively easy task. We use a libname statement to specify a
libref. A libref specifies a SAS library. There is a lot of history behind
SAS libraries, but for most purposes today, on Unix, Windows, and Macintosh
systems, a SAS library refers to a directory.
We use the libref to specify the directory to which the system file is
to be saved and then merely give the DATA statement a "two-level" name
in the format, libname.filename (without the file extension). Take a
look at this example:
Download this SAS code (Right-mouse button, select
"Save Link As...")
libname save 'C:\Temp';
data save.class; input name $ sex $ age height weight; label name =
"Subject's name"; label sex = "Subject's sex"; label age = "Age in
years"; label height = "Height in centimeters"; label weight =
"Weight in kilograms"; cards; JOHN M 12 59.0 99.5 JAMES M 12 57.3
83.0 ALFRED M 14 69.0 112.5 WILLIAM M 15 66.5 112.0 JEFFREY M 13
62.5 84.0 RONALD M 15 67.0 133.0 THOMAS M 11 57.5 85.0 PHILIP M 16
72.0 150.0 ALICE F 13 56.5 84.0 ROBERT M 12 64.8 128.0 HENRY M 14
63.5 102.5 ; proc print; run;
proc means ; var height weight; run;
When this job is submitted to SAS, a SAS System File, class.sas7bdat
is created in the C:\Temp directory (specified by the libname
statement). This SAS system file contains the data created by the data
step data save.class, including all variable names and any variable
labels that may have been included.
NOTE: the libname statement associates a physical directory with the
first part of the dataset name (SAVE).
The file extension (sas7bdat in this example) is created by SAS, and
can be different, based on operating system and SAS version. Below is
a table indicating what file extension is used for SAS versions 6 and
7 for Unix and Windows:
SAS Engine Unix (AIX, Solaris) Windows
Member of Type DATA
V6 .ssd01 .sd2
V7 .sas7bdat . sas7bdat
Table modified from SAS OnlineDoc V7-1.
Reading existing SAS system files
You can use SAS system files to create new data files or you can
invoke them directly in a PROC statement without need of employing a
DATA step. In the first example below, a SAS system file is employed
to create a new data file containing data for males only.
Download this SAS code (Right-mouse button, select
"Save Link As...")
The following output appears in the Output Window:
If there is no need to alter the data in the physical file,
statistical analyses can be performed by calling the data directly
from the PROC statement. Below is an example:
Download this SAS code (Right-mouse button, select
"Save Link As...")
This code would read the SAS system file specified by save.class. In
this example, that would be a file in the C:\Temp directory,
C:\Temp\class.sas7bdat. This file would have been created by SAS in an
earlier job.
Multiple datasets, multiple lines per observation, and column-formatted data
We can work with multiple data sets in SAS simply by using multiple data
statements to set-up the data. SAS can work with multiple data sets
simultaneously.
A data file may contain multiple lines of data for each observation.
Data may be coded this way for a variety of reasons. Sometimes the
data are originally from punched cards; such cards can contain only up
to 80 characters per "line". Depending on the data, there may be an
advantage to coding it in multiple lines. This can include
organizational advantage, storage advantages (occupies less space on
disk or tape???), or even packaging and transport advantages.
The usnews.dat file contains data on United States universities in
comma-delimited form, one observation per line.
Taking a look at the aaup2.dat data, we can see that it is coded
such that there are two lines of data for each observation (college).
When inputting data that contains more than one line per observation,
it's necessary to indicate which line to be reading variables from. To
do this, we use the symbol # followed by the line number within the
group of lines for the observation. Here's an example, using an input
statement for column-formatted data (for data aaup2):
filename data1 'c:\temp\usnews.dat';
filename data2 'c:\temp\aaup2.dat';
data usnews;
infile data1 dlm=',';
input fice college $ mathsat verbsat combsat act q1msat q3msat q1vsat
q3vsat q1act q3act appsrec appsacc newenrol pct10 pct25
flugrad ptugrad instate outstate randb room board additfee
books personal pctphd pctterm st2fac pctalum expstud gradrate
;
run;
data aaup2;
infile data2;
input
#1
fice 1 - 5
college $ 7 - 37
state $ 38 - 39
type $ 40 - 43
salfull 44 - 48
salassoc 49 - 52
salassis 53 - 56
salall 57 - 60
cmpfull 61 - 65
cmpassoc 66 - 69
cmpassis 70 - 73
cmp_all 74 - 78
#2
nbrfull 1 - 4
nbrassoc 5 - 8
nbrassis 9 - 12
nbrinst 13 - 16
nbrall 17 - 21
;
run;
It's not necessary to use a #1 before the first group of variables,
because SAS begins on the first line by default, but it's good
practice to do so whenever inputting data with multiple lines per
observation.
Direct Merge
Let's start-out with two SAS datasets sample1 and sample2,
which contain the following data:
SAMPLE1 SAMPLE2
IDNum v1 v2 a1 a2 a3
1001 12 16 8.1 6.3 7.5
1002 10 15 8.6 6.2 7.4
1003 11 14 8.8 6.1 7.3
1004 20 18 8.4 6.6 7.7
These datasets could have been read into SAS using two DATA/INFILE
steps, for example, or with two DATA/CARDS steps, or by any other
valid means. Assuming this has been done, the following SAS program
creates a new SAS data set called sample3, by directly merging the
above two SAS data sets together (pasting them side-by-side):
DATA sample3;
MERGE sample1 sample2;
RUN;
The new SAS data set sample3 would look like this:
SAMPLE3
IDNum v1 v2 a1 a2 a3
1001 12 16 8.1 6.3 7.5
1002 10 15 8.6 6.2 7.4
1003 11 14 8.8 6.1 7.3
1004 20 18 8.4 6.6 7.7
Normally, when using this type of merging, the number of observations
in the two datasets to be merged will be equal. If the number of
observations in the data sets are not equal, then missing values are
generated for the "blank" observations and variables. When using
direct merging, you are assuming that the order of the observations in
both data sets are same. If not, the above merge operation will create
a SAS data set sample3 with wrong combinations of the observations.
Match Merging
Let's take a look at the sample1 dataset again and a new dataset,
sample4:
SAMPLE1 SAMPLE4
IDNum v1 v2 IDNum b1 b2
1001 12 16 1004 82.1 61.6
1002 10 15 1005 77.3 92.5
1003 11 14 1001 54.2 66.3
1004 20 18 1003 92.6 81.2
The data sets sample1 and sample4 are to be combined by the variable
IDNum. Note that in sample4 the order and number of observations are
different from sample1. Here we need to match the observations by
IDNum and then merge the correct observations. For this we sort
sample4 by the variable IDNum in ascending order and then merge both
the data sets.
The data set sample1, need not be sorted because the data is
already sorted by the variable IDNum. If not, sample1 requires
sorting by the variable IDNum.
The following SAS program merges sample1 and sample4, and creates a
dataset called sample5.
/* The Sort Procedure */
PROC SORT DATA = sample4;
BY IDNum
RUN;
/* The data step to merge two data sets */
DATA sample5;
MERGE sample1 sample4;
BY IDNum
RUN;
The SORT procedure sorts the data set sample4 by the variable IDNum.
By default it will be sorted in ascending order. It is also possible
to sort values in descending order (PROC SORT; BY DESCENDING IDNum).
The variable IDNum is specified in the BY statement, and is referred
to as the by group variable.
The BY statement accepts variables which are sorted.
The new SAS data set sample5 will look like like this:
SAMPLE5
IDNum v1 v2 b1 b2
1001 12 16 54.2 66.3
1002 10 15 . .
1003 11 14 92.6 81.2
1004 20 18 82.1 61.6
1005 . . 77.3 92.5
The two data sets are merged by matching the observations by the
variable IDNum. Note the values for the variables v1 and v2 for the
IDNum 1005 are missing. Because in SAS data set sample1 there is no
observation for IDNum1005.
Using the SET statement
The SET statement in SAS, can be used to concatenate two or more data
sets together (i.e. append in series, one attached to the end of
another). SET is used after creating a new SAS dataset with the DATA
statement (ending with a semicolon). For example:
DATA sample6;
SET sample1 sample4;
RUN;
The above SET statement combines the data sets sample1 and sample4
discussed in the previous section. The data in sample4 is added below
the data in sample1. More than one data sets can be included in the
SET statement. The new combined data set sample6 is given below.
SAMPLE6
IDNum v1 v2 b1 b2
1001 12 16 . .
1002 10 15 . .
1003 11 14 . .
1004 20 18 . .
1004 . . 82.1 61.6
1005 . . 77.3 92.5
1001 . . 54.2 66.3
1002 . . 84.8 42.8
1003 . . 92.6 81.2
Since both datasets sample1 and sample4 have different variables,
missing values are generated for variables which do not belong to the
respective datasets. Thus the new data set sample6 now has five
variables with nine observations.
For example, consider the following two data sets class1 and class2
which have same three variables name, age and height.
CLASS1 CLASS2
name age height name age height
Andrew 15 67 Andrea 16 60
Philip 14 70 Linda 13 55
Robert 15 78 Sandra 17 65
Stephen 17 72
The following program combines the above two data sets and creates a
new SAS data set called allclass.
DATA allclass;
SET class1 class2;
RUN;
The data in allclass would look like this:
ALLCLASS
name age height
Andrew 15 67
Philip 14 70
Robert 15 78
Stephen 17 72
Andrea 16 60
Linda 13 55
Sandra 17 65
The data set allclass has three variables as in class1 and class2.
The procedure PROC APPEND also can be used to add observations from
one SAS data set to the end of another SAS data set. This procedure is
suitable to combine large data sets. The SET command reads all the SAS
data sets into memory and then combines them. The APPEND procedure
adds the observations from one SAS data set to the other without
reading all the observations into the memory.
The APPEND Procedure
APPEND adds observations from a SAS data set to the end of another SAS
data set. This means that one SAS dataset is changed by having
observations appended to it. Contrast this to using SET, where you
first create a new SAS dataset and then add other SAS datasets to the
new one. It is more efficient than using a SET statement to
concatenate two SAS data sets, especially when the BASE data set is
large.
Let's take a look at the class1 and class2 datasets again:
CLASS1 CLASS2
name age height name age height
Andrew 15 67 Andrea 16 60
Philip 14 70 Linda 13 55
Robert 15 78 Sandra 17 65
Stephen 17 72
Note that these data sets have the same variable names. The following
SAS program uses the PROC APPEND procedure to add observations from
one data set, class2 to the end of another data set (called the BASE
dataset), class1:
PROC APPEND BASE=class1 DATA=class2;
RUN;
The PROC APPEND procedure above adds the observations from SAS data
set class2 to the end of SAS data set class1. The SAS data set class1
IS CHANGED, and now will have the following observations.
CLASS1
name age height
Andrew 15 67
Philip 14 70
Robert 15 78
Stephen 17 72
Andrea 16 60
Linda 13 55
Sandra 17 65
It is also possible to choose which observations are to be added to
the other SAS data set. Here's an example:
PROC APPEND BASE=class2 DATA=class1(WHERE=(age=15));
RUN;
The PROC APPEND procedure above adds observations from the data set
class1 to the end of the data set class2. The data set option WHERE=
is used to choose the observations with age equals 15. Thus, it will
select all those observations which have age equals 15 from the data
set class1, and adds them to the end of the data set class2. The data
set class2 now will have the following observations.
CLASS2
name age height
Andrea 16 60
Linda 13 55
Sandra 17 65
Andrew 15 67
Robert 15 78
Please send questions or suggestions about this web page to beatnic@aset.psu.edu
ASET | ITS | Penn State
|