Starting Out with SAS
Topics
Common types of ASCII data file formats
Data files can come in a variety of formats. Some of these are:
space-delimited (each data element is separated by a space),
comma-delimited (each data element is separated by a comma), column
(data are aligned in columns, with nothing but column position to
delineate between one data element and another), and other formats.
The tutorial for the lab session last week contained an example of
reading-in "space-delimited" data that looked like this:
placebo better 8
placebo same 22
drug1 better 21
drug1 same 9
drug2 better 15
drug2 same 15
These data are also column- (or fixed-formatted. This means that each
variable could be read from a specified set of columns. For example,
the data for the variable SITE could be read from columns 1-9; the
data for the variable WEIGHT could be read from colums 12-15. We will
discuss column-formatted data in more detail later. The same data
could be formatted as space delimited in the following manner:
placebo better 8
placebo same 22
drug1 better 21
drug1 same 9
drug2 better 15
drug2 same 15
...with just one space between each data element. The same program
listed in the tutorial read these data just as easily with no
modifications to the code. Space-delimited data are separated by one
or more spaces.
Comma-delimited data are delimited by one comma between data elements:
placebo,better,8
placebo,same,22
drug1,better,21
drug1,same,9
drug2,better,15
drug2,same,15
Data that are entered into a spreadsheet, such as Excel or Lotus, can
be saved as comma-delimited data for SAS analysis. This is usually
accomplished by selecting "Save As..." and selecting CSV as the output
format.
Purely column-delimited data might look like this:
placebobetter 8
placebosame 22
drug1 better21
drug1 same 9
drug2 better15
drug2 same 15
We would need to use an input statement that specifies exactly which
columns to read each variable from.
Using DATALINES; (or CARDS;) to read data that is included as part of your SAS code
Normally data are read from an external data file that is separate
from your SAS program. However, data can be included as part of your
SAS code.
The CARDS; statement does exactly the same thing that a DATALINES;
statement does, but CARDS is a lot older than DATALINES. It was
originally used in SAS programs to signal the computer to start
reading-in data from punch cards. Since punch cards and card readers
are now extinct, the "CARDS;" statement started to seem strange to
newer SAS students... so SAS Institute created the synonym,
"DATALINES;".
This is usually done for one of only three limited reasons:
- Sometimes if you have a very, very small data set (perhaps a dozen
lines or so)
- Sometimes if you are testing your SAS code with a very small data
set
- Sometimes it can be useful, for instructional purposes, to have
the SAS code and the data all together (for example, this is what
was done with the SAS program example in lecture #1)
If you wish to enter your data as part of your SAS program, it must be
included within a DATA step only, and it is identified using a
DATALINES; statement. The DATALINES statement must be the LAST
statement of the DATA step (except for the run; statement!), and the
data immediately follows (NO blank line after the DATALINES
statement!!! SAS will read that as the first line of data and assign
missing values). Notice also that the data start in the first column
(DO NOT indent), and the end of the data are indicated with a
semicolon (;) on a line by itself immediately following the data
(again, NO blank lines between the end of the data and the semicolon:
Download this SAS code (Right-mouse button, select "Save Link As...")
...always remember to include the semicolon to mark the end of the
data, and a run; statement to mark the end of the data step!
The data listed here was submitted to STATLIB by Therese A. Stukel in
July, 1998. It consists of 20 observations from 10 pairs of human
twins, on 9 variables. Source: Tramo MJ, Loftus WC, Green RL, Stukel
TA, Weaver JB, Gazzaniga MS. Brain Size, Head Size, and IQ in
Monozygotic Twins. Neurology, 1998 (in press).
Inputting data from an external data file (space-delimited)
To get data from an external data file into a SAS data set (i.e. to
read it into your program), the first thing to do is to identify where
the data are. The way this is done will depend on the type of computer
your running SAS on. This is because different operating systems (e.g.
DOS, unix, CMS, Macintosh) use different formats for identifying file
names and directories (folders). Since we will be using the Unix
version of SAS, we will begin with specifying a Unix-style file
identification. The filename statement is used to identify where a
data file is located. The syntax of the filename statement is as
follows:
The FILENAME statement
The filename statement is used to identify where a data file is
located. The syntax of the filename statement is as follows:
filename fileref external-file
The fileref is simply any name you want to use to refer to the file.
It can be thought of as a "nickname" that you will use in your SAS
code to refer to the data file. The external-file is the directory and
filename of the file that contains the data. By using the filename
statement, the effect is to give a "nickname" to the data file. This
nickname is called the fileref. The fileref refers to the physical
data file on your disk. It is not a SAS data set--a SAS data set is
data that is input into your program. You may recode, or "massage" a
SAS data set within a SAS program without changing the physical data
file on your disk.
Here's an example:
filename data01 'c:\temp\iq-brainsize.dat';
In our SAS code, we now refer to the data in the file
iq-brainsize.dat as "data01". There is nothing magical about the
name "data01"; we could have just as easily used:
filename xanadu 'c:\temp\iq-brainsize.dat';
...or any other name we may want to choose. SAS uses the fileref
instead of the actual external filename as a means of
standardization--because there are many different ways that different
types of computers name files. The filename statement is used to
specify a fileref that SAS will understand regardless of the type of
computer you're using. It makes it easy to port your SAS code from one
computer platform to another (e.g. IBM PC running Windows-NT to a
Macintosh running MacOS); you would only need to change the
external-file specification once in the filename statement--and
wherever you use the fileref (the "nickname") later in your code, the
change will be "updated".
The DATA step
After the fileref has been specified in the filename statement, we
tell SAS that we want to create a SAS data set. A SAS dataset is the
way SAS works with data. When we read data from an external data file
into SAS, SAS creates a copy of the data file that it works with. This
is the SAS dataset, and it is named with a data statement. We use the
data statement to tell SAS that we want to create an empty SAS data
set, prepared to receive data that we will input.
A data statement, such as:
data brainsize;
merely tells SAS to set-up an empty SAS data set called "brainsize".
There is nothing magical about the name I chose, "brainsize"--this is
merely a name for the SAS dataset that we create; it can be anything
we wish, just so long as it follows SAS variable name conventions (see
Lecture #01).
The INFILE and INPUT statements
So far, no data has been read-in. When reading data from disk (or
tape), the next statement tells SAS what fileref to read from. This
statement is called an infile statement. The infile statement
identifies the fileref, and optionally specifies other characteristics
of the data file. Today, we're reading from a space-delimited data
file. SAS assumes that the data file is space- or column-formated,
unless there is an option on the infile statement saying otherwise. As
an aside, to tell SAS to expect the data to be comma-delimited, we
would use the dlm (delimitor) option. For example:
infile commas dlm=',';
...but our data are space-delimited, so we do not need to use the dlm option:
infile data01;
However, though it would be unnecessary, just to be explicit, we could
use the dlm option to specify that the data are space-delimited:
infile data01 dlm=' ';
(There is a space between the two ' ' marks in the example above).
After the INFILE statement comes an INPUT statement specifying the
variable names, separated by spaces, in the order in which they are to
be read from the data file:
input CCMIDSA FIQ HeadCirc BirthOrd PairID Sex $ TotalSA TotalVol
Weight;
Notice that the variable sex is coded as character data in the data
file (i.e. non-numeric) data. The variable sex represents the sex of
the observation, coded as either an "M" for males or an "F" for
females. In the input statement, we tell SAS to expect alphabetic
characters instead of numbers for this variable by adding a '$' after
the variable in the input list. There is a space between the variable
and the $ character.
Tying it together
Thus, the sequence of statements:
filename data01 'c:\temp\iq-brainsize.dat';
data brainsize;
infile data01;
input CCMIDSA FIQ HeadCirc BirthOrd PairID Sex $ TotalSA
TotalVol Weight;
run;
specifies a fileref called "data01" to refer to the disk file,
iq-brainsize.dat. Then a SAS data set called "brainsize" is
created. The infile statement tells SAS to find the data in the file
identified by "data01" and SAS assumes that the data are
space-delimited because there is no dlm statement. Next, the data in
the file are read-in according to the variable specifications in the
input statement.
The data statement comes before the infile statement, which in turn,
comes before the input statement.
Review: The anatomy of SAS program steps
All SAS programs consist of a sequence
of "steps". There are two kinds of steps: DATA steps, and PROCedures.
DATA step
A DATA step creates a SAS dataset (a collection of data
together with a "data dictionary", which defines the variables
and their properties). Data must be in the form of a SAS
dataset before it can be analyzed by SAS procedures.
In this example SAS program, these lines create the dataset
FITNESS from raw data input. Notice the use of the $ character
to alert SAS to expect character data:
data fitness;
input name $ sex $ age height weight;
cards; john M 12 59.0 99.5 james M 12 57.3 83.0 ... (more data
lines) ; run;
PROC step
A PROCedure step calls a SAS procedure to analyse or process a
SAS dataset.
In this example SAS program, these lines call two SAS
procedures to analyze the fitness dataset:
proc print data=fitness;
run;
proc means data=fitness;
var height weight;
run;
A SAS program can contain any number of DATA and PROC
steps. The SAS statements in each step are executed all together. Once
a dataset has been created, it can be processed by any subsequent DATA
or PROC step.
SAS Statements
- All SAS statements start with a keyword (DATA, INPUT, PROC, etc.)
- All SAS statements end with a semicolon (;) . (The most common
problem students encounter is omitting a semicolon -- SAS thinks
that two statements are just one.)
- SAS statements can be entered in free-format : You can begin in
any column, type several statements on one line or split a single
statement over several lines (as long as no word is split.).
- Uppercase and lowercase are equivalent, except inside quote marks
(For example, sex = 'm'; is not the same as sex = 'M';).
SAS Procedures
SAS Procedures exist to carry out all the forms of statistical
analysis. As the above examples indicate, a procedure is invoked in a
"PROC step" which starts with the keyword PROC, such as:
proc means data=fitness;
variables height weight;
The var, or variables, statement can be used with all procedures to
indicate which variables are to be analyzed. If this statement is
omitted, the default is to include all variables of the appropriate
type (character or numeric) for the given analysis.
Some other statements that can be used with most SAS procedure steps are:
by variable(s);
Causes the procedure to be repeated automatically for each
different value of the named variable(s). The data set must
first be sorted by those variables.
id variable(s);
Give the name of a variable to be used as an observation
IDentifier.
label var='label';
- Assign a descriptive label to a variable.
where (expression);
Select only those observations for which the expression is
true.
For example, the following lines produce separate means for males and
females, with the variable SEX labelled 'Gender'. (An ID statement is
not appropriate, because PROC MEANS produces only summary output.)
proc sort data=fitness;
by sex;
proc means data=fitness;
var height weight;
by sex;
label sex='Gender';
If the DATA=option is not used, SAS procedures process the most
recently created dataset.
Some INFILE options: MISSOVER, OBS=, and FIRSTOBS=
Here's a useful option on the infile statement:
infile data01 missover;
This statement instructs SAS to use file referenced as data01 for the
data. The missover option directs SAS to skip over missing data rather
than going to a new line.
infile data01 obs=n;
Instructs SAS to read-in only the first n observations in the data
file. This is useful to "test" your code for errors before running it
on a large data file--it reduces the time and resources needed to test
the program before running it on the full data.
infile data01 firstobs=n;
Skip first n-1 observations (lines). This can be handy when column
names in the data file are documented at the top of the file, for
example:
infile data01 firstobs=2;
will skip the first observation (line) in the data file when it is
read-in.
More about INFILE and data delimiters
We've already seen how to specify that the data we are INPUTting is
comma-delimited, by using dlm= on the INFILE statement:
infile commas dlm=',';
...and we know that we can either use a space character between the
quotes or eliminate the dlm= statement altogether (space-delimited is
the default) if the data are space-delimited...
Tab-delimited data
What if the data are tab-delimited? If we try using a tab character
between the quotes, it doesn't work (try it)!!! This is because the
tab character is a special character for some operating systems.
Instead of using the tab character between the quotes, we can specify
the hex (hexidecimal) code for the tab chacter:
infile tabs dlm='09'x;
The x character at the end of the dlm='09'x statement indicates that
the characters inside the quotes represent a hex code. The hex code
for a tab character is 09.
Alternatively, we could use:
infile tabs expandtabs;
...which actually converts the tabs to spaces before reading-in the
data.
NOTE: In the last two examples above, "tabs" is just the fileref--it's
whatever name you used in the filename statement to reference from
where you're reading data. There's nothing special about this name; it
could be anything you choose--I'm just using "tabs" as the name here
because I thought it a nice name to use for the example.
Double commas for missing values
Another common data format is to separate variables by commas and to
place character data between double quotes. This is often the
preferred text-format generated by spreadsheet programs. Furthermore,
two commas together indicate that there is a missing value (this is
what happens when you're entering data into a spreadsheet and leave
blank cells for missing values.
To work with this sort of data format in SAS, we can use the DSD
(data-sensitive delimiter) INFILE statement option. Besides allowing
commas as the data delimiter, this option reads character strings
enclosed in double quotes and strips off the quotes before assigning
the value to the character variable. It also allows you to include
commas within a character string. Finally, two commas together are
interpreted as a missing value. Here's an example, modified from
Ronald Cody, Ed.D. (SUGI23), The INPUT Statement: Where It's @:
Download this SAS code (Right-mouse button, select "Save Link As...")
To see that this program is working as expected, here is the output
from PROC PRINT:
When you use the DSD INFILE option, the default delimiter is a comma.
You may use the DSD and DLM= options together to allow all the
features just discussed but with a delimiter other than a comma.
Frequency tables: PROC FREQ;
PROC FREQ; can be used to produce one-way, two-way, or multi-way
frequency tables or cross tabulations (crosstabs). For now, we will
focus on using PROC FREQ for generating simple (one-way) frequency
tables. NOTE: PROC FREQ is used with a TABLES statement to specify the
variables to be put into a frequency table or cross tabulation. This
is unlike most of the other procedures we will learn about in this
class, which use var, (or variables) to list analysis variables. The
TABLES statement is use with PROC FREQ because unlike the VAR
statement, it allows for specifying combinations of variables for
two-way and multi-way tables using an asterisk (*). Generating
mulit-way tables using PROC FREQ will be discussed in another lecture.
A variable listed alone will result in a simple frequencies table. Two
or more variables listed on the TABLES statement, separated by spaces,
will produce an individual frequencies table for each variable. Here
are two examples illustrating this, with their respective output:
proc freq;
tables age;
run;
The SAS System
Cumulative Cumulative
AGE Frequency Percent Frequency Percent
-------------------------------------------------
20 11 18.6 11 18.6
21 19 32.2 30 50.8
22 15 25.4 45 76.3
23 2 3.4 47 79.7
24 8 13.6 55 93.2
26 2 3.4 57 96.6
28 1 1.7 58 98.3
38 1 1.7 59 100.0
proc freq;
tables age sex tv;
run;
The SAS System
Cumulative Cumulative
AGE Frequency Percent Frequency Percent
-------------------------------------------------
20 11 18.6 11 18.6
21 19 32.2 30 50.8
22 15 25.4 45 76.3
23 2 3.4 47 79.7
24 8 13.6 55 93.2
26 2 3.4 57 96.6
28 1 1.7 58 98.3
38 1 1.7 59 100.0
Cumulative Cumulative
SEX Frequency Percent Frequency Percent
-------------------------------------------------
F 19 32.2 19 32.2
M 40 67.8 59 100.0
Cumulative Cumulative
TV Frequency Percent Frequency Percent
------------------------------------------------
0 8 13.6 8 13.6
1 34 57.6 42 71.2
2 13 22.0 55 93.2
9 4 6.8 59 100.0
Sorting and subsetting
Sorting and running a procedure by subgroups
Sometimes it is very helpful to run each of several subgroups through
some summary or analysis procedure. This can be done with the sort
procedure and use of the by phrase. Using class.dat again, here's
an example of using proc sort; on the first eight observations of the
data:
Download this SAS code (Right-mouse button, select "Save Link As...")
The data are first sorted by pa and then we run the freq procedure
separately for each subgroup. Finally, selected variables from the
data are printed, using proc print. In contrast to the where
statement, which selects one subgroup, using by performs the procedure
for each subgroup. This is much cleaner than running SAS 3 times, each
time retaining only the treatment group under study. However, it does
produce a lot more output! The by phrase is available in all
procedures, but you must sort before you use it.
The "by" statement in your proc sort statement, in
most cases, should match the "by" statement in the analysis procedure.
Note: We do not use a $ to denote character data in SAS procedures;
it's needed only in the input statement.
Here's what the output looks like. Notice that the data are sorted
into subgroups according to the variable pa in the proc print; output
listing. You may wish to compare this to the first eight observations
in the raw data, class.dat:
The SAS System 1
------------ Pennsylvania origin=0 ---------------
Age
Cumulative Cumulative
AGE Frequency Percent Frequency Percent
-------------------------------------------------
22 2 66.7 2 66.7
26 1 33.3 3 100.0
The SAS System 2
------------ Pennsylvania origin=1 ---------------
Age
Cumulative Cumulative
AGE Frequency Percent Frequency Percent
-------------------------------------------------
21 2 40.0 2 40.0
22 1 20.0 3 60.0
23 1 20.0 4 80.0
38 1 20.0 5 100.0
The SAS System 3
OBS PA AGE ACADSTAT PRES
1 0 26 4 Reagan
2 0 22 4 Reagan
3 0 22 4 Reagan
4 1 23 4
5 1 21 4 Reagan
6 1 38 5 Carter
7 1 22 4 Kennedy
8 1 21 4 Bush
One can sort by several variables at once. The following code is
identical to the above code, except for the "by" statement in the sort
procedure. This time, we sort according to two variables, pa and age:
filename datain 'c:\temp\class.dat';
data class;
infile datain obs=8;
input age height acadstat pa macibm pres $ drink;
label pa = "Pennsylvania origin";
run;
proc sort data=class;
by pa age;
run;
proc freq data=class;
by pa;
tables age;
run;
proc print data=class;
var age height acadstat pres;
run;
This code will sort the data first in order of pa and then within each
"pa" group, the data are sorted by age. Notice how age is sorted
within each pa group in the proc print; output that is generated by
this program:
The SAS System 1
------------ Pennsylvania origin=0 ---------------
Age
Cumulative Cumulative
AGE Frequency Percent Frequency Percent
-------------------------------------------------
22 2 66.7 2 66.7
26 1 33.3 3 100.0
The SAS System 2
------------ Pennsylvania origin=1 ---------------
Age
Cumulative Cumulative
AGE Frequency Percent Frequency Percent
-------------------------------------------------
21 2 40.0 2 40.0
22 1 20.0 3 60.0
23 1 20.0 4 80.0
38 1 20.0 5 100.0
The SAS System 3
OBS PA AGE ACADSTAT PRES
1 0 22 4 Reagan
2 0 22 4 Reagan
3 0 26 4 Reagan
4 1 21 4 Reagan
5 1 21 4 Bush
6 1 22 4 Kennedy
7 1 23 4
8 1 38 5 Carter
Using where to select a single subgroup
The where statement can be used with
any procedure to restrict the action of that procedure to observations
satisfying the conditions of the where statement. Suppose we wanted to
analyze some data generated from a class survey, including only the Seniors,
and no one else? Here's an example using the class.dat data:
proc freq data=class;
where acadstat = 4;
tables pres;
run;
Here proc freq analyzes only the observations in the data set for
which the value of the variable acadstat is 4 (i.e. "Seniors"). The
output would be a single frequency table listing the reported ages for
only the Seniors (acadstat = 4) subgroup:
Cumulative Cumulative
PRES Frequency Percent Frequency Percent
-----------------------------------------------------
Bush 6 10.0 6 10.0
Carter 1 1.7 7 11.7
Clinton 6 10.0 13 21.7
DK 24 40.0 37 61.7
Kennedy 8 13.3 45 75.0
Nixon 1 1.7 46 76.7
Reagan 14 23.3 60 100.0
Frequency Missing = 1
The where statement can be used in conjunction with the set statement
to select observations from an existing data set. We will learn about
the set statement later in the semester, but here's a quick example
that you may want to take note of for future use:
Using the data generated in a previous Stat 480 class:
filename datain 'c:\temp\class.dat';
data class;
infile datain;
input age height acadstat pa macibm pres $ drink;
where acadstat=3 or acadstat=4;
run;
creates the SAS dataset, "class" which contains data only for Juniors
and Seniors (acadstat = 3 or 4). The logical operators "and" and "or"
can be used to create more complicated selection schemes.
The where statement provides more efficient processing than the
subsetting if statement (see below).
SAS functions can not be used within the where
statement, but are allowed in a subsetting if statement.
The subsetting "if"
You can add or drop variables and/or observations from a dataset. For
instance, if we were using the data we generated in class, and wanted
to read-in only data from those people whose favorite recent U.S.
president was Regan, we might write:
filename datain 'c:\temp\class.dat';
data class;
infile datain obs=8;
input age height acadstat pa macibm pres $ drink;
if pres = "Reagan";
run;
We used quotes around "Reagan" because the value
"Reagan" is not a number.
A more computationally efficient way of doing this is to use a where
statement:
filename datain 'c:\temp\class.dat';
data class;
infile datain obs=8;
input age height acadstat pa macibm pres $ drink;
where pres = "Reagan";
run;
On the other hand, suppose we wanted to read-in all the data except
for these "Reaganites"...
filename datain 'c:\temp\class.dat';
data class;
infile datain obs=8;
input age height acadstat pa macibm pres $ drink;
if pres = "Reagan" then delete;
run;
Here some other examples of using the if statement to subset data:
if x <= 10; /* keep only when x is at most 10 */
if x >= 10; /* keep only when x 10 or greater */
if x > 10; /* keep only when x more than 10 */
if not (x = 10); /* keep only when x is not 10 */
The LABEL statement
A label statement can be used within a DATA step to tag a description
to a variable. For example:
label agecat = 'Age category';
attaches the description, "Age category" to the variable, agecat. This
does not alter the data in any way, but when SAS generates output, it
will include this "label" along with any output that contains the
agecat variable. It is a convenience that can make your output more
"readable". The syntax for the label statement is:
label variable = 'label-tag';
...which can be used repeatedly, or:
label variable = 'label-tag'
variable = 'label-tag'
variable = 'label-tag'
.
.
.
variable = 'label-tag'; <-- Note the ending semicolon.
Listing data: PROC PRINT;
PROC PRINT lists the values of the variables in a SAS
data set to the output window. The most basic form looks like this:
proc print;
run;
In this form, a complete listing of the values of all variables in the
data set first will be printed in the output window because no var
statement was specified. A more complete example is
proc print label noobs heading=vertical;
by sex;
var age height tv pres;
run;
The label option uses variable labels, if they have been specified as
part of the DATa step, as column headings rather than variable names
(Using variable names is the default).
The noobs option omits the OBS column of output.
The heading=vertical option prints the column headings vertically.
This is useful when the names are long but the values of the variable
are short.
The var option specifies the variables to be listed and the order in
which they will appear. If it is not used, all the variables in the
data set will be printed.
Optionally, sum may be used instead of or along with var to list
variables. The sum option works similarly to var, except that a double
line is drawn at the bottom of the column listing for each variable in
the sum statement, and the sum of all observations for those variables
is printed (demonstration in class).
An optional by or class statement generates separate analyses for each
level of the by or class variable(s).
Here's a complete program and what the output looks like, using the
first twleve observations (OBS=12) from class.dat as an example;
first the output from the simple PROC PRINT; example, then the output
from the second, more involved example:
Download this SAS code (Right-mouse button, select "Save Link As...")
Since we defined a "label" for the pres variable, when we print using
the LABEL option with PROC PRINT, the label is output as the column
heading instead of the variable name. We did not define a "label" for
age and pa, so these variable names are used as the column headers...
Please send questions or suggestions about this web page to beatnic@aset.psu.edu
ASET | ITS | Penn State
|