About HPC Systems Software User Guides Education Partners

  / gears / hpc / education / tutorials / sas / sasstart

 

Starting Out with SAS


Topics


Common types of ASCII data file formats

Data files can come in a variety of formats. Some of these are: space-delimited (each data element is separated by a space), comma-delimited (each data element is separated by a comma), column (data are aligned in columns, with nothing but column position to delineate between one data element and another), and other formats.

The tutorial for the lab session last week contained an example of reading-in "space-delimited" data that looked like this:

placebo  better  8
placebo  same   22
drug1    better 21
drug1    same    9
drug2    better 15
drug2    same   15

These data are also column- (or fixed-formatted. This means that each variable could be read from a specified set of columns. For example, the data for the variable SITE could be read from columns 1-9; the data for the variable WEIGHT could be read from colums 12-15. We will discuss column-formatted data in more detail later. The same data could be formatted as space delimited in the following manner:

placebo better 8
placebo same 22
drug1 better 21
drug1 same 9
drug2 better 15
drug2 same 15

...with just one space between each data element. The same program listed in the tutorial read these data just as easily with no modifications to the code. Space-delimited data are separated by one or more spaces.

Comma-delimited data are delimited by one comma between data elements:

placebo,better,8
placebo,same,22
drug1,better,21
drug1,same,9
drug2,better,15
drug2,same,15

Data that are entered into a spreadsheet, such as Excel or Lotus, can be saved as comma-delimited data for SAS analysis. This is usually accomplished by selecting "Save As..." and selecting CSV as the output format.

Purely column-delimited data might look like this:

placebobetter 8
placebosame  22
drug1  better21
drug1  same   9
drug2  better15
drug2  same  15

We would need to use an input statement that specifies exactly which columns to read each variable from.

Using DATALINES; (or CARDS;) to read data that is included as part of your SAS code

Normally data are read from an external data file that is separate from your SAS program. However, data can be included as part of your SAS code.

The CARDS; statement does exactly the same thing that a DATALINES; statement does, but CARDS is a lot older than DATALINES. It was originally used in SAS programs to signal the computer to start reading-in data from punch cards. Since punch cards and card readers are now extinct, the "CARDS;" statement started to seem strange to newer SAS students... so SAS Institute created the synonym, "DATALINES;".

This is usually done for one of only three limited reasons:

  1. Sometimes if you have a very, very small data set (perhaps a dozen lines or so)
  2. Sometimes if you are testing your SAS code with a very small data set
  3. Sometimes it can be useful, for instructional purposes, to have the SAS code and the data all together (for example, this is what was done with the SAS program example in lecture #1)

If you wish to enter your data as part of your SAS program, it must be included within a DATA step only, and it is identified using a DATALINES; statement. The DATALINES statement must be the LAST statement of the DATA step (except for the run; statement!), and the data immediately follows (NO blank line after the DATALINES statement!!! SAS will read that as the first line of data and assign missing values). Notice also that the data start in the first column (DO NOT indent), and the end of the data are indicated with a semicolon (;) on a line by itself immediately following the data (again, NO blank lines between the end of the data and the semicolon:

Download this SAS code (Right-mouse button, select "Save Link As...")

...always remember to include the semicolon to mark the end of the data, and a run; statement to mark the end of the data step!

The data listed here was submitted to STATLIB by Therese A. Stukel in July, 1998. It consists of 20 observations from 10 pairs of human twins, on 9 variables. Source: Tramo MJ, Loftus WC, Green RL, Stukel TA, Weaver JB, Gazzaniga MS. Brain Size, Head Size, and IQ in Monozygotic Twins. Neurology, 1998 (in press).

Inputting data from an external data file (space-delimited)

To get data from an external data file into a SAS data set (i.e. to read it into your program), the first thing to do is to identify where the data are. The way this is done will depend on the type of computer your running SAS on. This is because different operating systems (e.g. DOS, unix, CMS, Macintosh) use different formats for identifying file names and directories (folders). Since we will be using the Unix version of SAS, we will begin with specifying a Unix-style file identification. The filename statement is used to identify where a data file is located. The syntax of the filename statement is as follows:

The FILENAME statement

The filename statement is used to identify where a data file is located. The syntax of the filename statement is as follows:

filename fileref external-file

The fileref is simply any name you want to use to refer to the file. It can be thought of as a "nickname" that you will use in your SAS code to refer to the data file. The external-file is the directory and filename of the file that contains the data. By using the filename statement, the effect is to give a "nickname" to the data file. This nickname is called the fileref. The fileref refers to the physical data file on your disk. It is not a SAS data set--a SAS data set is data that is input into your program. You may recode, or "massage" a SAS data set within a SAS program without changing the physical data file on your disk.

Here's an example:

filename data01 'c:\temp\iq-brainsize.dat';

In our SAS code, we now refer to the data in the file iq-brainsize.dat as "data01". There is nothing magical about the name "data01"; we could have just as easily used:

filename xanadu 'c:\temp\iq-brainsize.dat';

...or any other name we may want to choose. SAS uses the fileref instead of the actual external filename as a means of standardization--because there are many different ways that different types of computers name files. The filename statement is used to specify a fileref that SAS will understand regardless of the type of computer you're using. It makes it easy to port your SAS code from one computer platform to another (e.g. IBM PC running Windows-NT to a Macintosh running MacOS); you would only need to change the external-file specification once in the filename statement--and wherever you use the fileref (the "nickname") later in your code, the change will be "updated".

The DATA step

After the fileref has been specified in the filename statement, we tell SAS that we want to create a SAS data set. A SAS dataset is the way SAS works with data. When we read data from an external data file into SAS, SAS creates a copy of the data file that it works with. This is the SAS dataset, and it is named with a data statement. We use the data statement to tell SAS that we want to create an empty SAS data set, prepared to receive data that we will input.

A data statement, such as:

data brainsize;

merely tells SAS to set-up an empty SAS data set called "brainsize". There is nothing magical about the name I chose, "brainsize"--this is merely a name for the SAS dataset that we create; it can be anything we wish, just so long as it follows SAS variable name conventions (see Lecture #01).

The INFILE and INPUT statements

So far, no data has been read-in. When reading data from disk (or tape), the next statement tells SAS what fileref to read from. This statement is called an infile statement. The infile statement identifies the fileref, and optionally specifies other characteristics of the data file. Today, we're reading from a space-delimited data file. SAS assumes that the data file is space- or column-formated, unless there is an option on the infile statement saying otherwise. As an aside, to tell SAS to expect the data to be comma-delimited, we would use the dlm (delimitor) option. For example:

infile commas dlm=',';

...but our data are space-delimited, so we do not need to use the dlm option:

infile data01;

However, though it would be unnecessary, just to be explicit, we could use the dlm option to specify that the data are space-delimited:

infile data01 dlm=' ';

(There is a space between the two ' ' marks in the example above). After the INFILE statement comes an INPUT statement specifying the variable names, separated by spaces, in the order in which they are to be read from the data file:

input  CCMIDSA  FIQ HeadCirc BirthOrd PairID Sex $ TotalSA TotalVol
Weight;

Notice that the variable sex is coded as character data in the data file (i.e. non-numeric) data. The variable sex represents the sex of the observation, coded as either an "M" for males or an "F" for females. In the input statement, we tell SAS to expect alphabetic characters instead of numbers for this variable by adding a '$' after the variable in the input list. There is a space between the variable and the $ character.

Tying it together

Thus, the sequence of statements:

filename data01 'c:\temp\iq-brainsize.dat';
data brainsize;
    infile data01;
    input  CCMIDSA  FIQ  HeadCirc  BirthOrd  PairID  Sex  $ TotalSA
    TotalVol Weight;
    run;

specifies a fileref called "data01" to refer to the disk file, iq-brainsize.dat. Then a SAS data set called "brainsize" is created. The infile statement tells SAS to find the data in the file identified by "data01" and SAS assumes that the data are space-delimited because there is no dlm statement. Next, the data in the file are read-in according to the variable specifications in the input statement.

The data statement comes before the infile statement, which in turn, comes before the input statement.

Review: The anatomy of SAS program steps

All SAS programs consist of a sequence of "steps". There are two kinds of steps: DATA steps, and PROCedures.

DATA step

A DATA step creates a SAS dataset (a collection of data together with a "data dictionary", which defines the variables and their properties). Data must be in the form of a SAS dataset before it can be analyzed by SAS procedures.

In this example SAS program, these lines create the dataset FITNESS from raw data input. Notice the use of the $ character to alert SAS to expect character data:

data fitness;
   input name $ sex $ age height weight;

      cards;  john  M  12  59.0  99.5 james M 12 57.3 83.0 ... (more data
      lines) ; run;

PROC step

A PROCedure step calls a SAS procedure to analyse or process a SAS dataset.

In this example SAS program, these lines call two SAS procedures to analyze the fitness dataset:

proc print data=fitness;
   run;

proc means data=fitness;
   var height weight;
   run;

A SAS program can contain any number of DATA and PROC steps. The SAS statements in each step are executed all together. Once a dataset has been created, it can be processed by any subsequent DATA or PROC step.

SAS Statements

  • All SAS statements start with a keyword (DATA, INPUT, PROC, etc.)
  • All SAS statements end with a semicolon (;) . (The most common problem students encounter is omitting a semicolon -- SAS thinks that two statements are just one.)
  • SAS statements can be entered in free-format : You can begin in any column, type several statements on one line or split a single statement over several lines (as long as no word is split.).
  • Uppercase and lowercase are equivalent, except inside quote marks (For example, sex = 'm'; is not the same as sex = 'M';).

SAS Procedures

SAS Procedures exist to carry out all the forms of statistical analysis. As the above examples indicate, a procedure is invoked in a "PROC step" which starts with the keyword PROC, such as:

proc means data=fitness;
   variables height weight;

The var, or variables, statement can be used with all procedures to indicate which variables are to be analyzed. If this statement is omitted, the default is to include all variables of the appropriate type (character or numeric) for the given analysis.

Some other statements that can be used with most SAS procedure steps are:

by variable(s);

Causes the procedure to be repeated automatically for each different value of the named variable(s). The data set must first be sorted by those variables.

id variable(s);

Give the name of a variable to be used as an observation IDentifier.

label var='label';

Assign a descriptive label to a variable.

where (expression);

Select only those observations for which the expression is true.

For example, the following lines produce separate means for males and females, with the variable SEX labelled 'Gender'. (An ID statement is not appropriate, because PROC MEANS produces only summary output.)

proc sort data=fitness;
   by sex;
proc means data=fitness;
   var height weight;
   by sex;
   label sex='Gender';

If the DATA=option is not used, SAS procedures process the most recently created dataset.

Some INFILE options: MISSOVER, OBS=, and FIRSTOBS=

Here's a useful option on the infile statement:

infile data01 missover;

This statement instructs SAS to use file referenced as data01 for the data. The missover option directs SAS to skip over missing data rather than going to a new line.

infile data01 obs=n;

Instructs SAS to read-in only the first n observations in the data file. This is useful to "test" your code for errors before running it on a large data file--it reduces the time and resources needed to test the program before running it on the full data.

infile data01 firstobs=n;

Skip first n-1 observations (lines). This can be handy when column names in the data file are documented at the top of the file, for example:

infile data01 firstobs=2;

will skip the first observation (line) in the data file when it is read-in.

More about INFILE and data delimiters

We've already seen how to specify that the data we are INPUTting is comma-delimited, by using dlm= on the INFILE statement:

infile commas dlm=',';

...and we know that we can either use a space character between the quotes or eliminate the dlm= statement altogether (space-delimited is the default) if the data are space-delimited...

Tab-delimited data

What if the data are tab-delimited? If we try using a tab character between the quotes, it doesn't work (try it)!!! This is because the tab character is a special character for some operating systems. Instead of using the tab character between the quotes, we can specify the hex (hexidecimal) code for the tab chacter:

infile tabs dlm='09'x;

The x character at the end of the dlm='09'x statement indicates that the characters inside the quotes represent a hex code. The hex code for a tab character is 09.

Alternatively, we could use:

infile tabs expandtabs;

...which actually converts the tabs to spaces before reading-in the data.

NOTE: In the last two examples above, "tabs" is just the fileref--it's whatever name you used in the filename statement to reference from where you're reading data. There's nothing special about this name; it could be anything you choose--I'm just using "tabs" as the name here because I thought it a nice name to use for the example.

Double commas for missing values

Another common data format is to separate variables by commas and to place character data between double quotes. This is often the preferred text-format generated by spreadsheet programs. Furthermore, two commas together indicate that there is a missing value (this is what happens when you're entering data into a spreadsheet and leave blank cells for missing values.

To work with this sort of data format in SAS, we can use the DSD (data-sensitive delimiter) INFILE statement option. Besides allowing commas as the data delimiter, this option reads character strings enclosed in double quotes and strips off the quotes before assigning the value to the character variable. It also allows you to include commas within a character string. Finally, two commas together are interpreted as a missing value. Here's an example, modified from Ronald Cody, Ed.D. (SUGI23), The INPUT Statement: Where It's @:

Download this SAS code (Right-mouse button, select "Save Link As...")

To see that this program is working as expected, here is the output from PROC PRINT:

When you use the DSD INFILE option, the default delimiter is a comma. You may use the DSD and DLM= options together to allow all the features just discussed but with a delimiter other than a comma.

Frequency tables: PROC FREQ;

PROC FREQ; can be used to produce one-way, two-way, or multi-way frequency tables or cross tabulations (crosstabs). For now, we will focus on using PROC FREQ for generating simple (one-way) frequency tables. NOTE: PROC FREQ is used with a TABLES statement to specify the variables to be put into a frequency table or cross tabulation. This is unlike most of the other procedures we will learn about in this class, which use var, (or variables) to list analysis variables. The TABLES statement is use with PROC FREQ because unlike the VAR statement, it allows for specifying combinations of variables for two-way and multi-way tables using an asterisk (*). Generating mulit-way tables using PROC FREQ will be discussed in another lecture.

A variable listed alone will result in a simple frequencies table. Two or more variables listed on the TABLES statement, separated by spaces, will produce an individual frequencies table for each variable. Here are two examples illustrating this, with their respective output:

proc freq;
    tables age;
    run;

                  The SAS System

                        Cumulative  Cumulative
AGE   Frequency   Percent   Frequency    Percent
-------------------------------------------------
 20         11      18.6          11       18.6
 21         19      32.2          30       50.8
 22         15      25.4          45       76.3
 23          2       3.4          47       79.7
 24          8      13.6          55       93.2
 26          2       3.4          57       96.6
 28          1       1.7          58       98.3
 38          1       1.7          59      100.0

proc freq;
    tables age sex tv;
    run;

                  The SAS System

                        Cumulative  Cumulative
AGE   Frequency   Percent   Frequency    Percent
-------------------------------------------------
 20         11      18.6          11       18.6
 21         19      32.2          30       50.8
 22         15      25.4          45       76.3
 23          2       3.4          47       79.7
 24          8      13.6          55       93.2
 26          2       3.4          57       96.6
 28          1       1.7          58       98.3
 38          1       1.7          59      100.0


                        Cumulative  Cumulative
SEX   Frequency   Percent   Frequency    Percent
-------------------------------------------------
 F          19      32.2          19       32.2
 M          40      67.8          59      100.0


                       Cumulative  Cumulative
TV   Frequency   Percent   Frequency    Percent
------------------------------------------------
 0          8      13.6           8       13.6
 1         34      57.6          42       71.2
 2         13      22.0          55       93.2
 9          4       6.8          59      100.0
Sorting and subsetting

Sorting and running a procedure by subgroups

Sometimes it is very helpful to run each of several subgroups through some summary or analysis procedure. This can be done with the sort procedure and use of the by phrase. Using class.dat again, here's an example of using proc sort; on the first eight observations of the data:

Download this SAS code (Right-mouse button, select "Save Link As...")

The data are first sorted by pa and then we run the freq procedure separately for each subgroup. Finally, selected variables from the data are printed, using proc print. In contrast to the where statement, which selects one subgroup, using by performs the procedure for each subgroup. This is much cleaner than running SAS 3 times, each time retaining only the treatment group under study. However, it does produce a lot more output! The by phrase is available in all procedures, but you must sort before you use it.

The "by" statement in your proc sort statement, in most cases, should match the "by" statement in the analysis procedure.

Note: We do not use a $ to denote character data in SAS procedures; it's needed only in the input statement.

Here's what the output looks like. Notice that the data are sorted into subgroups according to the variable pa in the proc print; output listing. You may wish to compare this to the first eight observations in the raw data, class.dat:

                  The SAS System                 1

------------ Pennsylvania origin=0 ---------------

                       Age

                        Cumulative  Cumulative
AGE   Frequency   Percent   Frequency    Percent
-------------------------------------------------
 22          2      66.7           2       66.7
 26          1      33.3           3      100.0


                  The SAS System                 2

------------ Pennsylvania origin=1 ---------------

                       Age

                        Cumulative  Cumulative
AGE   Frequency   Percent   Frequency    Percent
-------------------------------------------------
 21          2      40.0           2       40.0
 22          1      20.0           3       60.0
 23          1      20.0           4       80.0
 38          1      20.0           5      100.0


                  The SAS System                 3

     OBS    PA    AGE    ACADSTAT    PRES

      1      0     26        4       Reagan
      2      0     22        4       Reagan
      3      0     22        4       Reagan
      4      1     23        4
      5      1     21        4       Reagan
      6      1     38        5       Carter
      7      1     22        4       Kennedy
      8      1     21        4       Bush

One can sort by several variables at once. The following code is identical to the above code, except for the "by" statement in the sort procedure. This time, we sort according to two variables, pa and age:

filename datain 'c:\temp\class.dat';

data class;
   infile datain obs=8;
   input age height acadstat pa macibm pres $ drink;
   label pa  = "Pennsylvania origin";
   run;

proc sort data=class;
   by pa age;
   run;

proc freq data=class;
   by pa;
   tables age;
   run;

proc print data=class;
   var age height acadstat pres;
   run;

This code will sort the data first in order of pa and then within each "pa" group, the data are sorted by age. Notice how age is sorted within each pa group in the proc print; output that is generated by this program:

                  The SAS System                 1

------------ Pennsylvania origin=0 ---------------

                       Age

                        Cumulative  Cumulative
AGE   Frequency   Percent   Frequency    Percent
-------------------------------------------------
 22          2      66.7           2       66.7
 26          1      33.3           3      100.0


                  The SAS System                 2

------------ Pennsylvania origin=1 ---------------

                       Age

                        Cumulative  Cumulative
AGE   Frequency   Percent   Frequency    Percent
-------------------------------------------------
 21          2      40.0           2       40.0
 22          1      20.0           3       60.0
 23          1      20.0           4       80.0
 38          1      20.0           5      100.0


                  The SAS System                 3

     OBS    PA    AGE    ACADSTAT    PRES

      1      0     22        4       Reagan
      2      0     22        4       Reagan
      3      0     26        4       Reagan
      4      1     21        4       Reagan
      5      1     21        4       Bush
      6      1     22        4       Kennedy
      7      1     23        4
      8      1     38        5       Carter

Using where to select a single subgroup

The where statement can be used with any procedure to restrict the action of that procedure to observations satisfying the conditions of the where statement. Suppose we wanted to analyze some data generated from a class survey, including only the Seniors, and no one else? Here's an example using the class.dat data:

proc freq data=class;
   where acadstat = 4;
   tables pres;
   run;

Here proc freq analyzes only the observations in the data set for which the value of the variable acadstat is 4 (i.e. "Seniors"). The output would be a single frequency table listing the reported ages for only the Seniors (acadstat = 4) subgroup:

                            Cumulative  Cumulative
PRES      Frequency   Percent   Frequency    Percent
-----------------------------------------------------
Bush             6      10.0           6       10.0
Carter           1       1.7           7       11.7
Clinton          6      10.0          13       21.7
DK              24      40.0          37       61.7
Kennedy          8      13.3          45       75.0
Nixon            1       1.7          46       76.7
Reagan          14      23.3          60      100.0

                Frequency Missing = 1

The where statement can be used in conjunction with the set statement to select observations from an existing data set. We will learn about the set statement later in the semester, but here's a quick example that you may want to take note of for future use:

Using the data generated in a previous Stat 480 class:

filename datain 'c:\temp\class.dat';

data class;
   infile datain;
   input age height acadstat pa macibm pres $ drink;
   where acadstat=3 or acadstat=4;
   run;

creates the SAS dataset, "class" which contains data only for Juniors and Seniors (acadstat = 3 or 4). The logical operators "and" and "or" can be used to create more complicated selection schemes.

The where statement provides more efficient processing than the subsetting if statement (see below).

SAS functions can not be used within the where statement, but are allowed in a subsetting if statement.

The subsetting "if"

You can add or drop variables and/or observations from a dataset. For instance, if we were using the data we generated in class, and wanted to read-in only data from those people whose favorite recent U.S. president was Regan, we might write:

filename datain 'c:\temp\class.dat';

data class;
   infile datain obs=8;
   input age height acadstat pa macibm pres $ drink;
   if pres = "Reagan";
   run;

We used quotes around "Reagan" because the value "Reagan" is not a number.

A more computationally efficient way of doing this is to use a where statement:

filename datain 'c:\temp\class.dat';

data class;
   infile datain obs=8;
   input age height acadstat pa macibm pres $ drink;
   where pres = "Reagan";
   run;

On the other hand, suppose we wanted to read-in all the data except for these "Reaganites"...

filename datain 'c:\temp\class.dat';

data class;
   infile datain obs=8;
   input age height acadstat pa macibm pres $ drink;
   if pres = "Reagan" then delete;
   run;

Here some other examples of using the if statement to subset data:

if x <= 10;                /* keep only when x is at most 10 */
if x >= 10;                /* keep only when x 10 or greater */
if x >  10;                /* keep only when x more than 10  */
if not (x = 10);           /* keep only when x is not 10     */
The LABEL statement

A label statement can be used within a DATA step to tag a description to a variable. For example:

label agecat = 'Age category';

attaches the description, "Age category" to the variable, agecat. This does not alter the data in any way, but when SAS generates output, it will include this "label" along with any output that contains the agecat variable. It is a convenience that can make your output more "readable". The syntax for the label statement is:

label variable = 'label-tag';

...which can be used repeatedly, or:

label variable = 'label-tag'
    variable = 'label-tag'
    variable = 'label-tag'
         .
         .
         .
    variable = 'label-tag';  <-- Note the ending semicolon.
Listing data: PROC PRINT;

PROC PRINT lists the values of the variables in a SAS data set to the output window. The most basic form looks like this:

proc print;
   run;

In this form, a complete listing of the values of all variables in the data set first will be printed in the output window because no var statement was specified. A more complete example is

proc print label noobs heading=vertical;
   by sex;
   var age height tv pres;
   run;

The label option uses variable labels, if they have been specified as part of the DATa step, as column headings rather than variable names (Using variable names is the default).

The noobs option omits the OBS column of output.

The heading=vertical option prints the column headings vertically. This is useful when the names are long but the values of the variable are short.

The var option specifies the variables to be listed and the order in which they will appear. If it is not used, all the variables in the data set will be printed.

Optionally, sum may be used instead of or along with var to list variables. The sum option works similarly to var, except that a double line is drawn at the bottom of the column listing for each variable in the sum statement, and the sum of all observations for those variables is printed (demonstration in class).

An optional by or class statement generates separate analyses for each level of the by or class variable(s).

Here's a complete program and what the output looks like, using the first twleve observations (OBS=12) from class.dat as an example; first the output from the simple PROC PRINT; example, then the output from the second, more involved example:

Download this SAS code (Right-mouse button, select "Save Link As...")

Since we defined a "label" for the pres variable, when we print using the LABEL option with PROC PRINT, the label is output as the column heading instead of the variable name. We did not define a "label" for age and pa, so these variable names are used as the column headers...


Please send questions or suggestions about this web page to beatnic@aset.psu.edu

ASET | ITS | Penn State