Friday, December 19, 2014

SAS in Mainframes(z/Os) Tutorial with examples

This post is basically about how SAS reads the record and internally processes it.
To start with lets check some basic sas concepts which comes into play when ever you run a simple SAS program.
The two primary steps in a sas program
1. SAS DATA step
2. SAS PROC step

DATA steps typically create or modify SAS data sets. They can also be used to produce custom-designed reports. For example, you can use DATA steps to
put your data into a SAS data set
  • compute values
  • check for and correct errors in your data
  • produce new SAS data sets by subsetting, merging, and updating existing data sets. 

PROC (procedure)  steps are pre-written routines that enable you to analyze and process the data in a SAS data set and to present the data in the form of a report

For example, you can use PROC steps to
  • create a report that lists the data
  • produce descriptive statistics
  • create a summary report. 
Diagrammatically the program flow is like below:


SAS program flow

Feature of a SAS program:
  • It usually begins with a SAS keyword.
  • It always ends with a semicolon.

In the DATA step, we introduce the input file,ie the external file (supplied in the DD name in JCL in case of mainframe) to SAS. Data step begins with DATA keyword.Also we take declare the layout of the field. As an example,
DATA CUST;                    
  INFILE  CUSTPOL;            
  INPUT @23  ACCTNO   $CHAR01.
        @60  POLNO    $CHAR10.
        @76  STAT     $CHAR01.
        @224 POLEFFDT 8.      
        @240 APPRCDT  8.  
          ;
Here, i have highlighted the SAS keywords in blue.  CUST  will be the name of the sas Dataset which SAS will prepare internally once this step is executed.
INFILE CUSTPOL : Here the CUSTPOL is the name of the physical(external) file from which the data is to be read. INPUT will take only those  fields from the specific positions and only those fields will be present in the SAS dataset CUST.

The above SAS DATA step is processed in 2  phases.
A) Compilation phase: Each of the statements are checked for syntax errors. Once it completes,execution begins.
B) Execution phase: Data is read and executed unless otherwise coded.
Some of the terms which comes with SAS data processing (Just a bit of knowledge is good)are: 
Input Buffer:
During Compilation phase, an input buffer(memory area) is created to hold the records from file. It is created when the raw data is read. It is just a logical concept.
Program Vector Data:
When the data is read, SAS builds a data set in the memory(which is very much internal to SAS) known as SAS data set.
This Program Vector contains automatic variables that can be used to track the number of observations,and comes  handy in many ways.

1.   _N_ counts the number of times that the DATA step begins to execute.
2 .  _ERROR_ signals the occurrence of an error that is caused by the data during execution.
The default value is 0, which means there is no error. When one or more errors occur, the value is set to 1.


At the beginning of the execution phase, the value of _N_ is 1. Because there are no data errors, the value of _ERROR_ is 0.

When we define the DATA step, we should try to use the minimum variables. Unnecessary declaration of the variables makes the SAS internal dataset bigger which can lead to more execution time.

During execution, each record in the input raw data file is read, stored in the program data vector, and then written to the new data set as an observation.
At the end of the DATA step, several actions occur. First, the values in the program data vector are written to the output data set as the first observation.



Log Messages
Each time  SAS  executes its step, it writes log . In z/os environment, it will be written to SASLOG.Looks like below.  It shows the number of records read, the number of records which gets selected in the criteria and finally goes into sas dataset.

NOTE: 17430 records were read from the infile CUSTPOL.                
      The minimum record length was 600.                              
      The maximum record length was 636.                              
NOTE: The data set WORK.CUSTPOL has 5818 observations and 12 variables.
NOTE: The DATA statement used the following resources:                
      CPU     time -         00:00:00.07                              
      Elapsed time -         00:00:02.99                              
      EXCP count   - 5998                                             
      Task  memory - 4904K (148K data, 4756K program)                 
      Total memory - 17710K (3488K data, 14222K program)              
      Timestamp    - 12/19/2014 2:43:29 AM                            
NOTE: The address space has used a maximum of 876K below the line and 1

Friday, December 12, 2014

What are KSDS, ESDS,RRDS, LDS VSAM ?. Concept and Structure of Vsams

Types of Vsam and its concepts:
Vsam supports the following types of file organizations:
1. ESDS  2. RRDS   3.KSDS
ESDS (Entry Sequenced Data Set)
ESDS can be accessed sequentially or directly. They can not be accessed skip sequentially. ESDS are mostly where we write logs and it can be accessed sequentially.
ESDS record can be located sequentially  by checking  each record, starting from the beginning of the data set.Also an ESDS record can be located directly if the relative byte address(RBA) of the record in known.This location can be derived from some previous search or by remembering the address where the record was inserted.
ESDS Data Insertion
 New  records must be inserted at the end of the file. Each record is inserted at the end of the latest CI.IF there is no enough space in the CI, then a new CI is started.
Deletion of records physically is not possible.If we do not need one record, we can mark it as deleted , but it will remain in the dataset until the ESDS is re built again.
Updation of the records can be done where they are provided the length remains the same.If the length changes, the record must be marked as deleted and added as a new record at the end.

RRDS (Relative Record Data Set)
RRDS is another type of VSAM.It consists of fixed length areas called slots. These are pre-formatted when the data is created, or whenever a new CA is created.Records are inserted into the slots.

RRDS and its slots
Advantages of RRDS over ESDS:
In RRDS, Records can be added and deleted within these slots.
Records can be directly accessed by specifying slot number known as RRN. The first slot has RRN 1  .We can use skip sequential processing for RRDS
The Application program can insert into any free slot which is know as Direct insertion. The application program can request  the record to be inserted into next free slot which is known as sequential insertion.
While deletion, a record can be deleted and slot can be re used.
RRDS Structure:
There is one RDF for every slot in RRDS. Each of the RDF holds the information whether its associated slot is empty or not
RRDS  Internal Structure
In RRDS, all the records must be of same length. Unlike ESDS, we can not have Spanned record in RRDS & We Can not have Free space in RRDS.
LDS: (Linear Data set)
Linear Data sets holds the records in the form of strings of data. There are no CIDF  or RDF. Most common usage of LDS in DB2, which uses LDS to store objects.
Internally Z/OS  uses  DATA IN VIRTUAL(DIV)  services or WINDOWS SERVICS to insert/update/ delete the data.

KSDS (Key Sequenced Data Set)
In a KSDS, we have a cluster which consists of two parts ; Data part and the Index Part
We need to have the data to be sorted when its inserted into a KSDS. The Key part can be 1-255 contiguous bytes and must be unique. Each Key points to the data part. As we Insert,update or delete a record, this Index component gets automatically.
KSDS Vsam Cluster
The Rounded Yellow part signifies the Vsam Name and is know as the VSAM Cluster. It relates the Data and the Index part.  It is not a file, just  a catalog entry.
Records in a KSDS can be updated and their length can be changed.If he record is reduced in length, then the remaining space can be used as free space, else if the record length increases, the record is moved to Free space to make room for the new updated record.

Check here on How to Define KSDS,ESDS,RRDS,LDS with IDCAMS

Friday, December 5, 2014

Vsam Basics and Vsam tutorial. (Control interval, Control Area, CIDF, RDF)

This post is just a short writeup for VSAMs.
What is a Vsam file?
Virtual Storage Access Method : This is one type of dataset and  APIs to access the dataset.It is an excellent database to keep the records as they  can be read sequentially,directly or skip sequentially.
We must know that VSAM dataset must always be kept on disk and not on Tape. IBM newly introduced an extension of the VSAM known as  Extended Format Vsam . To an application , a Vsam and extended Vsam are  all same.Extended vsam provides some addition features like compression of the data, sharing, improved performance and allows a maximum data set size if 128 terabytes.
What are control Intervals? 
It is the basic building block for  VSAM dataset. It holds one or more record. Concept is similar to block for sequential and partitioned data sets.  When we read/write record from a vsam file , we read the entire chunk of data, ie the entire CI is moved to memory from DASD, not just the single record
Control Interval which consists of bunch of records.
What are CIDF , RDF in VSAM ?
Apart from keeping records in a CI, Vsam places two additional fields which are needed to manage CI. They are CIDF and RDF.
Unused spce, RDF,CIDF in a VSAM
The Dotted space shows the unused space in a CI.
RDF (Record Descriptor field) stores the length of each record. VSAM with fixed length records needs two RDF. One to keep track of the number of records and the other to count the length of record.For variable length record, there is one per record
CIDF (Control interval definition field) is a 4 byte length field which holds the location and size of unused space in CI.(like the one shown in shaded space)

What is Control Area  ?
Group of CIs make one Control Area (CA).
Group of Control Interval is called as Control Area.

Read here  to check the  Parameters to define a Vsam with IDCAMS

Spanned Records in VSAM and CI Split:
While inserting a record, if  the Record size is larger than the CI size, it can not be stored. In such scenarios,  CI splits happen if the vsam is defined with SPANNED option. The records then span across more than one CI and is known as spanned record. Spanned record occupy the entire CI. Any unused space in the CI can not be used.
Spanned record