Friday, December 19, 2014

SAS in Mainframes(z/Os) Tutorial with examples

This post is basically about how SAS reads the record and internally processes it.
To start with lets check some basic sas concepts which comes into play when ever you run a simple SAS program.
The two primary steps in a sas program
1. SAS DATA step
2. SAS PROC step

DATA steps typically create or modify SAS data sets. They can also be used to produce custom-designed reports. For example, you can use DATA steps to
put your data into a SAS data set
  • compute values
  • check for and correct errors in your data
  • produce new SAS data sets by subsetting, merging, and updating existing data sets. 

PROC (procedure)  steps are pre-written routines that enable you to analyze and process the data in a SAS data set and to present the data in the form of a report

For example, you can use PROC steps to
  • create a report that lists the data
  • produce descriptive statistics
  • create a summary report. 
Diagrammatically the program flow is like below:

SAS program flow

Feature of a SAS program:
  • It usually begins with a SAS keyword.
  • It always ends with a semicolon.

In the DATA step, we introduce the input file,ie the external file (supplied in the DD name in JCL in case of mainframe) to SAS. Data step begins with DATA keyword.Also we take declare the layout of the field. As an example,
DATA CUST;                    
  INFILE  CUSTPOL;            
  INPUT @23  ACCTNO   $CHAR01.
        @60  POLNO    $CHAR10.
        @76  STAT     $CHAR01.
        @224 POLEFFDT 8.      
        @240 APPRCDT  8.  
Here, i have highlighted the SAS keywords in blue.  CUST  will be the name of the sas Dataset which SAS will prepare internally once this step is executed.
INFILE CUSTPOL : Here the CUSTPOL is the name of the physical(external) file from which the data is to be read. INPUT will take only those  fields from the specific positions and only those fields will be present in the SAS dataset CUST.

The above SAS DATA step is processed in 2  phases.
A) Compilation phase: Each of the statements are checked for syntax errors. Once it completes,execution begins.
B) Execution phase: Data is read and executed unless otherwise coded.
Some of the terms which comes with SAS data processing (Just a bit of knowledge is good)are: 
Input Buffer:
During Compilation phase, an input buffer(memory area) is created to hold the records from file. It is created when the raw data is read. It is just a logical concept.
Program Vector Data:
When the data is read, SAS builds a data set in the memory(which is very much internal to SAS) known as SAS data set.
This Program Vector contains automatic variables that can be used to track the number of observations,and comes  handy in many ways.

1.   _N_ counts the number of times that the DATA step begins to execute.
2 .  _ERROR_ signals the occurrence of an error that is caused by the data during execution.
The default value is 0, which means there is no error. When one or more errors occur, the value is set to 1.

At the beginning of the execution phase, the value of _N_ is 1. Because there are no data errors, the value of _ERROR_ is 0.

When we define the DATA step, we should try to use the minimum variables. Unnecessary declaration of the variables makes the SAS internal dataset bigger which can lead to more execution time.

During execution, each record in the input raw data file is read, stored in the program data vector, and then written to the new data set as an observation.
At the end of the DATA step, several actions occur. First, the values in the program data vector are written to the output data set as the first observation.

Log Messages
Each time  SAS  executes its step, it writes log . In z/os environment, it will be written to SASLOG.Looks like below.  It shows the number of records read, the number of records which gets selected in the criteria and finally goes into sas dataset.

NOTE: 17430 records were read from the infile CUSTPOL.                
      The minimum record length was 600.                              
      The maximum record length was 636.                              
NOTE: The data set WORK.CUSTPOL has 5818 observations and 12 variables.
NOTE: The DATA statement used the following resources:                
      CPU     time -         00:00:00.07                              
      Elapsed time -         00:00:02.99                              
      EXCP count   - 5998                                             
      Task  memory - 4904K (148K data, 4756K program)                 
      Total memory - 17710K (3488K data, 14222K program)              
      Timestamp    - 12/19/2014 2:43:29 AM                            
NOTE: The address space has used a maximum of 876K below the line and 1

Friday, December 12, 2014

What are KSDS, ESDS,RRDS, LDS VSAM ?. Concept and Structure of Vsams

Types of Vsam and its concepts:
Vsam supports the following types of file organizations:
1. ESDS  2. RRDS   3.KSDS
ESDS (Entry Sequenced Data Set)
ESDS can be accessed sequentially or directly. They can not be accessed skip sequentially. ESDS are mostly where we write logs and it can be accessed sequentially.
ESDS record can be located sequentially  by checking  each record, starting from the beginning of the data set.Also an ESDS record can be located directly if the relative byte address(RBA) of the record in known.This location can be derived from some previous search or by remembering the address where the record was inserted.
ESDS Data Insertion
 New  records must be inserted at the end of the file. Each record is inserted at the end of the latest CI.IF there is no enough space in the CI, then a new CI is started.
Deletion of records physically is not possible.If we do not need one record, we can mark it as deleted , but it will remain in the dataset until the ESDS is re built again.
Updation of the records can be done where they are provided the length remains the same.If the length changes, the record must be marked as deleted and added as a new record at the end.

RRDS (Relative Record Data Set)
RRDS is another type of VSAM.It consists of fixed length areas called slots. These are pre-formatted when the data is created, or whenever a new CA is created.Records are inserted into the slots.

RRDS and its slots
Advantages of RRDS over ESDS:
In RRDS, Records can be added and deleted within these slots.
Records can be directly accessed by specifying slot number known as RRN. The first slot has RRN 1  .We can use skip sequential processing for RRDS
The Application program can insert into any free slot which is know as Direct insertion. The application program can request  the record to be inserted into next free slot which is known as sequential insertion.
While deletion, a record can be deleted and slot can be re used.
RRDS Structure:
There is one RDF for every slot in RRDS. Each of the RDF holds the information whether its associated slot is empty or not
RRDS  Internal Structure
In RRDS, all the records must be of same length. Unlike ESDS, we can not have Spanned record in RRDS & We Can not have Free space in RRDS.
LDS: (Linear Data set)
Linear Data sets holds the records in the form of strings of data. There are no CIDF  or RDF. Most common usage of LDS in DB2, which uses LDS to store objects.
Internally Z/OS  uses  DATA IN VIRTUAL(DIV)  services or WINDOWS SERVICS to insert/update/ delete the data.

KSDS (Key Sequenced Data Set)
In a KSDS, we have a cluster which consists of two parts ; Data part and the Index Part
We need to have the data to be sorted when its inserted into a KSDS. The Key part can be 1-255 contiguous bytes and must be unique. Each Key points to the data part. As we Insert,update or delete a record, this Index component gets automatically.
KSDS Vsam Cluster
The Rounded Yellow part signifies the Vsam Name and is know as the VSAM Cluster. It relates the Data and the Index part.  It is not a file, just  a catalog entry.
Records in a KSDS can be updated and their length can be changed.If he record is reduced in length, then the remaining space can be used as free space, else if the record length increases, the record is moved to Free space to make room for the new updated record.

Check here on How to Define KSDS,ESDS,RRDS,LDS with IDCAMS

Friday, December 5, 2014

Vsam Basics and Vsam tutorial. (Control interval, Control Area, CIDF, RDF)

This post is just a short writeup for VSAMs.
What is a Vsam file?
Virtual Storage Access Method : This is one type of dataset and  APIs to access the dataset.It is an excellent database to keep the records as they  can be read sequentially,directly or skip sequentially.
We must know that VSAM dataset must always be kept on disk and not on Tape. IBM newly introduced an extension of the VSAM known as  Extended Format Vsam . To an application , a Vsam and extended Vsam are  all same.Extended vsam provides some addition features like compression of the data, sharing, improved performance and allows a maximum data set size if 128 terabytes.
What are control Intervals? 
It is the basic building block for  VSAM dataset. It holds one or more record. Concept is similar to block for sequential and partitioned data sets.  When we read/write record from a vsam file , we read the entire chunk of data, ie the entire CI is moved to memory from DASD, not just the single record
Control Interval which consists of bunch of records.
What are CIDF , RDF in VSAM ?
Apart from keeping records in a CI, Vsam places two additional fields which are needed to manage CI. They are CIDF and RDF.
Unused spce, RDF,CIDF in a VSAM
The Dotted space shows the unused space in a CI.
RDF (Record Descriptor field) stores the length of each record. VSAM with fixed length records needs two RDF. One to keep track of the number of records and the other to count the length of record.For variable length record, there is one per record
CIDF (Control interval definition field) is a 4 byte length field which holds the location and size of unused space in CI.(like the one shown in shaded space)

What is Control Area  ?
Group of CIs make one Control Area (CA).
Group of Control Interval is called as Control Area.

Read here  to check the  Parameters to define a Vsam with IDCAMS

Spanned Records in VSAM and CI Split:
While inserting a record, if  the Record size is larger than the CI size, it can not be stored. In such scenarios,  CI splits happen if the vsam is defined with SPANNED option. The records then span across more than one CI and is known as spanned record. Spanned record occupy the entire CI. Any unused space in the CI can not be used.
Spanned record

Saturday, October 18, 2014

Command shell (option 6) in mainframe - Executing commands through Terminal Utility program IKJEFT01

Most of us have definitely seen or used in some way or other  the command shell(option 6) from ISPF panel. This is very useful command window. TSO /E allows us to talk to MVS  either through TSO E commands , or the ISPF pannel(like option 6) as i was mentioning above. In some work , i came across a scenario where i need to execute certain commands in command shell.
For example the following  commands:
The simplest way is to to go to ISPF and go to option 6. Then execute each command manually .

But the tweek in my case was to execute these commands in  batch and collect the command logs, ie, whether these commands were executed or not and all details.

So, here is the utility IKJEFT01 which we can use. Its other name is terminal utility program.
Whatever we want to execute in the pannel screen can be executed from batch as well using ikjeft01.
So, lets have the JCL for  using this utility.
//STEP001  EXEC PGM=IKJEFT01                
Way 2
//STEP002    EXEC PGM=IKJEFT01             
//SYSTSPRT DD SYSOUT=*                     
//SYSTSIN  DD  *                           
    tso commands here

your-dataset would be the ps file having all the commands which needs to be executed
or  give all the commands in SYSTSIN DD *

Now if we explore further we can use this utility to perform many functions which in some cases may turn out to be very useful
Few useful commands which we can perform  in batch
1. Rename a dataset
2. Copy a file to some new name  (we can use other  ibm utilities as well)
3. Copy pds member to a new pds with a new name or same name
4 . Check job status and many other useful functions.
There are lot many others as well...
Have a look into jobs for some of the functions:
//STEP001    EXEC PGM=IKJEFT01                  
//SYSTSPRT DD SYSOUT=*                          
//SYSTSIN  DD  *                                
   SMCOPY FDS('USER12.JCL(TEST1)')  -           
          TDS('USER22.SAS(TEST2)') NOTRANS      

//SYSPRINT DD  SYSOUT=*                         
//SYSIN    DD  DUMMY                            
//* *****copy files                                            
//STEP002    EXEC PGM=IKJEFT01                  
//SYSTSPRT DD SYSOUT=*                          
//SYSTSIN  DD  *                                
   SMCOPY FDS('USER-ID.TST.FILE1')  -           

//SYSPRINT DD  SYSOUT=*                         
//SYSIN    DD  DUMMY                            
 The  first step STEP001 copies pds members to other pds
The next step,STEP002 copies files
SMCOPY is the command to copy. FDS and TDS means From dataset and To dataset respectively.
Will explore some more and update as time permits!.

Saturday, October 11, 2014

File Handling in cobol - COBOL TUTORIAL

A data file is nothing but a collection of relevant records. Cobol reads the records line by line. Read the first line and process the entire logic,then read the second record and so on... Generally we use two types of file in our program.
1. Sequential files
2. Vsam Files
What is a sequential File ?
A sequential file is one whose records are in sequential order, from first to last. When we want to process this file, we need to start from the first records and go on sequentially. We can not delete and insert record in the middle suddenly. We can only add the records in the end.
Coming on to Vsam file, we have mainly 3 types which we use frequently in program.
a) KSDS  b) ESDS c) RRDS. We will see what they are and how to handle them in subsequent posts.

Coming back to Cobol file handling, let us see steps and ways to read/write files:
Before going into details, we need to know the bare essentials needed for file handling.
1. Assign the file to cobol program
2. Open the file (Either in input or output mode. There are other modes as well.Will see later)
3. Read the file into Working storage variables
4. Write the logic in procedure division as per the requirement
5. Write the output and Close the file.

The Very First step in writing a Cobol is to assign the Files . We all know that to run the cobol program ,we need JCL. So, JCL is the place where we give the actual physical file needed to run our program.
But, Our cobol  program need to understand this thing.
So this association between the JCL file and the cobol program is established in Environment Division using 'ASSIGN TO' Clause .The mapping with JCL DDNAME is done here. The file is allocated to our program by this ASSIGN Statement.
Next, the layout of the file and its attributes are defined in the FILE SECTION of DATA DIVISION. We should have a copybook to identify what value corresponds to what fields. That needs to be predefined in a copybook and should be included in the program using COPY statement.
Cobol Assign Clause
 But, before cobol uses these files for reading or writing, we need to OPEN these files with some keywords which will make cobol understand, whether the file is a input file or it will be used as an output file.
We need to use  these statements  in the procedure division to instruct cobol the kind of operation we will be performing on the files.
Have a look into the JCL for better understanding of the file association between cobol and JCL.
Look for the DD names  CMFIN,FILE1, FILE2.
CMFIN is used with Dispostion of SHR which we are using as input in cobol
FILE1, FILE2 has Disposition of (NEW,CATLG,DELETE) which implies this file will be created new and hence opened in output mode in cobol code (above code)
//STEP0010 EXEC PGM=TSTCLAIM                         
//SYSOUT   DD SYSOUT=*                               
//SYSDBOUT DD SYSOUT=*                               
//********  OUTPUT FILES  ******                     
//FILE1    DD TEST.OP1.CLAIM,                        
//         DISP=(,CATLG,DELETE),                     
//         DCB=(LRECL=205,RECFM=FB,BLKSIZE=0),       
//         UNIT=TEST,SPACE=(TRK,(15,15),RLSE)        
//FILE2    DD DSN=TEST.OP1.CLAIM,                    
//         DISP=(,CATLG),UNIT=&UNIT,SPACE=(TRK,(15,15)
//         DCB=(LRECL=110,RECFM=FB,BLKSIZE=0)        
//**** END  ********                                  
Back to cobol:  Once the above declaration is made, we can use the input file to read the input records one by one and process it.
              STOP RUN.                              
 So, what is this  W-CMF-CLAIMS-DETAIL ?  This is the copybook which is defined for the input file. This READ statement dumps the every line of the input file into this structure , so that we can use these variables and process the data.

Lets see how does the READ statement work
Suppose your input file looks like:
 Simply by looking into these values wont make you understand what these digits are. But if we have a copybook, say  W-CMF-CLAIMS-DETAIL defined in the WORKING STORAGE of your cobol like below:
     05 WS-CLAIM-DT  PIC X(08).
     05 WS-CLAIM-STATUS PIC X(06).
Once we use the read statement for the very first time after the OPEN statement,like above
READ CMF-FILE  INTO   W-CMF-CLAIMS-DETAIL , the first 8 bytes of data will be placed in
WS-CLAIM-DT, next 6 bytes in WS-CLAIM-STATUS, and next 6 bytes in WS-CLAIM-COMPANY  Resulting in like below:
WS-CLAIM-DT will have the value 20140924
WS-CLAIM-STATUS will contain 001102
WS-CLAIM-COMPANY will contain  000560
Every time the read statement is performed, these values will get changed for the next record and so on  until EOF (End OF file) is reached.
Do All the processing as per the requirement in the procedure division and then the very last steps in this process would be writing the output file and closing all the files.

Friday, September 26, 2014

DFSORT/SYNCSORT to include spaces, insert fixed strings and refortmat the records using OUTREC

Continuing with the Previous SORT examples, this section will have some SORT features to understand the INREC/OUTREC features and how they work.
In the following sort example, i am trying to insert spaces and insert fixed string in the input file and format the output record.
Since , we are trying to build the record, ie, manipulate the entire record structure here and there, we will go with OUTREC BUILD option. This gives us complete control over the record structure. We can pick up any record from any position and place it anywhere as per the requirement.
Here goes my input file.
********************************* To
******************************** Bottom
//STEP0010 EXEC PGM=SORT                         
//SYSOUT    DD SYSOUT=*                          
//SORTWK01  DD UNIT=DISK,SPACE=(CYL,(100,100))   
//SORTOUT    DD DSN=BHI5122.TEST.SORT.OP3,        
//          DISP=(,CATLG),UNIT=TEST,             
//          SPACE=(CYL,(50,50),RLSE)             
//SYSIN     DD *                                 
  SORT FIELDS=COPY                               
  OUTREC BUILD=(1:1,4,5:2X,8:C'TST',13:5,7)     

********************************* Top of Data ***********
A001   TST  MUKESHN                                     
A002   TST  GRECHEN                                     
A003   TST  STEVEEN                                     
A003   TST  STEVEEN                                     
A004   TST  STEVEEN                                     
A004   TST  STEVEEN                                     
******************************** Bottom of Data *********

As we see here, OUTREC parameter, '1:1,4' tells sort to :Take record of length 4 bytes starting from 1st column and  place it in 1st column of the output file.
5:2X will put 2 byte of spaces.  X indicate spaces to be included. When we use 3X, that means 3 spaces to be put.
8:C'TST'   will tell sort to put the string 'TST' from 8th byte of the output record.
13:5,7 Will instruct sort to: Take the record of length 7 bytes from 5th column of the input file and put from 13th column in the output file.
Now match the output, and we can see the result!
A Point to remember : For INREC and OUTREC we can use FIELDS or BUILD. For OUTFIL , we can use  OUTREC or BUILD

2. Get the HEX Values using SORT
Using the Same input file, will use the OUTFIL OUTREC command to print the hex values
   ............... same as above JCL......
  SORT FIELDS=COPY                  
Output will look like:
Will keep updating ........

Wednesday, August 6, 2014

Can we create a VSAM file with IEFBR14 ?

The Answer is 'Yes', we can, but with some limitations. We all know or used to IDCAMS  when it comes to VSAM creation. But even IEFBR14 can create VSAM with some advantages and disadvantages!

Let have the JCL for our first run:
//STEP1    EXEC PGM=IEFBR14                    
//SYSPRINT DD SYSOUT=*                         
//         SPACE=(CYL,(5,5)

And here you go with the Output for KSDS VSAM:

Lets look for the parameters used in the JCL
LRECL=80  Defines the maximum record length of the VSAM in bytes
RECORG  Is the important parameter to note. Its value determines the type of VSAM dataset the job will create.
Here in the example we have
RECORG=KS which made it as a KSDS
RECORG=ES  will create a ESDS
RECORG=ES  will create Linear data set
RECORG=RR will create RRDS
'KEYLEN=' will define the length of the KSDS key
'KEYOFF='  will define the key offset length. Here the value of '0' indicates the first byte.
'SPACE='  will allocate the space .Here it is in cylinders, the primary and secondary being 5 each.
However we can also use SPACE=(TRK,(50,20))
We can use the AVGREC parameter along with the SPACE parameter like below.The value of AVGREC has some significance.
AVGREC=U  implies primary and secondary space numbers are  number of records
AVGREC=K  implies primary and secondary space numbers are  number of records in thousands(multiplied by1024)
AVGREC=M  implies primary and secondary space numbers are  number of records in Million(multiplied by 1024*1024)
Re writing the Above JCL once more
//STEP1    EXEC PGM=IEFBR14                    
//SYSPRINT DD SYSOUT=*                         
//         SPACE=(60,(300,200)),AVGREC=U

So here, the primary space large enough for  300*60=18000 bytes and secondary for 1200 bytes.
Disadvantage of using IEFBR14:
1. We cant specify CI Size
2. No Shareoptions
3.Seperate data and index component parameters

 Hope this helps a bit! :)

Tuesday, August 5, 2014

DB2 Plan +Package+ and Collection ID ? Frequently Asked questions in interview

Having Covered the Basic steps in a cobol-db2 compilation process, lets have a bit of more clarification & brain storming to check our understanding :)

Q 1: Syntax Checking of the SQL statements are done in Precompilation time or Bind time?

Both. Let me explain it in details a bit. We all are aware of DCLGENS and we use INCLUDE statements in our cobol module to include the DCLGENS. Precompilation process only uses DCLGEN as the reference for table description (provided we use DCLGEN).It does not go and check the DB2 catalog for the table structure. If no DCLGEN is used, no synatx checking is done in precompilation time.
But,  During BIND time, the check is more rigorous and thorough because the DB2 catalog is accessed to check the table structure along with other information like access permission and others

Q2: Can 1 package have more than 1 DBRM ?
From 1 program, we get one DBRM and then we bind the DBRM into a package; in the sense, 1 package is the DB2-optimized version of the that single  DBRM. Each Package is nothing but a single bound DBRM. There is direct one to one relationship with application program and package.

Q3: Are packages executable?
No. In DB2, only plan is the executable module.

Will keep on adding as and when i come across!!..  suggestions most welcome!

Sunday, August 3, 2014


Few Errors encountered while playing with VSAMs:


Explanation:You have defined your VSAM with Record length ,say (15 15)
You are trying to do REPRO from a dataset with Different LRECL

Create a flat file with the same LRECL as the VSAM file and use it in REPRO command

Reading/Interpret the LISTCAT command output in VSAM

When we create a VSAM,all its details are recorded in the catalog table in MVS. LISTCAT is a versatile command meaning 'List Catalog'. LISTCAT  command can be fired from command  shell (option 6 from ISPF menu ) or it can run in batch as a SYSIN parameter to the IDCAMS.

To run it from the command shell , go to option 6 and give 
We will get the information's being displayed on the screen page by page as we hit enter.

To use it in batch we need the IDCAMS utility . Below is the sample JCL for LISTCAT command 
//SYSPRINT DD SYSOUT=*                          
//SYSIN    DD *                                    

Interpreting the LISTCAT command
Interpreting the listcat command

As we can see here the name of the cluster is TST.MYVSAM.DEPT.As we go through the spool we will see the information of  STORAGECLASS,MANAGEMENTCLASS, but those things are not of that much importance to us now. Go on and have a look down for

It implies that this VSAM is not password protected. Also the ASSOCIATIONS parameter will list down all the datasets with the same name pattern and we can see that it shows the data component and INDEX component. So our VSAM is probably a KSDS.
Look in the ATTRIBUTES Section.
ATTRIBUTES section will give a more clear picture on the VSAM definition, ie how it is defined .
So, from these statistics we can somewhat guess the definition.Like, for the above VSAM, the key is of length 16 and offset of the key position being 20.
((RKP parameter defines the relative position  of the key field in the base cluster.) )
The AVGLRECL and MAXLRECL will give us the  idea about the record size parameter used in definition which in this case is  191 and 1500 respectively.
SPEED and NOERASE are also keywords in VSAM world.
One more 'NONSPANNED' , specifies that the data in the component can not span across CIs
So from this screenshot lets note down what all information we collected
SHR(2 3)
RECORDSIZE(191 01500)
KEYS(00016 00020)

If we look a little right on the mainframe screen we will get the information about the CI size .

 So, the CISZ parameter will be of length 18432.  CISZ(18432).
Also , watch the parameter NOREUSE and UNORDERED. So three more features added to our cart for the vsam statistics; CISZ(18432) ,NOREUSE and UNORDERED
The CI/CA in VSAM indicates the number of CI in each CA .Thus the above information tell us that there are 45 CIs in each CA.
Go on for few more details in the STATISTICS section
This Section really gives us all the details regarding the statistics of the vsam in that moment. Will see more in details later. As of now check out the FREESPACE parameter % CI and % CA. Those are the parameters with which the vsam has been defined.
Ohh..!! one more i got in the Allocation section.
SPACE-PRI    -------60
SPACE-SEC   -------10
This is more or less self explanatory, ie, the allocation quantity is CYLLNDER and primary and secondary allocations are 60 and 10 respectively.
So few more added in our cart.
FREESPACE(025  008)
CYL(00000060  00000010) 
Some other useful information from STATISTICS section
1. (( FREESPC --16386048 : This indicates space in bytes in completely free CIs. Do not mix them up unused space in in CIs) )
2. REC-TOTAL :Total number of logical records in the data component in that instance listcat was run.
3. REC-INSERTED: Total number of logical records inserted in data component ,since the vsam was last created or re-organized.
4. REC-RETRIEVED : total logical records read from data part since its creation or reorganized.It takes into account  reads for update during REWRITE processing.
5. REC-UPDATED : Total number of logical records UPDATED or REWRITTEN back into the data component since its last creation or last it was reorganized. Does not include DELETED records.However if  a records is updated, and may be got deleted later ,is counted here.
6. SPLITS-CA /SPLITS-CI : Total number of CI and CA splits that happened since it was last created or reorganized.
Having covered all these till now, if we look into the VSAM parameters which we collected till now  ( highlighted in blue above), we can some what figure out the structure of the VSAM which will be like:
DEFINE CLUSTER (          -
   SHR(2 3)               -
   RECORDSIZE(191 01500)  -
   NOREUSE                -
   INDEXED                -
   NOERASE                -
   UNORDERED              -
   NONSPANNED             -
   SPEED                  -
      DATA               -
       (                           -
       VOLUMES(XXXX)               -
       CISZ(18432)                 -
       KEYS(00016 00020)           -
       FREESPACE(025  008)         -
       CYL(00000060  00000010)     -
        )                          -
       INDEX                       -
       (                            -
       VOLUMES(XXXXX)              -

Wednesday, July 2, 2014

Basics of Indexing-Use of index with Example

In the performance tuning posts (db2 performance issues and tuning part 1part 2) , we have discussed how indexing can help tuning database performance to a great level but, today we are going to discuss about the proper design of indexes.

As indexes provide easy access to the data rows in a table just like a book’s index, it speeds up the query execution process and thus helps to tune the database performance.
But, what it does actually and how it does?
An example will help us to understand it in a better way.
Say we have a tblOrder table with 100 rows and three simple columns: OrderId, ItemName and ItemPrice.
We will try to find out the orders within a specific price range:
The query would look like:
SELECT OrderId, ItemName, ItemPrice
FROM tblOrder
WHERE ItemPrice>10 AND ItemPrice<20

Currently, we assume that this table does not have any index. So, to execute the above query, the processor starts searching each and every row one by one to get the required data.
The below diagram depicts the actual processing done to do so. It definitely becomes an overhead to search all the 100 rows only to get a few set of records.

Now, let us add an index on the ItemPrice data column of this table.
The query will look like:
CREATE INDEX IX_tblOrder_ItemPrice
ON tblOrder (ItemPrice ASC)
It acts just like the index in a book. Adding an index to this column means, it stores the copy of the column’s data and a reference to the row where the actual data is stored. These index entries are also sorted in ascending order by SQL.
The pictorial representation will be somewhat similar to below:
Now, when we execute the above query again with an index added to the table, it will first find the value in the index and then it will take the reference stored in that index to point the actual data row.  
Thus index makes the query processing fast and wipes out the overhead of scanning each and every row in a table.
Index Structure
An index is a set of pages or index nodes which are organized in a structure called B-tree structure. It is a hierarchical structure with the root node placed at the top and the leaf nodes at the bottom of the hierarchy as demonstrated in the below figure.

The indexing works in the following way:
When a query is fired against an indexed column, the query processor starts scanning from the top which is the root node. Slowly it navigates down the intermediate levels or the next levels. The more we go down through the intermediate level, the more granular division comes into picture. The processor continues scanning the index nodes until the leaf node is arrived.
Let us take an example of searching the value of 87 in an indexed column. As it is said earlier, the processor starts scanning from the root node and then gradually it goes down through the intermediate levels till the leaf nodes are reached. Here, we are trying to find the value of 87 which leads to the 1st page of the 1st intermediate level, i.e. the page with 1-100 values. Now it further goes down to get the exact value and it understands that the value can be there in the 2nd page of the 2nd intermediate level which is 51-100. As 51-100 is also not the leaf node, it searches down further and finally reaches to the leaf node and points to the page 76-100 which is the 4th page of the leaf node level.
The leaf node may contain the full data row or the reference to that row. It solely depends on the type of index that we declare such as, clustered index or non-clustered index.

We will discuss the types of indexes in our next post…Stay tuned…