Datastage Notes

Wednesday, 6 April 2016

Datastage Administration

Administration Overview

With IBM InfoSphere Information Server, you can administer security, entitlements, clusters and high availability configurations, logs, schedules, and services, and back up data. Both the IBM InfoSphere Information Server console and the IBM InfoSphere Information Server Web console provide administration capabilities.

Security administration

As part of InfoSphere Information Server administration, you set up and manage suite security. Security administration includes the following tasks:

Configuring and administering the user registry

The user registry holds user account information, such as user names and passwords, that can be accessed during authentication. You choose a user registry for the suite to use. You can choose the internal InfoSphere Information Server user registry, or an external local operating system or lightweight directory access protocol (LDAP) user registry. Depending on the registry you choose and the topology of your installation, you might also have to map credentials from one user registry to another.

Controlling access

You create user accounts and groups. You assign roles to users and groups to specify which features users can use and which projects a user can access. User roles can be defined at several levels that build on one another.

Auditing security-related events

Security-related events include all activities that set or modify security-related settings and all user authentications and application access attempts. You configure which events to log and how much information to include. You monitor and analyze the log information to help prevent unauthorized access to sensitive data.

Administering account passwords

You periodically change administrator account passwords to comply with your security policies.

Managing active user sessions

You view current active sessions, and manage session limits. If necessary, you can force one user or all users to disconnect.

Tuesday, 5 April 2016

DataStage Projects

DataStage is organized into a number of work areas called "projects". Each project has its own individual local Repository in which its own designs and technical and process metadata are stored.

Each project has its own directory on the server, and its local Repository is a separate instance of the database associated with the DataStage server engine. The name of the project and the schema name of the database instance are the same. System tables in the DataStage engine record the existence and location of each project. Location of any particular project may be determined through the Administrator client, by selecting that project from the list of available projects. The pathname of the project directory is displayed in the status bar. When there are no connected DataStage clients, dsrpcd may be the only DataStage process running on the DataStage server. In practice, however, there are one or two more. The DataStage deadlock daemon (dsdlockd) wakes periodically to check for deadlocks in the DataStage database and, secondarily, to clean up locks held by defunct processes – usually improperly disconnected DataStage clients. Job monitor is a Java application that captures "performance" data (row counts and times) from running DataStage jobs. This runs as a process

called JobMonApp.

Server Job Execution

Server jobs execute on the DataStage server (only) and execute in a shell called uvsh (or dssh, a synonym). The main process that runs the job executes a DataStage BASIC routine called DSD.RUN – the name of this program shows in a ps –ef listing (UNIX). This program interrogates the local Repository to determine the runtime configuration of the job, what stages are to be run and their interdependencies. When a server job includes a Transformer stage, a child process is forked from uvsh also running uvsh but this time executing a DataStage BASIC routine called DSD.StageRun. Server jobs only ever have uvsh processes at run time, except where the job design specifies opening a new shell (for example sh in UNIX or DOS in Windows) to perform some specific task; these will be child processes of uvsh.

Parallel Job Execution

Parallel job execution is rather more complex. When the job is initiated the primary process (called the “conductor”) reads the job design, which is a generated Orchestrate shell (osh) script. The conductor also reads the parallel execution configuration file specified by the current setting of the APT_CONFIG_FILE environment variable. Based on these two inputs, the conductor process composes the “score”, another osh script that specifies what will actually be executed. (Note that parallelism is not determined until run time – the same job might run in 12 nodes in one run and 16 nodes in another run. This automatic scalability is one of the features of the parallel execution technology underpinning Information Server (and therefore DataStage).

Once the execution nodes are known (from the configuration file) the conductor causes a coordinating process called a “section leader” to be started on each; by forking a child process if the node is on the same machine as the conductor or by remote shell execution if the node is on a different machine from the conductor (things are a little more dynamic in a grid configuration, but essentially this is what happens). Each section leader process is passed the score and executes it on its own node, and is visible as a process running osh. Section leaders’ stdout and stderr are redirected to the conductor, which is solely responsible for logging entries from the job.

The score contains a number of Orchestrate operators. Each of these runs in a separate process, called a “player” (the metaphor clearly is one of an orchestra). Player processes’ stdout and stderr are redirected to their parent section leader. Player processes also run the osh executable.

Communication between the conductor, section leaders and player processes in a parallel job is effected via TCP. The port numbers are configurable using environment variables. By default, communication between conductor and section leader processes uses port number 10000 (APT_PM_STARTUP_PORT) and communication between player processes and player processes on other nodes uses port number 11000 (APT_PLAYER_CONNECTION_PORT).

To find all the processes involved in executing a parallel job (they all run osh) you need to know the configuration file that was used. This can be found from the job's log, which is viewable using the Director client or the dsjob command line interface.

Datastage Architecture

Datastage is an ETL tool and it is client-server technology and integrated tool-set used for Designing, Running, Monitoring and Administrating the data acquisition is known as a job. A Job is graphical representation of data-flow from source to target and it is designed with source definition, target definition and transformation rules.

The Datastage Architecture consists of client-server components.

When we install a Datastage Software, it automatically show's Datastage Designer, Datastage Administrator, Datastage Director and Manager. These are the client tier and front-end components. Sever, Engine and Repository are the three back-end components.

Datastage Administrator:- Is used to perform create, delete the projects, also to clean the metadata stored in repository and install NLS.

Datastage Designer:- It is used to create design the Datastage Jobs with different transformation logics to Extract, Transform and Load data into target. The Designer can perform below mentioned activities.

Create the source definition.
Create the target definition.
Develop transformation rules.
Design the Jobs.

Datastage Director :- It is used to validate, schedule, run and monitor the Datastage Jobs that which are Design and Deploy in the Designer.

Datastage Manager :- It will be used to perform the following tasks.

Create the table definitions.
Metadata back-up and recovery can be performed.
Create the customized components.

Datastage Repository :- It is one of the server side component which is defined to store the information about to build out a Datawarehouse.

Datastage Server :- This is defined to execute the job at back-end and helps to process faster.

DataStage Engines :- The server engine is the original DataStage engine and, as its name suggests, is restricted to running jobs on the server. The parallel engine results from acquisition of Orchestrate, a parallel execution technology developed by Torrent Systems, in 2003. This technology enables work (and data) to be distributed over multiple logical "processing nodes" whether these are in a single machine or multiple machines in a cluster or grid configuration. It also allows the degree of parallelism to be changed without change to the design of the job.

DataStage Client/Server Connectivity

Connection from a DataStage client to a DataStage server is managed through a mechanism based upon the UNIX remote procedure call mechanism. DataStage uses a proprietary protocol called DataStage RPC which consists of an RPC daemon (dsrpcd) listening on TCP port number 31538 for connection requests from DataStage clients. Before dsrpcd gets involved, the connection request goes through an authentication process. Prior to version 8.0, this was the standard operating system authentication based on a supplied user ID and password (an option existed on Windows-based DataStage servers to authenticate using Windows LAN Manager, supplying the same credentials as being used on the DataStage client machine – this option was removed for version 8.0). With effect from version 8.0 authentication is handled by the Information Server through its login and security service. Each connection request from a DataStage client asks for connection to the dscs (DataStage Common Server) service. The dsrpcd (the DataStage RPC daemon) checks its dsrpcservices file to

determine whether there is an entry for that service and, if there is, to establish whether the requesting machine's IP address is authorized to request the service. If all is well, then the executable associated with the dscs service (dsapi_server) is invoked.

DataStage Processes and Shared Memory

Each dsapi_server process acts as the "agent" on the DataStage server for its own particular client connection, among other things managing traffic and the inactivity timeout. If the client requests access to the Repository, then the dsapi_server process will fork a child process called dsapi_slave

to perform that work. Typically, therefore, one would expect to see one dsapi_server and one

dsapi_slave process for each connected DataStage client. Processes may be viewed with the ps -ef command (UNIX) or with Windows Task Manager. Every DataStage process attaches to a shared memory segment that contains lock tables and various other inter-process communication structures. Further each DataStage process is allocated its own private shared memory segment. At the discretion of the DataStage administrator there may also be shared memory segments for routines written in the

DataStage BASIC language and for character maps used for National Language Support (NLS). Shared memory allocation may be viewed using the ipcs command (UNIX) or the shrdump command (Windows). The shrdump command ships with DataStage; it is not a native Windows command.

DataStage Overview

IBM Infosphere Datastage is a popular ETL(Extraction Transformation Load) tool in market. IBM brought this product from Ascential at 7.5 version and upgraded to different versions like 8, 8.1, 8.5, 8.7, 9.1, 11.3 and 11.5 though the most commonly used are 8, 8.1, 8.5.

Datastage jobs are highly scalable due to the implementation of parallel processing. The EE architecture is process-based (rather than thread processing), platform independent and uses the processing node concept. Datastage EE is able to execute jobs on multiple CPUs (nodes) in parallel and is fully scalable, which means that a properly designed job can run across resources within a single machine or take advantage of parallel platforms like a cluster, GRID, or MPP architecture.