Tuesday, 5 April 2016

Datastage Architecture

                                                      

                               Datastage Architecture


                       Datastage is an ETL tool and it is client-server technology and integrated tool-set used for Designing, Running, Monitoring and Administrating the data acquisition is known as a job. A Job is graphical representation of data-flow from source to target and it is designed with source definition, target definition and transformation rules.

                 The Datastage Architecture consists of  client-server components.

                          

When we install a Datastage Software, it automatically show's Datastage Designer,  Datastage Administrator, Datastage Director and Manager. These are the client tier and front-end components. Sever, Engine and Repository are the three back-end components.

Datastage Administrator:- Is used to perform create, delete the projects, also to clean the metadata stored in repository and install NLS.

Datastage Designer:- It is used to create design the Datastage Jobs with different transformation logics to Extract, Transform and Load data into target. The Designer can perform below mentioned activities.
  • Create the source definition.
  • Create the target definition.
  • Develop transformation rules.
  • Design the Jobs.
Datastage Director :- It is used to validate, schedule, run and monitor the Datastage Jobs that which are Design and Deploy in the Designer.

Datastage Manager :- It will be used to perform the following tasks.
  • Create the table definitions.
  • Metadata back-up and recovery can be performed.
  • Create the customized components.
Datastage Repository :- It is one of the server side component which is defined to store the information about to build out a Datawarehouse.

Datastage Server :- This is defined to execute the job at back-end and helps to process faster.


DataStage Engines :- The server engine is the original DataStage engine and, as its name suggests, is restricted to running jobs on the server. The parallel engine results from acquisition of Orchestrate, a parallel execution technology developed by Torrent Systems, in 2003. This technology enables work (and data) to be distributed over multiple logical "processing nodes" whether these are in a single machine or multiple machines in a cluster or grid configuration. It also allows the degree of parallelism to be changed without change to the design of the job.

                   DataStage Client/Server Connectivity

Connection from a DataStage client to a DataStage server is managed through a mechanism based upon the UNIX remote procedure call mechanism. DataStage uses a proprietary protocol called DataStage RPC which consists of an RPC daemon (dsrpcd) listening on TCP port number 31538 for connection requests from DataStage clients. Before dsrpcd gets involved, the connection request goes through an authentication process. Prior to version 8.0, this was the standard operating system authentication based on a supplied user ID and password (an option existed on Windows-based DataStage servers to authenticate using Windows LAN Manager, supplying the same credentials as being used on the DataStage client machine – this option was removed for version 8.0). With effect from version 8.0 authentication is handled by the Information Server through its login and security service. Each connection request from a DataStage client asks for connection to the dscs (DataStage Common Server) service. The dsrpcd (the DataStage RPC daemon) checks its dsrpcservices file to
determine whether there is an entry for that service and, if there is, to establish whether the requesting machine's IP address is authorized to request the service. If all is well, then the executable associated with the dscs service (dsapi_server) is invoked.

                  DataStage Processes and Shared Memory

Each dsapi_server process acts as the "agent" on the DataStage server for its own particular client connection, among other things managing traffic and the inactivity timeout. If the client requests access to the Repository, then the dsapi_server process will fork a child process called dsapi_slave
to perform that work. Typically, therefore, one would expect to see one dsapi_server and one
dsapi_slave process for each connected DataStage client. Processes may be viewed with the ps -ef command (UNIX) or with Windows Task Manager. Every DataStage process attaches to a shared memory segment that contains lock tables and various other inter-process communication structures. Further each DataStage process is allocated its own private shared memory segment. At the discretion of the DataStage administrator there may also be shared memory segments for routines written in the
DataStage BASIC language and for character maps used for National Language Support (NLS). Shared memory allocation may be viewed using the ipcs command (UNIX) or the shrdump command (Windows). The shrdump command ships with DataStage; it is not a native Windows command.

No comments:

Post a Comment