Tuesday, 5 April 2016

DataStage Projects

DataStage is organized into a number of work areas called "projects". Each project has its own individual local Repository in which its own designs and technical and process metadata are stored.



Each project has its own directory on the server, and its local Repository is a separate instance of the database associated with the DataStage server engine. The name of the project and the schema name of the database instance are the same. System tables in the DataStage engine record the existence and location of each project. Location of any particular project may be determined through the Administrator client, by selecting that project from the list of available projects. The pathname of the project directory is displayed in the status bar. When there are no connected DataStage clients, dsrpcd may be the only DataStage process running on the DataStage server. In practice, however, there are one or two more. The DataStage deadlock daemon (dsdlockd) wakes periodically to check for deadlocks in the DataStage database and, secondarily, to clean up locks held by defunct processes – usually improperly disconnected DataStage clients. Job monitor is a Java application that captures "performance" data (row counts and times) from running DataStage jobs. This runs as a process
called JobMonApp.

                                   Server Job Execution

Server jobs execute on the DataStage server (only) and execute in a shell called uvsh (or dssh, a synonym). The main process that runs the job executes a DataStage BASIC routine called DSD.RUN – the name of this program shows in a ps –ef listing (UNIX). This program interrogates the local Repository to determine the runtime configuration of the job, what stages are to be run and their interdependencies. When a server job includes a Transformer stage, a child process is forked from uvsh also running uvsh but this time executing a DataStage BASIC routine called DSD.StageRun. Server jobs only ever have uvsh processes at run time, except where the job design specifies opening a new shell (for example sh in UNIX or DOS in Windows) to perform some specific task; these will be child processes of uvsh.


                                  Parallel Job Execution

Parallel job execution is rather more complex. When the job is initiated the primary process (called the “conductor”) reads the job design, which is a generated Orchestrate shell (osh) script. The conductor also reads the parallel execution configuration file specified by the current setting of the APT_CONFIG_FILE environment variable. Based on these two inputs, the conductor process composes the “score”, another osh script that specifies what will actually be executed. (Note that parallelism is not determined until run time – the same job might run in 12 nodes in one run and 16 nodes in another run. This automatic scalability is one of the features of the parallel execution technology underpinning Information Server (and therefore DataStage).

Once the execution nodes are known (from the configuration file) the conductor causes a coordinating process called a “section leader” to be started on each; by forking a child process if the node is on the same machine as the conductor or by remote shell execution if the node is on a different machine from the conductor (things are a little more dynamic in a grid configuration, but essentially this is what happens). Each section leader process is passed the score and executes it on its own node, and is visible as a process running osh. Section leaders’ stdout and stderr are redirected to the conductor, which is solely responsible for logging entries from the job.

The score contains a number of Orchestrate operators. Each of these runs in a separate process, called a “player” (the metaphor clearly is one of an orchestra). Player processes’ stdout and stderr are redirected to their parent section leader. Player processes also run the osh executable.

Communication between the conductor, section leaders and player processes in a parallel job is effected via TCP. The port numbers are configurable using environment variables. By default, communication between conductor and section leader processes uses port number 10000 (APT_PM_STARTUP_PORT) and communication between player processes and player processes on other nodes uses port number 11000 (APT_PLAYER_CONNECTION_PORT).

To find all the processes involved in executing a parallel job (they all run osh) you need to know the configuration file that was used. This can be found from the job's log, which is viewable using the Director client or the dsjob command line interface.




No comments:

Post a Comment