PluS User's Manual

17-Oct-2006    Initial release
19-Dec-2006    Update resource specification of plus_reserve, add plus_account, ClassAd configuraion etc.

Table of Contents


Copyright 2003, 2004, 2005, 2006 Grid Technology Research Center,
National Institute of Advanced Industrial Science and Technology

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

This work is partly funded by the Science and Technology
Promotion Program's "Optical Paths Network Provisioning
based on Grid Technologies" of MEXT, Japan.

This product includes software from the Condor (r) Project (http://www.condorproject.org/)

1. Overview and Installation

1.1 Overview

PluS adds a advance-reservation function to TORQUE (PBS) and Grid Engine.
One of the following operations will be performed based on the startup option.

1.2 Directory configuration and main contents

build.xml Build file
plus.jar Scheduler JAR file (Create after building)
bin/
  plus_scheduler Startup script for this scheduler
  plus_schedkill Termination script for this scheduler
  plus_reserve Script for making reservations
  plus_modify Script for changing reservations
  plus_cancel Script for canceling reservations
  plus_destroy Script for discarding reservations
  plus_availnodes Script for confirming the number of nodes available for reservations
  plus_commit Reservation transaction confirmation script
  plus_abort Reservation transaction discarding script
  plus_status
Script for show current reservation status
  plus_account
Script for report and account for PluS reservation usage
  sge_plus.in
Template of sge_plus
  sge_plus
Script for start/stop PluS, using at /etc/init.d on Linux system (Created when building)
build/ Working directory for package build (Created when building)
src/jp/aist/gtrc/plus/
  command/ Reservation management command related source (Java)
  scheduler/ Scheduler related source (Java, partially C language and shell scripts)
  reserve/ Reservation management mechanism related source (Java)
doc/
  html/manual.html User Manual (in English), this file.
  db4o.lilcense.txt GPLv2 license file (for db4objects)
  Apache.license.txt Apache license file (for log4j)
  condor.license.txt
Condor license file (for classad.jar)
  pbs/TorqueProtocol.txt TORQUE (PBS) protocol description (in Japanese, UTF-8)
  sge/gdi-manual.txt GDI (Gridengine Database Interface) API description (in Japanese, UTF-8)
  sge/XMLsample.txt XML output example of SGE data by sge_operatord
lib/
  db4o-5.?-java5.jar db4objects ver 5.? library for Java 5.0. See http://www.db4o.com/
  log4j-1.2.13.jar Apache Log4j library. See http://logging.apache.org/
  classad.jar
ClassAd library from Condor. See http://www.cs.wisc.edu/condor/classad/
conf/

  log4j.properties
template configuration of log4j
  reservable.ad.sample*
sample file of reservable.ad, see here.
  sched_conf.sample
sample file of sched_conf, see here.

1.3 Operating Environment

1.4 Setup

Conduct step [1] first, then perform steps [2] to [4] based on the PluS startup mode.

[1] <Essential> Compile installation of the Java scheduler and reservation commands

In the directory where build.xml is stored, execute the following.
    % ant
    % sudo ant install
 
At this point, various scripts are copied to /usr/local/bin, plus.jar, db4o-xxx.jar, etc are copied to /usr/local/PluS.
 
To change the installation directory, change the command as follows:
    % ant -Dinstall.bin.dir=/myhome/bin -Dinstall.jar.dir=/myhome/jars install
Also, you can edit the first part of build.xml as necessary.
Directory names in the script file will be overwritten during the build.
When build.xml is changed, execute ant clean first.
 
Define the path to java and install.bin.dir if needed.

[2] <Only when using the SGE queue base version>Setting of sudoers and SGE

Set so that the PluS scheduler can execute the qresub command in sudo without a password.
For example, execute sudo visudo, and add a line such as
    plus ALL=(ALL) NOPASSWD: /opt/gridengine/bin/lx26-x86/qresub.
At this point, define the PluS scheduler starter as “plus”.
   
Add the PluS scheduler starter (e.g. plus) to the manager group of SGE.
Since the queue operation requires Manager right, use sudo qconf -am plus.
Confirm the user names which are displayed by qconf -sm.
   
Define the path to install.bin.dir specified in step [1].

[3] <Only when using the SGE self scheduling version> Build of sge_operatord

NOTE: We have only tested for Linux 2.6.x on Intel x86 now.

Obtain the source code of SGE 6.0u8.
Decompress http://gridengine.sunsource.net/files/documents/7/78/sge-V60u8_TAG-src.tar.gz, and apply
    src/jp/aist/gtrc/plus/scheduler/specific/sgesched/operatord/sge-V60u8-plus.patch.
Build SGE normally.
Copy gridengine/source/LINUX86_26/sge_operatord to $SGE_ROOT/bin/lx26-x86.
Define the path to sge_operatord.
Commands such as sge_qmaster and qsub do not need to be replaced. (However, replacing them does not cause a problem.)

Compiled sge_operatord for Linux 2.6.x, Intel x86, glibc 2.3.3, SGE 6.0u8 is
    src/jp/aist/gtrc/plus/scheduler/specific/sgesched/operatord/sge_operatord-V60u8-lx26-x86.
You can use this as $SGE_ROOT/bin/lx26-x86/sge_operatord.
It may not work by environment.

For sge_operatord, see
    src/jp/aist/gtrc/plus/scheduler/specific/sgesched/operatord/README.txt (only in Japanse now, Sorry!).

[4] <Only when using the TORQUE self scheduling version> Build of pbs_iff

Obtain TORQUE source (1.2.0 or later) from the following site.
    http://www.clusterresources.com/downloads/torque/
After decompressing, replace torque-X.Y.ZpN/src/iff/iff2.c with
    src/jp/aist/gtrc/plus/scheduler/specific/pbs/pbs_iff/iff2.c_torque-X.Y.ZpN.
Build and set up TORQUE normally.
Define the path to pbs_iff (/usr/local/sbin as default).
   
Patch iff2.c is available for TORQUE 1.2.0p4, 2.0.0p5, 2.1.0p0, and 2.1.6.
Our modification is approximately 10 lines; therefore, it may be easy to add the same modification to other versions.


2. How to use PluS

2.1 Startup and termination

2.1.1 Startup conditions

2.1.2 How to start scheduler

Execute "plus_scheduler [-t target] [-a algo] [-c conffile] [-r rsvFilePath] [-ne maxExpired] [-sgeq qNames]".

2.1.3 How to terminate scheduler

Execute "plus_schedkill"(Tentative: Anyone can execute this command and terminate the scheduler, and it may cause a problem.)
Or, kill the Java process of PluS.

2.2 Reservation management command note

2.2.1 Usage conditions

2.2.2 Exit code and display

2.3 Reservation management command

2.3.1 Registering reservations

plus_reserve [-T] -s startTime [-e endTime | -D duration] -n numNodes [-U users] [-mem memSize][-ncpu cpuNum] [-arch arch] [-os os] [-l attr=value[,...]] [-p] [-sn slotNum] [-sgeq qNames] [-x]

2.3.2 Viewing reservation information

plus_status [-r rsvID] [-o owner] [-n] [-s startTime] [-e endTime] [-ne numShowExpired] [-x]

2.3.3 Canceling reservations

plus_cancel [-T] [-r rsvID] [-o owner]

2.3.4 Changing reservations

plus_modify [-T] -r rsvID [-s startTime] [-e endTime | -D duration] [-n numNodes] [-U users]

2.3.5 Confirming the number of nodes available for reservations

plus_availnodes -s startTime -e endTime [-x]

2.3.6 Discarding reservations

plus_destroy [-T] [-r rsvID] [-o owner]

2.3.7 Confirming reservation operations

plus_commit -r reservationID

2.3.8 Discarding reservation operations

plus_abort -r reservationID

2.3.9 Viewing account of reservation usage

plus_account [-s startTime] [-e endTime] [-d days] [-f logfile] [-o [owner]] [-n [nodes]] [-l] [-S]

2.4 Restrictions on reservation transactions

plus_reserve/cancel/modify/destroy has the -T option.
When executing these plus_xxx with the -T option:
When executing plus_xxx without the -T option:

When the operation is waiting for commit/abort, plus_scheduler is valid before and after a restart.
When specifying the -o owner option for plus_cancel/destroy, the -T option cannot be used.

The operation is automatically aborted at follow conditions:

2.5 Submitting jobs to reservations

Assume that a reservation with ID = R123 has been reserved by processing the reservation registration mentioned above.
If the reservationID is incorrect or you want to cancel the reservation execution, delete the job by executing qdel jobID.

2.6 Reservation and job execution

A job not specified with a reservation ID when it was input will be executed immediately if there is an execution node that is not reserved or running other jobs.
TORQUE schedules immediately after the job was input; however, SGE schedules with a certain interval.

A job specified with a reservation ID will be executed at the time of its reservation starting time.
However, there may be an interval of 30 seconds (at maximum) between the reservation starting time and the actual execution starting time since the job execution is scheduled every 30 seconds (at maximum).

A reservation may not be executed even when the reservation has been successfully registered if the execution node is down at the execution time or numerous jobs were input in the reservation.
In this case, the reservation node will not be switched (tentative specifications).

If a job had been input without reservation and is running on a node at the time when the node is reserved for another job, the job will be forcibly stopped and sent back to the queue (back to the state after the job was input.)
The job will be executed from the beginning again if a node, which is not reserved or running any jobs, is found by the subsequent scheduling.



3. Internal description

3.1 Operation descriptions of SGE queue base version

PluS of the SGE queue base version creates a corresponding queue for SGE when a reservation is created.
Reservation users can utilize the reservations by inputting jobs in the reservation queue.
For exclusive usage of reservation, users who can use the reservations and execution hosts available for the reservations are specified for the reservation queue.

PluS executes suspend for the reservation queue before the reservation starting time, resumes it at the reservation starting time, and deletes the queue as a whole at the reservation ending time. If a job is input to a reservation queue which is suspended, the job is not executed. The reserved job will be executed by the first scheduling after the reservation is resumed.
The job scheduling of the SGE queue base version is controlled by the original sge_schedd, and PluS itself does not execute the job scheduling.
The queue control of PluS is controlled by executing its own shell script from PluS.

3.1.1 Operations when creating a reservation

  1. Decide a reservation node based on the node information PluS collects. Issue a reservation ID (R123 in this example), and record the reservation information in a DB file.
    These operations are the same as the self scheduling version.
  2. Create an SGE host group, @R123_hosts, corresponding to the reservation node group.
  3. Create an SGE user group, R123_users, corresponding to the reservation users.
  4. Create a reservation queue R123. At this point, set the available host to @R123_hosts, and the available user to R123_users.
  5. Suspend R123.
Use sge_rsvq_reg_nodes.sh for step 2, and use sge_rsvq_new.sh for steps 3 to 5.

3.1.2 Normal operations

PluS checks the reservation table regularly (30 second intervals as default) to detect a reservation whose starting or ending time has already passed.
Conduct the procedures of “Operations when starting a reservation” if there is a suspended queue during its reservation period, and conduct the procedures of “Operations when ending a reservation” if there is a reservation queue after its reservation ending time.

3.1.3 Operations when starting a reservation

  1. Suspend normal queues (original queues except reservation queues PluS created) on the reservation node.
  2. Forcibly terminate jobs without reservations on the reservation node, and return the jobs to a queue.
  3. Resume the reservation queue.
Use sge_rsvq_start_rsv.sh for steps 1 to 3.
Use the following script for step 2.
    qmod -sj jobID
    sudo -u jobOwnerName qresub jobID
    qdel jobID
To change how to handle jobs without reservations at the reservation starting time, modify this script.

Note that the job owner will be the qresub executer, in other words the PluS scheduler executer, when the job is re-input using qresub. To avoid this problem, sudo is used. This is the reason why sudoers settings are required at setup.

3.1.4 Operations when ending a reservation

  1. Delete all jobs (regardless of whether it is running or waiting for the operation) on the reservation queue.
  2. Delete the reservation queue, the user group and host group created for the queue.
Use sge_rsvq_del.sh for steps 1 and 2.
Since running jobs are also deleted, use the following script with the -f option for step 1.
    qdel -f jobID
To change how to handle jobs at the reservation ending time, modify this script.

3.1.5 Operations when canceling a reservation

  1. Change the host available for the reservation queue from @R123_hosts to NONE.
    As specifications of PluS, canceling a reservation only releases the reservation node, and the reservation information remains. Therefore, the reservation queue remains.
Use sge_rsvq_unreg_nodes.sh for step 1.

3.2 Job scheduling algorithm

The SGE self scheduling version and TORQUE self scheduling version do not use the original scheduler (sge_schedd/pbs_sched), and PluS itself controls scheduling.
As default, PluS executes jobs in order of priority, and in order of FIFO (order of job input time) if jobs have the same priority.

To specify the job execution order or node allocation order, create the following setting file (save as sched_conf), and execute plus_scheduler -a Sort -c sched_conf at scheduler startup.

Example of sched_conf description
# Scheduler Config file SAMPLE
# NodeSortKey can specify
#    NodeName, LowestLoadAverage, LongestIdleTime, LargestPhysicalMemory
# JobSortKey can specify
#    JobPriority,
#    LeastCPURequested, MostCPURequested,
#    LeastNodeRequested, MostNodeRequested,
#    LeastTimeRequested, MostTimeRequested,
#    QueuePriority, QueueRoundRobin, ByQueue,
#    OwnersName, OwnersGroup, OwnersHost
# OwnersXxx needs to specify XxxsOrder.
#
# SortKey: 1st sort key, 2nd sort key, ...
NodeSortKey:    LowestLoadAverage, LongestIdleTime
JobSortKey:     QueuePriority, JobPriority
#
# OrderType: primary job owner(group/host/domain), 2nd, ...
OwnersOrder:    studentA,studentB
GroupsOrder:    professors, doctors, masters
HostsOrder:     apgrid.org, hpcc.org
In this example:
Job priority is the value X (Max. 1023, standard 0) specified by qsub -p X.

3.3 Torque scheduler settings and corresponding PluS settings

The following job sort methods are available to Torque Fifo scheduler setting file (/var/spool/torque/sched_priv/sched_conf).

round_robin:
    When multiple queues exist, a job in the first queue is executed first, and a job in the second queue is executed at the second scheduling even if a job waiting to be executed exists in the first queue.
    Default is false.

by_queue:
    When multiple queues exist, a job in the first queue is executed first, and if a job waiting to be executed exists in the first queue at the second scheduling, the job is executed. If a job waiting to be executed does not exist in the first queue, a job in the second queue is executed.
    Default is true.

sort_queues:
    Queue selection order (with or without specifications of round_robin and by_queue) is the queue priority order.
    If the queue priority order does not exist, use the order registered in pbs_server.
    Default is true.

strict_fifo:
    When multiple jobs waiting to be executed exist in a queue, the first input job is executed first.
    If this is not specified, execute in the order of job priority.
    Default is true.
    With Torque implementation, however, jobs are not executed in the order of input time of all jobs in all queues when jobs exist in multiple queues. Jobs are executed in the order of queues registered to pbs_server AND in the order of input time in the same queue. This means that when strict_fifo is true, the operation behaves as round_robin=false, by_queue=true, and sort_queues=false regardless of other setting values.

In the following table, rr represents the setting value of round_robin, bq represents the setting value of by_queue, sq represents the setting value of sort_queues, and sf represents the setting value of strict_fifo. Also, F represents false, and T represents True.

rr
bq
sq
sf
JobSortKey specification with PluS
F
F
F
F
JobPriority (Queue is not specified)
F
F
F
T
SubmitTime (Queue is not specified)
F
F
T
F
QueuePriority, JobPriority
F
F
T
T
QueuePriority, SubmitTime
F
T
F
F
ByQueue, JobPriority
F
T
F
T
ByQueue, SubmitTime
F
T
T
F
QueuePriority, ByQueue, JobPriority     (default)
F
T
T
T
QueuePriority, ByQueue, SubmitTime
T
F
F
F
QueueRoundRobin, JobPriority
T
F
F
T
QueueRoundRobin, SubmitTime
T
F
T
F
QueuePriority, QueueRoundRobin, JobPriority
T
F
T
T
QueuePriority, QueueRoundRobin, SubmitTime
T
T
F
F
rr and bq must not be specified at the same time
T
T
F
T
same as above
T
T
T
F
same as above
T
T
T
T
same as above

For example, in the default sched_conf, sort_queue and by_queue are set to True, others are set to False. To realize the same scheduling as this, add the following script to the PluS setting file.
    JobSortKey: QueuePriority, ByQueue, JobPriority
When the setting file name is not specified, the operation behaves in the same way as this.

3.4 Node reservability configuration by ClassAd

3.4.1 Abstract

Normally, PluS selects reservation nodes from all execution nodes which satisfy normal reservation conditions.
You can add custom conditions which are also checked by PluS at node allocation.
Custom condition is written as ClassAd program, which is a part of Condor.
Please refer here to know ClassAd evaluation.

Make a "reservable.ad" file in PluS direcotory (/usr/local/PluS as default) which has definition of PLUS_NODE_RESERVABLE.

You can also restrict reservation nodes by -sgeq option of plus_scheduler or plus_reserve.

3.4.2 Special variables for PluS

3.4.3 Node information struct

Node information is expressed as a record expression of ClassAd.
It has follow members.
For example, "PLUS_CANDIDATE_NODE.name" means candidate node's name.

NOTE: isAlive/loadavg/nRunJobs are defined by information from sge_qmaster/pbs_server.
They are not just now status, polled periodically every 30~60 sec by sge_qmaster/pbs_server.

3.4.4 Reservability check flow

PluS evaludates PLUS_NODE_RESEVABLE for each nodes as follow.
"Normal reservation conditions" checks whether a $node
You cannot change these normal conditions by ClassAd.

3.4.5 Sample ClassAd definitions

You can use PluS/conf/reservable.ad.sample* files as template, which are copied to /usr/local/PluS when setup PluS.

3.4.5.1 Check node status

A candidate node is reservable if
[
    UnusableNodes = { "unusableA", "unusableB" };

    PLUS_NODE_RESERVABLE =
        (PLUS_CANDIDATE_NODE.nRunJobs == 0 &&
         PLUS_CANDIDATE_NODE.isAlive &&
         PLUS_CANDIDATE_NODE.loadavg <= 0.3 &&
         !member(PLUS_CANDIDATE_NODE.name, UnusableNodes));
]

3.4.5.2 Change reservability by how long until reservation start

Restrict number of reservable nodes by owner or how long until reservation start.

Define reservability percentage as follow graph.
All users cannot reserve if startTime is before now or startTime is after "now+LimitPerid" and reservation duration is longer than MaxReservationDuration.
If reservability is 80%, 80% of number of all execution nodes are reservable.

VIPs members can reserve 100% nodes. If you are in Users group, reservability is
        

Thease conditions are defined by ClassAd as follow.

[
    MaxPeriod = relTime("00:30:00");
    MinPeriod = relTime("02:00:00");
    LimitPeriod = relTime("7d");
    MaxReserveDuration = relTime("2d");
    VIPs = { "userA", "userB", "userC" };
    Users = { "userX" };
    VIPRatio = 100.0;
    UsersMaxRatio = 90.0;
    UsersMinRatio = 50.0;
    OthersMaxRatio = 50.0;
    OthersMinRatio = 30.0;
   
    MaxRatio = member(PLUS_RSV_OWNER, Users) ? UsersMaxRatio : OthersMaxRatio;
    MinRatio = member(PLUS_RSV_OWNER, Users) ? UsersMinRatio : OthersMinRatio;
    now = absTime(time());
    prev = PLUS_RSV_START - now;
    duration = PLUS_RSV_END - PLUS_RSV_START;
    ratioFunc = linear(prev, MaxPeriod, MaxRatio, MinPeriod, MinRatio);
    rsvRatio =
        (prev <= relTime("0") || LimitPeriod <= prev || duration >= MaxReserveDuration) ? 0 :
        member(PLUS_RSV_OWNER, VIPs) ? VIPRatio :
        (prev <= MaxPeriod) ? MaxRatio :
        (prev >= MinPeriod) ? MinRatio : ratioFunc;

    nAllocate = size(PLUS_ALLOCATED_NODES) + 1;
    nAllocatable = size(PLUS_ALL_NODES) * 0.01 * rsvRatio;
    PLUS_NODE_RESERVABLE = (nAllocate <= nAllocatable);
]

In this Class Ad, we define
  MaxPerid = 30 minutes, MinPeriod = 2 hours, LimitPeriod = 7 days = 24 * 7 hours,  MaxReserveDuration = 2 days = 24 * 2 hours.
  VIPs members are userA, userB, and userC.
  Users members are userX.

If you are "userA",
linear() and quadratic() is buildin function defined by PluS.
linear(x, x1, y1, x2, y2) returns value of linear function f(x), which satisfy y1 = f(x1) and y2 = f(x2).
If you want to use quadratic function, use quadratic() in same way.

3.4.5.3 Restrict maximum reservation usage

You can make a new reservation if
[
    UtilCheckPeriod = relTime("7d");
    MaxReserveCount = 100;
    MaxReserveDuration = relTime("2d");
    MaxReserveHourNode = 1000.0;
   
    now = absTime(time());
    util = plus_rsvutil(PLUS_RSV_OWNER,
        now - UtilCheckPeriod, now + UtilCheckPeriod);
       
    PLUS_NODE_RESERVABLE =
        ((util[0] <= MaxReserveCount) &&
         (util[1] <= MaxReserveDuration) &&
         (util[2] <= MaxReserveHourNode));
]


plus_rsvutil(owner, start, end) returns a list of values

3.4.5.4 Restrict nodes for each users groups

[
    Proj1Users = { "userA", "userB" };
    Proj1Nodes = { "node1", "node2" };
    Proj2Users = { "userA", "userC" };
    Proj2Nodes = { "node3", "node4" };

    avail1 = member(PLUS_RSV_OWNER, Proj1Users)
            ? member(PLUS_CANDIDATE_NODE.name, Proj1Nodes) : false;
    avail2 = member(PLUS_RSV_OWNER, Proj2Users)
            ? member(PLUS_CANDIDATE_NODE.name, Proj2Nodes) : false;

    PLUS_NODE_RESERVABLE = (avail1 || avail2);
]


"userA" is included in both Proj1Users and Proj2Users, so "userA" can reserve nodes both in both Proj1Nodes and Proj2Nodes.

3.5 Files related to execution

3.6 Improvements of Torque authentication program

pbs_iff command generated from the original iff2.c has the following options.

Since a file descriptor number cannot be obtained with Java, the specifications were changed as follows:
From the Java scheduler, execute pbs_iff by specifying a port number.
Original commands (qsub, etc.) previously using pbs_iff can be authenticated with this improved pbs_iff in the same way as before.

Changes in the sources are described as follows:
  #ifdef ORIGINAL
    Original source
  #else
    Changed source
  #endif