Lesson 3: Introduction to the MPI-DPP Interface

In this lesson we will explore the MPI-DPP interface based on two examples (excluding resource manager interaction).

Note

Key takeaways:

  • Extended MPI Session interface provides DPP functionality

  • Very generic and flexibly: Supports different application patterns

  • Complex to use: Should be the basis for more specialized interfaces

Background

DPP Design principles

As a reminder let’s recall the DPP design principles:

1. PSets and PSetOps are used for dynamic process management

../_images/psetops.png

2. Data Exchange via global publish/lookup store

../_images/dict.png

3. Association PSets/PSetOPs <-> Optimization information expressed in COL

../_images/setop.png

DPP Extensions for MPI Sessions

Based on the DPP deisgn principles we can define extensions to the MPI Session interface. The DynRes software stack supports a prototype implementation of this interface in Open-MPI, PRRTE and OpenPMIx.

  • PSetOp:

int MPI_Session_dyn_v2a_psetop(_nb)(
      MPI_Session session,    // session handle
      int *op                 // INOUT: The set operation to be executed (See list of operations below)
                              // (Will be set to MPI_PSETOP_NULL if requested resources not available)
      char **input_psets      // Names of the input Psets
      int ninput_psets        // Number of input Psets
      char ***output_psets    // OUT: Names of the resulting new Psets
      int *n_output_psets     // OUT: Number of resulting new Psets
      MPI_Info info           // Info object to further specify the operation
      (MPI_Request *request)  // OUT: An MPI Request for the non-blocking version
)
RETURN:
   - MPI_SUCCESS if operation was successful

Description: Indicates support for the specified PSet Operation to be applied on the input PSets. The info object can be used to specify optimization information (COL). If successful, the function allocates an array of n_output PSet names. It is the callers responsibility to free the PSet names and the output_psets array.

  • Publish:

int MPI_Session_set_pset_data(
      MPI_Session session,    // session handle
      const char* dict_name,  // Name of PSet in whose Dictionary the data should be published
      MPI_Info info           // Info object containg key-value pairs to be published
)
RETURN:
   - MPI_SUCCESS if operation was successful

Description: Publishes the key-value pairs included in info in a dictionary associated with the given dict_name.

  • Lookup:

int MPI_Session_get_pset_data(_nb)(
      MPI_Session session,    // session handle
      char *coll_pset_name,   /* Name of a PSet specifying procs of the collective
                             * → for consensus finding without primary process
                             * → all processes receive the same value(s)
                             */
      char *dict_name,        // Name of PSet in whose Dictionary the data should be looked up
      char **keys,            // The keys for which the values should be looked up
      int nkeys,              // The number of keys
      int wait,               // Flag indicating wether to wait for the values to be published
      MPI_Info *info_used     // OUT: MPI_Info object containing the requested key-values
      (MPI_Request *request)  // OUT: An MPI Request for the non-blocking version
)
RETURN:
   - MPI_SUCCESS if operation was successful

Description: Looks up a key-value pair in the dictionary associated with the given dict_name. The call is collective over the processes in coll_pset_name, i.e. all processes in coll_pset_name have to call this function. All processes in coll_pset_name are guaranteed to receive the same values. mpi://SELF may be used for individual lookups.

  • PSet info keys:

The following additional keys are provided via the existing MPI_Session_get_pset_info function.

"mpi_dyn"      ->    bool       // This PSet is the mpi://WORLD PSet of dynamically added processes,
                                // i.e. it was created by MPI_PSETOP_ADD or MPI_PSETOP_GROW
                                // Use case: E.g. allows processes to determine if they where added dynamically
"mpi_primary"  ->    bool       // True, if the calling process is the primary process of the PSet
                                // Only one process is the primary process of a PSet
                                // Use case: E.g. Have a "root" process without a communicator
"mpi_included" ->    bool       // True, if the calling process is included in the PSet

Hands-On 1: Grow

In this example we will use the MPI-DPP Interface to grow the number of MPI processes of an application.

PSet Graph

We can express this with DPP using the following PSet Graph.

../_images/grow.png

Program Flow

To implement the above Pset Graph we use the following program flow:

../_images/grow_flow.png

Note: ADD and SUB operations are implicitely covered by application start/termination.

Key elements:

  • Check pset info of mpi://WORLD to determine if process is dynamic

  • Original processes:
    • need to send a PSet operation of type Grow

    • name of output PSet needs to be published for dynamic procs

    • disconnect from old communicator

  • Dynamic processes:
    • need to lookup the PSet name to use for creating a communicator

  • Create common communicator from “grow” PSet

  • Disconnect from communicator

main()
/* Example of adding MPI processes */
int main(int argc, char* argv[]){

    MPI_Group group = MPI_GROUP_NULL;
    MPI_Session session = MPI_SESSION_NULL;
    MPI_Comm comm = MPI_COMM_NULL;
    MPI_Info info = MPI_INFO_NULL;

    char main_pset[MPI_MAX_PSET_NAME_LEN];
    char boolean_string[16], nprocs[16], **input_psets, **output_psets, host[64];
    int rank, flag = 0, dynamic_process = 0, noutput, op;

    if(argc < 2){
        exit(1);
    }
    strcpy(nprocs, argv[1]);

    gethostname(host, 64);

    char *dict_key = strdup("main_pset");

    /* We start with the mpi://WORLD PSet */
    strcpy(main_pset, "mpi://WORLD");

     /* Initialize the session */
    MPI_Session_init(MPI_INFO_NULL, MPI_ERRORS_ARE_FATAL, &session);

    /* Get the info from our mpi://WORLD pset */
    MPI_Session_get_pset_info (session, main_pset, &info);

    /* get value for the 'mpi_dyn' key -> if true, this process was added dynamically */
    MPI_Info_get(info, "mpi_dyn", 6, boolean_string, &flag);
    MPI_Info_free(&info);

    /* if mpi://WORLD is a dynamic PSet retrieve the name of the main PSet stored on mpi://WORLD */
    if(dynamic_process = (flag && 0 == strcmp(boolean_string, "True"))){

        /* Lookup the value for the "grown_main_pset" key in the PSet Dictionary and use it as our main PSet */
        MPI_Session_get_pset_data (session, main_pset, main_pset, (char **) &dict_key, 1, true, &info);
        MPI_Info_get(info, "main_pset", MPI_MAX_PSET_NAME_LEN, main_pset, &flag);
        MPI_Info_free(&info);
    }

    /* create a communcator from our main PSet */
    MPI_Group_from_session_pset (session, main_pset, &group);
    MPI_Comm_create_from_group(group, "mpi.forum.example", MPI_INFO_NULL, MPI_ERRORS_RETURN, &comm);
    MPI_Comm_rank(comm, &rank);
    MPI_Group_free(&group);

    printf("Rank %d: Host = '%s', PSet = '%s'. I am '%s'!\n", rank, host, main_pset, dynamic_process ? "dynamic" : "original");

    /* Original processes will switch to a grown communicator */
    if(!dynamic_process){
        sleep(5);
        /* One process needs to request the set operation and publish the kickof information */
        if(rank == 0){

            /* Request the GROW operation */
            op = MPI_PSETOP_GROW;

            /* We add nprocs processes */
            MPI_Info_create(&info);
            MPI_Info_set(info, "mpi_num_procs_add", nprocs);

            /* The main PSet is the input PSet of the operation */
            input_psets = (char **) malloc(1 * sizeof(char*));
            input_psets[0] = strdup(main_pset);

            noutput = 0;
            /* Send the Set Operation request */
            printf("\n\nPSETOP for op %d and input pset %s\n", op, input_psets[0]);
            MPI_Session_dyn_v2a_psetop(session, &op, input_psets, 1, &output_psets, &noutput, info);
            printf("PSETOP returned op %d and noutput = %d\n\n\n", op, noutput);
            MPI_Info_free(&info);

            strcpy(main_pset, output_psets[1]);

            /* Publish new main PSet name on the delta Pset for lookup by dynamic processes */
            MPI_Info_create(&info);
            MPI_Info_set(info, "main_pset", main_pset);
            MPI_Session_set_pset_data(session, output_psets[0], info);
            MPI_Info_free(&info);
            free_string_array(input_psets, 1);
            free_string_array(output_psets, noutput);
        }

        /* Share new main PSet name with other original processes via MPI communication */
        MPI_Bcast(main_pset, MPI_MAX_PSET_NAME_LEN, MPI_CHAR, 0, comm);

        /* Disconnect from the old communicator */
        MPI_Comm_disconnect(&comm);

        /* create a cnew ommunicator from the new main PSet*/
        MPI_Group_from_session_pset (session, main_pset, &group);
        MPI_Comm_create_from_group(group, "mpi.forum.example", MPI_INFO_NULL, MPI_ERRORS_RETURN, &comm);
        MPI_Comm_rank(comm, &rank);
        MPI_Group_free(&group);

        /* Indicate completion of the Pset operation */
        /* NOTE: MPI_Session_dyn_finalize_psetop WILL BE DEPRECATED IN FUTURE VERSIONS */
        if(rank == 0){
            MPI_Session_dyn_finalize_psetop(session, main_pset);
        }
    }

    printf("Rank %d: Host = '%s', PSet = '%s'. I am '%s'!\n", rank, host, main_pset, dynamic_process ? "dynamic" : "original");

    /* Disconnect from the old communicator */
    MPI_Comm_disconnect(&comm);

    /* Finalize the MPI Session */
    MPI_Session_finalize(&session);

    return 0;

}

Running the examples

  • Make sure you are in the tutorial_dynreshpc26 environment deployment (cf. Lesson 2).

  • Change to the mpi_tests directory:

(tutorial_dynreshpc26) [mpiuser@n1 hpc]$ cd /opt/hpc/build/mpi_tests

Example 1: Add 1 node with 2 processes

(tutorial_dynreshpc26) [mpiuser@n1 mpi_tests]$ mpirun -np 2 --host n1:2,n2:2 ./build/DynMPISessions_minimal_add_release 2

Example 2: Add 2 nodes with 1 and 2 processes respectively

(tutorial_dynreshpc26) [mpiuser@n1 mpi_tests]$ mpirun -np 2 --host n1:2,n2:1,n3:2 ./build/DynMPISessions_minimal_add_release 3

Note

The ompi implementation does currently not support intra-node changes of the number of processes!!

Non-working Example: Add 2 process on n1

(tutorial_dynreshpc26) [mpiuser@n1 mpi_tests]$ mpirun -np 2 --host n1:4 ./build/DynMPISessions_minimal_add_release 2

========================== 💥 SEGV 💥 =================================

Hands-On 2: Fork-Join

In this example we will use the MPI-DPP Interface to create a fork-join pattern with independently-dynamic tasks. The following image shows the general structure of the example:

../_images/fork-join.png

Our example consists of 4 tasks:

  • Task A: Executed by all initial processes (simple sleep).

  • Task 2A/B: Started with disjunct subsets of initial processes, Might change number of processes individually (iterative sleep with linear scalability).

  • Task 3: Executed by the union of processes a the end of Task 2A/B (simple sleep).

We express this pattern with 3 different PSetOps:

  • MPI_PSETOP_SPLIT: Takes one input PSet and creates \(m > 0\) output PSets which are disjoint subsets of the input PSet.

  • MPI_PSETOP_REPLACE: Takes one input Pset \(S_{in}\) and creates 3 output PSets: \(S_{new}\) containing new processes, \(S_{del}\) contaning processes to be terminated and \(S_{repl} = S_{in} \cup S_{new} \cap S_{del}\).

  • MPI_PSETOP_UNION: Takes n input PSets and creates one output PSet which contains the union of the processes in the provided input PSets.

Note

For demonstration we use blocking PSetOps, i.e. the appliccation waits for the reource manager / runtime decision. In a realistic setting we would rather use non-blocking REPLACE PsetOps.

Lets have a look at the main function:

main()
/* Example of adding MPI processes */
int main(int argc, char* argv[]){

    char task_pset[MPI_MAX_PSET_NAME_LEN];
    char task_name[MPI_MAX_PSET_NAME_LEN];
    char boolean_string[16];
    int flag = 0, dynamic_process = 0, rc = 0, task_iteration = 0;

    gethostname(host, 64);

    /* We start with the mpi://WORLD PSet */
    strcpy(task_pset, "mpi://WORLD");

    /* Initialize global MPI Session */
    MPI_Session_init(MPI_INFO_NULL, MPI_ERRORS_ARE_FATAL, &global_session);

    /* Check if process was created dynamically */
    get_pset_info(task_pset, "mpi_dyn", boolean_string, 16);
    dynamic_process = (0 == strcmp(boolean_string, "True"));

    /* Original processes execute task 1 */
    if(!dynamic_process){
        task1(task_pset);
    }

    /* Get task assignment */
    fork(dynamic_process, task_pset, task_name, &task_iteration);

    /* Execute assigned task */
    if(0 == strcmp(task_name, "Task2A")){
        rc = task2A(task_pset, task_iteration);
    }else if (0 == strcmp(task_name, "Task2B")){
        rc = task2B(task_pset, task_iteration);
    }

    /* If process was removed by shrink operation terminate now */
    if(rc == PROC_TERMINATE){
        MPI_Session_finalize(&global_session);
        return 0;
    }

    /* Get a common PSet for task 3 */
    join(task_pset, task_name);

    /* All processes execute task 3 */
    task3(task_pset);

    /* Finalize global MPI global_session */
    MPI_Session_finalize(&global_session);

    return 0;
}

Note

Key takeaways:

  • Most of the complexities of fork-join are hidden inside of an interface specialized on this application pattern.

  • The fork join functions are independent of MPI communication

Let’s have a look on how the plain MPI-DPP interface is use to implement the fork and join pattern

fork()
int fork(int dynamic_process, char task_pset[], char task_name[], int *task_iteration){
    char booleanString[16], intString[16];
    int flag, op, size, noutput = 0;
    char ** dict_keys, **input_psets, **output_psets;

    MPI_Info info, pset_info;

    if( !dynamic_process){
        /* Find the primary process of the task_pset */
        MPI_Session_get_pset_info (global_session, task_pset, &info);
        MPI_Info_get(info, "mpi_primary", 6, booleanString, &flag);
        MPI_Info_get(info, "mpi_size", 6, intString, &flag);
        size = atoi(intString);
        MPI_Info_free(&info);

        /* Primary process requests the split operation and publish the kickoff information */
        if(0 == strcmp(booleanString, "True") ){

            /* Request the SPLIT operation */
            op = MPI_PSETOP_SPLIT;

            MPI_Info_create(&info);
            char split_string[16];
            sprintf(split_string, "%d,%d", size/2, size - size/2);
            MPI_Info_set(info, "mpi_part_sizes", split_string);

            /* The main PSet is the input PSet of the operation */
            input_psets = (char **) malloc(1 * sizeof(char*));
            input_psets[0] = strdup(task_pset);

            /* Send the Set Operation request */
            MPI_Session_dyn_v2a_psetop(global_session, &op, input_psets, 1, &output_psets, &noutput, info);
            MPI_Info_free(&info);

            /* Publish the names of the sucessor tasks on current task_pset */
            MPI_Info_create(&info);
            MPI_Info_set(info, "pset_task_2a", output_psets[0]);
            MPI_Info_set(info, "pset_task_2b", output_psets[1]);
            MPI_Session_set_pset_data(global_session, task_pset, info);
            MPI_Info_free(&info);
            free_string_array(input_psets, 1);
            free_string_array(output_psets, noutput);
        }

        /* Lookup the names of the sucessor tasks on current task_pset and check where I am included */
        dict_keys = (char **) malloc(2 * sizeof(char *));
        dict_keys[0] = strdup("pset_task_2a");
        dict_keys[1] = strdup("pset_task_2b");
        MPI_Session_get_pset_data (global_session, task_pset, task_pset, (char **) dict_keys, 2, true, &info);
        MPI_Info_get(info, "pset_task_2a", MPI_MAX_PSET_NAME_LEN, task_pset, &flag);

        get_pset_info(task_pset, "mpi_included", booleanString, 16);
        if(0 == strcmp(booleanString, "True")){
            strcpy(task_name, "Task2A");
        }else {
            MPI_Info_get(info, "pset_task_2b", MPI_MAX_PSET_NAME_LEN, task_pset, &flag);
            strcpy(task_name, "Task2B");
        }

        MPI_Info_free(&info);
        free_string_array(dict_keys, 2);

        *task_iteration = 0;

    }else{
        /* Lookup task, pset and iteration on our world pset -> info was publish during the grow operation */
        dict_keys = (char **) malloc(3 * sizeof(char *));
        dict_keys[0] = strdup("task_pset");
        dict_keys[1] = strdup("task_name");
        dict_keys[2] = strdup("task_iteration");
        MPI_Session_get_pset_data (global_session, task_pset, task_pset, (char **) dict_keys, 3, true, &info);
        MPI_Info_get(info, "task_pset", MPI_MAX_PSET_NAME_LEN, task_pset, &flag);
        MPI_Info_get(info, "task_name", MPI_MAX_PSET_NAME_LEN, task_name, &flag);
        MPI_Info_get(info, "task_iteration", MPI_MAX_PSET_NAME_LEN, intString, &flag);
        MPI_Info_free(&info);

        *task_iteration = atoi(intString);

        free_string_array(dict_keys, 3);
    }

    return 0;
}

Note

Key takeaways:

  • Primary process responsible for handling psetop and publishing relvant information

  • Other processes lookup relevant information in global dict and pset info

  • Dynamic processes need to be handled differently

join()
int join(char *task_pset, char *task_name){
    char boolean_string[16], tmp[MPI_MAX_PSET_NAME_LEN];
    char coll_pset[MPI_MAX_PSET_NAME_LEN] = "mpi://SELF", dict_name[16] = "Task3";
    char **input, **output;
    const char *key1 = "Task2B", *key2 = "task_pset";
    int noutput = 0, op, flag;
    MPI_Info info;

    get_pset_info(task_pset, "mpi_primary", boolean_string, 16);

    if(0 == strcmp(boolean_string, "True")){
        MPI_Info_create(&info);
        MPI_Info_set(info, task_name, task_pset);
        MPI_Session_set_pset_data(global_session, "Task3", info);
        MPI_Info_free(&info);

        if(0 == strcmp(task_name, "Task2A")){
            MPI_Session_get_pset_data (global_session, coll_pset, dict_name, (char **) &key1, 1, true, &info);
            MPI_Info_get(info, "Task2B", MPI_MAX_PSET_NAME_LEN, tmp, &flag);
            MPI_Info_free(&info);

            op = MPI_PSETOP_UNION;
            input = (char **) malloc(2 * sizeof(char *));
            input[0] = strdup(task_pset);
            input[1] = strdup(tmp);

            /* Send the Set Operation request */
            MPI_Session_dyn_v2a_psetop(global_session, &op, input, 2, &output, &noutput, MPI_INFO_NULL);

            MPI_Info_create(&info);
            MPI_Info_set(info, "task_pset", output[0]);
            MPI_Session_set_pset_data(global_session, "Task3", info);
            MPI_Info_free(&info);

            free_string_array(output, noutput);

        }
    }

    MPI_Session_get_pset_data (global_session, task_pset, dict_name, (char **) &key2, 1, true, &info);
    MPI_Info_get(info, "task_pset", MPI_MAX_PSET_NAME_LEN, task_pset, &flag);
    MPI_Info_free(&info);

    strcpy(task_name, "Task3");

    return SUCCESS;

}

Note

Key takeaways:

  • Primary processes of both task PSets exhange PSet names via global dictionary

  • One process creates common PSet and publishes its name

  • All processes lookup new task PSet in dictionary

Let’s have a look at the task functions

Task1/3
int task1(char *pset_name){
    int rank, size;
    MPI_Comm comm;
    MPI_Group group;

    comm_from_pset(pset_name, &comm, &size, &rank);

    printf("TASK 1/3: ==> Rank %d/%d Running on host '%s'\n", rank, size-1, host);

    sleep(5);
    if(rank == 0)printf("\n");
    sleep(5);

    MPI_Barrier(comm);
    MPI_Comm_disconnect(&comm);
    return SUCCESS;
}
Task2A/B
int task2A/B(char *pset_name, int iteration){
    int rank, size, rc;
    MPI_Comm comm;


    /* create a communcator from our main PSet */
    comm_from_pset(pset_name, &comm, &size, &rank);

    for(int i = iteration; i < 3; i++){
        printf("TASK 2A: ==> Rank %d/%d Running iteration %d on host '%s'\n", rank, size-1, i, host);

        /* TASK WORK*/
        task_sleep(size, TASK2A/B_SCALABILITY);
        if(rank == 0)printf("\n");
        task_sleep(size, TASK2A/B_SCALABILITY);
        MPI_Barrier(comm);

        /* TASK RESIZE */
        rc = resize("Task2A", i, pset_name);
        if(rc == PROC_TERMINATE){
            MPI_Comm_disconnect(&comm);
            return rc;
        }else if (rc == PROC_RESIZE){
            MPI_Comm_disconnect(&comm);
            comm_from_pset(pset_name, &comm, &size, &rank);
        }

    }

    return SUCCESS;
}

Finally lets have a look at the resize function:

resize()
int resize(const char *task_name, int iteration, char *pset_name){

    char ** output_psets;
    char tmp_buf[16], prev_pset[MPI_MAX_PSET_NAME_LEN];
    int noutput = 0, op, flag;
    MPI_Info info;

    /* We want to keep at least 1 proc */
    get_pset_info(pset_name, "mpi_size", tmp_buf, 16);
    if (0 >= atoi(tmp_buf) + atoi(get_param(task_name, PROCS_ADD)) - atoi(get_param(task_name, PROCS_SUB))){
        return PROC_NO_ACTION;
    }

    char *key = (char *) malloc(16);
    snprintf(key, 16, "task_pset_%d", iteration);
    strcpy(prev_pset, pset_name);

    /* Primary process requests the PSetOp and publishes task info in dictionary */
    get_pset_info(pset_name, "mpi_primary", tmp_buf, 16);
    if(0 == strcmp(tmp_buf, "True")){
        op = MPI_PSETOP_REPLACE;

        MPI_Info_create(&info);
        if(0 == (strcmp(task_name, "Task2A") )){
            MPI_Info_set(info, "mpi_num_procs_add", TASK2A_PROCS_ADD);
            MPI_Info_set(info, "mpi_num_procs_sub", TASK2A_PROCS_SUB);
        }else{
            MPI_Info_set(info, "mpi_num_procs_add", TASK2B_PROCS_ADD);
            MPI_Info_set(info, "mpi_num_procs_sub", TASK2B_PROCS_SUB);
        }

        MPI_Session_dyn_v2a_psetop(global_session, &op, &pset_name, 1, &output_psets, &noutput, info);
        MPI_Info_free(&info);

        if(MPI_PSETOP_NULL == op) {
            MPI_Info_create(&info);
            MPI_Info_set(info, key, pset_name);
            MPI_Session_set_pset_data(global_session, pset_name, info);
        }else{
            /* Publish task info on output_psets[1] (output_psets[1] == mpi://WORLD of new processes) */
            snprintf(tmp_buf, sizeof(tmp_buf), "%d", iteration + 1);
            MPI_Info_create(&info);
            MPI_Info_set(info, "task_pset", output_psets[2]);
            MPI_Info_set(info, key, output_psets[2]);
            MPI_Info_set(info, "task_name", task_name);
            MPI_Info_set(info, "task_iteration", tmp_buf);
            MPI_Session_set_pset_data(global_session, output_psets[1], info);
            MPI_Session_set_pset_data(global_session, pset_name, info);
            MPI_Session_dyn_finalize_psetop(global_session, pset_name);
            free_string_array(output_psets, noutput);

        }
        MPI_Info_free(&info);
    }

    /* Lookup name of new PSet */
    MPI_Session_get_pset_data (global_session, pset_name, pset_name, (char **) &key, 1, true, &info);
    MPI_Info_get(info, key, MPI_MAX_PSET_NAME_LEN, pset_name, &flag);
    MPI_Info_free(&info);
    free(key);
    if(0 == strcmp(prev_pset, pset_name)){
        return PROC_NO_ACTION;
    }

    /* Processes NOT included in new PSet need to terminate */
    MPI_Session_get_pset_info (global_session, pset_name, &info);
    MPI_Info_get(info, "mpi_included", 6, tmp_buf, &flag);
    MPI_Info_free(&info);
    if(0 == strcmp(tmp_buf, "False")){
        return PROC_TERMINATE;
    }

    return PROC_RESIZE;

}

Note

Key takeaways:

  • Primary processes sends PSetOp and publishes data for new processes (lookup in fork) and existing processes

  • Existing processes lookup new task information

  • Existing processes check if they are included in new PSet, else terminate


Running the examples

Let’s take the role of the resource manager and play around with reconfiguration paramters.

FORK parameters:

  • SPLIT_RATIO 0.5

Task 2A paramters:

  • TASK2A_SCALABILITY 1

  • TASK2A_PROCS_ADD “0”

  • TASK2A_PROCS_SUB “0”

Task 2B paramters:

  • TASK2B_SCALABILITY 1

  • TASK2B_PROCS_ADD “0”

  • TASK2B_PROCS_SUB “0”

Example 1:

Baseline Example => Balanced tasks As a baseline we assume tasks 2A and 2B to have equal scalability and we perform an even split.

(tutorial_dynreshpc26) [mpiuser@n1 mpi_tests]$ time mpirun -np 6 --host n1:1,n2:1,n3:1,n4:1,n5:1,n6:1 ./build/DynMPISessions_minimal_fork-join_release

== 29.416s ==

Example 2:

Unbalanced Tasks => No counter measures Now we assume task 2A scales perfectly, while task 2B scales with factor 0.3 and we again perform an even split.

To get a feeling of a common development workflow with this setup we change the paramters directly in the source code.

On your LOCAL host:

  • Open the file $DOCKER_CLUSTER_DIR/build/mpi_tests/examples/DynMPISessions_minimal_fork-join.cpp in you preferred IDE.

  • Comment out the current paramter setting and uncomment the paramter settings for Example 2.

Inside the docker cluster:

Reinstall the mpi_test package:

(tutorial_dynreshpc26) [mpiuser@n1 mpi_tests]$ dynpkgs pkg_reinstall mpi_tests

Then rerun the example:

(tutorial_dynreshpc26) [mpiuser@n1 mpi_tests]$ time mpirun -np 6 --host n1:1,n2:1,n3:1,n4:1,n5:1,n6:1 ./build/DynMPISessions_minimal_fork-join_release

== 52.812s ==

Task 2B took much longer than task 2A. Need to wait for task 2B to finish before starting task 3. Large makespan.

Example 3

Unbalanced Tasks => Grow Task 2B Let’s try to improve this by adding additional resources to task B.

Similar to the prevoius exmaple adapt the source file, reinstall the packag and run the example (providing 2 additional hosts).

(tutorial_dynreshpc26) [mpiuser@n1 mpi_tests]$ time mpirun -np 6 --host n1:1,n2:1,n3:1,n4:1,n5:1,n6:1,n7:1,n8:1 ./build/DynMPISessions_minimal_fork-join_release

== 44.961s ==

Example 4

Unbalanced Tasks => move resources from Task 2A to Task 2B In this example, instead of adding additional resources lets remove resources from task A and add them to task B:

(tutorial_dynreshpc26) [mpiuser@n1 mpi_tests]$ time mpirun -np 6 --host n1:1,n2:1,n3:1,n4:1,n5:1,n6:1 ./build/DynMPISessions_minimal_fork-join_release

== 45.161s ==

Almost same performance as prevoius example but no additional resources required.

Example 5

Unbalanced Tasks => Adjust split ratio Obviously the imbalance is caused by an ignorant split operation. If we have a rough idea of the scalability of task 2A & 2B we can split resources in a smarter way.

Adjust, reinstall, rerun:

(tutorial_dynreshpc26) [mpiuser@n1 mpi_tests]$ time mpirun -np 6 --host n1:1,n2:1,n3:1,n4:1,n5:1,n6:1 ./build/DynMPISessions_minimal_fork-join_release

== 41.403s ==

Example 6

Unbalanced Tasks => Adjust split ratio & grow Task 2B However, we detect that the split decision is still not optimal and we have idle resources. Let’s add them to task 2B.

Adjust, reinstall, rerun:

(tutorial_dynreshpc26) [mpiuser@n1 mpi_tests]$ time mpirun -np 6 --host n1:1,n2:1,n3:1,n4:1,n5:1,n6:1,n7:1,n8:1 ./build/DynMPISessions_minimal_fork-join_release

== 37.367s ==

Ongoing & Future Work:

  • Discussions in MPI Sessions WG

  • Intra-node process changes