[PIPE2D-1038] Investigate parsl plugin for processing on cluster Created: 03/May/22  Updated: 09/Jul/22  Resolved: 09/Jul/22

Status: Done
Project: DRP 2-D Pipeline
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Story Priority: Normal
Reporter: price Assignee: price
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Sprint: 2DDRP-2022 E

 Description   

The Gen3 middleware will allow us to run our pipeline over multiple nodes. We need a middleware plugin that will work with our Slurm (or PBS) clusters. The leading contender is the parsl plugin developed by DESC.



 Comments   
Comment by price [ 09/Jul/22 ]

Running BPS on Tiger:

  • Stack requirements:
    + Recent LSST stack
    + tickets/DM-35494 of ctrl_mpexec and pipe_base
    + ctrl_bps_parsl
    + Parsl (see ctrl_bps_parsl README)
  • Docs:
    + BPS guide: https://pipelines.lsst.io/modules/lsst.ctrl.bps/index.html
    + Parsl plugin guide: https://github.com/lsst/ctrl_bps_parsl/blob/main/README.md
  • The choice of filesystem is very important for efficient cluster use: see https://researchcomputing.princeton.edu/support/knowledge-base/data-storage . /projects and /tigress are slow to access on cluster nodes, so you should work on /scratch/gpfs and include execution_butler_copy_files.yaml in your BPS config. This copies the necessary data from the primary butler repo on /projects to an “execution butler” in your working directory (on /scratch/gpfs), and copies the results back afterwards. This adds some overhead to the beginning and end of the processing, but it allows the processing to be efficient (and not degrade the system for other users).
  • When running flat construction, got about 100 ISR jobs/hour/core.
  • Consider strategies to obtain cluster resources:
    + Parsl launches Slurm jobs to obtain cluster resources; these jobs connect to the process running on the head node and execute the various tasks.
    + Don’t want to have more resources than can be used at once, but don’t want to be waiting a long time for resources.
    + Smaller and shorter jobs are easier to schedule, but are less capable and introduce overhead.
    + One strategy might be to use a 6 hour “singleton” job, so only one job is allowed to run at a time, while other jobs hold their place in line.
    + Another strategy might be to use multiple 2 hour jobs, and then you use those jobs as they become available.
  • Clusters make the pipeline more efficient: when running DRP, I recommend including ${DRP_PIPE_DIR}/ingredients/clusters.yaml (requires branch u/price/20220628 of drp_pipe).
  • Example BPS config file for flat construction:
    pipelineYaml: "${CP_PIPE_DIR}/pipelines/DarkEnergyCamera/cpFlat.yaml"
    wmsServiceClass: lsst.ctrl.bps.parsl.ParslService
    #computeSite: local
    computeSite: tiger
    includeConfigs:
      - ${CTRL_BPS_PARSL_DIR}/etc/execution_butler_copy_files.yaml
    payload:
      output: u/price/20220629/flat
      butlerConfig: /projects/MERIAN/repo
      inCollection: DECam/defaults/merian,DECam/calib/curated/19700101T000000Z,DECam/calib/unbounded,u/price/20220629/calib
      dataQuery: "instrument='DECam' AND exposure IN (970228, 970229, 970230, 970231, 970232, 970233, 970234, 970235, 970236, 970237, 970238, 970501, 970502, 970503, 970504, 970505, 970506, 970507, 970508, 970509, 970510, 970511, 970836, 970837, 970838, 970839, 970840, 970841, 970842, 970843, 970844, 970845, 970846, 971174, 971175, 971176, 971177, 971178, 971179, 971180, 971181, 971182, 971183, 971184, 971554, 971555, 971556, 971557, 971558, 971559, 1052706, 1052707, 1052708, 1052709, 1052710, 1052711, 1052712, 1052713, 1052714, 1052715, 1052716, 1053093, 1053094, 1053095, 1053096, 1053097, 1053098, 1053099, 1053100, 1053101, 1053102, 1053103, 1053485, 1053486, 1053487, 1053488, 1053489, 1053490, 1053491, 1053492, 1053493, 1053494, 1053495, 1053858, 1053859, 1053860, 1053861, 1053862, 1053863, 1053864, 1053865, 1053866, 1053867, 1053868, 1054287, 1054288, 1054289, 1054290, 1054291, 1054292)"
      payloadName: flat
    clusterAlgorithm: lsst.ctrl.bps.quantum_clustering_funcs.dimension_clustering
    saveClusteredQgraph: true
    cluster:
        exposure_detector:
            pipetasks: isr,cpFlatMeasure
            dimensions: exposure,detector
    site:
      local:
        class: lsst.ctrl.bps.parsl.sites.Local
        cores: 12
      tiger:
        class: lsst.ctrl.bps.parsl.sites.princeton.Tiger
        nodes: 1
        cores_per_node: 40
        walltime: "0:59:59"  # Get into tiger-test queue
    
  • Example BPS config file for DRP on ci_hsc:
    includeConfigs:
      - ${DRP_PIPE_DIR}/ingredients/clusters.yaml
    
    pipelineYaml: "${DRP_PIPE_DIR}/pipelines/HSC/DRP-ci_hsc.yaml"
    wmsServiceClass: lsst.ctrl.bps.parsl.ParslService
    computeSite: local
    #computeSite: tiger
    site:
      local:
        class: lsst.ctrl.bps.parsl.sites.Local
        cores: 20
      tiger:
        class: lsst.ctrl.bps.parsl.sites.princeton.Tiger
        nodes: 1
        walltime: "0:59:59"
    
Comment by price [ 09/Jul/22 ]

The parsl plugin works nicely. I extracted the useful parts into a plugin package: ctrl_bps_parsl, and was able to run workflows on the Princeton Tiger cluster.

Generated at Sat Feb 10 16:01:46 JST 2024 using Jira 8.3.4#803005-sha1:1f96e09b3c60279a408a2ae47be3c745f571388b.