Distributed Computing User Tutorial

Estimated reading time: 4 minutes

Preface

This is a tutorial for CEPC-DIRAC grid computing system. Preliminary knowledge of Linux and CEPC v1 are need. If you need help, please contact ZHAO Xianghu.

Step 1. Request a certificate and join CEPC VO

Go to webpage https://cagrid.ihep.ac.cn to request a certificate from IHEP CA. Here attached a detailed operation guide with screenshot How to Request Certificate.pdf.

It will take 2 or 3 days waiting for approval.

When your application is approved, you will receive an email from ihepca@ihep.ac.cn with the serial number and DN. Then you can import the certificate to your browser, and then save it to local disk as a .p12 file (e.g. yourCertificate.p12). Here attached a detailed operation guide with screenshot How to Get Certificate after Approval.pdf.

Use the following command to generate userkey.pem and usercert.pem from the .p12 file and place it at $HOME/.globus,

$ openssl pkcs12 -in yourCertificate.p12 -out userkey.pem -nocerts
$ openssl pkcs12 -in yourCertificate.p12 -out usercert.pem -nokeys -clcerts

and then change their permision to 400 and 600.

When you save the .p12 file and generate .pem files, you will be informed to set password. Please write down your password somewhere for future use.

The PEM password you set during generation of userkey.pem and usercert.pem, will be used later when you setup grid environment for submitting job.

With the certificate in webbroser, you can go to https://voms.ihep.ac.cn:8443/voms/cepc/ and following the guidelines to join the CEPC VO.

Step 2. Setup Environment

Run the following command to setup the environment:

source /cvmfs/dcomputing.ihep.ac.cn/frontend/dsub/setup/env_dsub.sh

If you use tcsh, the command should be

source /cvmfs/dcomputing.ihep.ac.cn/frontend/dsub/setup/env_dsub.csh

You will be informed to input the PEM password.

Step 3. Prepare a job configuration file

A job configuration(cfg) file is a normal text file contain paramters definition for your job. You can take a look at the example job cfg file by:

vim $DSUBDOC/dsub-example/job.cfg

Here printed the context of this file without comments:

job_type = cepc_sr
repo_dir = ./repo
work_dir = ./work
input_filelist = ./stdhep.list 
output_dir = test_001
evtmax = 10
job_group = 150116_CEPC_test_001

The comments in this file will explain the meanings of these parameters and how to modify their values to fit your situation.

Step 4. Submit jobs

Once you have parepared a job cfg file. Run

dsub job.cfg

to submit jobs.

As an simple example, you can do the first test according to the following steps:

cp -r $DSUBDOC/dsub-example .
cd dsub-example
dsub job.cfg

It will submit 5 jobs with 10 events/job, the input stdhep file is listed in the file stdhep.list.

Step 5. Monitor job status

By using the web brower with your certificate, go to https://dirac.ihep.ac.cn/, from the menu bar choose “Applications”–> “Task Manager” to show your tasks. For detailed jobs choose menu “Job Monitor” to show your jobs. You can use the job group to select you concerned jobs.

Step 6. Get output data

Currently, the output data will be written to

/cefs/tmp_storage/yant/gridfs/cepc/user/<initial>/<username>/<output_dir>/sim
/cefs/tmp_storage/yant/gridfs/cepc/user/<initial>/<username>/<output_dir>/rec

It’s readable by all AFS users in physics or higgs group. You can directly read them in your analysis jobs. However, old files in these directory will be removed regularly for saving disk space. So, please copy the data to your bakup directory in time.

Step 7. Debugging and Get job logs

When job is failed, you can select it and click “Reschedule” button on up-right corner. Occasional error will be solved by rescheduling.

If you see “job.py Exited with status " in the column "Application Status", here is a list of meaning of error code:

  • 10 preparation error
  • 11 cvmfs not found
  • 20 simulation error
  • 21 DB connection failed
  • 30 reconstruction error

You can use the following command to get the job logs

getlog <jobID>
getlog <jobID1> <jobID2> <jobID3> ...
getlog -g <job_group>

The logs are useful during debugging. For example, if simulation error occurs, you can check simu.log.