Friday, August 5, 2022

Transfer directory from EFS to S3 Glacier

1. Create an S3 bucket
> aws s3 mb s3://rv398-20220712

2. Copy EFS files to the S3 bucket
> aws s3 cp /mnt/efs/Joana3/Data s3://rv398-20220712/Data --recursive

3. Change bucket lifecycle
> cat lifecycle.json
{
  "Rules": [ 
    { 
      "ID": "Move to Glacier (all objects in bucket)", 
      "Prefix": "", 
      "Status": "Enabled", 
      "Transition": { 
        "Days": 0, 
        "StorageClass": "GLACIER" 
      } 
    } 
  ] 
}
> aws s3api put-bucket-lifecycle --bucket rv398-20220712 \\
--lifecycle-configuration file://lifecycle.json

Note: It will take up to 24 hours for the file to be stored in Glacier


Thursday, March 31, 2022

BD Rhapsody analysis pipeline on AWS

Objectif: Implement BD Rhapsody analysis pipeline (WTA) on a instance of AWS (without using Seven Bridges)

1. Create an EC2 instance (r5.12xlarge)

2. Mount EFS storage
> sudo yum install -y amazon-efs-utils
> sudo yum install -y nfs-utils
> sudo mkdir /mnt/efs

> sudo chgrp -R ec2-user /mnt/efs
> sudo chmod -R g+w /mnt/efs
sudo mount -t efs -o tls File_system_ID:/ /mnt/efs
where File_system_ID is the file system id (format "fs-XXXXXXXX")

3. Install docker
> sudo yum install -y docker
> sudo service docker start
> sudo usermod -a -G docker ec2-user

4. install pip for python2
> sudo yum install -y python2-pip.noarch

5. install CWL-runner
> pip2 install cwlref-runner

6. download CWL and YML files
> curl -O https://bitbucket.org/CRSwDev/cwl/raw/2a9b10d03b02fd4b65b92f78cfe81b80253eff47/v1.10/rhapsody_wta_1.10.cwl
> curl -O https://bitbucket.org/CRSwDev/cwl/raw/2a9b10d03b02fd4b65b92f78cfe81b80253eff47/v1.10/template_wta_1.10.yml

7. Download reference file
> curl -O https://bd-rhapsody-public.s3.amazonaws.com/Rhapsody-WTA/GRCh38-PhiX-gencodev29/GRCh38-PhiX-gencodev29-20181205.tar.gz
> curl -O https://bd-rhapsody-public.s3.amazonaws.com/Rhapsody-WTA/GRCh38-PhiX-gencodev29/gencodev29-20181205.gtf

7b. Create reference file
> docker run -v /mnt/efs:/mnt -t -i bdgenomics/rhapsody bash
> mkdir ggOverhang100
> STAR --runMode genomeGenerate \
       --runThreadN 8 \
       --genomeDir ggOverhang100 \
       --genomeFastaFiles /mnt/genome/Mmul_10/Sequence/genome.fa \
       --sjdbGTFfile /mnt/genome/Mmul_10/Annotation/genes.gtf \
       --sjdbOverhang 100
> tar -czvf ggOverhang100.tgz ggOverhang100/

8. Edit YML file
...

9. Launch CWL-runner
> cwl-runner --outdir /mnt/efs/rhapsody_test/output_test rhapsody_wta_1.10.cwl template_wta_1.10.yml

---
Supplementary step: upload file to Seven Bridges
> sb projects list
> sb upload start ggOverhang100.tar.gz --destination projectName --name rhesus-ref







Friday, November 26, 2021

Email notification for AWS batch

Goal: Get an email notification when an AWS batch (single or array) job is completed. The problem (1) is to get email notifications only for the user that launched the job and (2) for array jobs only get one email notification when all the children jobs are completed.

Solution: For the array job problem the solution is to create a rule specific to the parental job. For the email notifications to only one user, the solution is to create a topic per job launched.

1. Create an AWS array job (called here sleep.json)
> cat sleep.json
{
    "jobName": "sleep-job-1",
    "jobQueue": "job-queue-1",
    "jobDefinition": "job-def-1",
    "arrayProperties": {
        "size": 3
    },
    "containerOverrides": {
        "command": [
            "sleep",
            "30"
        ]
    },
    "timeout": {
        "attemptDurationSeconds": 7200
    }
}

2. Create bash script that call aws batch job and fetch the jobId (called here sleep.sh)
> cat sleep.sh
#!/bin/bash
cmd="aws batch submit-job"
cmd="$cmd --cli-input-json file://sleep.json"


jobid=$(eval $cmd | \
        grep jobId | \
        sed -r 's|.+jobId\": \"(.+)\"$|\1|g')

3. Create an event rule (i.e. pattern to be matched) that is specific to the AWS array job
> cat rule.json
{
"Name": "rule-1",
"EventPattern": "{\"source\":[\"aws.batch\"],
\"detail-type\":[\"Batch Job State Change\"],
\"detail\":{\"jobId\":[\"JOBID\"],
\"status\":[\"FAILED\",\"SUCCEEDED\"]}}",
"State": "ENABLED",
"Description": "rule for specific jobId",
"EventBusName": "default"
}
> tail sleep.sh
sed -ri "s|\"jobId\\\\\":[^,]+|\"jobId\\\\\":[\\\\\"${jobid}\\\\\"]|g" \
rule.json
# register rule
aws events put-rule \
    --cli-input-json file://rule.json
4. Create a topic (i.e. communication channel) for that rule
> tail spleep.sh
# register topic
cmd="aws sns create-topic"
cmd="$cmd --name 'job-${jobid}"

topicArn=$(eval $cmd | \
grep TopicArn | \
sed -r 's|.+TopicArn\": \"(.+)\"$|\1|g')

5. Add iam role to allow notification
> cat sleep.sh
attributeValue="{\"Version\":\"2012-10-17\",
                 \"Id\":\"__default_policy_ID\",
                 \"Statement\":[{
                    \"Sid\":\"__default_statement_ID\",
                    \"Effect\":\"Allow\",
                    \"Principal\":{\"AWS\":\"*\"},
                    \"Action\":[
                      \"SNS:GetTopicAttributes\",
                      \"SNS:SetTopicAttributes\",
                      \"SNS:AddPermission\",
                      \"SNS:RemovePermission\",
                      \"SNS:DeleteTopic\",
                      \"SNS:Subscribe\",
                      \"SNS:ListSubscriptionsByTopic\",
                      \"SNS:Publish\",
                      \"SNS:Receive\"],
               \"Resource\":\"${topicArn}\",
              {\"Sid\":\"AWSEvents_rule-1_1\",
               \"Effect\":\"Allow\",
               \"Principal\":{\"Service\":\"events.amazonaws.com\"},
               \"Action\":\"sns:Publish\",
               \"Resource\":\"${topicArn}\"}]}"

aws sns set-topic-attributes \
   --topic-arn "$topicArn" \
   --attribute-name "Policy" \
   --attribute-value "$attributeValue"

6. create target (i.e. resource to be invoked) for the rule
> cat sleep.sh
# add target to rule
aws events put-targets \
--rule "$ruleName" \
    --targets "Id"=1,"Arn"="$topicArn"

PS: multiple rules can be linked to the same topic using different targets

7. Create subscription (mode of notification)

> cat sleep.sh
# create subscription
aws sns subscribe \
 --topic-arn "$topicArn" \
 --protocol "email" \
 --notification-endpoint "YourEmailAddress"

Friday, November 19, 2021

Globus transfer to AWS EFS

Topic: I needed to download files shared via Globus (globus.org) to AWS EFS drive. Globus provides an option to share a directory to AWS S3 bucket but not directly to EFS; however, this requires having a paid account with Globus (do not work with personal endpoint). This solution below (installing Globus CLI and downloading the file from an EC2 instance with the EFS mounted) works even with Globus's personal endpoint. If Globus CLI is already installed on your EC2 frontend, you can skip to step 6.

1. Create EC2 instance
I suggest an EC2 instance with high bandwidth for a faster download (ex. c5n with up to 25 Gbps network bandwidth)
> aws ec2 run-instances \
--image-id ami-0beaa649c482330f7 \
--count 1 \
--instance-type c5n.2xlarge \
--key-name sfourat \
--security-group-ids sg-0d0e3364014a7fc7e sg-0173b74c97e48493e \
--subnet-id subnet-04f3da868a634843d \
--profile "tki-aws-account-310-rhedcloud/RHEDcloudAdministratorRole"

2. Mount EFS
> sudo yum -y update
> sudo yum -y install amazon-efs-utils
> sudo yum -y install nfs-utils
> sudo mkdir -p /mnt/efs
> sudo mount -t efs -o tls fs-57e8702f:/ /mnt/efs

3. Install Globus on your AWS EC2 frontend
> pip3 install globus-cli

4. Connect to Globus
> globus login --no-local-server

5. Authentify in local browser and get Native App Authorization Code

> Please authenticate with Globus here:

> ------------------------------------

> https://auth.globus.org/v2/oauth2/authorize?client_id=...













6. Install Globus personal server

> wget https://downloads.globus.org/globus-connect-personal/v3/linux/stable/globusconnectpersonal-latest.tgz

> tar -xzvf globusconnectpersonal-latest.tgz 


7. create an endpoint

> ./globusconnectpersonal -setup

Globus Connect Personal needs you to log in to continue the setup process.


We will display a login URL. Copy it into any browser and log in to get a

single-use code. Return to this command with the code to continue setup.


Login here:

-----

https://auth.globus.org/v2/oauth2/authorize?...


Input a value for the Endpoint Name: aws

registered new endpoint, id: ...


8. start endpoint

> ./globusconnectpersonal -start -restrict-paths rw/mnt/efs &


9. print all endpoints by current user

> globus endpoint search --filter-scope my-endpoints


10. Directory Listing

> globus ls 'endpointUUID:/'


11. Start transfer

> globus transfer shared-endpoint:/ myendpoint:/

Wednesday, October 13, 2021

Upload large files to GitHub using Piggyback

Goal: Upload large files to GitHub without having to use (and pay) for Git LFS.

Solution: Use the R package `piggyback` (CRAN)

1. First identify files that have more than 100 Mb in your GitHub repo
find . -type f -exec du -a {} + |  grep -v .git | awk '$1 > 1e5'
295840  ./input/genes.gtf
> 410016  ./utils/msigdb_v7.4.xml

2. install and load package piggyback
R> install.packages("piggyback")
R> library(package = "piggyback")

3. Generate GitHub personal token
a. GitHub > Settings












b. Settings > Developer settings

c. Developer settings > Personal access tokens







4. Set your GitHub personal token for piggyback
R> Sys.setenv(GITHUB_TOKEN="...")

5. Create a new release of your package (works also for private 
R> pb_new_release(repo = "sekalylab/fluomics.hypertension", 
                  tag  = "v0.0.1")

6. upload the large file to GitHub
R> pb_upload(file = "input/genes.gtf", 
             repo = "sekalylab/fluomics.hypertension",
             tag  = "v0.0.1")
R> uploading genes.gtf ...
R> pb_upload(file = "utils/msigdb_v7.4.xml", 
             repo = "sekalylab/fluomics.hypertension",
             tag  = "v0.0.1")
R> uploading genes.gtf ...

7. add large files to gitignore
> echo "input/genes.gtf" > .gitignore
> echo "utils/msigdb_v7.4.xml" >> .gitignore

8. check on GitHub that the upload was done and that the files are available




Tuesday, September 28, 2021

mRNASeq on Emory AWS tutorial

1. connect to Emory VPN (BIG-IP)

2. FTP FASTQ files to EFS storage
> sftp -i userName.pem username@10.65.122.170
> cd /mnt/efs mkdir myDirectoryName
> cd myDirectoryName
> put *.fq.gz

3. connect to frontend machine
> ssh -i userName.pem username@10.65.122.170

4. create temporary AWS CLI credentials
> tki
Username: emoryUserName
Password: emoryPassword
Available Duo Authentication Methods: 1. auto
Available Regions: 2. us-east-2


Accept the push/sms/phone call.
This will create AWS CLI credentials valid for 12h (in $HOME/.aws/credentials).

5. change permission of the log file to allow subsequent tki calls
> chmod g+w /mnt/efs/bin/emory-tki/logs/tkiclient.log

6. clone the mRNAseq pipeline
> git clone --branch aws https://github.com/sekalylab/mRNASeq.git

7. launch the mRNASeq pipeline
> bash mRNA.preprocess_master.sh -d /mnt/efs/myDirectoryName

8. download the output files on local machines
> sftp -i userName.pem username@10.65.122.170
> cd /mnt/efs/myDirectoryName
> get *_genecounts


9. empty the EFS storage
> rm -r /mnt/efs/myDirectoryName

Transfer directory from EFS to S3 Glacier

1. Create an S3 bucket > aws s3 mb s3://rv398-20220712 2. Copy EFS files to the S3 bucket > aws s3 cp /mnt/efs/Joana3/Data s3://rv398-...