Month: January 2018

  • Use s3cmd to Download Requester Pays Buckets on S3

    List files under pdf:

    $ s3cmd ls --requester-pays s3://arxiv/pdf
                           DIR   s3://arxiv/pdf/
    

    List files under pdf:

    $ s3cmd ls --requester-pays s3://arxiv/pdf/\*
    2010-07-29 19:56 526202880   s3://arxiv/pdf/arXiv_pdf_0001_001.tar
    2010-07-29 20:08 138854400   s3://arxiv/pdf/arXiv_pdf_0001_002.tar
    2010-07-29 20:14 525742080   s3://arxiv/pdf/arXiv_pdf_0002_001.tar
    2010-07-29 20:33 156743680   s3://arxiv/pdf/arXiv_pdf_0002_002.tar
    2010-07-29 20:38 525731840   s3://arxiv/pdf/arXiv_pdf_0003_001.tar
    2010-07-29 20:52 187607040   s3://arxiv/pdf/arXiv_pdf_0003_002.tar
    2010-07-29 20:58 525731840   s3://arxiv/pdf/arXiv_pdf_0004_001.tar
    2010-07-29 21:11  44851200   s3://arxiv/pdf/arXiv_pdf_0004_002.tar
    2010-07-29 21:14 526305280   s3://arxiv/pdf/arXiv_pdf_0005_001.tar
    2010-07-29 21:27 234711040   s3://arxiv/pdf/arXiv_pdf_0005_002.tar
    ...
    

    Get all files under pdf:

    $ s3cmd get --requester-pays s3://arxiv/pdf/\*
    

    List all content to text file:

    $ s3cmd ls --requester-pays s3://arxiv/src/\* > all_files.txt
    

    Calculate file size:

    $ awk '{s += $3} END { print "sum is", s/1000000000, "GB, average is", s/NR }' all_files.txt
    sum is 844.626 GB, average is 4.80447e+08
    
  • Install Tsunami UDP on CentOS 7

    Install dependencies:

    $ yum install cvs git gcc automake autoconf libtool -y
    

    Download Tsunami UDP:

    $ cd /tmp
    $ cvs -z3 -d:pserver:[email protected]:/cvsroot/tsunami-udp co -P tsunami-udp
    $ cd tsunami-udp
    $ ./recompile.sh
    $ make install
    

    Then on the server side:

    $ tsunamid --port 46224 * # (Serves all files from current directory for copy)
    

    On the client side:

    $ tsunami connect <server_ip> get *
    

    Transfer dataset back to S3:

    aws s3 cp --recursive /mnt/bigephemeral s3://<your-new-bucket>/
    

    Limitations:

    • Tsunami UDP transfers only files and doesn’t do directories/ subdirectories, we need to tar them all up as one single tar file (additional storage capacity needs to be taken into consideration).
    • Multi-threading is not supported.
    • Multi session not supported. Client supports only one connection to the server at a time. No parallel file transfer.
    • No resume or retry for file transfer.
    • Does not support Native encryption.

    Refs: