Big file transfer in Linux

  sonic0002        2013-01-10 05:55:50       7,658        0         

It's very common that we need to transfer files between two different hosts such as backups. It is also an very simple task, we can use scp or rsync to complete the task well. But what if the file is very big, it may take some time to transfer it. How can we transfer a big file with high speed? Here we propose one solution.

Copy file

If we copy one uncompressed file, then we should follow below steps:

  1. Compress data
  2. Send it to another host
  3. Uncompress the data
  4. Verify the data integrity

This will be very efficient and it also saves bandwidth.

With ZIP+SCP

We can combine ZIP and SCP to achieve this.

gzip -c /home/yankay/data | ssh yankay01 "gunzip -c - > /home/yankay/data"

This command will use GZIP to compress /home/yankay/data and then send it to host yankay01 through ssh.

The file size of data is 1.1 GB, it becomes 183MB after compressed with Zip compression, the above command takes 45.6s, the average throughput is 24.7MB/s. Actually scp has compression capability as well, we can write the above command as :

scp -C -c blowfish /home/yankay/data yankay01:/home/yankay/data

The end result of both commands above is the same, the difference is that we use blowfish algorithm as the compression algorithm, it will be faster using the blowfish algorithm than the default algorithm.

The above command takes 45s again, the average throughput is 24MB/s which has no much improvement. It seems the bottleneck is not at the network side.

Then what is the bottleneck?

Performance analysis

We need to define some variables

  • The compression ratio of the compression toll is CompressRatio
  • The compression throughput is CompressSpeed MB/s
  • The throughput of the network is NetSpeed MB/s

Because we use pipe, the performance of pipe depends on the performance of the slowest component, so the overall performance is:

  Compression rate Throughput 100M/s 62MB/s
ZLIB 35.80% 9.6 9.6 9.6 9.6
LZO 54.40% 101.7 101.7 101.7 18.38235294
LIBLZF 54.60% 134.3 134.3 113.5531136 18.31501832
QUICKLZ 54.90% 183.4 182.1493625 112.9326047 18.21493625
FASTLZ 56.20% 134.4 134.4 110.3202847 17.79359431
SNAPPY 59.80% 189 167.2240803 103.6789298 16.72240803
NONE 100% 300 100 62 10

When the compression throughput is less than the network throughput, then the bottleneck is the compression, otherwise, it is the network.

We have our test data below:

speed=min(NetSpeed/CompressRadio,CompressSpeed)

We can find , when the network speed is 100M/s, QuickLZ has the best performance. If we use SSH as the data transfer protocol, it will not achieve the best performance. In 10M/s, all algorithms have almost the same performance, but QuickLZ has a relatively better performance.

For different data and hosts, the best algorithm is also different, but one thing can be sure, the bottleneck should be on network side.

Conclusion

According to above analysis, we should not use SSH as the network transfer protocol, we can use NC to improve the performance. And we can use qpress as the compression algorithm.

scp /usr/bin/qpress yankay01:/usr/bin/qpress
ssh yankay01 "nc -l 12345 |  qpress -dio > /home/yankay/data" &
qpress -o /home/yankay/data |nc yankay01 12345

The first line above is to install the qpress on the remote machine, the second line is to listen to a port with NC, the third line is to compress and transfer the data.

It takes 2.8s to execute above commands, the average throughput is 402MB/s which will be 16 times faster than ZIP+SCP.

Source : http://www.yankay.com/linux%E5%A4%A7%E6%96%87%E4%BB%B6%E4%BC%A0%E8%BE%93/

LINUX  SCP  ZIP 

       

  RELATED


  0 COMMENT


No comment for this article.



  RANDOM FUN

Client server communication