Username or Email Address
In this blog post I’m going to talk about AWS DataSync – tool which was introduced by Amazon at re:Invent 2018 to simplify moving your data between on-premises and AWS cloud. Core reason to introduce this tool was the fact that increased move to the cloud of critical workloads drove the need for move of increasingly large datasets into the cloud along with these workloads. AWS DataSync focuses on online data transfer scenarios which include such use cases as migration of active application data, recurring transfers for data processing, and disaster recovery.
Online data transfer scenarios addressed by DataSync – Picture from AWS DataSync launch presentation at re:Invent 2018
Migrations of active application data mean that you need to deal with data which is in constant state of flux without natural break point for a one-time transfer.
Transfers for timely in-cloud processing require automated and accelerated data transfers which copy data to S3 or EFS securely and reliably for timely analysis (think of DNA sequencing, video production, GIS or seismographic data as potential data which these days often must be processed in-cloud).
Data replication requires periodic incremental copies to reduce bandwidth and minimize loss in the event of an on-premise failure.
Targeting scenarios mentioned above, AWS DataSync represents an addition to the set of Amazon online data transfer tools which also includes S3 Transfer Acceleration (feature of S3 storage which uses edge locations for S3 enabled applications), Kinesis Data Firehose (load of streaming data into S3), and AWS Transfer for SFTP (managed file transfers into S3).
DataSync tries to address problems of large-scale active data sets transfers allowing incremental transfers and end-to-end data validation. More specifically, it enables automation of large data set transfers between your on-premise storage and Amazon S3 or Amazon Elastic File System (EFS) and offers the following benefits:
Let’s look more carefully at DataSync features and benefits.
Fast data transfer. DataSync uses purpose-built transfer protocol which is being incapsulated in TLS in combination with multithreaded design to enable scaling across multiple agents and combat network latency. Single agent can transfer data with up to 10 Gbps speed, and you can achieve faster speeds using multiple agents.
Incremental transfers, inline compression, and sparse file detection features are added into the tool with final goal of reducing amount of data transferred over the network. Read and write optimizations for Amazon S3 and EFS accelerate write and read data operations enabling you to get data in or out of these services as fast as possible.
Bundled together these features allow fast data transfers where your source storage speed may become a limiting factor. DataSync technology also provides you with an ability to set configurable throughput limits to limit amount of bandwidth which can consumed by the transfer tasks (you can choose between “Use available” and “Set bandwidth (MiB/s)” options).
Easy to use. DataSync can be setup and managed using AWS Console, CLI, or SDK and not involves deployment or management of any additional infrastructure in AWS. Agent has fully-managed updates and patches and allows preservation of metadata between storage systems and services.
Security and reliability. Encryption in-transit is guaranteed through use of TLS 1.2 and there is support for AWS KMS which provides encryption at-rest to ensure security of your data. In terms of reliability DataSync performs data validation in-transit and at-rest with automatic recovery from I/O errors or transmission failures.
Cloud integration. In terms of cloud integration there is good support for AWS management, identity, and compliance tools. You can monitor your data transfers using Amazon CloudWatch and use detailed CloudWatch logs to track data movement. Control and usage audit are possible with AWS Identity and Access Management (IAM) and AWS CloudTrail. DataSync is also PCI-DSS compliant and HIPAA eligible.
Cost-efficiency. There is no minimum commitments or upfront fees for using this service with 0.04$ per-GB price tag (see AWS DataSync Pricing information from Amazon for more details on pricing). Essentially you just pay for how much data you move. AWS infrastructure is fully managed and scaled based on usage and on-premises agents automatically updated and patched.
With AWS DataSync you can use the same service for one-time data migrations, recurring data processing workflows and automated replication for data protection and recovery. Working with DataSync involves three main steps:
DataSync Console Create Task Wizard
I will try to cover these steps and practical use of this tool in more details in my next blog post as this one was intended to be a general overview of AWS DataSync. As you can see AWS DataSync offers online transfer service that simplifies, automates, and accelerates data transfers between on-premises storage and AWS combining the speed and reliability of network acceleration software with the cost efficiency comparable with open source tools.
I hope this blog post has been informative for you and if you want to learn more about AWS DataSync you can refer to the following resources:
AWS DataSync User Guide
AWS DataSync – Automate & accelerate online data transfer (video from AWS re:Invent 2018)
AWS DataSync API Reference
AWS DataSync FAQs
Services by Mikhail Rodionov