Customers of Amazon Web Services and Informatica can deploy a big data solution that fully integrates Big Data Management with the AWS cloud platform and the Amazon EMR cluster.
Several different methods are available for deploying Big Data Management:
Hybrid deployment
Install and configure the Informatica domain and Big Data Management on-premise, and configure them to push processing to the Amazon EMR cluster.
Manual cloud deployment
Manually install and configure the Informatica domain and Big Data Management on AWS EC2 instances in the same region as your Amazon EMR cluster, or deploy the domain on-premises.
Marketplace cloud deployment
Execute a Big Data Management deployment from the AWS marketplace to create an Informatica domain and an Amazon EMR cluster in the AWS cloud, exploring Big Data Management functionality through prepackaged mappings.
The Big Data Management marketplace solution on AWS creates and connects the following resources in the VPC:
Informatica domain server on an EC2 instance, with additional instances to contain nodes in the Data Integration Service grid
Informatica clients on a remote Windows server, on a public subnet
EMR cluster
Amazon S3 storage resources, including S3 hosts for source and target data
Amazon RDS relational databases for Informatica domain repositories and optionally for source and target data
AWS security and account management services
AWS regions and Lambda functions
The marketplace solution includes prepackaged mappings that demonstrate various Big Data Management functionality.
The following diagram shows the architecture of the Big Data Management on AWS marketplace solution:
The numbers in the architecture diagram correspond to items in the following list:
A virtual public cloud (VPC) to contain the Big Data Management deployment.
Availability zones.
Subnets to contain specific elements of the deployment. Create two private subnets, plus one public subnet if you want to use a remote Windows server for Informatica clients. Create each of the subnets in a different availability zone.
The Informatica domain, including the Model Repository Service and the Data Integration Service.
Amazon EMR cluster to process mappings and other jobs from the Data Integration Service.
Amazon RDS databases for Informatica domain repositories:
Domain repository database
Model repository
Monitoring Model repository
An Amazon Redshift data warehouse, to act as a repository for data sources and targets.
S3 storage, to act as a temporary location for files that the data integration service moves between EC2 instances and the EMR cluster.
AWS Lambda functions.
Amazon CloudWatch.
Big Data Management clients in a separate EC2 instance in a public subnet. See
Informatica clients for an explanation of each of these.