With the advent of big data technologies, many organizations are adopting a new information storage model called data lake to solve data management challenges. The data lake model is being adopted for diverse use cases, such as business intelligence, analytics, regulatory compliance, and fraud detection.
A data lake is a shared repository of raw and enterprise data from a variety of sources. It is often built over a distributed Hadoop cluster, which provides an economical and scalable persistence and compute layer. Hadoop makes it possible to store large volumes of structured and unstructured data from various enterprise systems within and outside the organization. Data in the lake can include raw and refined data, master data and transactional data, log files, and machine data.
Organizations are also looking to provide ways for different kinds of users to access and work with all of the data in the enterprise, within the Hadoop data lake as well data outside the data lake. They want data analysts and data scientists to be able to use the data lake for ad-hoc self-service analytics to drive business innovation, without exposing the complexity of underlying technologies or the need for coding skills. IT and data governance staff want to monitor data related user activities in the enterprise. Without strong data management and governance foundation enabled by intelligence, data lakes can turn into data swamps.
In version 10.1, Informatica introduces Intelligent Data Lake, a new product to help customers derive more value from their Hadoop-based data lake and make data available to all users in the organization.
Intelligent Data Lake is a collaborative self-service big data discovery and preparation solution for data analysts and data scientists. It enables analysts to rapidly discover and turn raw data into insight and allows IT to ensure quality, visibility, and governance. With Intelligent Data Lake, analysts to spend more time on analysis and less time on finding and preparing data.
Intelligent Data Lake provides the following benefits:
Data analysts can quickly and easily find and explore trusted data assets within the data lake and outside the data lake using semantic search and smart recommendations.
Data analysts can transform, cleanse, and enrich data in the data lake using an Excel-like spreadsheet interface in a self-service manner without the need for coding skills.
Data analysts can publish data and share knowledge with the rest of the community and analyze the data using their choice of BI or analytic tools.
IT and governance staff can monitor user activity related to data usage in the lake.
IT can track data lineage to verify that data is coming from the right sources and going to the right targets.
IT can enforce appropriate security and governance on the data lake
IT can operationalize the work done by data analysts into a data delivery process that can be repeated and scheduled.
Intelligent Data Lake has the following features:
Search
Find the data in the lake as well as in the other enterprise systems using smart search and inference-based results.
Filter assets based on dynamic facets using system attributes and custom defined classifications.
Explore
Get an overview of assets, including custom attributes, profiling statistics for data quality, data domains for business content, and usage information.
Add business context information by crowd-sourcing metadata enrichment and tagging.
Preview sample data to get a sense of the data asset based on user credentials.
Get lineage of assets to understand where data is coming from and where it is going and to build trust in the data.
Know how the data asset is related to other assets in the enterprise based on associations with other tables or views, users, reports and data domains.
Progressively discover additional assets with lineage and relationship views.
Acquire
Upload personal delimited files to the lake using a wizard-based interface.
Hive tables are automatically created for the uploads in the most optimal format.
Create, append to, or overwrite assets for uploaded data.
Collaborate
Organize work by adding data assets to projects.
Add collaborators to projects with different roles, such as co-owner, editor, or viewer, and with different privileges.
Recommendations
Improve productivity by using recommendations based on the behavior and shared knowledge of other users.
Get recommendations for alternate assets that can be used in a project.
Get recommendations for additional assets that can be used a project.
Recommendations change based on what is in the project.
Prepare
Use excel-like environment to interactively specify transformation using sample data.
See sheet-level and column-level overviews, including value distributions and numeric and date distributions.
Add transformations in the form of recipe steps and see the results immediately on the sheets.
Perform column-level data cleansing and data transformation using string, math, date, logical operations.
Perform sheet-level operations to combine, merge, aggregate, or filter data.
Refresh the sample in the worksheet if the data in the underlying tables change.
Derive sheets from existing sheets and get alerts when parent sheets change.
All transformation steps are stored in the recipe which can be played back interactively.
Publish
Use the power of the underlying Hadoop system to run large-scale data transformation without coding or scripting.
Run data preparation steps on actual large data sets in the lake to create new data assets.
Publish the data in the lake as a Hive table in the desired database.
Create, append, or overwrite assets for published data.
Data Asset Operations
Export data from the lake to a CSV file.
Copy data into another database or table.
Delete the data asset if allowed by user credentials.
My Activities
Keep track of upload activities and their status.
Keep track of publications and their status.
View log files in case of errors and share with IT administrators if needed.
IT Monitoring
Keep track of user, data asset and project activities by building reports on top of the audit database.
Find information such as the top active users, the top datasets by size, prior updates, most reused assets, and the most active projects.
IT Operationalization
Operationalize the ad-hoc work done by analysts.
User Informatica Developer to customize and optimize the Informatica Big Data Management mappings translated from the recipes that analysts create.
Deploy, schedule, and monitor the Informatica Big Data Management mappings to ensure that data assets are delivered at the right time to the right destinations.
Make sure that the entitlements for access to various databases and tables in the data lake are according to security policies.