AWS Analytics

Amazon Redshift:

  • Amazon Redshift is a fast powerful, fully managed, petabyte-scale data warehouse service in the cloud that makes it simple and cost-effective to analyze all your data using standard SQL and existing Business Intelligence (BI) tools.
  • Customer can start small for just $0.25 per hour with no commitments or upfront costs and scale to a petabyte or more for $ 1000 per terabyte per year, less than a tenth of most of data warehousing solutions.
  • Redshift is for OLAP transaction , RDS for OLTP transactions.
  • Extremely cost-effective as compared to some other on-premises data warehouse platforms. No upfront commitment, you can start small and grow as required.
    Single Node (160 GB)-You can start with Single,160 GB ,Redshift data warehouse.
    Multi Node Deployment-For multinode deployment (cluster), you need a leader node and compute node.
    Leader node managed client connections and receives queries. Stores metadata.
    Compute Node store data and perform queries and computations.
    You can have upto 128 compute node in clusters.
  • Columnar Data Storage–Instead of storing data as a series of rows, Amazon Redshift organizes the data by column-greatly improving query performance. Requires fewer I/Os which greatly enhances performance.
  • Massive parallel processing: Amazon Redshift automatically distributes data and query load across all nodes, Amazon Redshift makes it easy to add nodes to your data warehouse and enables you to maintain fast query performance as your data warehouse grows.
  • RedShift provides advanced compression: Data is stored sequentially in columns which allows for much better performance and less storage space. RedShift automatically selects the compression scheme
  • Option to query directly from data files on S3 via RedShift Spectrum
    RedShift is 10x faster than a traditional SQL DB.
  • Amazon RedShift Spectrum is a feature of Amazon Redshift that enables you to run queries against exabytes of unstructured data in Amazon S3, with no loading or ETL required.
  • RedShift uses replication and continuous backups to enhance availability and improve durability and can automatically recover from component and node failures. Only available in one AZ but you can restore snapshots into another AZ.
    Alternatively, you can run data warehouse clusters in multiple AZ’s by loading data into two Amazon Redshift data warehouse clusters in separate AZs from the same set of Amazon S3 input files
    Redshift replicates your data within your data warehouse cluster and continuously backs up your data to Amazon S3.
  • High availability for RedShift:
    • Currently, RedShift does not support Multi-AZ deployments
    • The best HA option is to use multi-node cluster which supports data replication and
    node recovery
    • A single not RedShift cluster does not support data replication and you’ll have to restore from a snapshot on S3 if a drive fails
    RedShift can asynchronously replicate your snapshots to S3 in another region for DR.
    Single-node clusters do not support data replication (in a failure scenario you would need to restore from a snapshot).
  • By default, Amazon Redshift retains backups for 1 day. You can configure this to be as long as 35 days.
    If you delete the cluster you can choose to have a final snapshot taken and retained. Manual backups are not automatically deleted when you delete a cluster.

Security :

  • You can load encrypted data from S3
  • Supports SSL Encryption in-transit between client applications and Redshift data warehouse cluster
  • VPC for network isolation.
  • Encryption for data at rest (AES 256)
  • Audit logging and AWS CloudTrail integration
  • RedShift takes care of key management or you can manage your own through HSM or KMS.

Redshift Billing:

  • Charged for compute nodes hours, 1 unit per hour (only compute node, not leader node)
  • Backup storage – storage on S3
  • Data transfer – no charge for data transfer between RedShift and S3 within a region but for other scenarios you may pay charges.

Redshift Limitations:
RedShift can store huge amounts of data but cannot ingest huge amounts of data in real time.
You cannot have direct access to your AWS RedShift cluster nodes as a user, but you can through applications
Currently, RedShift does not support Multi-AZ deployments

Amazon EMR:

  • Amazon EMR is a web service that process large amount of data.
  • Amazon EMR uses hosted Hadoop framework running on Amazon EC2 and Amazon S3.Uses Apache Hadoop as data processing engine.
  • Mostly used for log analysis, financial analysis and ETL (Extract, translate and loading activities)
  • Steps –programmatic task to process data
  • Clusters–Collections of EC2 instances provisioned by EMR to run your steps.
  • All nodes for a given cluster are launched in the same availability zones.
  • With EMR, there is access to underlying OS (One can SSH ).

Amazon Kinesis

Amazon Kinesis:

  • Collect, process and analyze real time , streaming data. It lets you analyse real time data instredof waiting for whole data, so it process as and when data comes.
  • Use cases: Kinesis is used for real time data like Video games, ecommerce websites, Telemetry data (weather forecast), IoT data , Machine learning.
  • Click Streams-Helps to find out where the user clicked on a website by collecting logs for better user expericance in future.
  • Data is processed in “shards” with each shard able to ingest 1000 records per second.
  • There is default limit of 500 shards
  • Data is stored by default for 24 hours but can be configured for up to 7 days.

There are 4 types of Kinesis Service:

  • Kinesis Video Streams:
  • Amazon Kinesis Video Streams is a fully managed AWS service that lets you stream live video from devices to the AWS Cloud, or build applications for real-time video processing or batch-oriented video analytics. You can use them for analysis, real time reporting and Machine learning.
  • Supports encryption at rest with server side encryption(KMS) with a customer master key.
  • Stores data in shards. Can have multiple shards in stream.
  • Kinesis Video streams automatically provisions and elastically scales all the infrastructure needed to ingest streaming video data from millions of devices.
  • It durably stores, encrypts and indexes video data in your streams and allows you to access data through easy to use API like AWS Rekoginition(Face detection), AWS Sage maker(machine learning), Tensorflow, HLS-based video playback(eg Youtube uses Apples HLS service for live streaming),custom video processing etc.
  • Kinesis Data Streams:
    Kinesis Data streams lets you build custom applications that process or analyze streaming data for specific needs like accelerated log and data feed intake, real time metrics and reporting, real time data analytics, and complex stream processing.
    Use cases–To process and analyze real time application streaming logs, Clickstreams, Social media feed logs.
  • Kinesis Data streams stores data for later processing by applications. A data blob is the data of your interest your data producer adds to your data streams. Maximum size of data blob within one record is one megabyte.
  • Each Shard can support up to 1000 PUT records per second.
  • Partition keys are used to group data by shard within a stream.
  • Kinesis Streams use KMS Master key for encryption.
    Kinesis Data streams replicate across three Availability zones.
  • Kinesis Data Firehose: Loads streaming data into data stores and data lakes(pool or repository of unprocessed data).
    Kinesis DATA Firehose lets load streaming data into data stores and data lakes and analytics tools. It captures, transforms and loads streaming data.
  • Enables real time analytics with excising business intelligence tools.
    Kinesis Data source can be used as the source to Kinesis Data Firehose.
  • Kinesis Data Firehose can invoke Lambda function to transform data before delivering it to destination.
    Firehose destination include: Amazon S3, Amazon Redshift, Amazon Elastic Search Service, Splunk.
  • For Amazon Redshift destination, streaming data is delivered to S3 bucket first , Kinesis Data Firehose issues an Amazon Redshift Copy command to load data from S3 bucket to Amazon Redshift cluster.
  • Kinesis Data Analytics: Process real time data with SQL or Java. Kinesis data analytics is serverless. There are no servers to manage.
  • Kinesis Data Analytics is the easiest way to process and analyze real time streaming data.
  • Can use standard SQL queries to process Kinesis Data streams.
  • Use cases–Generate time series analytics, Feed real time dashboards, create real time alerts and notifications.
  • Can stream data from Kinesis stream and Kinesis Firhose. Output to S3, Redshift, Elasticsearch and Kinesis Data streams.
  • Sits over Kinesis Data stream and Kinesis Data Firehose.
  • IAM can be used to provide Kinesis Analytics with permissions to read records from source and write to destinations.
  •  With Amazon Kinesis Data Analytics for SQL Applications, you can process and analyze streaming data using standard SQL. The service enables you to quickly author and run powerful SQL code against streaming sources to perform time series analytics, feed real-time dashboards, and create real-time metrics.
  • To get started with Kinesis Data Analytics, you create a Kinesis data analytics application that continuously reads and processes streaming data. The service supports ingesting data from Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose streaming sources. Then, you author your SQL code using the interactive editor and test it with live streaming data. You can also configure destinations where you want Kinesis Data Analytics to send the results.
  • Kinesis Data Analytics supports Amazon Kinesis Data Firehose (Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk), AWS Lambda, and Amazon Kinesis Data Streams as destinations.
  • Amazon Athena:
  • Lets you analyze data in S3 using SQL query. With Athena, there’s no need for complex extract, transform, and load (ETL) jobs to prepare your data for analysis
  • Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
  • Athena is easy to use. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Athena is easy to use for anyone with SQL skills to quickly analyze large-scale datasets.
  • Amazon QuickSight
  • Amazon QuickSight is a fast, cloud-powered business intelligence (BI) service that makes it easy for you to deliver insights to everyone in your organization. QuickSight lets you create and publish interactive dashboards that can be accessed from browsers or mobile devices.
  • AWS Glue
  • AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics.
  • AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL.
  • AWS Data Pipeline
  • AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services, as well as on-premises data sources, at specified intervals.


Leave a Reply

Your email address will not be published.