Spark2.0 Internals, Kafka and, NoSQL DBs
Spark is explored in great detail. Programing paradigm of spark is given due importance. RDDs is explored as data structures. The novelty of RDDs is brought into focus. Contrast is done with Hadoop Map reduce. All of this on a proper cluster. Most importantly how does it interact with other components like Cassandra, HBase, Mongo, Kafka , Storm etc. For example how do we perform a distributed join with Spark with data in multiple join tables. Or How we process real-time events from Kafka and then store processed data in Mongo for online consumption.
Intended Audience:
- Programmers
- Engineers
Key Skills:
- Spark: In a cluster
- Spark: Streaming
- Spark: Programming Model
- Spark: RDDs comparison with other techniques
- Spark: Detailed Architecture
- Spark: RDDs as Data Structure
Prerequisites:
- Good knowledge of Java8 Lambda expressions
- Good understanding of Hadoop
- Good understanding of Java
- Good understanding of HDFS
- Linux would help as it would be the platform
Instructional Method:
- This is an instructor led course which provides lecture topics and the practical application of Spark, Spark Streaming and the underlying technologies. It pictorially presents most concepts and there is a detailed case study that strings together the technologies, patterns and design.
Distributing Data with HDFS
- Interfaces
- Hadoop Filesystems
- The Design of HDFS
- Parallel Copying with distcp
- Keeping an HDFS Cluster Balanced
- Hadoop Archives
- Using Hadoop Archives
- Limitations
- Data Flow
- Anatomy of a File Write
- Anatomy of a File Read
- Coherency Model
- The Command-Line Interface
- Basic Filesystem Operations
The Java Interface
- Querying the Filesystem
- Reading Data Using the FileSystem API
- Directories
- Deleting Data
- Reading Data from a Hadoop URL
- Writing Data
Understanding Hadoop I/O
- Serialization
- Implementing a Custom Writable
- Serialization Frameworks
- The Writable Interface
- Writable Classes
- Avro
Data Integrity
- ChecksumFileSystem
- LocalFileSystem
- Data Integrity in HDFS
ORC Files
- Large size enables efficient read of columns
- New types (datetime, decimal)
- Encoding specific to the column type
- Default stripe size is 250 MB
- A single file as output of each task
- Split files without scanning for markers
- Bound the amount of memory required for reading or writing.
- Lowers pressure on the NameNode
- Dramatically simplifies integration with Hive
- Break file into sets of rows called a stripe
- Complex types (struct, list, map, union)
- Support for the Hive type model
ORC File: Footer
- Count, min, max, and sum for each column
- Types, number of rows
- Contains list of stripe
ORC Files: Index
- Required for skipping rows
- Position in each stream
- Min and max for each column
- Currently every 10,000 rows
- Could include bit field or bloom filter
ORC Files: Postscript
- Contains compression parameters
- Size of compressed footer
ORC Files: Data
- Directory of stream locations
- Required for table scan
Parquet
- Nested Encoding
- Configurations
- Error recovery
- Extensibility
- Nulls
- File format
- Data Pages
- Motivation
- Unit of parallelization
- Logical Types
- Metadata
- Modules
- Column chunks
- Separating metadata and column data
- Checksumming
- Types
File-Based Data Structures
- MapFile
- SequenceFile
Compression
- Codecs
- Using Compression in MapReduce
- Compression and Input Splits
Spark Introduction
- GraphX
- MLlib
- Spark SQL
- Data Processing Applications
- Spark Streaming
- What Is Apache Spark?
- Data Science Tasks
- Spark Core
- Storage Layers for Spark
- Who Uses Spark, and for What?
- A Unified Stack
- Cluster Managers
RDDs
- Expressing Existing Programming Models
- Fault Recovery
- Interpreter Integration
- Memory Management
- Implementation
- MapReduce
- RDD Operations in Spark
- Iterative MapReduce
- Console Log Minning
- Google’s Pregel
- User Applications Built with Spark
- Behavior with Insufficient Memory
- Support for Checkpointing
- A Fault-Tolerant Abstraction
- Evaluation
- Job Scheduling
- Spark Programming Interface
- Advantages of the RDD Model
- Understanding the Speedup
- Leveraging RDDs for Debugging
- Iterative Machine Learning Applications
- Explaining the Expressivity of RDDs
- Representing RDDs
- Applications Not Suitable for RDDs
RDD Internals: Part-2
- Sorting Data
- Operations That Affect Partitioning
- Determining an RDD’s Partitioner
- Grouping Data
- Motivation
- Aggregations
- Data Partitioning (Advanced)
- Actions Available on Pair RDDs
- Joins
- Creating Pair RDDs
- Operations That Benefit from Partitioning
- Transformations on Pair RDDs
- Example: PageRank
- Custom Partitioners
Data ingress and egress
- Hadoop Input and Output Formats
- File Formats
- Local/“Regular†FS
- Text Files
- Java Database Connectivity
- Structured Data with Spark SQL
- Elasticsearch
- File Compression
- Apache Hive
- Cassandra
- Object Files
- Comma-Separated Values and Tab-Separated Values
- HBase
- Databases
- Filesystems
- SequenceFiles
- JSON
- HDFS
- Motivation
- JSON
- Amazon S3
Running on a Cluster
- Scheduling Within and Between Spark Applications
- Spark Runtime Architecture
- A Scala Spark Application Built with sbt
- Packaging Your Code and Dependencies
- Launching a Program
- A Java Spark Application Built with Maven
- Hadoop YARN
- Deploying Applications with spark-submit
- The Driver
- Standalone Cluster Manager
- Cluster Managers
- Executors
- Amazon EC2
- Cluster Manager
- Dependency Conflicts
- Apache Mesos
- Which Cluster Manager to Use?
Spark Internals
Spark: YARN Mode
- Resource Manager
- Node Manager
- Workers
- Containers
- Threads
- Task
- Executers
- Application Master
- Multiple Applications
- Tuning Parameters
Spark: LocalMode
- Spark Caching
- With Serialization
- Off-heap
- In Memory
Running on a Cluster
- Scheduling Within and Between Spark Applications
- Spark Runtime Architecture
- A Scala Spark Application Built with sbt
- Packaging Your Code and Dependencies
- Launching a Program
- A Java Spark Application Built with Maven
- Hadoop YARN
- Deploying Applications with spark-submit
- The Driver
- Standalone Cluster Manager
- Cluster Managers
- Executors
- Amazon EC2
- Cluster Manager
- Dependency Conflicts
- Apache Mesos
- Which Cluster Manager to Use?
Spark Serialization
StandAlone Mode
- Task
- Multiple Applications
- Executers
- Tuning Parameters
- Workers
- Threads
Advanced Spark Programming
- Working on a Per-Partition Basis
- Optimizing Broadcasts
- Accumulators
- Custom Accumulators
- Accumulators and Fault Tolerance
- Numeric RDD Operations
- Piping to External Programs
- Broadcast Variables
Spark Streaming
- Stateless Transformations
- Output Operations
- Checkpointing
- Core Sources
- Receiver Fault Tolerance
- Worker Fault Tolerance
- Stateful Transformations
- Batch and Window Sizes
- Architecture and Abstraction
- Performance Considerations
- Streaming UI
- Driver Fault Tolerance
- Multiple Sources and Cluster Sizing
- Processing Guarantees
- A Simple Example
- Input Sources
- Additional Sources
- Transformations
Spark SQL
- User-Defined Functions
- Long-Lived Tables and Queries
- Spark SQL Performance
- Apache Hive
- Loading and Saving Data
- Performance Tuning Options
- Parquet
- Initializing Spark SQL
- Caching
- SchemaRDDs
- JSON
- From RDDs
- Linking with Spark SQL
- Spark SQL UDFs
- Using Spark SQL in Applications
- Basic Query Example
Tuning and Debugging Spark
- Driver and Executor Logs
- Memory Management
- Finding Information
- Configuring Spark with SparkConf
- Key Performance Considerations
- Components of Execution: Jobs, Tasks, and Stages
- Spark Web UI
- Hardware Provisioning
- Level of Parallelism
- Serialization Format
- Memory Management
- Driver and Executor Logs
- Components of Execution: Jobs, Tasks, and Stages
- Key Performance Considerations
Metrics and Debugging
- Evaluating spark jobs
- Monitoring tool for spark
- Spark WebUI
- Memory consumption and resource allocation
- Job metrics
- Debugging & troubleshooting spark jobs
- Monitoring Spark jobs
Hardware Provisioning
Level of Parallelism
Monitoring Spark
- Logging in Spark
- Spark History Server
- Spark Metrics
- Exploring the Spark Application UI
Finding Information
- Spark Administration & Best Practices
- Estimating cluster resource requirements
- Estimating Drive/Executer Memory Sizes
- Serialization Forma
Kafka Internals
- Kafka Core Concepts
- brokers
- Topics
- producers
- replicas
- Partitions
- consumers
Operating Kafka
- P&S tuning
- monitoring
- deploying
- Architecture
- hardware specs
Developing Kafka apps
- serialization
- compression
- testing
- Case Study
- reading from Kafka
- Writing to Kafka
Cassandra Internals
- Cassandra in a cluster
- Replication Strategies
- Seed Nodes
- Adding Nodes to a Cluster
- Node Configuration
- Cassandra Cluster Manager
- Creating a Cluster
- Dynamic Ring Participation
- Snitches
- Partitioners
The Cassandra Query Language
- Data Types
- CQL1
- The Relational Data Model
- CQL3
- CQL Types
- Cassandra’s Data Model
- Secondary Indexes
- CQL2
Performance Tuning
- Memtables
- Commit Logs
- Caching
- Compaction
- Hinted Handoff
- JVM Settings
- Concurrency and Threading
- SSTables
- Networking and Timeouts
- Managing Performance
- Using cassandra-stress
The Cassandra Architecture
- System Keyspaces
- Partitioners
- Data Centers and Racks
- Staged Event-Driven Architecture (SEDA)
- Lightweight Transactions and Paxos
- Rings and Tokens
- Compaction
- Queries and Coordinator Nodes
- Caching
- Consistency Levels
- Hinted Handoff
- Bloom Filters
- Gossip and Failure Detection
- Anti-Entropy, Repair, and Merkle Trees
- Snitches
- Virtual Nodes
- Managers and Services
- Replication Strategies
- Memtables, SSTables, and Commit Logs
- Tombstones
Monitoring and Maintenance
- Logging
- Cassandra’s MBeans
- Backup and Recovery
- Maintenance Tools
- SSTable Utilities
- Basic Maintenance
- Adding Nodes
- Handling Node Failure
- Health Check
- Monitoring with nodetool
- Monitoring Cassandra with JMX
MongoDB Internals
- Indexing and query optimization
- Replication
- Sharding
- Sharding
- Starting the Servers
- Adding a Shard from a Replica Set
- Splitting Chunks
- Chunk Ranges
- Configuring Sharding
- Sharding Data
- Understanding the Components of a Cluster
- When to Shard
- The Balancer
- Config Servers
- Adding Capacity
- The mongos Processes
- How MongoDB Tracks Cluster Data
Monitoring MongoDB Applications
- False Positives
- Seeing the Current Operations
- Documents
- Collections
- Calculating Sizes
- Finding Problematic Operations
- Preventing Phantom Operations
- Using mongotop and mongostat
- Killing Operations
- Using the System Profiler
- Databases
- Seeing What Your Application Is Doing
Durability
- What Journaling Does
- Replacing Data Files
- Durability with Replication
- Sneaky Unclean Shutdowns
- Planning Commit Batches
- Checking for Corruption
- What MongoDB Does Not Guarantee
- Repairing Data Files
- Setting Commit Intervals
- Turning Off Journaling
- The mongodlock File
Advanced Sharding
- Controlling Data Distribution
- Location-Based Shard Keys
- Hashed Shard Keys for GridFS
- The Firehose Strategy
- Shard Key Limitations
- Ascending Shard Keys
- Shard Key Strategies
- Picturing Distributions
- Randomly Distributed Shard Keys
- Hashed Shard Key
- Choosing a Shard Key
- Shard Key Cardinality
- Manual Sharding
- Shard Key Rules and Guidelines
- Taking Stock of Your Usage
- Multi-Hotspot
- Using a Cluster for Multiple Databases and Collections
Hbase
- Intoduction
- Clients
- Concepts
- Hbase vs RDBMS
Log Structures Merge Trees
- Compaction
- Limitations of B+ Trees
- Limitations of Binary Trees
- LogStructured Merge tree as the back bone of storage
HBase Storage Architecture
- MemStore
- Read and Write Path
- Physical Architecture
- HFile
- WAL
- HFile Format
- HMaster and HRegionServer
- How Data is Store in Hfile
- Root Table and Meta Table
- Key Format
- Role ofZookeeper
Future Directions
- MMap for bloom filters and Block indexes
- Exploring OFF-Heap Storage
Introduction
- Common Advantages
- Dynamo and Bigtable
- Table, Column Families,Rows and Columns
- Data Mode
- BigTable and HBase(C + P)
- What am I giving up?
- Schemaless
- Key/Value Stores
HBase Operations
- Access Patterns
- Batching
- Filters
- Put
- Gets
- Caching
- Scanning
Desiging HBase Tables and Schemas
- Time-Ordered Relations
- Pagination
- Concepts
- Key Design
- Partial Key Scans
- Tall-Narrow Versus Flat-Wide Tables
- Time Series Data
- Secondary Indexes
- Advanced Schemas
+91 7719882295
+1 315-636-0645