内容简介

  在使用ApacheHadoop设计端到端数据管理解决方案时获得专家级指导。当其他很多渠道还停留在解释Hadoop生态系统中该如何使用各种纷繁复杂的组件时,这本专注实践的书已带领你从架构的整体角度思考,它对于你的特别应用场景而言是必不可少的,将所有组件紧密结合在一起,形成完整有针对性的应用程序。
  为了增强学习效果,《Hadoop应用架构(影印版英文版)》第二部分提供了各种详细的架构案例.涵盖部分常见的Hadoop应用场景。
  无论你是在设计一个新的Hadoop应用还是正计划将Hadoop整合到现有的数据基础架构中,《Hadoop应用架构(影印版英文版)》都将在这整个过程中提供技巧性的指导。
  使用Hadoop存放数据和建模数据时需要考虑的要素在系统中导入数据和从系统中导出数据的实践指导数据处理的框架,包括MapReduce、Spark和Hive常用Hadoop处理模式,例如移除重复记录和使用窗口分析Giraph,GraphX以及其他Hadoop上的大图片处理工具使用工作流协作和调度工具,例如ApacheOozie使用ApacheStorm、ApacheSparkStreaming和ApacheFlume处理准实时数据流点击流分析、欺诈防止和数据仓库的架构实例

目录

Foreword
Preface

PartⅠ.ArchitecturalConsiderationsforHadoopApplications
1.DataModelinginHadoop
DataStorageOptions
StandardFileFormats
HadoopFileTypes
SerializationFormats
ColumnarFormats
Compression
HDFSSchemaDesign
LocationofHDFSFiles
AdvancedHDFSSchemaDesign
HDFSSchemaDesignSummary
HBaseSchemaDesign
RowKey
Timestamp
Hops
TablesandRegions
UsingColumns
UsingColumnFamilies
Time-to-Live
ManagingMetadata
WhatIsMetadata?
WhyCareAboutMetadata?
WheretoStoreMetadata?
ExamplesofManagingMetadata
LimitationsoftheHiveMetastoreandHCatalog
OtherWaysofStoringMetadata
Conclusion
2.DataMovement
DataIngestionConsiderations
TimelinessofDataIngestion
IncrementalUpdates
AccessPatterns
OriginalSourceSystemandDataStructure
Transformations
NetworkBottlenecks
NetworkSecurity
PushorPull
FailureHandling
LevelofComplexity
DataIngestionOptions
FileTransfers
ConsiderationsforFileTransfersversusOtherIngestMethods
Sqoop:BatchTransferBetweenHadoopandRelationalDatabases
Flume:Event-BasedDataCollectionandProcessing
Kafka
DataExtraction
Conclusion
3.ProcessingDatainHadoop
MapReduce
MapReduceOverview
ExampleforMapReduce
WhentoUseMapReduce
Spark
SparkOverview
OverviewofSparkComponents
BasicSparkConcepts
BenefitsofUsingSpark
SparkExample
WhentoUseSpark
Abstractions
Pig
PigExample
WhentoUsePig
Crunch
CrunchExample
WhentoUseCrunch
Cascading
CascadingExample
WhentoUseCascading
Hive
HiveOverview
ExampleofHiveCode
WhentoUseHive
Impala
ImpalaOverview
Speed-OrientedDesign
ImpalaExample
WhentoUseImpala
Conclusion
4.CommonHadoopProcessingPatterns
Pattern:RemovingDuplicateRecordsbyPrimaryKey
DataGenerationforDeduplicationExample
CodeExample:SparkDeduplicationinScala
CodeExample:DeduplicationinSQL
Pattern:WindowingAnalysis
DataGenerationforWindowingAnalysisExample
CodeExample:PeaksandValleysinSpark
CodeExample:PeaksandValleysinSQL
Pattern:TimeSeriesModifications
UseHBaseandVersioning
UseHBasewithaRowKeyofRecordKeyandStartTime
UseHDFSandRewritetheWholeTable
UsePartitionsonHDFSforCurrentandHistoricalRecords
DataGenerationforTimeSeriesExample
CodeExample:TimeSeriesinSpark
CodeExample:TimeSeriesinSQL
Conclusion
5.GraphProcessingonHadoop
WhatIsaGraph?
WhatIsGraphProcessing?
HowDoYouProcessaGraphinaDistributedSystem?
TheBulkSynchronousParallelModel
BSPbyExample
Giraph
ReadandPartitiontheData
BatchProcesstheGraphwithBSP
WritetheGraphBacktoDisk
PuttingItAllTogether
WhenShouldYouUseGiraph?
GraphX
JustAnotherRDD
GraphXPregelInterface
vprog0
sendMessage0
mergeMessage0
WhichTooltoUse?
Conclusion
6.Orchestration
WhyWeNeedWorkflowOrchestration
TheLimitsofScripting
TheEnterpriseJobSchedulerandHadoop
OrchestrationFrameworksintheHadoopEcosystem
OozieTerminology
OozieOverview
OozieWorkflow
WorkflowPatterns
Point-to-PointWorkflow
Fan-OutWorkflow
Capture-and-DecideWorkflow
ParameterizingWorkflows
ClasspathDefinition
SchedulingPatterns
FrequencyScheduling
TimeandDataTriggers
ExecutingWorkflows
Conclusion
7.Near-Real-TimeProcessingwithHadoop
StreamProcessing
ApacheStorm
StormHigh-LevelArchitecture
StormTopologies
TuplesandStreams
SpoutsandBolts
StreamGroupings
ReliabilityofStormApplications
Exactly-OnceProcessing
FaultTolerance
IntegratingStormwithHDFS
IntegratingStormwithHBase
StormExample:SimpleMovingAverage
EvaluatingStorm
Trident
TridentExample:SimpleMovingAverage
EvaluatingTrident
SparkStreaming
OverviewofSparkStreaming
SparkStreamingExample:SimpleCount
SparkStreamingExample:MultipleInputs
SparkStreamingExample:MaintainingState
SparkStreamingExample:Windowing
SparkStreamingExample:StreamingversusETLCode
EvaluatingSparkStreaming
FlumeInterceptors
WhichTooltoUse?
Low-LatencyEnrichment,Validation,Alerting,andIngestion
NRTCounting,RollingAverages,andIterativeProcessing
ComplexDataPipelines
Conclusion

PartⅡ.CaseStudies
8.ClickstreamAnalysis
DefiningtheUseCase
UsingHadoopforClickstreamAnalysis
DesignOverview
Storage
Ingestion
TheClientTier
TheCollectorTier
Processing
DataDeduplication
Sessionization
Analyzing
Orchestration
Conclusion
9.FraudDetection
ContinuousImprovement
TakingAction
ArchitecturalRequirementsofFraudDetectionSystems
IntroducingOurUseCase
High-LevelDesign
ClientArchitecture
ProfileStorageandRetrieval
Caching
HBaseDataDefinition
DeliveringTransactionStatus:ApprovedorDenied?
Ingest
PathBetweentheClientandFlume
Near-Real-TimeandExploratoryAnalytics
Near-Real-TimeProcessing
ExploratoryAnalytics
WhatAboutOtherArchitectures?
FlumeInterceptors
KafkatoStormorSparkStreaming
ExternalBusinessRulesEngine
Conclusion
10.DataWarehouse
UsingHadoopforDataWarehousing
DefiningtheUseCase
OLTPSchema
DataWarehouse:IntroductionandTerminology
DataWarehousingwithHadoop
High-LevelDesign
DataModelingandStorage
Ingestion
DataProcessingandAccess
Aggregations
DataExport
Orchestration
Conclusion
A.JoinsinImpala

Index

精彩书摘

  《Hadoop应用架构(影印版英文版)》:
  IncludeseverythingrequiredforHadoopapplicationstorun,exceptdata,ThisincludesJARfiles,Oozieworkflowdefinitions,HiveHQLfiles,andmore.Theapplicationcodedirectory/appisusedforapplicationartifactssuchasJARsforOozieactionsorHiveuser—definedfunctions(UDFs).ItisnotalwaysnecessarytostoresuchapplicationartifactsinHDFS.butsomeHadoopapplicationssuchasOozieandHiverequirestoringsharedcodeandconfigurationonHDFSsoitcanbeusedbycodeexecutingonanynodeofthecluster.Thisdirectoryshouldhaveasubdirectoryforeachgroupandapplication,similartothestructureusedin/etl.Foragivenapplication(say,Oozie),youwouldneedadirectoryforeachversionoftheartifactsyoudecidetostoreinHDFS,possiblytagging,viaasymlinkinHDFS,thelatestartifactaslatestandthecurrentlyusedoneascurrent.Thedirectoriescontainingthebinaryartifactswouldbepresentundertheseversioneddirectories.Thiswilllooksimilarto:/appkgroup>kapplication>/.Tocontinueourpreviousexample,theJARforthelatestbuildofouraggregatepreferencesprocesswouldbeinadirectorystructurelike/app/BI/clickstream/latest/aggregate—preferences/uber—aggregate—preferences.jar.
  ……

其他推荐