클라우드 아키텍트: 스토리지
AZ-304 클라우드 아키텍트 준비 (3/9)
Azure Data Architecture Guide - Azure Architecture Center
This guide presents a structured approach for designing data-centric solutions on Microsoft Azure. It is based on…
1. Blob (Object)
5. Table Storagehybrid storage solution
1. Azure File Sync
2. Data Box Gateway
3. Azure Stack EdgeDB transfershareback upCaching
Storing the Audit Information: in Blob Storage as long as the storage account in the same location as the Azure SQL Server.
- OLTP: Online Transaction Processing
- OLAP: Online Analytical Processing
Bid Data solution
A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional data base systems.
The data may be processed in batch or in real time. Big data solutions typically involve a large amount of non-relational data, such as key-value data, JSON documents, or time series data. Often traditional RDBMS systems are not well-suited to store this type of data.
Azure Data Platform
- Azure Data Factory
- Azure Data Lake Gen2
- Azure Synapse Analytics
- Azure Databricks
- Azure Cosmos DB
- Azure Cognitive Services
- Azure Event Hubs
- Azure Stream Analytics
- Microsoft Power BI
- Azure Data Factory pipeline: pull data from a wide variety of databases, both on-premises and in the cloud. Pipelines can be triggered based on a pre-defined schedule, in response to an event or be explicitly called via REST APIs.
- Azure Data Factory pipeline > Azure Data Lake Store Gen 2: stage the data copied from the relational databases. You can save the data in delimited text format or compressed as Parquet files.
- Azure Synapse PolyBase: fast ingestion into your data ware house tables.
- Power BI data set: data visualization. it implements a semantic model to simplify the analysis of business data and relationships.
Semi-structured data sources
- Azure Data Factory pipelines to pull data from a wide variety of semi-structured data sources, both on-premises and in the cloud.CSV, JSON file, No-SQL databases such as CosmosDB or MongoDB.
- Azure Data Factory pipeline > Azure Data Lake Store Gen 2: save the original data copied from the semi-structured data source.
- Azure Data Factory Mapping Data Flows or Azure Databricks notebooks: process the semi-structured data and apply the necessary transformations before data can be used for reporting.
- Azure Synapse PolyBase: fast ingestion into your data ware house tables.
- Power BI data set: data visualization. it implements a semantic model to simplify the analysis of business data and relationships.
Non-structured data sources
- Azure Data Factory pipelines
- Azure Data Factory pipeline > Azure Data Lake Store Gen 2
- Azure Databricks notebooks: process the unstructured data. The notebooks can make use of Cognitive Services APIs or invoke custom Azure Machine learning Service models to generate insights from the unstructured data. You can save the resulting dataset as Parquet files in the data lake.
- Azure Synapse PolyBase
- Power BI data set
- Azure Event Hub: ingest data streams generated by a client application. The Event Hub will then ingest and store streaming data preserving the sequence of events received. Consumers can then connect to Event Hub and retrieve the messages for processing.
- Configure the Event Hub Capture: save a copy of the events in your data lake. “Cold Path” of the Lambda Architecture pattern and allows you to perform historical and trend analysis so the stream data saved in your data lake using tools such as Databricks notebooks.
- Stream Analytics job: “Hot Path” of the Lambda Architecture pattern and derive insights from the stream data in transit. Define at least one input for the data stream coming from your Event Hub, one query to process the input data stream and one Power BI output to where the query results will be sent to.
- Power BI real-time data sets and dashboard capabilities for to visualize the fast changing insights generated by your Stream Analytics query.
1. Data Ops for modern data warehouse
A modern data warehouse(MDW) lets you easily bring all of your data together at any scale. It doesn’t matter if it’s structured, unstructured, or semi-structured data. You can gain insights to an MDW through analytical dashboards, operational reports, or advanced analytics for all your users.
→ Azure Data Factory(ADF), Azure Data bricks, Azure Data Lake Storage(ADLS) Gen2, Azure Synapse Analytics, Azure Key Vault, Azure DevOps, Power BI
2. Hybrid ETL with Azure Data Factory
3. Master data management with CludIn / Profisee
4. N-tier app with Cassandra
- Network and load balancing
- Virtual network and subnets
- Application gateway (layer 7 load balancer, WAF-web application firewall)
- Load balncers
- DDos Protection
- Azure DNS
5. Windows N-tier applications
1. Storage Account
Azure Standard Storage delivers reliable, low-cost disk support for VMs running latency-insensitive workloads. It also supports blobs, table, queues, and files. With Standard Storage, the data is stored on hard disk drives(HDDs). When working with VMs, you can use standard SSD and HDD disks for Dev/Test scenarios and less critical workloads, and premium SSD disks for mission-critical production applications. Standard Storage is available in all Azure regions.
In addition, we recommend that you create your Azure Storage account in the same data center as your SQL Server virtual machines to reduce transfer delays. When creating a storage account, disable geo-replication as consistent write order across multiple disks is not guaranteed. Instead, consider configuring a SQL Server disaster recovery technology between two Azure Data Centers. We should not use geo-redundant storage accounts for SQL Servers.
- Failover between replicas of the database must occur without any data loss.
- The database must remain available in the event of a zone outage.
- Costs must be minimized.
Azure Premium Storage delivers high-performance, low-latency disk support for virtual machines(VMs) with input/output(I/O)-intensive workloads. VM disks that use Premium Storage store data on solid-state drives(SSDs). To take the advantage of the speed and performance of premium storage disks, you can migrate existing VM disks to Premium Storage.
Types of Storage Account
- General-purpose v2 accounts[Standard, Premium]: Supported Access tiers
→ Blob, File, Queue, Table, Disk, and Data Lake Gen2
→ LRS, GRS, RA-GRS, ZRS, GZRS, RA-GZRS
→ Supported Access tiers는 General-purpose v2 account만 제공된다
- General-purpose v1 accounts[Standard, Premium]: legacy account type (권장하지 않음)
- Block Blob Storage accounts[Premium]: premium performance characteristics for block blobs and append blobs. Recommended for scenarios with high transaction rates, or scenarios that use smaller objects or require consistently low storage latency.
→ Blob(Block blobs and append blobs only)
→ LRS, ZRS
- File Storage accounts[Premium]: File-only storage accounts with premium performance characteristics. enterprise or high performance scale applications.
→ File only
→ LRS, ZRS
- BlobStorage accounts[Standard]: legacy blob-only (권장하지 않음), 대신에 general-purpose v2 사용 권장
→ GRS, RA-GRS
Azure Blob storage is Microsoft’s object storage solution for the cloud. Blob storage is optimized for storing massive amounts of unstructured data. Unstructured data is data that doesn’t adhere to a particular data model or definition, such as text or binary data.
- long-term retention(LTR): You can store azure SQL Database full backups in RA-GRS(read-access geo-redundant storage) blob storage for up to 10 years. You can then restore any backup as a new database.
→ data store of the payment processing.
- Data retention at the workspace level can be configured from 30 to 730 days(2 years) for all workspaces unless the legacy Free pricing tier. Learn more about pricing for longer data retention. Retention for individual data types can be set as low as 4 days.
- Hot — Optimized for storing data that is accessed frequently. (can be accessed immediately from the storage account.)
- Cool — Optimized for storing data that is infrequently accessed and stored for at least 30 days. (can be accessed immediately from the storage account.) It provides a low cost high performance storage for infrequently access data.
- Archive — Optimized for storing data that is rarely accessed and stored for at least 180 days with flexible latency requirements (on the order of hours). Data in the archive access tier is stored offline. The archive tier offers the lowest storage costs but also the highest access costs and latency. You need to ensure that the archived data cannot be deleted for five years. The solution must prevent administrators from deleting the data.
- Block Blobs: Block blobs are optimized for uploading large amounts of data efficiently. Block blobs are comprised of blocks, each of which is identified by a block ID. A block blob can include up to 50,000 blocks.
- Page Blobs: Page blobs are a collection of 512-byte pages optimized for random read and write operations. To create a page blob, you initialize the page blob and specify the maximum size the page blob will grow.
- Append Blobs: An append blob is comprised of blocks and is optimized for append operations. When you modify an append blob, blocks are added to the end of the blob only, via the Append Block operation.
- Immutable storage for Azure Blob storage enables users to store business-critical data objects in a WORM (Write Once, Read Many) state. This state makes the data non-erasable and non-modifiable for a user-specified interval. For the duration of the retention interval, blobs can be created and read, but cannot be modified or deleted. Immutable storage is available for general-purpose v2 and Blob storage accounts in all Azure regions.
_Only the hot and cool access tiers can be set at the account level. The archive access tier can only be set at the blob level.
_The hot and cool tiers support all redundancy options. The archive tier supports only LRS, GRS, and RA-GRS.
_Data storage limits are set at the account level and not per access tier.
_You create an Azure Blob storage container, and you configure a time-based retention policy and lock the policy.
- Rehydrate Blob Data
rehydrate an archived blob to an online tier, copy an archived blob to an online tier, If it is the archive access tier, you need to first change the access tier for the blob object.
-Standard priority: The rehydration request will be processed in the order it was received and may take up to 15 hours.
-High priority: The rehydration request will be prioritized over Standard requests and may finish in under 1 hour for objects under ten GB in size.
- Zone-redundant storage (ZRS) copies your data synchronously across three Azure availability zones in the primary region. (논리적)
- Locally redundant storage (LRS) copies your data synchronously three times within a single physical location in the primary region. LRS is the least expensive replication option, but is not recommended for applications requiring high availability. (물리적)
- Multiple Storage Account Queue: for the storage account queue to ensure that each additional application will be able to read the relevant transactions.
- Storage Explorer: Microsoft Azure Storage Explorer is a standalone app that makes it easy to work with Azure Storage data on Windows, macOS, and Linux. In this article, you’ll learn several ways of connecting to and managing your Azure storage accounts.
- Cloud-hosted file shares(Azure Files)
- managed disks: Azure Managed Disks are high-performance, highly durable block storage designed to be used with Azure Virtual Machines. You can use multiple Managed Disks with each virtual machine.
- JSON 형식의 파일 저장 가능
- SQL statements를 사용해 쿼리 가능
- Table Service
- Be able to store at least 1 TB of data.
- Support multiple consistency levels.
- Guaranteed speed at any scale: Gain unparalleled SLA-backed(Service Level Agreement) speed and throughput, fast global access, and instant elasticity.
- Simplified application development: Build fast with open source APIs, multiple SDKs, schemaless data and no-ETL(추출, 변환, 적재-Extract, transform, load) analytics over operational data.
- Precisely defined, multiple consistency choices: When building globally distributed applications in Cosmos DB, you no longer have to make extreme tradeoffs between consistency, availability, latency and throughput(처리량).
- Azure Cosmos DB comes with multiple APIs: SQL API, a JSON document database service that supports SQL queries.
2. Hybrid Storage & Data transfer
Azure File Sync
Azure 파일 동기화를 사용하여 온-프레미스 파일 서버의 유연성, 성능 및 호환성을 유지하면서 Azure Files에서 조직의 파일 공유를 중앙 집중화할 수 있습니다.
Azure File Sync to centralize your organization’s file shares in Azure Files, while keeping the flexibility, performance, and compatibility of an on-premises file server. Azure File Sync transforms Windows Server into a quick cache of your Azure file share. You need an Azure File share in the same region that you want to deploy Azure File Sync.
Azure Data Share
여러 소스에 있는 모든 형식 및 크기의 데이터를 다른 조직과 공유하세요. 공유하는 내용, 데이터를 수신하는 사용자 및 사용 약관을 쉽게 관리하세요. Data Share는 사용자에게 익숙한 인터페이스를 적용해 데이터 공유 관계를 한눈에 파악할 수 있도록 합니다. 몇 번 클릭만으로 데이터를 공유하거나, REST API를 사용하여 고유한 애플리케이션을 빌드하세요.
- Offline bulk transfer to Azure
- Transfer small datasets (online)
- SQL server: data migration assistant
- Table, NoSQL: Azure Cosmos DB Data Migration Tool
Data Migration Service
- Data Migration Assistant: migrate the data. it has support for various versions of Microsoft SQL Server as shown below.
- Azure CosmosDB Data Migration tool: migration of data to CosmosDB
- AzCopy: work with data in Azure storage account
- Data Management Gateway: building a gateway with the on-premise infrastructure
- on-premises file server to Blob Storage: an Azure Import/Export job, Azure Data factory
- The log files are generated by user activity to Apache web servers. The log files are in a consistent format. Approximately 1 GB of logs are generated per day. Microsoft Power Bl is used to display weekly reports of the user activity. → Replace Azure Data Factory with CRON jobs that use AzCopy.
- AzCopy is a command-line utility that you can use to copy blobs or files to or from a storage account.
- Cron is one of the most useful utility that you can find in any Unix-like operating system. It is used to schedule commands at a specific time.
3. Azure Data Platform
Automate data movement using Azure Data Factory, then load data into Azure Data Lake Storage, transform and clean it using Azure Databricks, and make it available for analytics using Azure Synapse Analytics.
Azure Data Factory
- self-hosted integration runtime(server-side)
- it can be accepted as a data source in Azure Data Factory.
- Data Factory에는 pipeline 을 설정해야 한다.
파이프라인은 logical grouping of activities이다. together perform a task를 하는!
- 데이터를 불러오는데에는 self-hosted integration runtime을 설치해야 한다. The Integration Runtime is a customer-managed data integration infrastructure used by Azure Data Factory to provide data integration capabilities across different network environments.
- ETL(추출, 변환, 적재-Extract, transform, load) 서비스 제공
- Azure Data Factory는 서버리스 스케일 아웃 데이터 통합 및 데이터 변환을 위한 Azure의 클라우드 ETL 서비스입니다. 코드가 필요 없는 UI로 직관적 작성 및 단일 창을 통한 모니터링 및 관리를 지원합니다.
- You can use Copy Activity in Azure Data Factory to copy data from and to Azure Data Lake Storage Gen2, and use Data Flow to transform data in Azure Data Lake Storage Gen2.
- The integration runtime (IR) is the compute infrastructure that Azure Data Factory uses to provide data-integration capabilities across different network environments. For details about IR, see Integration runtime overview.
Azure Data Lake Storage Gen2
Data Lake Storage Gen2 allows you to easily manage massive amounts of data. These documents can be accessed by end users. You should be able to provide access to the documents to users via Access Control lists(RBAC or POSIX-like access control lists-ACLs).
- Microsoft has engineered an extremely powerful solution that helps customers get their data to the Azure public cloud in a cost-effective, secure, and efficient manner with powerful Azure and machine learning at play. The solution is called Data Box. importing 70TB to Azure.
Data Box Gateway
- Data Box Gateway는 가상화된 환경 또는 하이퍼바이저에 프로비전된 가상 머신을 기반으로 하는 가상 디바이스입니다. 가상 디바이스는 온-프레미스에 있으며, NFS 및 SMB 프로토콜을 사용하여 해당 디바이스에 데이터를 씁니다.
Azure Stack Edge
Hardware as a Service: 가속 AI / 워크로드 설계, 데이터 전송
→ Caching for high-performance computing (HPC) workloads
- Azure Databricks is a data analytics platform optimized for the Microsoft Azure cloud services platform.
- Azure Databricks SQL Analytics provides an easy-to-use platform for analysts who want to run SQL queries on their data lake, create multiple visualization types to explore query results from different perspectives, and build and share dashboards.
- Azure Databricks Workspace provides an interactive workspace that enables collaboration between data engineers, data scientists, and machine learning engineers.
Asynchronous message Queueing
- Queue Storage
- Service Bus queues
Azure Service Bus는 신뢰할 수 있는 메시지 큐 및 지속형 게시/구독 메시징을 포함하여 클라우드 기반, 메시지 지향 미들웨어 기술 집합을 지원합니다. 이러한 조정 된 메시징 기능은 Service Bus 메시징 작업을 사용 하 여 게시-구독, 임시 분리 및 부하 분산 시나리오를 지 원하는 분리 된 메시징 기능으로 간주할 수 있습니다.
5. Azure SQL DataBase(DB)
Azure SQL Managed Instance(MI)
Azure SQL Database Managed Instance configured for Hybrid workloads. Use this topology if your Azure SQL Database Managed Instance is connected to your on-premises network. This approach provides the most simplified network routing and yields maximum data throughput during the migration.
- Provide automatic patching and version updates to SQL server
- Provide automatic backup services
- Provide high availability
- Encrypt all data in transit
- Provide a native virtual network with private IP addressing
- Be a single-tenant environment with dedicated underlying infrastructure
- Failover between replicas of the database must occur without any data loss, The database must remain available In the event of a zone outage, Costs must be minimized.
Microsoft SQL Server Integration Services (SSIS)
- On premises: SQL Server, SSIS runtime hosted by SQL server, SSIS Scale Out, Custom solutions
- On Azure: SQL Database or SQL Database Managed instance, Azure SSIS Integration Runtime, a component of Azure Data Factory. Scaling options for the Azure-SSIS integration Runtime [ migrate the packages to Azure Data Factory. ]
- You need to recommend a solution that facilitates the migration while minimizing changes to the existing packages. The solution must minimize costs.
- Azure SQL database(Store the SSISDB catalog): You can’t create the SSISDB Catalog database on Azure SQL Database at this time independently of creating the Azure-SSIS Integration Runtime in Azure Data Factory. The Azure-SSIS IR is the runtime environment that runs SSIS packages on Azure.
- Azure-SQL Server Integration Service Integration Runtime and self-hosted integration runtime: implement a runtime engine for package execution.
- You need to recommend a solution to host the SSlS(Microsoft SQL Server integration Services) packages in Azure. The solution must ensure that the packages can target the SQL Database instances as their destinations.
- host SSIS — Data Factory
The virtual core (vCore) purchasing model used by Azure SQL Database and Azure SQL Managed Instance provides several benefits:
- Higher compute, memory, I/O, and storage limits.
- Control over the hardware generation to better match compute and memory requirements of the workload.
- Pricing discounts for Azure Hybrid Benefit (AHB) and Reserved Instance (RI). Azure 하이브리드 혜택(Azure Hybrid Benefit (AHB))은 클라우드에서 워크로드를 실행하는 비용을 대폭 줄일 수 있는 라이선스 혜택입니다. 이 혜택은 Azure에서 온-프레미스 Software Assurance 지원 Windows Server 및 SQL Server 라이선스를 사용할 수 있도록 합니다
- Greater transparency in the hardware details that power the compute, that facilitates planning for migrations from on-premises deployments.
- Use a minimum of 2 P30 disks(1 for log files and 1 for data files including TempDB). For workloads requiring — 50,000 IOPS, consider using an Ultra SSD.
- Avoid using operating system or temporary disks for database storage and loggig.
- Enable read caching on the disks hosting the data files and Temp DBdata files. → Read only Data
- Do not enable caching on disk(s) hosting the log file. Important: Stop the SQL Server service when changing the cache settings for an Azure VM Disk. → None logs
a caching policy for each disk
Log: None / Data: Read Only
AzCopy only copy files, not the disks.
A proper solution to replicate the disks of the virtual machines is Azure Site Recovery.
You have an Azure Storage V2 account named Storage 1. archive data to Storage1. You need to ensure that the archived data cannot be deleted for five years. The solution must prevent administrators from deleting the data.
→ Immutable Blob Storage가 필요하다. File Share 가 아니라…
→ Time-based Retention policy support: Users can set policies to store data for a specified interval. When a time-based retention policy is set, blobs can be created and read, but not modified or deleted. After the retention period has expired, blobs can be deleted but not overwritten.
1. Create a new container or select an existing container to store the blobs that need to be kept in the immutable state. The container must be in a general-purpose v2 or Blob storage account.
2. Select Access policy in the container settings. Then select Add policy under Immutable blob storage.
3. To enable time-based retention, select Time-based retention from the drop-down menu. 65 IT Certification Guaranteed, The Easy Way!
4. Enter the retention interval in days (acceptable values are 1 to 146000 days).
tables in an Azure Storage account that use read-access geo-redundant storage (RA-GRS)
→ replicate the data to a second region
→ minimize costs
the application will use Azure Storage queues. You need to recommend a processing solution for the application to interact with the storage queues. Be scheduled by using a CRON job. Upload messages every five minutes. to interact with queue → .Net Core
5TB of company files that are accessed rarely. You plan to copy the files to Azure Storage. The files must be available within 24 hours of being requested. Storage costs must be minimized.
→ Create a general-purpose v1 storage account. Create a blob container and copy the files to the blob container.
→ Create a general-purpose v2 storage account that is configured for the Cool default access tier. Create a file share in the storage account and copy the files to the file share.
The application consumes data from multiple databases. Application code references database tables using a combination of the server, database, and table name. You need to migrate the application data to Azure.
→ SQL Server Stretch Database
→ SQL Managed instance
Azure Content Delivery Network
Store Content close to end users
- A content delivery network (CDN) is a distributed network of servers that can efficiently deliver web content to users. CDNs store cached content on edge servers in point-of-presence (POP) locations that are close to end users, to minimize latency.
- Azure Content Delivery Network (CDN) offers developers a global solution for rapidly delivering high bandwidth content to users by caching their content at strategically placed physical nodes across the world. Azure CDN can also accelerate dynamic content, which cannot be cached, by leveraging various network optimizations using CDN POPs. For example, route optimization to bypass Border Gateway Protocol (BGP).
Azure Redis Cache
Store Content close to the application
- Azure Cache for Redis is based on the popular software Redis. It is typically used as a cache to improve the performance and scalability of systems that rely heavily on backend data-stores. Performance is improved by temporarily copying frequently accessed data to fast storage located close to the application. With Azure Cache for Redis, this fast storage is located in-memory with Azure Cache for Redis instead of being loaded from disk by a database.