Skip to content

Commit

Permalink
docs: refine some expressions of cbdb overview en
Browse files Browse the repository at this point in the history
  • Loading branch information
lijiajia committed Jun 8, 2023
1 parent 800325a commit 1b9c007
Show file tree
Hide file tree
Showing 2 changed files with 17 additions and 17 deletions.
32 changes: 16 additions & 16 deletions docs/cbdb-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ title: Product Overview

Cloudberry Database, built on the PostgreSQL 14.4 core (released in mid-2022), stands as one of the most advanced and mature open-source MPP databases available. With a range of high concurrency and high availability features, it is capable of efficiently handling complex tasks, managing and processing massive amounts of data. It is now widely used in multiple fields.

- Outstanding performance: Cloudberry has remarkable advantages in data storage, high concurrency, high availability, linear scalability, responsiveness, ease of use, and cost-effectiveness. In the era of big data, Cloudberry excels in processing terabyte-level datasets, surpassing Hadoop in terms of stand-alone performance.
- Strong syntax compatibility: Cloudberry offers superior usability and functionality compared to the SQL engine on Hadoop, making it more accessible for novice users.
- Comprehensive tooling: Cloudberry boasts a comprehensive suite of tools, eliminating the need for extensive tool customization. This makes it an ideal solution for large-scale data warehouse projects, saving time and effort for users.
- Flexible deployment: Cloudberry supports flexible deployment options, including the traditional hardware deployment, as well as the multi-cloud and cross-cloud deployments.
- Cloudberry provides comprehensive support for diverse data types, formats, and storage media. Its flexibility ensures that it can effectively meet the various requirements of users.
- Outstanding performance: Cloudberry Database has remarkable advantages in data storage, high concurrency, high availability, linear scalability, responsiveness, ease of use, and cost-effectiveness. In the era of big data, Cloudberry Database excels in processing terabyte-level datasets, surpassing Hadoop in terms of stand-alone performance.
- Strong syntax compatibility: Cloudberry Database offers superior usability and functionality compared to the SQL engine on Hadoop, making it more accessible for novice users.
- Comprehensive tooling: Cloudberry Database boasts a comprehensive suite of tools, eliminating the need for extensive tool customization. This makes it an ideal solution for large-scale data warehouse projects, saving time and effort for users.
- Flexible deployment: Cloudberry Database supports flexible deployment options, including the traditional hardware deployment, as well as the multi-cloud and cross-cloud deployments.
- Cloudberry Database provides comprehensive support for diverse data types, formats, and storage media. Its flexibility ensures that it can effectively meet the various requirements of users.

This document introduces the product architecture, the mechanisms of Cloudberry Database's internal modules, and what those mean to users.

Expand All @@ -27,7 +27,7 @@ To the user, Cloudberry Database looks like a full-featured Relational Database

The architecture diagram of Cloudberry Database is shown below: [Insert Architecture Diagram]

![Cloudberry architecture](./media/cbdb-arch.png)
![Cloudberry Database architecture](./media/cbdb-arch.png)

Cloudberry Database is comprised of the following components:

Expand All @@ -41,17 +41,17 @@ Cloudberry Database is comprised of the following components:

Typically, a segment host runs 2 to 8 segment instances, depending on the processor, memory, storage, network interfaces and workload. Balancing the configuration of segment hosts is critical because Cloudberry Database achieves optimal performance by evenly distributing data and workload among segments, allowing all segments to initiate and complete tasks simultaneously.

- The **interconnection** serves as the network layer in the Cloudberry Database system architecture. It refers to the network infrastructure on which the communication of the master and segments depends, using a standard Ethernet switching structure.
- The **interconnect** serves as the network layer in the Cloudberry Database system architecture. It refers to the network infrastructure on which the communication of the master and segments depends, using a standard Ethernet switching structure.

For performance reasons, a 10 GB or faster network is recommended. By default, the interconnection module communicates using the UDP protocol with flow control (UDPIFC) to send messages over the network. Cloudberry Database performs packet validation beyond what UDP offers, which means that the reliability of Cloudberry Database is equivalent to using the TCP protocol, and the performance and scalability exceed that of TCP. If the interconnection is changed to the TCP protocol instead, Cloudberry Database's scalability is limited to 1000 segments. However, this restriction does not apply when UDPIFC is used as the default protocol.
For performance reasons, a 10 GB or faster network is recommended. By default, the interconnect module communicates using the UDP protocol with flow control (UDPIFC) to send messages over the network. Cloudberry Database performs packet validation beyond what UDP offers, which means that the reliability of Cloudberry Database is equivalent to using the TCP protocol, and the performance and scalability exceed that of TCP. If the interconnectx is changed to the TCP protocol instead, Cloudberry Database's scalability is limited to 1000 segments. However, this restriction does not apply when UDPIFC is used as the default protocol.

- Cloudberry Database uses Multiversion Concurrency Control (MVCC) to ensure data consistency. When querying the database, each transaction sees only a snapshot of the data, which ensures that the current transaction does not see other transactions modifying the same records. Accordingly, MVCC provides transaction isolation for every database transaction.

MVCC minimizes lock contention to ensure performance in multi-user environments by avoiding explicit locking for database transactions. One remarkable advantage of MVCC over locks is that read and write operations do not conflict, and they never block each other.

## Data loading

Cloudberry Database supports massive parallel and persistent data loading through external table technology, and enables automatic conversion between character sets such as GBK and UTF8. Based on the MPP architecture, the Scatter-Gather Streaming<sup>TM</sup> technology provides linear expansion of performance. CBDB supports a variety of storage media such as external file servers, Hive, Hbase, HDFS, S3 and various file formats such as CSV, Text, JSON, ORC, Parquet, and supports compressed data file loading such as Zip. CBDB is used by DataStage, Informatica, Kettle for ETL tool integration.
Cloudberry Database supports massive parallel and persistent data loading through external table technology, and enables automatic conversion between character sets such as GBK and UTF8. Based on the MPP architecture, the Scatter-Gather Streaming<sup>TM</sup> technology provides linear expansion of performance. Cloudberry Database supports a variety of storage media such as external file servers, Hive, Hbase, HDFS, S3 and various file formats such as CSV, Text, JSON, ORC, Parquet, and supports compressed data file loading such as Zip. Cloudberry Database is used by DataStage, Informatica, Kettle for ETL tool integration.

Cloudberry Database also supports streaming data loading. For subscribed Kafka Topics, it launches multiple tasks in parallel to read partition data based on the configured maximum task value. After reading the data, the records are cached until a certain time or record count is reached. They are then loaded into Cloudberry Database using gpfdist to ensure data integrity and prevent duplication or loss. This capability is ideal for scenarios involving streaming data collection and real-time analysis. Cloudberry Database supports a data loading throughput of tens of millions of records per minute.

Expand All @@ -68,7 +68,7 @@ Replication Table can be used for small tables, and users can specify custom Has
- Row storage: fast update speed, frequent query for most fields, more random row access.
- Columnar storage: Few field queries, significant savings in I/O operations, and frequent access to large data volumes.

Cloudberry can design the storage mode according to the application type, down to the finest granularity to partition, to achieve a table with multiple storage modes to optimize access performance. When a query is executed, the Cloudberry Database optimizer generates the corresponding optimal query plan based on statistical information and the storage form used by the user, without user intervention.
Cloudberry Database can design the storage mode according to the application type, down to the finest granularity to partition, to achieve a table with multiple storage modes to optimize access performance. When a query is executed, the Cloudberry Database optimizer generates the corresponding optimal query plan based on statistical information and the storage form used by the user, without user intervention.

![Row storage and column storage](./media/cbdb-row-col-storage.png)

Expand All @@ -77,9 +77,9 @@ Data compression can improve data processing performance. The compression ratio
- Zlib 1-9: high compression ratio, occupies more CPU resources, suitable for scenarios with strong CPU computing power.
- Zstandard 1~19: achieve the balance between CPU and compression ratio.

Data security is also very important. Cloudberry supports multiple databases, and data is not shared between databases. Cross-database access can be performed through DBLink. The logical organization of data within the database includes multiple types of data objects such as tables, views, indexes, and functions. Data access is supported across schemas.
Data security is also very important. Cloudberry Database supports multiple databases, and data is not shared between databases. Cross-database access can be performed through DBLink. The logical organization of data within the database includes multiple types of data objects such as tables, views, indexes, and functions. Data access is supported across schemas.

In terms of storage security, Cloudberry supports different storage modes, data redundancy and data encryption (AES 128, AES 192, AES 256, DES and national secret encryption). Cloudberry supports ciphertext authentication and various encryption algorithms such as SCRAM-SHA-256, MD5, LDAP, RADIUS. For different users, set various types of permissions on different levels of objects (such as schema, table, row, column, view and function). The permissions that can be set include select, update, execution, ownership and more.
In terms of storage security, Cloudberry Database supports different storage modes, data redundancy and data encryption (AES 128, AES 192, AES 256, DES and national secret encryption). Cloudberry Database supports ciphertext authentication and various encryption algorithms such as SCRAM-SHA-256, MD5, LDAP, RADIUS. For different users, set various types of permissions on different levels of objects (such as schema, table, row, column, view and function). The permissions that can be set include select, update, execution, ownership and more.

## Data analysis

Expand All @@ -89,15 +89,15 @@ In addition, Cloudberry Database integrates a large number of rich analysis comp

- Machine learning component MADlib on Cloudberry Database: SQL-driven, algorithm + computing power + data.
- PL language. Developers can write user-defined functions in R, Python, Perl, Java, PostgreSQL, and other languages.
- Based on the MPP engine, CBDB realizes high-performance, parallel computing, seamlessly integrates with SQL, and executes calculation and analysis on SQL results.
- Based on the MPP engine, Cloudberry Database realizes high-performance, parallel computing, seamlessly integrates with SQL, and executes calculation and analysis on SQL results.
- PostGIS, based on PostGIS 2.X with enterprise-level improvements, supports Cloudberry Database MPP architecture, integrates object storage, supports large-capacity objects, supports all spatial data types (such as geometry, geography, Raster), supports spatio-temporal index, and supports complex spatial and geographic location calculations, sphere length calculations, and spatial aggregation functions (such as contain, cover, intersect).
- The Cloudberry Database Text component supports accelerated document retrieval capabilities via ElasticSearch, which has significantly improved the performance of traditional GIN data text query by an order of magnitude, and supports multiple word segmentation, natural language processing, and query result rendering capabilities.

## Flexible workload management

- Connection pool PGBouncer (Connection level, supports high concurrency of Cloudberry clusters at the connection level): the database side manages sessions in a unified manner, controls how many users can access at the same time, avoids frequent creation and destruction of service processes. PGBouncer occupies small memory, supports high concurrency, and uses libevent for socket communication to achieve higher efficiency.
- Resource group (Session level, quantitative control of Cloudberry cluster resources at the session level): Sorts out workloads, analyzes the CPU and memory of the load, and publishes requirements, sets Resource Group based on workload analysis, monitors GP operation, dynamically adjusts RS, and uses rules to clean up idle sessions.
- Dynamically allocating resource groups (Query level, dynamically adjust CBDB cluster resources at the SQL level): before or during the execution of SQL statements, dynamically implement flexible and dynamic allocation of resources, give priority to specific queries to shorten their running time.
- Connection pool PGBouncer (Connection level, supports high concurrency of Cloudberry Database clusters at the connection level): the database side manages sessions in a unified manner, controls how many users can access at the same time, avoids frequent creation and destruction of service processes. PGBouncer occupies small memory, supports high concurrency, and uses libevent for socket communication to achieve higher efficiency.
- Resource group (Session level, quantitative control of Cloudberry Database cluster resources at the session level): Sorts out workloads, analyzes the CPU and memory of the load, and publishes requirements, sets Resource Group based on workload analysis, monitors GP operation, dynamically adjusts RS, and uses rules to clean up idle sessions.
- Dynamically allocating resource groups (Query level, dynamically adjust Cloudberry Database cluster resources at the SQL level): before or during the execution of SQL statements, dynamically implement flexible and dynamic allocation of resources, give priority to specific queries to shorten their running time.

## Highly compatible with third-party products

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ HashData 由如下组件构成:

数据节点主机通常执行 2 到 8 个数据节点,具体取决于处理器、内存、存储、网络接口和工作负载。数据节点主机的需要平衡配置,因为 Cloudberry Database获得最佳性能的关键是将数据和工作负载平均分配到数据节点中,以便所有数据节点同时开始处理一项任务并同时完成工作。

- **内部互联 (Interconnection)** 是 Cloudberry Database 系统架构中的网络层。内部互联是指控制节点、数据节点通信所依赖的网络基础架构,使用标准的以太网交换结构。
- **内部互联 (Interconnect)** 是 Cloudberry Database 系统架构中的网络层。内部互联是指控制节点、数据节点通信所依赖的网络基础架构,使用标准的以太网交换结构。

出于性能原因,建议使用 10 GB 或更快的网络。默认情况下,内部互联模块使用带有流控制(UDPIFC) 的 UDP 协议来实现通信,以通过网络发送消息。Cloudberry Database 执行的数据包验证超出了 UDP 所提供的范围,这意味着可靠性等同于使用 TCP 协议,并且性能和可伸缩性超过了 TCP 协议。 如果将内部互联改为使用 TCP 协议,则 Cloudberry Database 的可伸缩性限制为 1000 个数据节点。使用 UDPIFC 作为默认协议时,此限制不适用。

Expand Down

0 comments on commit 1b9c007

Please sign in to comment.