From 5af098a0dbe05cd259605973ba73c586fd70c7ae Mon Sep 17 00:00:00 2001
From: TomShawn <41534398+TomShawn@users.noreply.github.com>
Date: Fri, 9 Jun 2023 11:17:53 +0800
Subject: [PATCH 01/21] docs: add cbdb feature overview and architecture
---
docs/cbdb-architecture.md | 49 ++++++
docs/cbdb-overview.md | 50 +-----
.../current/cbdb-architecture.md | 39 +++++
.../current/cbdb-overview.md | 149 +++++++++++-------
4 files changed, 180 insertions(+), 107 deletions(-)
create mode 100644 docs/cbdb-architecture.md
create mode 100644 i18n/zh/docusaurus-plugin-content-docs/current/cbdb-architecture.md
diff --git a/docs/cbdb-architecture.md b/docs/cbdb-architecture.md
new file mode 100644
index 000000000..88b33f40c
--- /dev/null
+++ b/docs/cbdb-architecture.md
@@ -0,0 +1,49 @@
+---
+id: cbdb-overview
+title: Cloudberry DataBase Overview
+---
+
+Learn about Cloudberry Database in just a few minutes.
+
+## Overview
+
+Cloudberry Database is an enterprise-level cloud-native database product that's highly elastic, performant, available, and cost-effective. It helps enterprise users handle and analyze large datasets of terabytes and petabytes with ease. HashData's vision is to break down the entry barriers for enterprises to build big data systems and bring out the complete potential of big data resources.
+
+Cloudberry Database's core technical features include:
+
+- Cloud-native architecture, without the architectural constraints seen in traditional MPP databases.
+- The computing engine is independent of data storage and metadata management, which provides
+multi-dimensional flexibility.
+- Second-level scaling and minute-level node self-healing and recovery, simplifying maintenance tasks.
+
+Supports online upgrades and multi-active deployment.
+
+- Complete database capabilities, transactional consistency, and full compatibility with PostgreSQL and the
+Greenplum database.
+- Supports mainstream analytical tools for machine learning, graph computation, and spatiotemporal analysis.
+- Easily integrates with common ETL and BI tools.
+- Supports hybrid and integrated data warehouse and data lake solutions.
+
+Cloudberry Database consists of two main modules: a user module and a management module. The top layer of the user module is an independent metadata service layer, the middle is a stateless computing layer, and the bottom layer is a shared data storage layer. The management module incorporates the management console (Cloud Manager) that manages all metadata clusters and computing clusters, including cluster creation, startup and shutdown, and resource management, monitoring, and alerts.
+
+## Advantages
+
+- Cloud-native architecture: Cloudberry Database is built from scratch to leverage cloud computing advantages, ensuring high elasticity and availability by segregating storage, computation, and metadata. This approach eliminates the architectural constraints of traditional MPP databases and delivers a robust big data platform.
+- Multi-dimensional elastic scaling capability: Cloudberry Database scales storage and computation resources independently to adjust throughput, response time, and data capacity horizontally, vertically, and in storage.
+- Comprehensive database capability: Cloudberry Database supports UTF-8, GBK, and other encoding formats, multi-tenant management, relational data models, and standard SQL syntax. It facilitates strong consistency in ACID transactions and provides partition management and popular interfaces like JDBC and ODBC.
+- Rich analytical features: The database and data warehouse products, PostgreSQL and Greenplum Database, bring advanced analytical features to Cloudberry Database, which has been customized for cloud platforms to support distributed machine learning and spatiotemporal analysis. It has the ANSI SQL 2008 standard and the 2003 OLAP extension and supports languages, including PL/Pgsql, PL/C, PL/Python, PL/Java, and PL/R.
+- Cloudberry Database natively supports Apache MADlib (an in-database machine learning library based on SQL) and PostGIS.
+- ETL and BI Tool Integration: Cloudberry Database easily merges with many ETL and BI tools typically used in the database industry.
+
+## Future work
+
+Our Cloudberry Database's open-source work is ongoing. Detailed and organized documentation will be available in the upcoming months.
+
+You may visit the following channels to stay connected:
+
+- Website: [http://cloudberrydb.org](http://cloudberrydb.org)
+- Twitter: [http://twitter.com/cloudberrydb](http://twitter.com/cloudberrydb)
+- GitHub: [http://github.com/cloudberrydb](http://github.com/cloudberrydb)
+- Slack: [@cloudberrydb](https://communityinviter.com/apps/cloudberrydb/welcome)
+
+Thanks!
diff --git a/docs/cbdb-overview.md b/docs/cbdb-overview.md
index 88b33f40c..30404ce4c 100644
--- a/docs/cbdb-overview.md
+++ b/docs/cbdb-overview.md
@@ -1,49 +1 @@
----
-id: cbdb-overview
-title: Cloudberry DataBase Overview
----
-
-Learn about Cloudberry Database in just a few minutes.
-
-## Overview
-
-Cloudberry Database is an enterprise-level cloud-native database product that's highly elastic, performant, available, and cost-effective. It helps enterprise users handle and analyze large datasets of terabytes and petabytes with ease. HashData's vision is to break down the entry barriers for enterprises to build big data systems and bring out the complete potential of big data resources.
-
-Cloudberry Database's core technical features include:
-
-- Cloud-native architecture, without the architectural constraints seen in traditional MPP databases.
-- The computing engine is independent of data storage and metadata management, which provides
-multi-dimensional flexibility.
-- Second-level scaling and minute-level node self-healing and recovery, simplifying maintenance tasks.
-
-Supports online upgrades and multi-active deployment.
-
-- Complete database capabilities, transactional consistency, and full compatibility with PostgreSQL and the
-Greenplum database.
-- Supports mainstream analytical tools for machine learning, graph computation, and spatiotemporal analysis.
-- Easily integrates with common ETL and BI tools.
-- Supports hybrid and integrated data warehouse and data lake solutions.
-
-Cloudberry Database consists of two main modules: a user module and a management module. The top layer of the user module is an independent metadata service layer, the middle is a stateless computing layer, and the bottom layer is a shared data storage layer. The management module incorporates the management console (Cloud Manager) that manages all metadata clusters and computing clusters, including cluster creation, startup and shutdown, and resource management, monitoring, and alerts.
-
-## Advantages
-
-- Cloud-native architecture: Cloudberry Database is built from scratch to leverage cloud computing advantages, ensuring high elasticity and availability by segregating storage, computation, and metadata. This approach eliminates the architectural constraints of traditional MPP databases and delivers a robust big data platform.
-- Multi-dimensional elastic scaling capability: Cloudberry Database scales storage and computation resources independently to adjust throughput, response time, and data capacity horizontally, vertically, and in storage.
-- Comprehensive database capability: Cloudberry Database supports UTF-8, GBK, and other encoding formats, multi-tenant management, relational data models, and standard SQL syntax. It facilitates strong consistency in ACID transactions and provides partition management and popular interfaces like JDBC and ODBC.
-- Rich analytical features: The database and data warehouse products, PostgreSQL and Greenplum Database, bring advanced analytical features to Cloudberry Database, which has been customized for cloud platforms to support distributed machine learning and spatiotemporal analysis. It has the ANSI SQL 2008 standard and the 2003 OLAP extension and supports languages, including PL/Pgsql, PL/C, PL/Python, PL/Java, and PL/R.
-- Cloudberry Database natively supports Apache MADlib (an in-database machine learning library based on SQL) and PostGIS.
-- ETL and BI Tool Integration: Cloudberry Database easily merges with many ETL and BI tools typically used in the database industry.
-
-## Future work
-
-Our Cloudberry Database's open-source work is ongoing. Detailed and organized documentation will be available in the upcoming months.
-
-You may visit the following channels to stay connected:
-
-- Website: [http://cloudberrydb.org](http://cloudberrydb.org)
-- Twitter: [http://twitter.com/cloudberrydb](http://twitter.com/cloudberrydb)
-- GitHub: [http://github.com/cloudberrydb](http://github.com/cloudberrydb)
-- Slack: [@cloudberrydb](https://communityinviter.com/apps/cloudberrydb/welcome)
-
-Thanks!
+TODO
\ No newline at end of file
diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-architecture.md b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-architecture.md
new file mode 100644
index 000000000..03fbcce28
--- /dev/null
+++ b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-architecture.md
@@ -0,0 +1,39 @@
+---
+title: 架构介绍
+---
+
+# Cloudberry Database 架构介绍
+
+本文介绍 Cloudberry Database 的产品架构以及内部模块的实现机制。
+
+在大多数情况下,Cloudberry Database 在 SQL 支持、功能、配置选项和最终用户功能方面与 PostgreSQL 非常相似。数据库用户与 Cloudberry Database 数据库的交互体验,非常接近与单机 PostgreSQL 进行交互。
+
+Cloudberry Database 采用 MPP 架构技术,通过在多个服务器或主机之间分配数据和处理工作负载来存储和处理大量数据。
+
+MPP 也称为无共享体系架构,是指具有多台主机的系统,这些主机协作执行一项操作。每台主机都有自己的处理器、内存、磁盘、网络资源和操作系统。Cloudberry Database使用这种高性能的系统架构来分配海量数据的负载,并且可以并行使用系统的所有资源来处理查询。
+
+从用户角度来看,Cloudberry Database 是一个完备的关系数据库管理系统 (RDBMS)。从物理层面来看,它内含多个 PostgreSQL 实例。为了实现多个独立 PostgreSQL 实例的分工和合作,Cloudberry Database 在不同层面对数据存储、计算、通信和管理进行了分布式集群化处理。Cloudberry Database 虽然是一个集群,然而对用户而言,它封装了所有分布式的细节,为用户提供了单个逻辑数据库。这种封装极大地解放了开发人员和运维人员的工作。
+
+Cloudberry Database 架构图如下所示:
+
+
+
+HashData 由如下组件构成:
+
+- **控制节点 (Master)** 是 Cloudberry Database 数据库系统的入口,它接受客户端连接和 SQL 查询,并将工作分配给数据节点实例。用户与 Cloudberry Database 进行交互,使用客户端程序(例如 psql)或应用程序编程接口(API)(例如 JDBC、ODBC 或 libpq PostgreSQL C API)连接到控制节点。
+ - 控制节点是全局系统目录所在的位置,全局系统目录是一组系统表,其中包含有关 Cloudberry Database 数据库系统本身的元数据。
+ - 控制节点不包含任何用户数据,数据只保存在数据节点实例上。
+ - 控制节点对客户端连接进行身份验证,处理传入的 SQL 命令,在数据节点之间分配工作负载,协调每个数据节点返回的结果,并将最终结果呈现给客户端程序。
+ - Cloudberry Database 使用预写日志记录(WAL)进行控制节点/Standby 镜像。在基于 WAL 的日志记录中,所有修改都将在写入磁盘之前先写日志,以确保任何进程内操作的数据完整性。
+
+- **数据节点 (Segment)** 实例是独立的 Postgres 进程,每个数据节点存储一部分数据并执行相应部分查询。当用户通过控制节点连接到数据库并提交查询请求时,会在每个数据节点创建进程来处理查询。用户定义的表及其索引分布在 Cloudberry Database 中的所有可用数据节点中,每个数据节点都包含数据的不同部分,不同部分数据处理的进程在相应的数据节点中运行。用户通过控制节点与数据节点进行交互,数据节点在称为数据节点主机的服务器上运行。
+
+ 数据节点主机通常执行 2 到 8 个数据节点,具体取决于处理器、内存、存储、网络接口和工作负载。数据节点主机的需要平衡配置,因为 Cloudberry Database获得最佳性能的关键是将数据和工作负载平均分配到数据节点中,以便所有数据节点同时开始处理一项任务并同时完成工作。
+
+- **内部互联 (Interconnect)** 是 Cloudberry Database 系统架构中的网络层。内部互联是指控制节点、数据节点通信所依赖的网络基础架构,使用标准的以太网交换结构。
+
+ 出于性能原因,建议使用 10 GB 或更快的网络。默认情况下,内部互联模块使用带有流控制(UDPIFC) 的 UDP 协议来实现通信,以通过网络发送消息。Cloudberry Database 执行的数据包验证超出了 UDP 所提供的范围,这意味着可靠性等同于使用 TCP 协议,并且性能和可伸缩性超过了 TCP 协议。 如果将内部互联改为使用 TCP 协议,则 Cloudberry Database 的可伸缩性限制为 1000 个数据节点。使用 UDPIFC 作为默认协议时,此限制不适用。
+
+- Cloudberry Database 使用多版本控制 (Multiversion Concurrency Control/MVCC) 保证数据一致性。这意味着在查询数据库时,每个事务看到的只是数据的快照,其确保当前的事务不会看到其他事务在相同记录上的修改。据此为数据库的每个事务提供事务隔离。
+
+ MVCC 以避免给数据库事务显式锁定的方式,最大化减少锁争用以确保多用户环境下的性能。在并发控制方面,使用 MVCC 而不是使用锁机制的最大优势是,MVCC 对查询(读)的锁与写的锁不存在冲突,并且读与写之间从不互相阻塞。
diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md
index 9cb96cefa..c73aeebc7 100644
--- a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md
+++ b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md
@@ -1,11 +1,10 @@
---
-id: cbdb-overview
-title: 产品介绍
+title: 功能和场景概览
---
-# Cloudberry Database 介绍
+# Cloudberry Database 功能和场景概览
-Cloudberry Database 基于最新的 PostgreSQL14.4 内核(2022 年中发布),是当前最先进的成熟开源 MPP 数据库之一,具备高并发、高可用等多种特性,可以对复杂任务进行快速高效计算,以满足海量数据管理和计算的需求,目前在多个领域都有着广泛应用。
+Cloudberry Database 基于最新的 PostgreSQL 14.4 内核,是当前最先进的成熟开源 MPP 数据库之一,具备高并发、高可用等多种特性,可以对复杂任务进行快速高效计算,以满足海量数据管理和计算的需求,目前在多个领域都有着广泛应用。
- 性能优秀: Cloudberry 在数据存储、高并发、高可用、线性扩展、反应速度、易用性和性价比等方面显著的优势。进入大数据时代以后,Cloudberry 在处理 TB 级别数据量上性能优秀,单机性能明显优于 Hadoop。
- 语法兼容性强:在功能和语法上,远比 Hadoop 上的 SQL 引擎 Hive 易用,普通用户更加容易上手。
@@ -13,99 +12,133 @@ Cloudberry Database 基于最新的 PostgreSQL14.4 内核(2022 年中发布)
- 部署灵活: Cloudberry 支持灵活的部署方式,包括传统的硬件部署,支持多云和跨云部署。
- 对不同数据类型、数据格式、存储介质都提供完善的支持。多层次的灵活性也让 Cloudberry Database可以更好地满足用户多方位的需求。
-本手册重点介绍 Cloudberry Database 的产品架构以及内部模块的实现机制,以及对于用户的意义。
+本文档介绍 Cloudberry Database 的主要功能以及主要使用场景。
-## 产品架构
+## 主要功能
-在大多数情况下,Cloudberry Database 在 SQL 支持、功能、配置选项和最终用户功能方面与 PostgreSQL 非常相似。数据库用户与 Cloudberry Database 数据库的交互体验,非常接近与单机 PostgreSQL 进行交互。
+### 多场景高效查询
-Cloudberry Database 采用 MPP 架构技术,通过在多个服务器或主机之间分配数据和处理工作负载来存储和处理大量数据。
+Cloudberry Database 致力于提供高效的数据查询服务。无论是在大规模数据分析场景,还是在高度分布式的环境下,我们的解决方案都能为您提供最优的查询性能。这得益于 Cloudberry Database 的两种灵活的查询优化方式,以及一系列内置的优化技术。
-MPP 也称为无共享体系架构,是指具有多台主机的系统,这些主机协作执行一项操作。每台主机都有自己的处理器、内存、磁盘、网络资源和操作系统。Cloudberry Database使用这种高性能的系统架构来分配海量数据的负载,并且可以并行使用系统的所有资源来处理查询。
+Cloudberry Database 提供了两种优化器以支持用户在不同场景下进行有效的查询:
-从用户角度来看,Cloudberry Database 是一个完备的关系数据库管理系统 (RDBMS)。从物理层面来看,它内含多个 PostgreSQL 实例。为了实现多个独立 PostgreSQL 实例的分工和合作,Cloudberry Database 在不同层面对数据存储、计算、通信和管理进行了分布式集群化处理。Cloudberry Database 虽然是一个集群,然而对用户而言,它封装了所有分布式的细节,为用户提供了单个逻辑数据库。这种封装极大地解放了开发人员和运维人员的工作。
+- **基于 PostgreSQL 的优化器**:此优化器内置于 Cloudberry Database 中,经过特别调整以更好地支持分布式环境。这意味着它能够在处理大数据分析任务时产生更高效的查询计划。
+- **GPORCA 优化器**:这是一个基于开源的优化器,经过特定适配,以满足分布式环境下的查询优化需求。
-Cloudberry Database 架构图如下所示:
+Cloudberry Database 提供的技术包括分区静态和动态减裁、聚集下推、连接过滤和向量化执行引擎等,它们都旨在为您提供最快、最精确的查询结果。此外,我们还特别考虑了数据库管理员的需求,提供了基于规则的逻辑查询优化功能,以及代价最低的查询路径生成策略,使您可以根据实际需求灵活调整查询计划。
-
+### 多态数据存储
-HashData 由如下组件构成:
+Cloudberry Database 能够高效灵活地满足多种应用类型和业务需求的数据存储需求。以下是 Cloudberry Database 提供的主要数据存储特性:
-- **控制节点 (Master)** 是 Cloudberry Database 数据库系统的入口,它接受客户端连接和 SQL 查询,并将工作分配给数据节点实例。用户与 Cloudberry Database 进行交互,使用客户端程序(例如 psql)或应用程序编程接口(API)(例如 JDBC、ODBC 或 libpq PostgreSQL C API)连接到控制节点。
- - 控制节点是全局系统目录所在的位置,全局系统目录是一组系统表,其中包含有关 Cloudberry Database 数据库系统本身的元数据。
- - 控制节点不包含任何用户数据,数据只保存在数据节点实例上。
- - 控制节点对客户端连接进行身份验证,处理传入的 SQL 命令,在数据节点之间分配工作负载,协调每个数据节点返回的结果,并将最终结果呈现给客户端程序。
- - Cloudberry Database 使用预写日志记录(WAL)进行控制节点/Standby 镜像。在基于 WAL 的日志记录中,所有修改都将在写入磁盘之前先写日志,以确保任何进程内操作的数据完整性。
+- **均匀的数据分布**:通过 Hash 和 Random 的方式进行数据分布,可以更好地利用磁盘性能并解决 I/O 瓶颈问题。
+- **多种存储类型的选择**:
-- **数据节点 (Segement)** 实例是独立的 Postgres 进程,每个数据节点存储一部分数据并执行相应部分查询。当用户通过控制节点连接到数据库并提交查询请求时,会在每个数据节点创建进程来处理查询。用户定义的表及其索引分布在 Cloudberry Database 中的所有可用数据节点中,每个数据节点都包含数据的不同部分,不同部分数据处理的进程在相应的数据节点中运行。用户通过控制节点与数据节点进行交互,数据节点在称为数据节点主机的服务器上运行。
+ - 行式存储:适用于大多数字段频繁查询和随机行访问较多的情况。
+ - 列式存储:当您需要对少数字段进行查询时,这种方式可以大幅节省 I/O 操作,非常适合大数据量频繁访问的场景。
- 数据节点主机通常执行 2 到 8 个数据节点,具体取决于处理器、内存、存储、网络接口和工作负载。数据节点主机的需要平衡配置,因为 Cloudberry Database获得最佳性能的关键是将数据和工作负载平均分配到数据节点中,以便所有数据节点同时开始处理一项任务并同时完成工作。
+- **专门的存储模式**:Cloudberry Database 设计了 Heap 存储、AO 行存储、AOCS 列存储等不同的存储模式以优化各种应用类型的性能。在最细粒度到分区的层面,一张表可以实现多种存储模式。
+- **支持分区表**:你可以根据特定条件定义表的分区方式。在查询时,系统将自动过滤不需要查询的子表,提高数据的查询效率。
+- **高效的数据压缩功能**:支持多种压缩算法,如 Zlib 1-9 和 Zstandard 1~19,以提高数据处理性能,并保持 CPU 与压缩比的平衡。
+- **对小表的优化**:你可以选择使用 Replication Table,并在创建表时指定自定义 Hash 算法,更灵活地控制数据分布。
-- **内部互联 (Interconnect)** 是 Cloudberry Database 系统架构中的网络层。内部互联是指控制节点、数据节点通信所依赖的网络基础架构,使用标准的以太网交换结构。
+无论选择哪种存储模式,Cloudberry Database 优化器都会根据你所使用的存储形态和统计信息自动生成最优的查询计划。此外,所有存储模式都支持数据压缩功能,且一张表的不同列可以支持不同的压缩算法。
- 出于性能原因,建议使用 10 GB 或更快的网络。默认情况下,内部互联模块使用带有流控制(UDPIFC) 的 UDP 协议来实现通信,以通过网络发送消息。Cloudberry Database 执行的数据包验证超出了 UDP 所提供的范围,这意味着可靠性等同于使用 TCP 协议,并且性能和可伸缩性超过了 TCP 协议。 如果将内部互联改为使用 TCP 协议,则 Cloudberry Database 的可伸缩性限制为 1000 个数据节点。使用 UDPIFC 作为默认协议时,此限制不适用。
+因此,Cloudberry Database 提供了高效、灵活的数据存储和处理解决方案,能满足多种业务需求。
-- Cloudberry Database 使用多版本控制 (Multiversion Concurrency Control/MVCC) 保证数据一致性。这意味着在查询数据库时,每个事务看到的只是数据的快照,其确保当前的事务不会看到其他事务在相同记录上的修改。据此为数据库的每个事务提供事务隔离。
+### 多层次的数据安全防护
- MVCC 以避免给数据库事务显式锁定的方式,最大化减少锁争用以确保多用户环境下的性能。在并发控制方面,使用 MVCC 而不是使用锁机制的最大优势是,MVCC 对查询(读)的锁与写的锁不存在冲突,并且读与写之间从不互相阻塞。
+Cloudberry Database 着重强调数据安全性,提供了全方位的安全保护措施。这些安全特性被设计为满足各种数据库环境需求,并提供多层次的安全防护,包括:
-## 数据加载
+- **数据库隔离**:在 Cloudberry Database 中,数据在各数据库间不共享,实现了多数据库环境的隔离。如果需要进行跨数据库访问,可以使用 DBLink 功能。
-Cloudberry Database 通过外部表技术支持大批量并行、持续化的数据加载,能够支持GBK/UTF8等字符集间的自动转换。由于基于 MPP 架构,Scatter-Gather StreamingTM 技术提供性能线性扩张。能够支持外部文件服务器、Hive、Hbase、HDFS、S3 多种存储介质以及 CSV、Text、JSON、ORC、Parquet等多种文件格式,支持 Zip 等压缩数据文件加载,被 DataStage、Informatica、Kettle 等多款 ETL 工具集成。
+- **内部数据组织**:数据库内部的数据逻辑组织包括多种数据对象,如表、视图、索引、函数等,而数据访问则可以跨 Schema 进行。
-Cloudberry Database 同时支持流式数据加载,针对订阅的 Kafka Topic,根据设置的Task最大值,启动多个 Task 并行读取 Partition 数据,读取后将记录缓存,到一定时间或记录数,通过 gpfdist 加载到 Cloudberry Database 保证数据不重、不丢,用于流数据采集、实时分析场景。支持达到每分钟几千万的数据加载吞吐量。
+- **强大的数据存储安全性**:Cloudberry Database 提供了不同的存储模式以支持数据冗余,并采用各种加密方法(包括 AES 128、192、256,DES,以及国密加密等)以确保数据存储的安全性。此外,还支持密文认证,包括 SCRAM-SHA-256、MD5、LDAP、RADIUS 等加密算法。
-PXF 是 Cloudberry Database 内置组件,支持将外部数据源映射到 Cloudberry Database 外部表,实现 Data Fabric 架构。并基于 MPP 引擎实现并行、高速的数据访问,支持混合数据生态管理和访问。
+- **用户数据保护**:Cloudberry Database 提供了函数加密解密,以及透明数据加密解密。透明数据加密解密的过程由 Cloudberry Database 内核完成,用户无需进行任何操作。可以支持的数据格式包括 Heap 表,AO 行存储,AOCS 列存储。除了常见的 AES 等加密算法,也特别支持国密算法,使用户可以方便地扩展自己的算法到透明数据加密中。
-
+- **详细的权限设定**:为了满足不同用户和不同级别的对象(例如:Schema、表、行、列、视图、函数等)的权限需求,Cloudberry Database 提供了丰富的权限设定选项,包括 `SELECT`、`UPDATE`、执行权、所有权等等。
-## 数据存储和安全
+因此,Cloudberry Database 为数据安全性提供了全方位的保障,无论是从数据存储的安全性,还是用户数据保护和权限设定,都能满足多种安全性要求。
-Cloudberry Database 计算的并行化基于数据在存储层的均匀分布,数据均匀分布是并行处理的关键,Cloudberry Database 数据库提供了 Hash 和 Random 两种方式存储层分布数据,保证:数据均匀分布在每一块磁盘上面发挥每一块磁盘性能,根本上解决I/O瓶颈Cloudberry Database 提供了更灵活的分布方式。
+### 数据加载
-针对小表可以采用Replication Table 支持用户在创建表示指定自定义Hash算法,灵活控制数据分布。同时支持行式存储和列式存储。
+Cloudberry Database 提供了一系列高效且灵活的数据加载解决方案,以满足各种数据处理需求,包括:
-- 行式存储:更新速度快,大多数字段频繁查询,随机行访问较多。
-- 列式存储:少数字段查询,大幅节省 I/O 操作,大数据量频繁访问。
+- **并行化和持久化的数据加载**:通过外部表技术,Cloudberry Database 支持大批量并行和持久化的数据加载,实现字符集间的自动转换,例如从 GBK 到 UTF8。这一功能使得数据输入变得更为流畅。
-Cloudberry 可以按照应用类型设计存储模式,最细粒度到分区,实现一张表多种存储模式,达到最优化访问性能。查询执行时,Cloudberry Database 优化器会根据用户使用的存储形态根据统计信息生成对应最优的查询计划,而不需要用户干预。
+- **灵活的数据源和文件格式支持**:无论数据存储在外部文件服务器、Hive、Hbase、HDFS 还是 S3 等多种存储介质,或是处于 CSV、Text、JSON、ORC、Parquet 等多种文件格式,Cloudberry Database 都能提供支持。并且,该数据库也可以加载 Zip 等压缩数据文件。
-
+- **集成多款 ETL 工具**:DataStage、Informatica、Kettle 等多款 ETL 工具都已集成到 Cloudberry Database 中,提升数据处理的便利性。
-数据压缩提高数据处理性能,压缩比依赖于压缩算法和数据内容,针对移动信令、话单、点击流数据压缩比可以达到 20 倍以上。无论哪种存储模式,均支持压缩,一张表的不同列支持不同的压缩算法。Cloudberry Database 提供多种压缩算法:
+- **支持流式数据加载**:Cloudberry Database 可针对订阅的 Kafka Topic 启动多个并行读取任务,将读取后的记录缓存,到达一定时间或记录数后,通过 gpfdist 加载到数据库中。这种方式可以确保数据的完整性,不重也不丢,非常适用于流数据采集和实时分析场景。支持达到每分钟几千万的数据加载吞吐量。
-- Zlib 1-9,压缩比高,占用 CPU 资源较多,适用于 CPU 计算能力较强的场景。
-- Zstandard 1~19,实现 CPU 与压缩比的平衡。
+- **高性能的数据访问**:PXF 是 Cloudberry Database 的内置组件,可以将外部数据源映射到 Cloudberry Database 的外部表,实现并行和高速的数据访问。PXF 支持混合数据生态的管理和访问,帮助实现 Data Fabric 架构。
-同时,数据安全也非常重要,Cloudberry 支持多数据库,数据库之间数据不共享,跨数据库访问可通过 DBLink。数据库内部数据的逻辑组织,包括多累数据对象,如:表、视图、索引、函数等,数据访问可以跨 Schema。
+这些特性使得 Cloudberry Database 可以在不同的数据环境中提供高效且稳定的数据加载解决方案。
-在存储安全性上,支持不同存储模式,支持数据冗余,支持数据加密 AES 128、192,256 DES等以及国密加密。支持密文认证,支持各类加密算法 SCRAM-SHA-256、MD5、LDAP、RADIUS 等。针对不同的用户,在不同级别的对象(如:Schema、表、行、列、视图、函数等)上进行多种类型的权限设定,可以设定的权限包括:SELECT、UPDATE、执行权、所有权等等。
+### 多层容错
-## 数据分析
+Cloudberry Database 为了确保数据安全和服务的连续性,采取了多种多级的容错机制:
-Cloudberry Database 内核内置了强大的并行优化器和执行器,能够兼容 PostgreSQL 生态,能够支持数据分区裁剪、索引(BTree,Bitmap,Hash,Brin,GIN等),JIT(表达式即时编译处理)等技术。
+- **数据页面的 Checksum**:在底层存储上,Cloudberry Database 使用 Checksum 机制进行坏块检测,保证数据的完整性。
-除此之外,Cloudberry Database 集成了大量丰富的分析组件:
+- **镜像节点配置**:通过在数据节点间配置镜像节点,Cloudberry Database 能实现服务的高可用和故障切换。一旦检测到主节点发生不可恢复故障,系统会自动切换到备份数据节点,确保用户查询不会受到影响。
-- 机器学习组件。MADlib on Cloudberry Database:全部 SQL 驱动,算法 + 算力 + 数据。
-- PL language。开发人员可以使用 R、Python、Perl、Java、PostgreSQL 等语言编写用户自定义函数。
-- 基于 MPP 引擎,实现高性能、并行计算,与 SQL 无缝集成,针对 SQL 执行结果计算、分析。
-- PostGIS。基于PostGIS 2.X 进行了企业级改进,支持 Cloudberry Database MPP 架构,集成对象存储,支持大容量对象从 OSS 加载入库,支持所有的空间数据类型(geometry、geography 、Raster等),支持时空索引,支持复杂的空间和地理位置计算,球体长度计算 空间聚集函数(包含、覆盖、相交等)。
-- Cloudberry Database Text 组件。支持利用 ElasticSearch 加速文件检索能力,相比传统的 GIN 数据文本查询性能达到数量级的明显提升,并且支持多种分词,自然语言处理,查询结果渲染等能力。
+- **控制节点的备份**:类似于数据节点,控制节点也可以配置备份节点,以防止主控制节点发生故障。一旦主控制节点发生故障,系统将自动切换到备份控制节点,确保服务的连续性。
-## 灵活的工作负载管理
+通过这些设计,Cloudberry Database 对可能的故障场景做出了充分的预防和应对,从而为用户提供了更加可靠、稳定的服务。
-- 连接池 PGBouncer(Connection 级,在连接级别支持 Cloudberry 集群高并发):数据库端统一管理会话,控制同时有多少用户可以接入,避免频繁创建销毁服务进程,占用内存小,支持高并发,使用 libevent 进行 Socket 通信,效率更高。
-- 资源组 Resource Group(Session 级,在会话级别量化控制 Cloudberry 集群资源):梳理典型工作负载,分析负载 CPU、内存、并发度需求,基于对工作负载的分析设置 Resource Group,监控 GP 运行,动态调整 RS,利用规则清理空闲会话。
-- 动态分配资源组(Query 级,在 SQL 级别动态调整 CBDB 集群资源):在 SQL 语句执行前或执行过程中,动态实现资源的灵活、动态调配,用于优待特定查询,从而缩短其运行时间。
+### 丰富的数据分析支持
-## 高度兼容第三方产品
+Cloudberry Database 提供了强大的数据分析功能,使得数据处理、查询和分析变得更加高效。下面是主要的数据分析功能和相关组件:
-Cloudberry Database 数据库与 BI 工具、挖掘预测工具、ETL 工具、J2EE/.NET 应用程序、以及其他数据源/计算引擎均有良好的连通性。
+- **并行优化器和执行器**:Cloudberry Database 内核内置了并行优化器和执行器,不仅能够兼容 PostgreSQL 生态,还支持数据分区裁剪、多种索引技术(包括 BTree,Bitmap,Hash,Brin,GIN等),以及 JIT(表达式即时编译处理)等。
-
+- **机器学习组件 - MADlib**:Cloudberry Database 集成了 MADlib 组件,为用户提供了全 SQL 驱动的机器学习功能,让算法、算力和数据能够深度融合。
-## 跨平台和国产化支持
+- **支持多种编程语言**:Cloudberry Database 为开发者提供了丰富的编程语言选择,包括 R、Python、Perl、Java和 PostgreSQL 等,使得用户可以方便地编写自定义函数。
+
+- **基于 MPP 引擎的高性能并行计算**:Cloudberry Database 的 MPP 引擎支持高性能并行计算,与 SQL 无缝集成,可以针对 SQL 执行结果进行快速的计算和分析。
+
+- **PostGIS 地理数据处理**:Cloudberry Database 引入了升级版的 PostGIS 2.X,支持其 MPP 架构,进一步提升了对地理空间数据的处理能力。主要特性包括:
+
+ - 集成对象存储:支持大容量地理空间数据从对象存储(OSS)直接加载入库。
+ - 全面的空间数据类型支持:包括 geometry(几何)、geography(地理)、Raster(栅格)等空间数据类型。
+ - 时空索引:提供时空索引技术,可以有效加速空间和时间相关的查询。
+ - 复杂的空间和地理位置计算:包括球体长度计算以及空间聚集函数(如包含、覆盖、相交等)。
+
+- **Cloudberry Database Text 组件**:这个组件支持利用 ElasticSearch 加速文件检索能力,相比传统的 GIN 数据文本查询性能有数量级的提升,支持多种分词,自然语言处理,以及查询结果渲染等。
+
+通过这些数据分析功能,Cloudberry Database 能够满足各类复杂的数据处理、分析和查询需求。
+
+### 灵活的工作负载管理
+
+Cloudberry Database 提供了全面的工作负载管理功能,旨在有效地利用和优化数据库资源,以确保高效、稳定的运行。其工作负载管理主要包括以下三个层次的控制:
+
+- **连接池 PGBouncer(连接级别管理)**:通过连接池,Cloudberry Database 对用户接入进行统一管理,限制同时活跃的用户数量,以提高效率并避免因频繁创建和销毁服务进程而浪费资源。连接池具有较小的内存占用,并能够支持高并发连接,使用 libevent 进行 Socket 通信以提高通信效率。
+
+- **资源组 Resource Group(会话级别管理)**:通过资源组,Cloudberry Database 能够分析并分类典型的工作负载,量化每个工作负载所需的 CPU、内存、并发度等资源。这样,根据工作负载的实际需求,可以设定适合的资源组,并动态调整资源使用,以确保整体运行效率。同时,还可以利用规则清理空闲的会话,释放不必要的资源。
+
+- **动态资源组分配(SQL级别管理)**:通过动态资源组分配,Cloudberry Database 能够在 SQL 语句执行前或执行过程中灵活地分配资源,以便优待特定的查询,缩短其运行时间。
+
+这种分层的工作负载管理模式,使得 Cloudberry Database 能够更有效地处理高并发请求,保障数据库性能和稳定性,同时满足不同的业务需求。
+
+### 多种兼容性
+
+Cloudberry Database 的兼容性表现在多个方面,这使得它能够灵活应对各种工具、平台和语言。以下是其主要的兼容性特点:
+
+- **SQL 兼容性**:Cloudberry Database 兼容 PostgreSQL 和 Greenplum 语法,支持 SQL-92,SQL-99,以及 SQL 2003 标准,包括 SQL 2003 OLAP 扩展,如窗口函数,`rollup`,`cube` 等。
+
+- **组件兼容性**:基于 PostgreSQL 14.4 内核,Cloudberry Database 兼容市面上常用的大多数 PostgreSQL 组件和扩展。
+
+- **工具和程序兼容性**:与多种 BI 工具、挖掘预测工具、ETL 工具,以及 J2EE/.NET 应用程序都有良好的连通性。
+
+- **硬件平台兼容性**:能够在多种硬件架构下运行,包括 X86、ARM、飞腾、鲲鹏、海光等。
+
+- **操作系统兼容性**:兼容多种操作系统环境,如 CentOS、Ubuntu、Kylin、BC-Linux 等。
+
+## 使用场景
-Cloudberry Database 支持多种包括 X86、ARM、飞腾、鲲鹏、海光等系统硬件架构,以及 CentOS、Ubuntu、Kylin、BC-Linux 等多种操作系统环境。
From ba9dae556e5f4b728a20bf46e673782a7a25f012 Mon Sep 17 00:00:00 2001
From: TomShawn <41534398+TomShawn@users.noreply.github.com>
Date: Fri, 9 Jun 2023 11:22:20 +0800
Subject: [PATCH 02/21] Delete cbdb-architecture.md
---
docs/cbdb-architecture.md | 49 ---------------------------------------
1 file changed, 49 deletions(-)
delete mode 100644 docs/cbdb-architecture.md
diff --git a/docs/cbdb-architecture.md b/docs/cbdb-architecture.md
deleted file mode 100644
index 88b33f40c..000000000
--- a/docs/cbdb-architecture.md
+++ /dev/null
@@ -1,49 +0,0 @@
----
-id: cbdb-overview
-title: Cloudberry DataBase Overview
----
-
-Learn about Cloudberry Database in just a few minutes.
-
-## Overview
-
-Cloudberry Database is an enterprise-level cloud-native database product that's highly elastic, performant, available, and cost-effective. It helps enterprise users handle and analyze large datasets of terabytes and petabytes with ease. HashData's vision is to break down the entry barriers for enterprises to build big data systems and bring out the complete potential of big data resources.
-
-Cloudberry Database's core technical features include:
-
-- Cloud-native architecture, without the architectural constraints seen in traditional MPP databases.
-- The computing engine is independent of data storage and metadata management, which provides
-multi-dimensional flexibility.
-- Second-level scaling and minute-level node self-healing and recovery, simplifying maintenance tasks.
-
-Supports online upgrades and multi-active deployment.
-
-- Complete database capabilities, transactional consistency, and full compatibility with PostgreSQL and the
-Greenplum database.
-- Supports mainstream analytical tools for machine learning, graph computation, and spatiotemporal analysis.
-- Easily integrates with common ETL and BI tools.
-- Supports hybrid and integrated data warehouse and data lake solutions.
-
-Cloudberry Database consists of two main modules: a user module and a management module. The top layer of the user module is an independent metadata service layer, the middle is a stateless computing layer, and the bottom layer is a shared data storage layer. The management module incorporates the management console (Cloud Manager) that manages all metadata clusters and computing clusters, including cluster creation, startup and shutdown, and resource management, monitoring, and alerts.
-
-## Advantages
-
-- Cloud-native architecture: Cloudberry Database is built from scratch to leverage cloud computing advantages, ensuring high elasticity and availability by segregating storage, computation, and metadata. This approach eliminates the architectural constraints of traditional MPP databases and delivers a robust big data platform.
-- Multi-dimensional elastic scaling capability: Cloudberry Database scales storage and computation resources independently to adjust throughput, response time, and data capacity horizontally, vertically, and in storage.
-- Comprehensive database capability: Cloudberry Database supports UTF-8, GBK, and other encoding formats, multi-tenant management, relational data models, and standard SQL syntax. It facilitates strong consistency in ACID transactions and provides partition management and popular interfaces like JDBC and ODBC.
-- Rich analytical features: The database and data warehouse products, PostgreSQL and Greenplum Database, bring advanced analytical features to Cloudberry Database, which has been customized for cloud platforms to support distributed machine learning and spatiotemporal analysis. It has the ANSI SQL 2008 standard and the 2003 OLAP extension and supports languages, including PL/Pgsql, PL/C, PL/Python, PL/Java, and PL/R.
-- Cloudberry Database natively supports Apache MADlib (an in-database machine learning library based on SQL) and PostGIS.
-- ETL and BI Tool Integration: Cloudberry Database easily merges with many ETL and BI tools typically used in the database industry.
-
-## Future work
-
-Our Cloudberry Database's open-source work is ongoing. Detailed and organized documentation will be available in the upcoming months.
-
-You may visit the following channels to stay connected:
-
-- Website: [http://cloudberrydb.org](http://cloudberrydb.org)
-- Twitter: [http://twitter.com/cloudberrydb](http://twitter.com/cloudberrydb)
-- GitHub: [http://github.com/cloudberrydb](http://github.com/cloudberrydb)
-- Slack: [@cloudberrydb](https://communityinviter.com/apps/cloudberrydb/welcome)
-
-Thanks!
From e823de5553a857d25b859e9c2fd3b29910d54d43 Mon Sep 17 00:00:00 2001
From: TomShawn <41534398+TomShawn@users.noreply.github.com>
Date: Fri, 9 Jun 2023 11:23:36 +0800
Subject: [PATCH 03/21] Update cbdb-overview.md
---
docs/cbdb-overview.md | 49 ++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 48 insertions(+), 1 deletion(-)
diff --git a/docs/cbdb-overview.md b/docs/cbdb-overview.md
index 30404ce4c..4cd42857d 100644
--- a/docs/cbdb-overview.md
+++ b/docs/cbdb-overview.md
@@ -1 +1,48 @@
-TODO
\ No newline at end of file
+---
+title: Cloudberry DataBase Overview
+---
+
+Learn about Cloudberry Database in just a few minutes.
+
+## Overview
+
+Cloudberry Database is an enterprise-level cloud-native database product that's highly elastic, performant, available, and cost-effective. It helps enterprise users handle and analyze large datasets of terabytes and petabytes with ease. HashData's vision is to break down the entry barriers for enterprises to build big data systems and bring out the complete potential of big data resources.
+
+Cloudberry Database's core technical features include:
+
+- Cloud-native architecture, without the architectural constraints seen in traditional MPP databases.
+- The computing engine is independent of data storage and metadata management, which provides
+multi-dimensional flexibility.
+- Second-level scaling and minute-level node self-healing and recovery, simplifying maintenance tasks.
+
+Supports online upgrades and multi-active deployment.
+
+- Complete database capabilities, transactional consistency, and full compatibility with PostgreSQL and the
+Greenplum database.
+- Supports mainstream analytical tools for machine learning, graph computation, and spatiotemporal analysis.
+- Easily integrates with common ETL and BI tools.
+- Supports hybrid and integrated data warehouse and data lake solutions.
+
+Cloudberry Database consists of two main modules: a user module and a management module. The top layer of the user module is an independent metadata service layer, the middle is a stateless computing layer, and the bottom layer is a shared data storage layer. The management module incorporates the management console (Cloud Manager) that manages all metadata clusters and computing clusters, including cluster creation, startup and shutdown, and resource management, monitoring, and alerts.
+
+## Advantages
+
+- Cloud-native architecture: Cloudberry Database is built from scratch to leverage cloud computing advantages, ensuring high elasticity and availability by segregating storage, computation, and metadata. This approach eliminates the architectural constraints of traditional MPP databases and delivers a robust big data platform.
+- Multi-dimensional elastic scaling capability: Cloudberry Database scales storage and computation resources independently to adjust throughput, response time, and data capacity horizontally, vertically, and in storage.
+- Comprehensive database capability: Cloudberry Database supports UTF-8, GBK, and other encoding formats, multi-tenant management, relational data models, and standard SQL syntax. It facilitates strong consistency in ACID transactions and provides partition management and popular interfaces like JDBC and ODBC.
+- Rich analytical features: The database and data warehouse products, PostgreSQL and Greenplum Database, bring advanced analytical features to Cloudberry Database, which has been customized for cloud platforms to support distributed machine learning and spatiotemporal analysis. It has the ANSI SQL 2008 standard and the 2003 OLAP extension and supports languages, including PL/Pgsql, PL/C, PL/Python, PL/Java, and PL/R.
+- Cloudberry Database natively supports Apache MADlib (an in-database machine learning library based on SQL) and PostGIS.
+- ETL and BI Tool Integration: Cloudberry Database easily merges with many ETL and BI tools typically used in the database industry.
+
+## Future work
+
+Our Cloudberry Database's open-source work is ongoing. Detailed and organized documentation will be available in the upcoming months.
+
+You may visit the following channels to stay connected:
+
+- Website: [http://cloudberrydb.org](http://cloudberrydb.org)
+- Twitter: [http://twitter.com/cloudberrydb](http://twitter.com/cloudberrydb)
+- GitHub: [http://github.com/cloudberrydb](http://github.com/cloudberrydb)
+- Slack: [@cloudberrydb](https://communityinviter.com/apps/cloudberrydb/welcome)
+
+Thanks!
From b0860120cc3e10fe3f8147c3c81578a0692fc402 Mon Sep 17 00:00:00 2001
From: TomShawn <41534398+TomShawn@users.noreply.github.com>
Date: Fri, 9 Jun 2023 11:34:00 +0800
Subject: [PATCH 04/21] refine
---
.../docusaurus-plugin-content-docs/current/cbdb-overview.md | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md
index c73aeebc7..ddb88f310 100644
--- a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md
+++ b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md
@@ -18,14 +18,14 @@ Cloudberry Database 基于最新的 PostgreSQL 14.4 内核,是当前最先进
### 多场景高效查询
-Cloudberry Database 致力于提供高效的数据查询服务。无论是在大规模数据分析场景,还是在高度分布式的环境下,我们的解决方案都能为您提供最优的查询性能。这得益于 Cloudberry Database 的两种灵活的查询优化方式,以及一系列内置的优化技术。
+Cloudberry Database 致力于提供高效的数据查询服务。无论是在大规模数据分析场景,还是在高度分布式的环境下,我们的解决方案都能为你提供最优的查询性能。这得益于 Cloudberry Database 的两种灵活的查询优化方式,以及一系列内置的优化技术。
Cloudberry Database 提供了两种优化器以支持用户在不同场景下进行有效的查询:
- **基于 PostgreSQL 的优化器**:此优化器内置于 Cloudberry Database 中,经过特别调整以更好地支持分布式环境。这意味着它能够在处理大数据分析任务时产生更高效的查询计划。
- **GPORCA 优化器**:这是一个基于开源的优化器,经过特定适配,以满足分布式环境下的查询优化需求。
-Cloudberry Database 提供的技术包括分区静态和动态减裁、聚集下推、连接过滤和向量化执行引擎等,它们都旨在为您提供最快、最精确的查询结果。此外,我们还特别考虑了数据库管理员的需求,提供了基于规则的逻辑查询优化功能,以及代价最低的查询路径生成策略,使您可以根据实际需求灵活调整查询计划。
+Cloudberry Database 提供的技术包括分区静态和动态减裁、聚集下推、连接过滤和向量化执行引擎等,它们都旨在为你提供最快、最精确的查询结果。此外,我们还特别考虑了数据库管理员的需求,提供了基于规则的逻辑查询优化功能,以及代价最低的查询路径生成策略,使你可以根据实际需求灵活调整查询计划。
### 多态数据存储
@@ -35,7 +35,7 @@ Cloudberry Database 能够高效灵活地满足多种应用类型和业务需求
- **多种存储类型的选择**:
- 行式存储:适用于大多数字段频繁查询和随机行访问较多的情况。
- - 列式存储:当您需要对少数字段进行查询时,这种方式可以大幅节省 I/O 操作,非常适合大数据量频繁访问的场景。
+ - 列式存储:当你需要对少数字段进行查询时,这种方式可以大幅节省 I/O 操作,非常适合大数据量频繁访问的场景。
- **专门的存储模式**:Cloudberry Database 设计了 Heap 存储、AO 行存储、AOCS 列存储等不同的存储模式以优化各种应用类型的性能。在最细粒度到分区的层面,一张表可以实现多种存储模式。
- **支持分区表**:你可以根据特定条件定义表的分区方式。在查询时,系统将自动过滤不需要查询的子表,提高数据的查询效率。
From 1263098ac6f7f4dee1324e9c87d4161c4911e6a3 Mon Sep 17 00:00:00 2001
From: TomShawn <41534398+TomShawn@users.noreply.github.com>
Date: Fri, 9 Jun 2023 11:41:45 +0800
Subject: [PATCH 05/21] remove user scenarios
---
.../current/cbdb-overview.md | 26 +++++++------------
1 file changed, 10 insertions(+), 16 deletions(-)
diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md
index ddb88f310..87f3b7a86 100644
--- a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md
+++ b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md
@@ -1,8 +1,8 @@
---
-title: 功能和场景概览
+title: 特性概览
---
-# Cloudberry Database 功能和场景概览
+# Cloudberry Database 特性概览
Cloudberry Database 基于最新的 PostgreSQL 14.4 内核,是当前最先进的成熟开源 MPP 数据库之一,具备高并发、高可用等多种特性,可以对复杂任务进行快速高效计算,以满足海量数据管理和计算的需求,目前在多个领域都有着广泛应用。
@@ -14,9 +14,7 @@ Cloudberry Database 基于最新的 PostgreSQL 14.4 内核,是当前最先进
本文档介绍 Cloudberry Database 的主要功能以及主要使用场景。
-## 主要功能
-
-### 多场景高效查询
+## 多场景高效查询
Cloudberry Database 致力于提供高效的数据查询服务。无论是在大规模数据分析场景,还是在高度分布式的环境下,我们的解决方案都能为你提供最优的查询性能。这得益于 Cloudberry Database 的两种灵活的查询优化方式,以及一系列内置的优化技术。
@@ -27,7 +25,7 @@ Cloudberry Database 提供了两种优化器以支持用户在不同场景下进
Cloudberry Database 提供的技术包括分区静态和动态减裁、聚集下推、连接过滤和向量化执行引擎等,它们都旨在为你提供最快、最精确的查询结果。此外,我们还特别考虑了数据库管理员的需求,提供了基于规则的逻辑查询优化功能,以及代价最低的查询路径生成策略,使你可以根据实际需求灵活调整查询计划。
-### 多态数据存储
+## 多态数据存储
Cloudberry Database 能够高效灵活地满足多种应用类型和业务需求的数据存储需求。以下是 Cloudberry Database 提供的主要数据存储特性:
@@ -46,7 +44,7 @@ Cloudberry Database 能够高效灵活地满足多种应用类型和业务需求
因此,Cloudberry Database 提供了高效、灵活的数据存储和处理解决方案,能满足多种业务需求。
-### 多层次的数据安全防护
+## 多层次的数据安全防护
Cloudberry Database 着重强调数据安全性,提供了全方位的安全保护措施。这些安全特性被设计为满足各种数据库环境需求,并提供多层次的安全防护,包括:
@@ -62,7 +60,7 @@ Cloudberry Database 着重强调数据安全性,提供了全方位的安全保
因此,Cloudberry Database 为数据安全性提供了全方位的保障,无论是从数据存储的安全性,还是用户数据保护和权限设定,都能满足多种安全性要求。
-### 数据加载
+## 数据加载
Cloudberry Database 提供了一系列高效且灵活的数据加载解决方案,以满足各种数据处理需求,包括:
@@ -78,7 +76,7 @@ Cloudberry Database 提供了一系列高效且灵活的数据加载解决方案
这些特性使得 Cloudberry Database 可以在不同的数据环境中提供高效且稳定的数据加载解决方案。
-### 多层容错
+## 多层容错
Cloudberry Database 为了确保数据安全和服务的连续性,采取了多种多级的容错机制:
@@ -90,7 +88,7 @@ Cloudberry Database 为了确保数据安全和服务的连续性,采取了多
通过这些设计,Cloudberry Database 对可能的故障场景做出了充分的预防和应对,从而为用户提供了更加可靠、稳定的服务。
-### 丰富的数据分析支持
+## 丰富的数据分析支持
Cloudberry Database 提供了强大的数据分析功能,使得数据处理、查询和分析变得更加高效。下面是主要的数据分析功能和相关组件:
@@ -113,7 +111,7 @@ Cloudberry Database 提供了强大的数据分析功能,使得数据处理、
通过这些数据分析功能,Cloudberry Database 能够满足各类复杂的数据处理、分析和查询需求。
-### 灵活的工作负载管理
+## 灵活的工作负载管理
Cloudberry Database 提供了全面的工作负载管理功能,旨在有效地利用和优化数据库资源,以确保高效、稳定的运行。其工作负载管理主要包括以下三个层次的控制:
@@ -125,7 +123,7 @@ Cloudberry Database 提供了全面的工作负载管理功能,旨在有效地
这种分层的工作负载管理模式,使得 Cloudberry Database 能够更有效地处理高并发请求,保障数据库性能和稳定性,同时满足不同的业务需求。
-### 多种兼容性
+## 多种兼容性
Cloudberry Database 的兼容性表现在多个方面,这使得它能够灵活应对各种工具、平台和语言。以下是其主要的兼容性特点:
@@ -138,7 +136,3 @@ Cloudberry Database 的兼容性表现在多个方面,这使得它能够灵活
- **硬件平台兼容性**:能够在多种硬件架构下运行,包括 X86、ARM、飞腾、鲲鹏、海光等。
- **操作系统兼容性**:兼容多种操作系统环境,如 CentOS、Ubuntu、Kylin、BC-Linux 等。
-
-## 使用场景
-
-
From 1ff9bf58e2bf1af37a315f90d9fc232fe945b5ac Mon Sep 17 00:00:00 2001
From: TomShawn <41534398+TomShawn@users.noreply.github.com>
Date: Fri, 9 Jun 2023 11:49:08 +0800
Subject: [PATCH 06/21] add some blank files
---
docs/cbdb-architecture.md | 1 +
docs/cbdb-scenarios.md | 1 +
.../zh/docusaurus-plugin-content-docs/current/cbdb-scenarios.md | 1 +
sidebars.js | 2 +-
4 files changed, 4 insertions(+), 1 deletion(-)
create mode 100644 docs/cbdb-architecture.md
create mode 100644 docs/cbdb-scenarios.md
create mode 100644 i18n/zh/docusaurus-plugin-content-docs/current/cbdb-scenarios.md
diff --git a/docs/cbdb-architecture.md b/docs/cbdb-architecture.md
new file mode 100644
index 000000000..30404ce4c
--- /dev/null
+++ b/docs/cbdb-architecture.md
@@ -0,0 +1 @@
+TODO
\ No newline at end of file
diff --git a/docs/cbdb-scenarios.md b/docs/cbdb-scenarios.md
new file mode 100644
index 000000000..30404ce4c
--- /dev/null
+++ b/docs/cbdb-scenarios.md
@@ -0,0 +1 @@
+TODO
\ No newline at end of file
diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-scenarios.md b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-scenarios.md
new file mode 100644
index 000000000..30404ce4c
--- /dev/null
+++ b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-scenarios.md
@@ -0,0 +1 @@
+TODO
\ No newline at end of file
diff --git a/sidebars.js b/sidebars.js
index 80ab78b61..648979075 100644
--- a/sidebars.js
+++ b/sidebars.js
@@ -19,7 +19,7 @@ const sidebars = {
{
type: 'category',
label: 'Introduction',
- items: ['cbdb-overview','cbdb-vs-gp-features']
+ items: ['cbdb-overview','cbdb-architecture','cbdb-scenarios','cbdb-vs-gp-features']
},
// {
From 5378ebfb825f1e92dee5128deb464ecc1d610d3c Mon Sep 17 00:00:00 2001
From: TomShawn <41534398+TomShawn@users.noreply.github.com>
Date: Fri, 9 Jun 2023 14:42:15 +0800
Subject: [PATCH 07/21] address comments
---
.../current/cbdb-overview.md | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md
index 87f3b7a86..77508016b 100644
--- a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md
+++ b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md
@@ -6,10 +6,10 @@ title: 特性概览
Cloudberry Database 基于最新的 PostgreSQL 14.4 内核,是当前最先进的成熟开源 MPP 数据库之一,具备高并发、高可用等多种特性,可以对复杂任务进行快速高效计算,以满足海量数据管理和计算的需求,目前在多个领域都有着广泛应用。
-- 性能优秀: Cloudberry 在数据存储、高并发、高可用、线性扩展、反应速度、易用性和性价比等方面显著的优势。进入大数据时代以后,Cloudberry 在处理 TB 级别数据量上性能优秀,单机性能明显优于 Hadoop。
+- 性能优秀: Cloudberry Database 在数据存储、高并发、高可用、线性扩展、反应速度、易用性和性价比等方面显著的优势。进入大数据时代以后,Cloudberry Database 在处理 TB 级别数据量上性能优秀,单机性能明显优于 Hadoop。
- 语法兼容性强:在功能和语法上,远比 Hadoop 上的 SQL 引擎 Hive 易用,普通用户更加容易上手。
-- 工具完善: Cloudberry 有着完善的工具体系,用户无需投入太多时间和精力进行工具改造,因此适合作为大型数据仓库的解决方案。
-- 部署灵活: Cloudberry 支持灵活的部署方式,包括传统的硬件部署,支持多云和跨云部署。
+- 工具完善: Cloudberry Database 有着完善的工具体系,用户无需投入太多时间和精力进行工具改造,因此适合作为大型数据仓库的解决方案。
+- 部署灵活: Cloudberry Database 支持灵活的部署方式,包括传统的硬件部署,支持多云和跨云部署。
- 对不同数据类型、数据格式、存储介质都提供完善的支持。多层次的灵活性也让 Cloudberry Database可以更好地满足用户多方位的需求。
本文档介绍 Cloudberry Database 的主要功能以及主要使用场景。
@@ -23,7 +23,7 @@ Cloudberry Database 提供了两种优化器以支持用户在不同场景下进
- **基于 PostgreSQL 的优化器**:此优化器内置于 Cloudberry Database 中,经过特别调整以更好地支持分布式环境。这意味着它能够在处理大数据分析任务时产生更高效的查询计划。
- **GPORCA 优化器**:这是一个基于开源的优化器,经过特定适配,以满足分布式环境下的查询优化需求。
-Cloudberry Database 提供的技术包括分区静态和动态减裁、聚集下推、连接过滤和向量化执行引擎等,它们都旨在为你提供最快、最精确的查询结果。此外,我们还特别考虑了数据库管理员的需求,提供了基于规则的逻辑查询优化功能,以及代价最低的查询路径生成策略,使你可以根据实际需求灵活调整查询计划。
+Cloudberry Database 提供的技术包括分区静态和动态减裁、聚集下推、连接过滤等,它们都旨在为你提供最快、最精确的查询结果。此外,此外,针对不同的查询场景和查询语句,Cloudberry Database 提供了基于规则的查询优化手段和基于代价的查询优化手段以生成更高效的查询执行计划。
## 多态数据存储
From 12c65c9861f1d17955d233649737def09f81cd52 Mon Sep 17 00:00:00 2001
From: TomShawn <41534398+TomShawn@users.noreply.github.com>
Date: Fri, 9 Jun 2023 14:43:54 +0800
Subject: [PATCH 08/21] address comment
---
i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md
index 77508016b..fb19f3e86 100644
--- a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md
+++ b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md
@@ -21,7 +21,7 @@ Cloudberry Database 致力于提供高效的数据查询服务。无论是在大
Cloudberry Database 提供了两种优化器以支持用户在不同场景下进行有效的查询:
- **基于 PostgreSQL 的优化器**:此优化器内置于 Cloudberry Database 中,经过特别调整以更好地支持分布式环境。这意味着它能够在处理大数据分析任务时产生更高效的查询计划。
-- **GPORCA 优化器**:这是一个基于开源的优化器,经过特定适配,以满足分布式环境下的查询优化需求。
+- **GPORCA 优化器**:这是一个开源的优化器,经过特定适配,以满足分布式环境下的查询优化需求。
Cloudberry Database 提供的技术包括分区静态和动态减裁、聚集下推、连接过滤等,它们都旨在为你提供最快、最精确的查询结果。此外,此外,针对不同的查询场景和查询语句,Cloudberry Database 提供了基于规则的查询优化手段和基于代价的查询优化手段以生成更高效的查询执行计划。
From 133c48b08d7b5155fd09f797d3373be186999170 Mon Sep 17 00:00:00 2001
From: TomShawn <41534398+TomShawn@users.noreply.github.com>
Date: Mon, 12 Jun 2023 10:14:15 +0800
Subject: [PATCH 09/21] modify overview to make it simpler
---
docs/cbdb-architecture.md | 40 ++++++++++++++-
docs/cbdb-scenarios.md | 4 ++
.../current/cbdb-architecture.md | 2 -
.../current/cbdb-overview.md | 49 +++++--------------
.../current/cbdb-scenarios.md | 4 ++
5 files changed, 59 insertions(+), 40 deletions(-)
diff --git a/docs/cbdb-architecture.md b/docs/cbdb-architecture.md
index 30404ce4c..0ac93a32b 100644
--- a/docs/cbdb-architecture.md
+++ b/docs/cbdb-architecture.md
@@ -1 +1,39 @@
-TODO
\ No newline at end of file
+---
+title: Architecture
+---
+
+# Cloudberry Database Architecture
+
+This document introduces the product architecture and the implementation mechanism of the internal modules in Cloudberry Database.
+
+In most cases, Cloudberry Database is close to PostgreSQL in terms of SQL support, features, configuration options, and user functionalities. The experience users have with Cloudberry Database is similar to interacting with a standalone PostgreSQL system.
+
+Cloudberry Database uses MPP (Massively Parallel Processing) architecture to store and process large volumes of data by distributing data and computational workloads across multiple servers or hosts.
+
+MPP, also known as the shared-nothing architecture, refers to systems with multiple hosts that work together to perform a task. Each host has its own processor, memory, disk, network resources, and operating system. Cloudberry Database uses this high-performance architecture to distribute data loads and can use all system resources in parallel to process queries.
+
+From the user's point of view, Cloudberry Database is a complete Relational Database Management System (RDBMS). In physical terms, it contains multiple PostgreSQL instances. To make these independent PostgreSQL instances work together, Cloudberry Database carries out distributed cluster processing at different levels for data storage, computation, communication, and management. Although Cloudberry Database is a cluster, it hides all the distributed details from the user and provides a single logical database. This greatly eases the work of developers and operational staff.
+
+The architecture diagram of Cloudberry Database is as follows:
+
+
+
+- **Master node** (or control node) is the gateway to the Cloudberry Database system, which accepts client connections and SQL queries, and allocates tasks to data node instances. Users interact with Cloudberry Database by connecting to the master node using a client program (such as psql) or an application programming interface (API) (such as JDBC, ODBC, or libpq PostgreSQL C API).
+ - The master node is where the global system catalog located. The global system catalog is a set of system tables that contain metadata about the Cloudberry Database system itself.
+ - The master node does not store any user data, and data is stored only in the data node instances.
+ - The master node authenticates client connections, handles incoming SQL commands, allocates workload among data nodes, coordinates the results returned by each data node, and presents the final results to the client program.
+ - Cloudberry Database uses Write Ahead Logging (WAL) for master node/standby mirroring. In WAL-based logging, all modifications are first written to a log before being written to the disk, which ensures the data integrity of any in-process operation.
+
+- **Segment** (or data node) instances are individual Postgres processes, each storing a portion of the data and executing the corresponding part of the query. When a user connects to the database through the master node and submits a query request, a process is created on each data node to handle the query. User-defined tables and their indices are distributed across all available data nodes in Cloudberry Database, with each data node containing different parts of the data, and processes dealing with different parts of the data running on their respective data nodes. Users interact with the data nodes via the master node, and the data nodes run on servers called data node hosts.
+
+ Typically, a data node host runs 2 to 8 data nodes, depending on the processor, memory, storage, network interface, and workload. The configuration of the data node host needs to be balanced, because evenly distributing the data and workload among the data nodes is the key to achieving optimal performance with Cloudberry Database, which allows all data nodes to start processing a task and finish the work at the same time.
+
+- **Interconnect** is the network layer in the Cloudberry Database system architecture. Interconnect refers to the network infrastructure upon which the communication between the master node and the data nodes relies, which uses a standard Ethernet switching structure.
+
+ For performance reasons, it is recommended to use a network with a speed of 10 GB or faster. By default, the Interconnect module uses the UDP protocol with flow control (UDPIFC) for communication to send messages through the network. The data packet verification performed by Cloudberry Database exceeds the scope provided by UDP, meaning that its reliability is equivalent to using the TCP protocol, and its performance and scalability surpass the TCP protocol. If the Interconnect is changed to use the TCP protocol, the scalability of Cloudberry Database is limited to 1000 data nodes. This limit does not apply when using UDPIFC as the default protocol.
+
+- Cloudberry Database uses Multiversion Concurrency Control (MVCC) to ensure data consistency. This means that when querying the database, each transaction only sees a snapshot of the data, ensuring that current transactions do not see modifications made by other transactions on the same records. This provides transaction isolation for each transaction in the database.
+
+ MVCC minimizes lock contention to ensure performance in a multi-user environment. This is done by avoiding explicit locking for database transactions.
+
+ In concurrency control, MVCC does not introduce conflicts for query (read) locks and write locks. In addition, read and write operations do not block each other. This is the biggest advantages of using MVCC over using lock mechanism.
diff --git a/docs/cbdb-scenarios.md b/docs/cbdb-scenarios.md
index 30404ce4c..4ab7328c4 100644
--- a/docs/cbdb-scenarios.md
+++ b/docs/cbdb-scenarios.md
@@ -1 +1,5 @@
+---
+title: User Scenarios
+---
+
TODO
\ No newline at end of file
diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-architecture.md b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-architecture.md
index 03fbcce28..a200cb821 100644
--- a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-architecture.md
+++ b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-architecture.md
@@ -18,8 +18,6 @@ Cloudberry Database 架构图如下所示:

-HashData 由如下组件构成:
-
- **控制节点 (Master)** 是 Cloudberry Database 数据库系统的入口,它接受客户端连接和 SQL 查询,并将工作分配给数据节点实例。用户与 Cloudberry Database 进行交互,使用客户端程序(例如 psql)或应用程序编程接口(API)(例如 JDBC、ODBC 或 libpq PostgreSQL C API)连接到控制节点。
- 控制节点是全局系统目录所在的位置,全局系统目录是一组系统表,其中包含有关 Cloudberry Database 数据库系统本身的元数据。
- 控制节点不包含任何用户数据,数据只保存在数据节点实例上。
diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md
index fb19f3e86..c0bca46a4 100644
--- a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md
+++ b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md
@@ -16,17 +16,20 @@ Cloudberry Database 基于最新的 PostgreSQL 14.4 内核,是当前最先进
## 多场景高效查询
-Cloudberry Database 致力于提供高效的数据查询服务。无论是在大规模数据分析场景,还是在高度分布式的环境下,我们的解决方案都能为你提供最优的查询性能。这得益于 Cloudberry Database 的两种灵活的查询优化方式,以及一系列内置的优化技术。
+Cloudberry Database 致力于提供高效的数据查询服务,支持用户在不同场景下进行有效的查询:
-Cloudberry Database 提供了两种优化器以支持用户在不同场景下进行有效的查询:
+- **大数据分析环境**:Cloudberry Database 使用内置的 PostgreSQL 的优化器,可更好地支持分布式环境。这意味着它能够在处理大数据分析任务时产生更高效的查询计划。
+- **分布式环境**:采用开源优化器 GPORCA 优化器,经过特定适配,可满足分布式环境下的查询优化需求。
-- **基于 PostgreSQL 的优化器**:此优化器内置于 Cloudberry Database 中,经过特别调整以更好地支持分布式环境。这意味着它能够在处理大数据分析任务时产生更高效的查询计划。
-- **GPORCA 优化器**:这是一个开源的优化器,经过特定适配,以满足分布式环境下的查询优化需求。
-
-Cloudberry Database 提供的技术包括分区静态和动态减裁、聚集下推、连接过滤等,它们都旨在为你提供最快、最精确的查询结果。此外,此外,针对不同的查询场景和查询语句,Cloudberry Database 提供了基于规则的查询优化手段和基于代价的查询优化手段以生成更高效的查询执行计划。
+Cloudberry Database 还提供分区静态和动态减裁、聚集下推、连接过滤等技术,旨在提供最快、最精确的查询结果。此外,针对不同的查询场景和查询语句,Cloudberry Database 提供了基于规则的查询优化手段和基于代价的查询优化手段以生成更高效的查询执行计划。
## 多态数据存储
+Cloudberry Database 支持多种不同的存储格式,包括 Heap 存储、AO 行存储、AOCS 列存储,用于不同的应用场景。同时,Cloudberry Database 还支持分区表,用户可以按照某个条件定义表的分区方式,查询时根据查询条件自动过滤不需要查询的子表,提高数据的查询效率。
+
+
+更多详情
+
Cloudberry Database 能够高效灵活地满足多种应用类型和业务需求的数据存储需求。以下是 Cloudberry Database 提供的主要数据存储特性:
- **均匀的数据分布**:通过 Hash 和 Random 的方式进行数据分布,可以更好地利用磁盘性能并解决 I/O 瓶颈问题。
@@ -40,41 +43,15 @@ Cloudberry Database 能够高效灵活地满足多种应用类型和业务需求
- **高效的数据压缩功能**:支持多种压缩算法,如 Zlib 1-9 和 Zstandard 1~19,以提高数据处理性能,并保持 CPU 与压缩比的平衡。
- **对小表的优化**:你可以选择使用 Replication Table,并在创建表时指定自定义 Hash 算法,更灵活地控制数据分布。
-无论选择哪种存储模式,Cloudberry Database 优化器都会根据你所使用的存储形态和统计信息自动生成最优的查询计划。此外,所有存储模式都支持数据压缩功能,且一张表的不同列可以支持不同的压缩算法。
-
-因此,Cloudberry Database 提供了高效、灵活的数据存储和处理解决方案,能满足多种业务需求。
+
## 多层次的数据安全防护
-Cloudberry Database 着重强调数据安全性,提供了全方位的安全保护措施。这些安全特性被设计为满足各种数据库环境需求,并提供多层次的安全防护,包括:
-
-- **数据库隔离**:在 Cloudberry Database 中,数据在各数据库间不共享,实现了多数据库环境的隔离。如果需要进行跨数据库访问,可以使用 DBLink 功能。
-
-- **内部数据组织**:数据库内部的数据逻辑组织包括多种数据对象,如表、视图、索引、函数等,而数据访问则可以跨 Schema 进行。
-
-- **强大的数据存储安全性**:Cloudberry Database 提供了不同的存储模式以支持数据冗余,并采用各种加密方法(包括 AES 128、192、256,DES,以及国密加密等)以确保数据存储的安全性。此外,还支持密文认证,包括 SCRAM-SHA-256、MD5、LDAP、RADIUS 等加密算法。
-
-- **用户数据保护**:Cloudberry Database 提供了函数加密解密,以及透明数据加密解密。透明数据加密解密的过程由 Cloudberry Database 内核完成,用户无需进行任何操作。可以支持的数据格式包括 Heap 表,AO 行存储,AOCS 列存储。除了常见的 AES 等加密算法,也特别支持国密算法,使用户可以方便地扩展自己的算法到透明数据加密中。
-
-- **详细的权限设定**:为了满足不同用户和不同级别的对象(例如:Schema、表、行、列、视图、函数等)的权限需求,Cloudberry Database 提供了丰富的权限设定选项,包括 `SELECT`、`UPDATE`、执行权、所有权等等。
-
-因此,Cloudberry Database 为数据安全性提供了全方位的保障,无论是从数据存储的安全性,还是用户数据保护和权限设定,都能满足多种安全性要求。
+Cloudberry Database 加强对用户数据的保护,支持函数加密解密,以及透明数据加密和解密。透明数据加密解密指在用户不感知的情况下,加密解密过程由 Cloudberry Database 内核完成,目前可以支持的数据格式包括 Heap 表、AO 行存储、AOCS 列存储。同时加密算法除了常用的 AES 等算法以外,还特别支持国密算法,用户可以方便的扩展自己的算法到透明数据加密中。
## 数据加载
-Cloudberry Database 提供了一系列高效且灵活的数据加载解决方案,以满足各种数据处理需求,包括:
-
-- **并行化和持久化的数据加载**:通过外部表技术,Cloudberry Database 支持大批量并行和持久化的数据加载,实现字符集间的自动转换,例如从 GBK 到 UTF8。这一功能使得数据输入变得更为流畅。
-
-- **灵活的数据源和文件格式支持**:无论数据存储在外部文件服务器、Hive、Hbase、HDFS 还是 S3 等多种存储介质,或是处于 CSV、Text、JSON、ORC、Parquet 等多种文件格式,Cloudberry Database 都能提供支持。并且,该数据库也可以加载 Zip 等压缩数据文件。
-
-- **集成多款 ETL 工具**:DataStage、Informatica、Kettle 等多款 ETL 工具都已集成到 Cloudberry Database 中,提升数据处理的便利性。
-
-- **支持流式数据加载**:Cloudberry Database 可针对订阅的 Kafka Topic 启动多个并行读取任务,将读取后的记录缓存,到达一定时间或记录数后,通过 gpfdist 加载到数据库中。这种方式可以确保数据的完整性,不重也不丢,非常适用于流数据采集和实时分析场景。支持达到每分钟几千万的数据加载吞吐量。
-
-- **高性能的数据访问**:PXF 是 Cloudberry Database 的内置组件,可以将外部数据源映射到 Cloudberry Database 的外部表,实现并行和高速的数据访问。PXF 支持混合数据生态的管理和访问,帮助实现 Data Fabric 架构。
-
-这些特性使得 Cloudberry Database 可以在不同的数据环境中提供高效且稳定的数据加载解决方案。
+Cloudberry Database 提供了一系列高效且灵活的数据加载解决方案,以满足各种数据处理需求,包括并行化和持久化的数据加载、支持灵活的数据源和文件格式、集成多款 ETL 工具、支持流式数据加载、提供高性能的数据访问。
## 多层容错
@@ -86,8 +63,6 @@ Cloudberry Database 为了确保数据安全和服务的连续性,采取了多
- **控制节点的备份**:类似于数据节点,控制节点也可以配置备份节点,以防止主控制节点发生故障。一旦主控制节点发生故障,系统将自动切换到备份控制节点,确保服务的连续性。
-通过这些设计,Cloudberry Database 对可能的故障场景做出了充分的预防和应对,从而为用户提供了更加可靠、稳定的服务。
-
## 丰富的数据分析支持
Cloudberry Database 提供了强大的数据分析功能,使得数据处理、查询和分析变得更加高效。下面是主要的数据分析功能和相关组件:
diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-scenarios.md b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-scenarios.md
index 30404ce4c..1a4f3f169 100644
--- a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-scenarios.md
+++ b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-scenarios.md
@@ -1 +1,5 @@
+---
+title: 使用场景
+---
+
TODO
\ No newline at end of file
From 25e04fc1b553795ec01f1e94ae56d8dac111735c Mon Sep 17 00:00:00 2001
From: TomShawn <41534398+TomShawn@users.noreply.github.com>
Date: Mon, 12 Jun 2023 11:00:48 +0800
Subject: [PATCH 10/21] refine feature overview
---
docs/cbdb-architecture.md | 2 +-
.../current/cbdb-overview.md | 84 ++++++++++++++-----
2 files changed, 66 insertions(+), 20 deletions(-)
diff --git a/docs/cbdb-architecture.md b/docs/cbdb-architecture.md
index 0ac93a32b..6a4156888 100644
--- a/docs/cbdb-architecture.md
+++ b/docs/cbdb-architecture.md
@@ -16,7 +16,7 @@ From the user's point of view, Cloudberry Database is a complete Relational Data
The architecture diagram of Cloudberry Database is as follows:
-
+
- **Master node** (or control node) is the gateway to the Cloudberry Database system, which accepts client connections and SQL queries, and allocates tasks to data node instances. Users interact with Cloudberry Database by connecting to the master node using a client program (such as psql) or an application programming interface (API) (such as JDBC, ODBC, or libpq PostgreSQL C API).
- The master node is where the global system catalog located. The global system catalog is a set of system tables that contain metadata about the Cloudberry Database system itself.
diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md
index c0bca46a4..28dcbe9ea 100644
--- a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md
+++ b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md
@@ -8,29 +8,26 @@ Cloudberry Database 基于最新的 PostgreSQL 14.4 内核,是当前最先进
- 性能优秀: Cloudberry Database 在数据存储、高并发、高可用、线性扩展、反应速度、易用性和性价比等方面显著的优势。进入大数据时代以后,Cloudberry Database 在处理 TB 级别数据量上性能优秀,单机性能明显优于 Hadoop。
- 语法兼容性强:在功能和语法上,远比 Hadoop 上的 SQL 引擎 Hive 易用,普通用户更加容易上手。
-- 工具完善: Cloudberry Database 有着完善的工具体系,用户无需投入太多时间和精力进行工具改造,因此适合作为大型数据仓库的解决方案。
-- 部署灵活: Cloudberry Database 支持灵活的部署方式,包括传统的硬件部署,支持多云和跨云部署。
-- 对不同数据类型、数据格式、存储介质都提供完善的支持。多层次的灵活性也让 Cloudberry Database可以更好地满足用户多方位的需求。
-
-本文档介绍 Cloudberry Database 的主要功能以及主要使用场景。
+- 工具完善: Cloudberry Database 有着完善的工具体系,用户无需投入太多时间和精力进行工具改造,适合作为大型数据仓库的解决方案。
+- 部署灵活:Cloudberry Database 支持灵活的部署方式,包括传统的硬件部署,支持多云和跨云部署。
+- 对不同数据类型、数据格式、存储介质都提供完善的支持,多层次地满足用户多方位的需求。
## 多场景高效查询
-Cloudberry Database 致力于提供高效的数据查询服务,支持用户在不同场景下进行有效的查询:
+- Cloudberry Database 支持用户在大数据分析环境和分布式环境下进行有效的查询:
-- **大数据分析环境**:Cloudberry Database 使用内置的 PostgreSQL 的优化器,可更好地支持分布式环境。这意味着它能够在处理大数据分析任务时产生更高效的查询计划。
-- **分布式环境**:采用开源优化器 GPORCA 优化器,经过特定适配,可满足分布式环境下的查询优化需求。
+ - **大数据分析环境**:Cloudberry Database 使用内置的 PostgreSQL 的优化器,可更好地支持分布式环境。这意味着它能够在处理大数据分析任务时产生更高效的查询计划。
+ - **分布式环境**:采用开源优化器 GPORCA 优化器,经过特定适配,可满足分布式环境下的查询优化需求。
-Cloudberry Database 还提供分区静态和动态减裁、聚集下推、连接过滤等技术,旨在提供最快、最精确的查询结果。此外,针对不同的查询场景和查询语句,Cloudberry Database 提供了基于规则的查询优化手段和基于代价的查询优化手段以生成更高效的查询执行计划。
+- 提供分区静态和动态减裁、聚集下推、连接过滤等技术,以帮助用户获得最快、最精确的查询结果。
+- 提供了基于规则的查询优化手段和基于代价的查询优化手段,帮助用户生成更高效的查询执行计划。
## 多态数据存储
Cloudberry Database 支持多种不同的存储格式,包括 Heap 存储、AO 行存储、AOCS 列存储,用于不同的应用场景。同时,Cloudberry Database 还支持分区表,用户可以按照某个条件定义表的分区方式,查询时根据查询条件自动过滤不需要查询的子表,提高数据的查询效率。
-更多详情
-
-Cloudberry Database 能够高效灵活地满足多种应用类型和业务需求的数据存储需求。以下是 Cloudberry Database 提供的主要数据存储特性:
+点击以查看主要数据存储特性
- **均匀的数据分布**:通过 Hash 和 Random 的方式进行数据分布,可以更好地利用磁盘性能并解决 I/O 瓶颈问题。
- **多种存储类型的选择**:
@@ -49,13 +46,49 @@ Cloudberry Database 能够高效灵活地满足多种应用类型和业务需求
Cloudberry Database 加强对用户数据的保护,支持函数加密解密,以及透明数据加密和解密。透明数据加密解密指在用户不感知的情况下,加密解密过程由 Cloudberry Database 内核完成,目前可以支持的数据格式包括 Heap 表、AO 行存储、AOCS 列存储。同时加密算法除了常用的 AES 等算法以外,还特别支持国密算法,用户可以方便的扩展自己的算法到透明数据加密中。
+
+点击查看详情
+
+Cloudberry Database 着重强调数据安全性,提供了全方位的安全保护措施。这些安全特性被设计为满足各种数据库环境需求,并提供多层次的安全防护,包括:
+
+- **数据库隔离**:在 Cloudberry Database 中,数据在各数据库间不共享,实现了多数据库环境的隔离。如果需要进行跨数据库访问,可以使用 DBLink 功能。
+
+- **内部数据组织**:数据库内部的数据逻辑组织包括多种数据对象,如表、视图、索引、函数等,而数据访问则可以跨 Schema 进行。
+
+- **强大的数据存储安全性**:Cloudberry Database 提供了不同的存储模式以支持数据冗余,并采用各种加密方法(包括 AES 128、192、256,DES,以及国密加密等)以确保数据存储的安全性。此外,还支持密文认证,包括 SCRAM-SHA-256、MD5、LDAP、RADIUS 等加密算法。
+
+- **用户数据保护**:Cloudberry Database 提供了函数加密解密,以及透明数据加密解密。透明数据加密解密的过程由 Cloudberry Database 内核完成,用户无需进行任何操作。可以支持的数据格式包括 Heap 表,AO 行存储,AOCS 列存储。除了常见的 AES 等加密算法,也特别支持国密算法,使用户可以方便地扩展自己的算法到透明数据加密中。
+
+- **详细的权限设定**:为了满足不同用户和不同级别的对象(例如:Schema、表、行、列、视图、函数等)的权限需求,Cloudberry Database 提供了丰富的权限设定选项,包括 `SELECT`、`UPDATE`、执行权、所有权等等。
+
+
+
## 数据加载
Cloudberry Database 提供了一系列高效且灵活的数据加载解决方案,以满足各种数据处理需求,包括并行化和持久化的数据加载、支持灵活的数据源和文件格式、集成多款 ETL 工具、支持流式数据加载、提供高性能的数据访问。
+
+点击查看数据加载方案详情
+
+- **并行化和持久化的数据加载**:通过外部表技术,Cloudberry Database 支持大批量并行和持久化的数据加载,实现字符集间的自动转换,例如从 GBK 到 UTF8。这一功能使得数据输入变得更为流畅。
+
+- **灵活的数据源和文件格式支持**:无论数据存储在外部文件服务器、Hive、Hbase、HDFS 还是 S3 等多种存储介质,或是处于 CSV、Text、JSON、ORC、Parquet 等多种文件格式,Cloudberry Database 都能提供支持。并且,该数据库也可以加载 Zip 等压缩数据文件。
+
+- **集成多款 ETL 工具**:DataStage、Informatica、Kettle 等多款 ETL 工具都已集成到 Cloudberry Database 中,提升数据处理的便利性。
+
+- **支持流式数据加载**:Cloudberry Database 可针对订阅的 Kafka Topic 启动多个并行读取任务,将读取后的记录缓存,到达一定时间或记录数后,通过 gpfdist 加载到数据库中。这种方式可以确保数据的完整性,不重也不丢,非常适用于流数据采集和实时分析场景。支持达到每分钟几千万的数据加载吞吐量。
+
+- **高性能的数据访问**:PXF 是 Cloudberry Database 的内置组件,可以将外部数据源映射到 Cloudberry Database 的外部表,实现并行和高速的数据访问。PXF 支持混合数据生态的管理和访问,帮助实现 Data Fabric 架构。
+
+
+
## 多层容错
-Cloudberry Database 为了确保数据安全和服务的连续性,采取了多种多级的容错机制:
+Cloudberry Database 为了确保数据安全和服务的连续性,采取了数据页面、Checksum、镜像节点配置、控制节点备份的多级容错机制。
+
+
+
+点击查看详细信息
- **数据页面的 Checksum**:在底层存储上,Cloudberry Database 使用 Checksum 机制进行坏块检测,保证数据的完整性。
@@ -63,9 +96,14 @@ Cloudberry Database 为了确保数据安全和服务的连续性,采取了多
- **控制节点的备份**:类似于数据节点,控制节点也可以配置备份节点,以防止主控制节点发生故障。一旦主控制节点发生故障,系统将自动切换到备份控制节点,确保服务的连续性。
+
+
## 丰富的数据分析支持
-Cloudberry Database 提供了强大的数据分析功能,使得数据处理、查询和分析变得更加高效。下面是主要的数据分析功能和相关组件:
+Cloudberry Database 提供了强大的数据分析功能,使得数据处理、查询和分析变得更加高效,满足各类复杂的数据处理、分析和查询需求。
+
+
+点击以查看主要的数据分析功能和组件
- **并行优化器和执行器**:Cloudberry Database 内核内置了并行优化器和执行器,不仅能够兼容 PostgreSQL 生态,还支持数据分区裁剪、多种索引技术(包括 BTree,Bitmap,Hash,Brin,GIN等),以及 JIT(表达式即时编译处理)等。
@@ -84,23 +122,29 @@ Cloudberry Database 提供了强大的数据分析功能,使得数据处理、
- **Cloudberry Database Text 组件**:这个组件支持利用 ElasticSearch 加速文件检索能力,相比传统的 GIN 数据文本查询性能有数量级的提升,支持多种分词,自然语言处理,以及查询结果渲染等。
-通过这些数据分析功能,Cloudberry Database 能够满足各类复杂的数据处理、分析和查询需求。
+
## 灵活的工作负载管理
-Cloudberry Database 提供了全面的工作负载管理功能,旨在有效地利用和优化数据库资源,以确保高效、稳定的运行。其工作负载管理主要包括以下三个层次的控制:
+Cloudberry Database 提供了全面的工作负载管理功能,旨在有效地利用和优化数据库资源,以确保高效、稳定的运行。其工作负载管理主要包括连接级别管理、会话级别管理、SQL 级别管理三个层次的控制。
+
+
+点击以查看详情
- **连接池 PGBouncer(连接级别管理)**:通过连接池,Cloudberry Database 对用户接入进行统一管理,限制同时活跃的用户数量,以提高效率并避免因频繁创建和销毁服务进程而浪费资源。连接池具有较小的内存占用,并能够支持高并发连接,使用 libevent 进行 Socket 通信以提高通信效率。
- **资源组 Resource Group(会话级别管理)**:通过资源组,Cloudberry Database 能够分析并分类典型的工作负载,量化每个工作负载所需的 CPU、内存、并发度等资源。这样,根据工作负载的实际需求,可以设定适合的资源组,并动态调整资源使用,以确保整体运行效率。同时,还可以利用规则清理空闲的会话,释放不必要的资源。
-- **动态资源组分配(SQL级别管理)**:通过动态资源组分配,Cloudberry Database 能够在 SQL 语句执行前或执行过程中灵活地分配资源,以便优待特定的查询,缩短其运行时间。
+- **动态资源组分配(SQL 级别管理)**:通过动态资源组分配,Cloudberry Database 能够在 SQL 语句执行前或执行过程中灵活地分配资源,以便优待特定的查询,缩短其运行时间。
-这种分层的工作负载管理模式,使得 Cloudberry Database 能够更有效地处理高并发请求,保障数据库性能和稳定性,同时满足不同的业务需求。
+
## 多种兼容性
-Cloudberry Database 的兼容性表现在多个方面,这使得它能够灵活应对各种工具、平台和语言。以下是其主要的兼容性特点:
+Cloudberry Database 的兼容性表现在 SQL 语法、组件、工具和程序、硬件平台和操作系统等多个方面,这使得它能够灵活应对各种工具、平台和语言。
+
+
+点击以查看详情
- **SQL 兼容性**:Cloudberry Database 兼容 PostgreSQL 和 Greenplum 语法,支持 SQL-92,SQL-99,以及 SQL 2003 标准,包括 SQL 2003 OLAP 扩展,如窗口函数,`rollup`,`cube` 等。
@@ -111,3 +155,5 @@ Cloudberry Database 的兼容性表现在多个方面,这使得它能够灵活
- **硬件平台兼容性**:能够在多种硬件架构下运行,包括 X86、ARM、飞腾、鲲鹏、海光等。
- **操作系统兼容性**:兼容多种操作系统环境,如 CentOS、Ubuntu、Kylin、BC-Linux 等。
+
+
From dc2571598babf258429b46cce07b5991df45fc21 Mon Sep 17 00:00:00 2001
From: TomShawn <41534398+TomShawn@users.noreply.github.com>
Date: Tue, 13 Jun 2023 18:04:37 +0800
Subject: [PATCH 11/21] refine architecture translation
---
docs/cbdb-architecture.md | 20 ++++++++++----------
1 file changed, 10 insertions(+), 10 deletions(-)
diff --git a/docs/cbdb-architecture.md b/docs/cbdb-architecture.md
index 6a4156888..ab5324f51 100644
--- a/docs/cbdb-architecture.md
+++ b/docs/cbdb-architecture.md
@@ -6,31 +6,31 @@ title: Architecture
This document introduces the product architecture and the implementation mechanism of the internal modules in Cloudberry Database.
-In most cases, Cloudberry Database is close to PostgreSQL in terms of SQL support, features, configuration options, and user functionalities. The experience users have with Cloudberry Database is similar to interacting with a standalone PostgreSQL system.
+In most cases, Cloudberry Database is similar to PostgreSQL in terms of SQL support, features, configuration options, and user functionalities. The experience users have with Cloudberry Database is similar to interacting with a standalone PostgreSQL system.
-Cloudberry Database uses MPP (Massively Parallel Processing) architecture to store and process large volumes of data by distributing data and computational workloads across multiple servers or hosts.
+Cloudberry Database uses MPP (Massively Parallel Processing) architecture to store and process large volumes of data, by distributing data and computing workloads across multiple servers or hosts.
MPP, also known as the shared-nothing architecture, refers to systems with multiple hosts that work together to perform a task. Each host has its own processor, memory, disk, network resources, and operating system. Cloudberry Database uses this high-performance architecture to distribute data loads and can use all system resources in parallel to process queries.
-From the user's point of view, Cloudberry Database is a complete Relational Database Management System (RDBMS). In physical terms, it contains multiple PostgreSQL instances. To make these independent PostgreSQL instances work together, Cloudberry Database carries out distributed cluster processing at different levels for data storage, computation, communication, and management. Although Cloudberry Database is a cluster, it hides all the distributed details from the user and provides a single logical database. This greatly eases the work of developers and operational staff.
+From users' view, Cloudberry Database is a complete relational database management system (RDBMS). In a physical view, it contains multiple PostgreSQL instances. To make these independent PostgreSQL instances work together, Cloudberry Database performs distributed cluster processing at different levels for data storage, computing, communication, and management. Although Cloudberry Database is a cluster, it hides all the distributed details from the user and provides a single logical database. This greatly eases the work of developers and operational staff.
The architecture diagram of Cloudberry Database is as follows:
- **Master node** (or control node) is the gateway to the Cloudberry Database system, which accepts client connections and SQL queries, and allocates tasks to data node instances. Users interact with Cloudberry Database by connecting to the master node using a client program (such as psql) or an application programming interface (API) (such as JDBC, ODBC, or libpq PostgreSQL C API).
- - The master node is where the global system catalog located. The global system catalog is a set of system tables that contain metadata about the Cloudberry Database system itself.
- - The master node does not store any user data, and data is stored only in the data node instances.
- - The master node authenticates client connections, handles incoming SQL commands, allocates workload among data nodes, coordinates the results returned by each data node, and presents the final results to the client program.
+ - The master node acts as the global system directory, containing a set of system tables that record the metadata of Cloudberry Database.
+ - The master node does not store any user data. User data is stored only in the data node instances.
+ - The master node performs authentication for client connections, processes SQL commands, distributes workload among segments, coordinates the results returned by each segment, and presents the final results to the client program.
- Cloudberry Database uses Write Ahead Logging (WAL) for master node/standby mirroring. In WAL-based logging, all modifications are first written to a log before being written to the disk, which ensures the data integrity of any in-process operation.
-- **Segment** (or data node) instances are individual Postgres processes, each storing a portion of the data and executing the corresponding part of the query. When a user connects to the database through the master node and submits a query request, a process is created on each data node to handle the query. User-defined tables and their indices are distributed across all available data nodes in Cloudberry Database, with each data node containing different parts of the data, and processes dealing with different parts of the data running on their respective data nodes. Users interact with the data nodes via the master node, and the data nodes run on servers called data node hosts.
+- **Segment** (or data node) instances are individual Postgres processes, each storing a portion of the data and executing the corresponding part of the query. When a user connects to the database through the master node and submits a query request, a process is created on each segment node to handle the query. User-defined tables and their indexes are distributed across the available segments, and each segment node contains distinct portions of the data. The processes of data processing runs in the corresponding segment. Users interact with segments through the master, and the segment operate on servers known as the segment host.
- Typically, a data node host runs 2 to 8 data nodes, depending on the processor, memory, storage, network interface, and workload. The configuration of the data node host needs to be balanced, because evenly distributing the data and workload among the data nodes is the key to achieving optimal performance with Cloudberry Database, which allows all data nodes to start processing a task and finish the work at the same time.
+ Typically, a segment host runs 2 to 8 data nodes, depending on the processor, memory, storage, network interface, and workload. The configuration of the segment host needs to be balanced, because evenly distributing the data and workload among segments is the key to achieving optimal performance with Cloudberry Database, which allows all segments to start processing a task and finish the work at the same time.
-- **Interconnect** is the network layer in the Cloudberry Database system architecture. Interconnect refers to the network infrastructure upon which the communication between the master node and the data nodes relies, which uses a standard Ethernet switching structure.
+- **Interconnect** is the network layer in the Cloudberry Database system architecture. Interconnect refers to the network infrastructure upon which the communication between the master node and the segments relies, which uses a standard Ethernet switching structure.
- For performance reasons, it is recommended to use a network with a speed of 10 GB or faster. By default, the Interconnect module uses the UDP protocol with flow control (UDPIFC) for communication to send messages through the network. The data packet verification performed by Cloudberry Database exceeds the scope provided by UDP, meaning that its reliability is equivalent to using the TCP protocol, and its performance and scalability surpass the TCP protocol. If the Interconnect is changed to use the TCP protocol, the scalability of Cloudberry Database is limited to 1000 data nodes. This limit does not apply when using UDPIFC as the default protocol.
+ For performance reasons, it is recommended to use a network with a speed of 10 GB or faster. By default, the Interconnect module uses the UDP protocol with flow control (UDPIFC) for communication to send messages through the network. The data packet verification performed by Cloudberry Database exceeds the scope provided by UDP, meaning that its reliability is equivalent to using the TCP protocol, and its performance and scalability surpass the TCP protocol. If the Interconnect is changed to use the TCP protocol, the scalability of Cloudberry Database is limited to 1000 segments. This limit does not apply when UDPIFC is used as the default protocol.
- Cloudberry Database uses Multiversion Concurrency Control (MVCC) to ensure data consistency. This means that when querying the database, each transaction only sees a snapshot of the data, ensuring that current transactions do not see modifications made by other transactions on the same records. This provides transaction isolation for each transaction in the database.
From 1e072cbf5a013d3252ec332fd16f0991e318a54a Mon Sep 17 00:00:00 2001
From: TomShawn <41534398+TomShawn@users.noreply.github.com>
Date: Fri, 16 Jun 2023 10:16:51 +0800
Subject: [PATCH 12/21] refne
---
docs/cbdb-overview.md | 169 ++++++++++++++----
.../current/cbdb-overview.md | 16 +-
2 files changed, 144 insertions(+), 41 deletions(-)
diff --git a/docs/cbdb-overview.md b/docs/cbdb-overview.md
index 4cd42857d..a99b59a55 100644
--- a/docs/cbdb-overview.md
+++ b/docs/cbdb-overview.md
@@ -1,48 +1,155 @@
---
-title: Cloudberry DataBase Overview
+title: Feature Overview
---
-Learn about Cloudberry Database in just a few minutes.
+# Cloudberry Database Feature Overview
-## Overview
+Cloudberry Database 基于最新的 PostgreSQL 14.4 内核,是当前最先进的成熟开源 MPP 数据库之一,具备高并发、高可用等多种特性,可以对复杂任务进行快速高效计算,以满足海量数据管理和计算的需求,目前在多个领域都有着广泛应用。
-Cloudberry Database is an enterprise-level cloud-native database product that's highly elastic, performant, available, and cost-effective. It helps enterprise users handle and analyze large datasets of terabytes and petabytes with ease. HashData's vision is to break down the entry barriers for enterprises to build big data systems and bring out the complete potential of big data resources.
+本文档从总体上介绍 Cloudberry Database 的特性。
-Cloudberry Database's core technical features include:
+## 多场景高效查询
-- Cloud-native architecture, without the architectural constraints seen in traditional MPP databases.
-- The computing engine is independent of data storage and metadata management, which provides
-multi-dimensional flexibility.
-- Second-level scaling and minute-level node self-healing and recovery, simplifying maintenance tasks.
+- Cloudberry Database 支持用户在大数据分析环境和分布式环境下进行有效的查询:
-Supports online upgrades and multi-active deployment.
+ - **大数据分析环境**:Cloudberry Database 使用内置的 PostgreSQL 的优化器,可更好地支持分布式环境。这意味着它能够在处理大数据分析任务时产生更高效的查询计划。
+ - **分布式环境**:采用开源优化器 GPORCA 优化器,经过特定适配,可满足分布式环境下的查询优化需求。
-- Complete database capabilities, transactional consistency, and full compatibility with PostgreSQL and the
-Greenplum database.
-- Supports mainstream analytical tools for machine learning, graph computation, and spatiotemporal analysis.
-- Easily integrates with common ETL and BI tools.
-- Supports hybrid and integrated data warehouse and data lake solutions.
+- 提供分区静态和动态减裁、聚集下推、连接过滤等技术,以帮助用户获得最快、最精确的查询结果。
+- 提供了基于规则的查询优化手段和基于代价的查询优化手段,帮助用户生成更高效的查询执行计划。
-Cloudberry Database consists of two main modules: a user module and a management module. The top layer of the user module is an independent metadata service layer, the middle is a stateless computing layer, and the bottom layer is a shared data storage layer. The management module incorporates the management console (Cloud Manager) that manages all metadata clusters and computing clusters, including cluster creation, startup and shutdown, and resource management, monitoring, and alerts.
+## 多态数据存储
-## Advantages
+Cloudberry Database 支持多种不同的存储格式,包括 Heap 存储、AO 行存储、AOCS 列存储,用于不同的应用场景。同时,Cloudberry Database 还支持分区表,用户可以按照某个条件定义表的分区方式,查询时根据查询条件自动过滤不需要查询的子表,提高数据的查询效率。
-- Cloud-native architecture: Cloudberry Database is built from scratch to leverage cloud computing advantages, ensuring high elasticity and availability by segregating storage, computation, and metadata. This approach eliminates the architectural constraints of traditional MPP databases and delivers a robust big data platform.
-- Multi-dimensional elastic scaling capability: Cloudberry Database scales storage and computation resources independently to adjust throughput, response time, and data capacity horizontally, vertically, and in storage.
-- Comprehensive database capability: Cloudberry Database supports UTF-8, GBK, and other encoding formats, multi-tenant management, relational data models, and standard SQL syntax. It facilitates strong consistency in ACID transactions and provides partition management and popular interfaces like JDBC and ODBC.
-- Rich analytical features: The database and data warehouse products, PostgreSQL and Greenplum Database, bring advanced analytical features to Cloudberry Database, which has been customized for cloud platforms to support distributed machine learning and spatiotemporal analysis. It has the ANSI SQL 2008 standard and the 2003 OLAP extension and supports languages, including PL/Pgsql, PL/C, PL/Python, PL/Java, and PL/R.
-- Cloudberry Database natively supports Apache MADlib (an in-database machine learning library based on SQL) and PostGIS.
-- ETL and BI Tool Integration: Cloudberry Database easily merges with many ETL and BI tools typically used in the database industry.
+
+点击以查看详情
-## Future work
+- **均匀的数据分布**:通过 Hash 和 Random 的方式进行数据分布,可以更好地利用磁盘性能并解决 I/O 瓶颈问题。
+- **多种存储类型的选择**:
-Our Cloudberry Database's open-source work is ongoing. Detailed and organized documentation will be available in the upcoming months.
+ - 行式存储:适用于大多数字段频繁查询和随机行访问较多的情况。
+ - 列式存储:当你需要对少数字段进行查询时,这种方式可以大幅节省 I/O 操作,非常适合大数据量频繁访问的场景。
-You may visit the following channels to stay connected:
+- **专门的存储模式**:Cloudberry Database 设计了 Heap 存储、AO 行存储、AOCS 列存储等不同的存储模式以优化各种应用类型的性能。在最细粒度到分区的层面,一张表可以实现多种存储模式。
+- **支持分区表**:你可以根据特定条件定义表的分区方式。在查询时,系统将自动过滤不需要查询的子表,提高数据的查询效率。
+- **高效的数据压缩功能**:支持多种压缩算法,如 Zlib 1-9 和 Zstandard 1~19,以提高数据处理性能,并保持 CPU 与压缩比的平衡。
+- **对小表的优化**:你可以选择使用 Replication Table,并在创建表时指定自定义 Hash 算法,更灵活地控制数据分布。
-- Website: [http://cloudberrydb.org](http://cloudberrydb.org)
-- Twitter: [http://twitter.com/cloudberrydb](http://twitter.com/cloudberrydb)
-- GitHub: [http://github.com/cloudberrydb](http://github.com/cloudberrydb)
-- Slack: [@cloudberrydb](https://communityinviter.com/apps/cloudberrydb/welcome)
+
-Thanks!
+## 多层次的数据安全防护
+
+Cloudberry Database 加强对用户数据的保护,支持函数加密解密,以及透明数据加密和解密。透明数据加密解密指在用户不感知的情况下,加密解密过程由 Cloudberry Database 内核完成,目前可以支持的数据格式包括 Heap 表、AO 行存储、AOCS 列存储。同时加密算法除了常用的 AES 等算法以外,还特别支持国密算法,用户可以方便的扩展自己的算法到透明数据加密中。
+
+
+点击以查看详情
+
+Cloudberry Database 着重强调数据安全性,提供了全方位的安全保护措施。这些安全特性被设计为满足各种数据库环境需求,并提供多层次的安全防护,包括:
+
+- **数据库隔离**:在 Cloudberry Database 中,数据在各数据库间不共享,实现了多数据库环境的隔离。如果需要进行跨数据库访问,可以使用 DBLink 功能。
+
+- **内部数据组织**:数据库内部的数据逻辑组织包括多种数据对象,如表、视图、索引、函数等,而数据访问则可以跨 Schema 进行。
+
+- **强大的数据存储安全性**:Cloudberry Database 提供了不同的存储模式以支持数据冗余,并采用各种加密方法(包括 AES 128、192、256,DES,以及国密加密等)以确保数据存储的安全性。此外,还支持密文认证,包括 SCRAM-SHA-256、MD5、LDAP、RADIUS 等加密算法。
+
+- **用户数据保护**:Cloudberry Database 提供了函数加密解密,以及透明数据加密解密。透明数据加密解密的过程由 Cloudberry Database 内核完成,用户无需进行任何操作。可以支持的数据格式包括 Heap 表,AO 行存储,AOCS 列存储。除了常见的 AES 等加密算法,也特别支持国密算法,使用户可以方便地扩展自己的算法到透明数据加密中。
+
+- **详细的权限设定**:为了满足不同用户和不同级别的对象(例如:Schema、表、行、列、视图、函数等)的权限需求,Cloudberry Database 提供了丰富的权限设定选项,包括 `SELECT`、`UPDATE`、执行权、所有权等等。
+
+
+
+## 数据加载
+
+Cloudberry Database 提供了一系列高效且灵活的数据加载解决方案,以满足各种数据处理需求,包括并行化和持久化的数据加载、支持灵活的数据源和文件格式、集成多款 ETL 工具、支持流式数据加载、提供高性能的数据访问。
+
+
+点击以查看详情
+
+- **并行化和持久化的数据加载**:通过外部表技术,Cloudberry Database 支持大批量并行和持久化的数据加载,实现字符集间的自动转换,例如从 GBK 到 UTF8。这一功能使得数据输入变得更为流畅。
+
+- **灵活的数据源和文件格式支持**:无论数据存储在外部文件服务器、Hive、Hbase、HDFS 还是 S3 等多种存储介质,或是处于 CSV、Text、JSON、ORC、Parquet 等多种文件格式,Cloudberry Database 都能提供支持。并且,该数据库也可以加载 Zip 等压缩数据文件。
+
+- **集成多款 ETL 工具**:DataStage、Informatica、Kettle 等多款 ETL 工具都已集成到 Cloudberry Database 中,提升数据处理的便利性。
+
+- **支持流式数据加载**:Cloudberry Database 可针对订阅的 Kafka Topic 启动多个并行读取任务,将读取后的记录缓存,到达一定时间或记录数后,通过 gpfdist 加载到数据库中。这种方式可以确保数据的完整性,不重也不丢,非常适用于流数据采集和实时分析场景。支持达到每分钟几千万的数据加载吞吐量。
+
+- **高性能的数据访问**:PXF 是 Cloudberry Database 的内置组件,可以将外部数据源映射到 Cloudberry Database 的外部表,实现并行和高速的数据访问。PXF 支持混合数据生态的管理和访问,帮助实现 Data Fabric 架构。
+
+
+
+## 多层容错
+
+Cloudberry Database 为了确保数据安全和服务的连续性,采取了数据页面、Checksum、镜像节点配置、控制节点备份的多级容错机制。
+
+
+
+点击以查看详情
+
+- **数据页面的 Checksum**:在底层存储上,Cloudberry Database 使用 Checksum 机制进行坏块检测,保证数据的完整性。
+
+- **镜像节点配置**:通过在数据节点间配置镜像节点,Cloudberry Database 能实现服务的高可用和故障切换。一旦检测到主节点发生不可恢复故障,系统会自动切换到备份数据节点,确保用户查询不会受到影响。
+
+- **控制节点的备份**:类似于数据节点,控制节点也可以配置备份节点,以防止主控制节点发生故障。一旦主控制节点发生故障,系统将自动切换到备份控制节点,确保服务的连续性。
+
+
+
+## 丰富的数据分析支持
+
+Cloudberry Database 提供了强大的数据分析功能,使得数据处理、查询和分析变得更加高效,满足各类复杂的数据处理、分析和查询需求。
+
+
+点击以查看详情
+
+- **并行优化器和执行器**:Cloudberry Database 内核内置了并行优化器和执行器,不仅能够兼容 PostgreSQL 生态,还支持数据分区裁剪、多种索引技术(包括 BTree,Bitmap,Hash,Brin,GIN等),以及 JIT(表达式即时编译处理)等。
+
+- **机器学习组件 - MADlib**:Cloudberry Database 集成了 MADlib 组件,为用户提供了全 SQL 驱动的机器学习功能,让算法、算力和数据能够深度融合。
+
+- **支持多种编程语言**:Cloudberry Database 为开发者提供了丰富的编程语言选择,包括 R、Python、Perl、Java和 PostgreSQL 等,使得用户可以方便地编写自定义函数。
+
+- **基于 MPP 引擎的高性能并行计算**:Cloudberry Database 的 MPP 引擎支持高性能并行计算,与 SQL 无缝集成,可以针对 SQL 执行结果进行快速的计算和分析。
+
+- **PostGIS 地理数据处理**:Cloudberry Database 引入了升级版的 PostGIS 2.X,支持其 MPP 架构,进一步提升了对地理空间数据的处理能力。主要特性包括:
+
+ - 集成对象存储:支持大容量地理空间数据从对象存储(OSS)直接加载入库。
+ - 全面的空间数据类型支持:包括 geometry(几何)、geography(地理)、Raster(栅格)等空间数据类型。
+ - 时空索引:提供时空索引技术,可以有效加速空间和时间相关的查询。
+ - 复杂的空间和地理位置计算:包括球体长度计算以及空间聚集函数(如包含、覆盖、相交等)。
+
+- **Cloudberry Database Text 组件**:这个组件支持利用 ElasticSearch 加速文件检索能力,相比传统的 GIN 数据文本查询性能有数量级的提升,支持多种分词,自然语言处理,以及查询结果渲染等。
+
+
+
+## 灵活的工作负载管理
+
+Cloudberry Database 提供了全面的工作负载管理功能,旨在有效地利用和优化数据库资源,以确保高效、稳定的运行。其工作负载管理主要包括连接级别管理、会话级别管理、SQL 级别管理三个层次的控制。
+
+
+点击以查看详情
+
+- **连接池 PGBouncer(连接级别管理)**:通过连接池,Cloudberry Database 对用户接入进行统一管理,限制同时活跃的用户数量,以提高效率并避免因频繁创建和销毁服务进程而浪费资源。连接池具有较小的内存占用,并能够支持高并发连接,使用 libevent 进行 Socket 通信以提高通信效率。
+
+- **资源组 Resource Group(会话级别管理)**:通过资源组,Cloudberry Database 能够分析并分类典型的工作负载,量化每个工作负载所需的 CPU、内存、并发度等资源。这样,根据工作负载的实际需求,可以设定适合的资源组,并动态调整资源使用,以确保整体运行效率。同时,还可以利用规则清理空闲的会话,释放不必要的资源。
+
+- **动态资源组分配(SQL 级别管理)**:通过动态资源组分配,Cloudberry Database 能够在 SQL 语句执行前或执行过程中灵活地分配资源,以便优待特定的查询,缩短其运行时间。
+
+
+
+## 多种兼容性
+
+Cloudberry Database 的兼容性表现在 SQL 语法、组件、工具和程序、硬件平台和操作系统等多个方面,这使得它能够灵活应对各种工具、平台和语言。
+
+
+点击以查看详情
+
+- **SQL 兼容性**:Cloudberry Database 兼容 PostgreSQL 和 Greenplum 语法,支持 SQL-92,SQL-99,以及 SQL 2003 标准,包括 SQL 2003 OLAP 扩展,如窗口函数,`rollup`,`cube` 等。
+
+- **组件兼容性**:基于 PostgreSQL 14.4 内核,Cloudberry Database 兼容市面上常用的大多数 PostgreSQL 组件和扩展。
+
+- **工具和程序兼容性**:与多种 BI 工具、挖掘预测工具、ETL 工具,以及 J2EE/.NET 应用程序都有良好的连通性。
+
+- **硬件平台兼容性**:能够在多种硬件架构下运行,包括 X86、ARM、飞腾、鲲鹏、海光等。
+
+- **操作系统兼容性**:兼容多种操作系统环境,如 CentOS、Ubuntu、Kylin、BC-Linux 等。
+
+
diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md
index 28dcbe9ea..4f92fa3cb 100644
--- a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md
+++ b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md
@@ -6,11 +6,7 @@ title: 特性概览
Cloudberry Database 基于最新的 PostgreSQL 14.4 内核,是当前最先进的成熟开源 MPP 数据库之一,具备高并发、高可用等多种特性,可以对复杂任务进行快速高效计算,以满足海量数据管理和计算的需求,目前在多个领域都有着广泛应用。
-- 性能优秀: Cloudberry Database 在数据存储、高并发、高可用、线性扩展、反应速度、易用性和性价比等方面显著的优势。进入大数据时代以后,Cloudberry Database 在处理 TB 级别数据量上性能优秀,单机性能明显优于 Hadoop。
-- 语法兼容性强:在功能和语法上,远比 Hadoop 上的 SQL 引擎 Hive 易用,普通用户更加容易上手。
-- 工具完善: Cloudberry Database 有着完善的工具体系,用户无需投入太多时间和精力进行工具改造,适合作为大型数据仓库的解决方案。
-- 部署灵活:Cloudberry Database 支持灵活的部署方式,包括传统的硬件部署,支持多云和跨云部署。
-- 对不同数据类型、数据格式、存储介质都提供完善的支持,多层次地满足用户多方位的需求。
+本文档从总体上介绍 Cloudberry Database 的特性。
## 多场景高效查询
@@ -27,7 +23,7 @@ Cloudberry Database 基于最新的 PostgreSQL 14.4 内核,是当前最先进
Cloudberry Database 支持多种不同的存储格式,包括 Heap 存储、AO 行存储、AOCS 列存储,用于不同的应用场景。同时,Cloudberry Database 还支持分区表,用户可以按照某个条件定义表的分区方式,查询时根据查询条件自动过滤不需要查询的子表,提高数据的查询效率。
-点击以查看主要数据存储特性
+点击以查看详情
- **均匀的数据分布**:通过 Hash 和 Random 的方式进行数据分布,可以更好地利用磁盘性能并解决 I/O 瓶颈问题。
- **多种存储类型的选择**:
@@ -47,7 +43,7 @@ Cloudberry Database 支持多种不同的存储格式,包括 Heap 存储、AO
Cloudberry Database 加强对用户数据的保护,支持函数加密解密,以及透明数据加密和解密。透明数据加密解密指在用户不感知的情况下,加密解密过程由 Cloudberry Database 内核完成,目前可以支持的数据格式包括 Heap 表、AO 行存储、AOCS 列存储。同时加密算法除了常用的 AES 等算法以外,还特别支持国密算法,用户可以方便的扩展自己的算法到透明数据加密中。
-点击查看详情
+点击以查看详情
Cloudberry Database 着重强调数据安全性,提供了全方位的安全保护措施。这些安全特性被设计为满足各种数据库环境需求,并提供多层次的安全防护,包括:
@@ -68,7 +64,7 @@ Cloudberry Database 着重强调数据安全性,提供了全方位的安全保
Cloudberry Database 提供了一系列高效且灵活的数据加载解决方案,以满足各种数据处理需求,包括并行化和持久化的数据加载、支持灵活的数据源和文件格式、集成多款 ETL 工具、支持流式数据加载、提供高性能的数据访问。
-点击查看数据加载方案详情
+点击以查看详情
- **并行化和持久化的数据加载**:通过外部表技术,Cloudberry Database 支持大批量并行和持久化的数据加载,实现字符集间的自动转换,例如从 GBK 到 UTF8。这一功能使得数据输入变得更为流畅。
@@ -88,7 +84,7 @@ Cloudberry Database 为了确保数据安全和服务的连续性,采取了数
-点击查看详细信息
+点击以查看详情
- **数据页面的 Checksum**:在底层存储上,Cloudberry Database 使用 Checksum 机制进行坏块检测,保证数据的完整性。
@@ -103,7 +99,7 @@ Cloudberry Database 为了确保数据安全和服务的连续性,采取了数
Cloudberry Database 提供了强大的数据分析功能,使得数据处理、查询和分析变得更加高效,满足各类复杂的数据处理、分析和查询需求。
-点击以查看主要的数据分析功能和组件
+点击以查看详情
- **并行优化器和执行器**:Cloudberry Database 内核内置了并行优化器和执行器,不仅能够兼容 PostgreSQL 生态,还支持数据分区裁剪、多种索引技术(包括 BTree,Bitmap,Hash,Brin,GIN等),以及 JIT(表达式即时编译处理)等。
From fe9b968e2d46cae33f7e1223eaf9138666d58519 Mon Sep 17 00:00:00 2001
From: TomShawn <41534398+TomShawn@users.noreply.github.com>
Date: Fri, 16 Jun 2023 15:27:48 +0800
Subject: [PATCH 13/21] add cbdb feature overview
---
docs/cbdb-overview.md | 142 ++++++++++++++++++++----------------------
1 file changed, 69 insertions(+), 73 deletions(-)
diff --git a/docs/cbdb-overview.md b/docs/cbdb-overview.md
index a99b59a55..2b681c432 100644
--- a/docs/cbdb-overview.md
+++ b/docs/cbdb-overview.md
@@ -4,152 +4,148 @@ title: Feature Overview
# Cloudberry Database Feature Overview
-Cloudberry Database 基于最新的 PostgreSQL 14.4 内核,是当前最先进的成熟开源 MPP 数据库之一,具备高并发、高可用等多种特性,可以对复杂任务进行快速高效计算,以满足海量数据管理和计算的需求,目前在多个领域都有着广泛应用。
+Cloudberry Database, built on the latest PostgreSQL 14.4 kernel, is one of the most advanced and mature open-source MPP databases available today. It comes with multiple features, including high concurrency and high availability. It can perform quick and efficient computing for complex tasks, meeting the demands of managing and computing vast amounts of data. It is widely applied in multiple fields.
-本文档从总体上介绍 Cloudberry Database 的特性。
+This document gives a general introduction to the features of Cloudberry Database.
-## 多场景高效查询
+## Efficient queries in different scenarios
-- Cloudberry Database 支持用户在大数据分析环境和分布式环境下进行有效的查询:
+- Cloudberry Database allows you to perform efficient queries in big data analysis environments and distributed environments:
- - **大数据分析环境**:Cloudberry Database 使用内置的 PostgreSQL 的优化器,可更好地支持分布式环境。这意味着它能够在处理大数据分析任务时产生更高效的查询计划。
- - **分布式环境**:采用开源优化器 GPORCA 优化器,经过特定适配,可满足分布式环境下的查询优化需求。
+ - **Big data analysis environment**: Cloudberry Database uses the built-in PostgreSQL optimizer, which offers better support for distributed environments. This means that it can generate more efficient query plans when handling big data analysis tasks.
+ - **Distributed environment**: Built in with the specially-adapted open-source GPORCA optimizer, Cloudberry Database meets the query optimization needs in distributed environments.
-- 提供分区静态和动态减裁、聚集下推、连接过滤等技术,以帮助用户获得最快、最精确的查询结果。
-- 提供了基于规则的查询优化手段和基于代价的查询优化手段,帮助用户生成更高效的查询执行计划。
+- Multiple technologies are used such as static and dynamic partition pruning, aggregate push-down, and join filtering to help you get the fastest and most accurate query results possible.
+- Both rule-based and cost-based query optimization methods are provided to help you generate more efficient query execution plans.
-## 多态数据存储
+## Polymorphic data storage
-Cloudberry Database 支持多种不同的存储格式,包括 Heap 存储、AO 行存储、AOCS 列存储,用于不同的应用场景。同时,Cloudberry Database 还支持分区表,用户可以按照某个条件定义表的分区方式,查询时根据查询条件自动过滤不需要查询的子表,提高数据的查询效率。
+For different scenarios, Cloudberry Database supports multiple storage formats, including Heap storage, AO row storage, and AOCS column storage. Cloudberry Database also supports partitioned tables. You can define the partitioning of a table based on certain conditions. When executing a query, it automatically filters out the sub-tables that are not needed for the query to improve query efficiency.
-点击以查看详情
+Click to see details
-- **均匀的数据分布**:通过 Hash 和 Random 的方式进行数据分布,可以更好地利用磁盘性能并解决 I/O 瓶颈问题。
-- **多种存储类型的选择**:
+- **Even data distribution**: By using Hash and Random methods for data distribution, Cloudberry Database takes better advantage of disk performance and solves I/O bottleneck issues.
+- **Choice of multiple storage types**:
- - 行式存储:适用于大多数字段频繁查询和随机行访问较多的情况。
- - 列式存储:当你需要对少数字段进行查询时,这种方式可以大幅节省 I/O 操作,非常适合大数据量频繁访问的场景。
+ - Row-based storage: Suitable for scenarios where most fields are frequently queried, and there are many random row accesses.
+ - Column-based storage: When you need to query a small number of fields, this method can greatly save I/O operations, making it ideal for scenarios where large amounts of data are accessed frequently.
-- **专门的存储模式**:Cloudberry Database 设计了 Heap 存储、AO 行存储、AOCS 列存储等不同的存储模式以优化各种应用类型的性能。在最细粒度到分区的层面,一张表可以实现多种存储模式。
-- **支持分区表**:你可以根据特定条件定义表的分区方式。在查询时,系统将自动过滤不需要查询的子表,提高数据的查询效率。
-- **高效的数据压缩功能**:支持多种压缩算法,如 Zlib 1-9 和 Zstandard 1~19,以提高数据处理性能,并保持 CPU 与压缩比的平衡。
-- **对小表的优化**:你可以选择使用 Replication Table,并在创建表时指定自定义 Hash 算法,更灵活地控制数据分布。
+- **Special storage modes**: Cloudberry Database has different storage modes such as Heap storage, AO row storage, AOCS column storage to optimize the performance of different types of applications. At the finest granularity level of partitioning, a table can have multiple storage modes.
+- **Support for partitioned tables**: You can define the partitioning of a table based on specific conditions. During querying, the system will automatically filter out the sub-tables that are not needed for the query, thus enhancing query efficiency.
+- **Efficient data compression function**: Cloudberry Database supports multiple compression algorithms, such as Zlib 1-9 and Zstandard 1~19, to enhance data processing performance and maintain a balance between CPU and compression ratio.
+- **Optimization for small tables**: You can choose to use the Replication Table and specify a custom Hash algorithm when creating the table, allowing for more flexible control of data distribution.
-## 多层次的数据安全防护
+## Multi-layer data security
-Cloudberry Database 加强对用户数据的保护,支持函数加密解密,以及透明数据加密和解密。透明数据加密解密指在用户不感知的情况下,加密解密过程由 Cloudberry Database 内核完成,目前可以支持的数据格式包括 Heap 表、AO 行存储、AOCS 列存储。同时加密算法除了常用的 AES 等算法以外,还特别支持国密算法,用户可以方便的扩展自己的算法到透明数据加密中。
+Cloudberry Database enhances user data protection by supporting function encryption and transparent data encryption (TDE). This means that the Cloudberry Database kernel performs these processes invisibly to users. The data formats subject to this encryption include Heap tables, AO row storage, and AOCS column storage. In addition to common encryption algorithms like AES, Cloudberry Database also supports national secret algorithms, allowing seamless integration of your own algorithms into the transparent data encryption process.
-点击以查看详情
+Click to view details
-Cloudberry Database 着重强调数据安全性,提供了全方位的安全保护措施。这些安全特性被设计为满足各种数据库环境需求,并提供多层次的安全防护,包括:
+Cloudberry Database focuses on data security and provides security protection measures. These security measures are designed to satisfy different database environment needs and offer multi-layer security protection:
-- **数据库隔离**:在 Cloudberry Database 中,数据在各数据库间不共享,实现了多数据库环境的隔离。如果需要进行跨数据库访问,可以使用 DBLink 功能。
-
-- **内部数据组织**:数据库内部的数据逻辑组织包括多种数据对象,如表、视图、索引、函数等,而数据访问则可以跨 Schema 进行。
-
-- **强大的数据存储安全性**:Cloudberry Database 提供了不同的存储模式以支持数据冗余,并采用各种加密方法(包括 AES 128、192、256,DES,以及国密加密等)以确保数据存储的安全性。此外,还支持密文认证,包括 SCRAM-SHA-256、MD5、LDAP、RADIUS 等加密算法。
-
-- **用户数据保护**:Cloudberry Database 提供了函数加密解密,以及透明数据加密解密。透明数据加密解密的过程由 Cloudberry Database 内核完成,用户无需进行任何操作。可以支持的数据格式包括 Heap 表,AO 行存储,AOCS 列存储。除了常见的 AES 等加密算法,也特别支持国密算法,使用户可以方便地扩展自己的算法到透明数据加密中。
-
-- **详细的权限设定**:为了满足不同用户和不同级别的对象(例如:Schema、表、行、列、视图、函数等)的权限需求,Cloudberry Database 提供了丰富的权限设定选项,包括 `SELECT`、`UPDATE`、执行权、所有权等等。
+- **Database isolation**: In Cloudberry Database, data is not shared between databases, which achieves isolation in a multi-database environment. If cross-database access is needed, you can use the DBLink feature.
+- **Internal data organization**: The logical organization of data in the database includes data objects such as tables, views, indexes, and functions. Data access can be performed across schemas.
+- **Data storage security**: Cloudberry Database offers different storage modes to support data redundancy. It uses encryption methods including AES 128, 192, 256, DES, and national secret encryption to secure data storage. It also supports ciphertext authentication, which includes encryption algorithms like SCRAM-SHA-256, MD5, LDAP, RADIUS.
+- **User data protection**: Cloudberry Database supports function encryption and decryption, and transparent data encryption and decryption. The process is implemented by the Cloudberry Database kernel without any user interaction. It supports data formats such as Heap tables, AO row storage, and AOCS column storage. In addition to common encryption algorithms like AES, it also supports national secret algorithms, allowing you to easily add your own algorithms into transparent data encryption.
+- **Detailed permission settings**: To satisfy different users and objects (like schemas, tables, rows, columns, views, functions), Cloudberry Database provides a range of permission setting options. These include `SELECT`, `UPDATE`, execution rights, and ownership.
-## 数据加载
+## Data loading
-Cloudberry Database 提供了一系列高效且灵活的数据加载解决方案,以满足各种数据处理需求,包括并行化和持久化的数据加载、支持灵活的数据源和文件格式、集成多款 ETL 工具、支持流式数据加载、提供高性能的数据访问。
+Cloudberry Database provides a series of efficient and flexible data loading solutions to meet various data processing needs, including parallel and persistent data loading, support for flexible data sources and file formats, integration of multiple ETL tools, and support for stream data loading and high-performance data access.
-点击以查看详情
+Click to view details
-- **并行化和持久化的数据加载**:通过外部表技术,Cloudberry Database 支持大批量并行和持久化的数据加载,实现字符集间的自动转换,例如从 GBK 到 UTF8。这一功能使得数据输入变得更为流畅。
+- **Parallel and persistent data loading**: Using external table technology, Cloudberry Database supports loading massive parallel and persistent data, and performs automatic conversion between character sets, such as from GBK to UTF-8. This feature makes data entry much smoother.
-- **灵活的数据源和文件格式支持**:无论数据存储在外部文件服务器、Hive、Hbase、HDFS 还是 S3 等多种存储介质,或是处于 CSV、Text、JSON、ORC、Parquet 等多种文件格式,Cloudberry Database 都能提供支持。并且,该数据库也可以加载 Zip 等压缩数据文件。
+- **Flexible data source and file format support**: Cloudberry Database supports data sources such as external file servers, Hive, Hbase, HDFS or S3, and supports data formats such as CSV, Text, JSON, ORC, and Parquet. In addition, the database can also load compressed data files such as Zip.
-- **集成多款 ETL 工具**:DataStage、Informatica、Kettle 等多款 ETL 工具都已集成到 Cloudberry Database 中,提升数据处理的便利性。
+- **Integrate multiple ETL tools**: DataStage, Informatica, Kettle and other ETL tools have been integrated into Cloudberry Database to facilitate data processing.
-- **支持流式数据加载**:Cloudberry Database 可针对订阅的 Kafka Topic 启动多个并行读取任务,将读取后的记录缓存,到达一定时间或记录数后,通过 gpfdist 加载到数据库中。这种方式可以确保数据的完整性,不重也不丢,非常适用于流数据采集和实时分析场景。支持达到每分钟几千万的数据加载吞吐量。
+- **Support stream data loading**: Cloudberry Database can start multiple parallel read tasks for the subscribed Kafka topic, cache the read records, and load the records into the database via gpfdist after a certain time or number of records. This method can ensure the integrity of data without duplication or loss, and is very suitable for stream data collection and real-time analysis scenarios. Cloudberry Database supports data loading throughput of tens of millions per minute.
-- **高性能的数据访问**:PXF 是 Cloudberry Database 的内置组件,可以将外部数据源映射到 Cloudberry Database 的外部表,实现并行和高速的数据访问。PXF 支持混合数据生态的管理和访问,帮助实现 Data Fabric 架构。
+- **High-performance data access**: PXF is a built-in component of Cloudberry Database, which can map external data sources to external tables of Cloudberry Database to achieve parallel and high-speed data access. PXF supports the management and access of hybrid data ecology and helps realize the Data Fabric architecture.
-## 多层容错
+## Multi-layer fault tolerance
-Cloudberry Database 为了确保数据安全和服务的连续性,采取了数据页面、Checksum、镜像节点配置、控制节点备份的多级容错机制。
+To ensure data security and service continuity, Cloudberry Database adopts a multi-level fault-tolerant mechanism of data pages, checksum, mirror node configuration, and control node backup.
-点击以查看详情
+Click to view details
-- **数据页面的 Checksum**:在底层存储上,Cloudberry Database 使用 Checksum 机制进行坏块检测,保证数据的完整性。
+- **Checksum of data page**: In the underlying storage, Cloudberry Database uses the checksum mechanism to detect bad blocks to ensure data integrity.
-- **镜像节点配置**:通过在数据节点间配置镜像节点,Cloudberry Database 能实现服务的高可用和故障切换。一旦检测到主节点发生不可恢复故障,系统会自动切换到备份数据节点,确保用户查询不会受到影响。
+- **Mirror node configuration**: By configuring mirror nodes among segments (or data nodes), Cloudberry Database can achieve high availability and failover of services. Once an unrecoverable failure of the primary node is detected, the system will automatically switch to the backup segment to ensure that user queries will not be affected.
-- **控制节点的备份**:类似于数据节点,控制节点也可以配置备份节点,以防止主控制节点发生故障。一旦主控制节点发生故障,系统将自动切换到备份控制节点,确保服务的连续性。
+- **Backup of control nodes**: Similar to segments, master nodes (or control nodes) can also be configured as backup nodes or standby nodes in case the master node fails. Once the master node fails, the system will automatically switch to the standby node to ensure the continuity of services.
-## 丰富的数据分析支持
+## Rich data analysis support
-Cloudberry Database 提供了强大的数据分析功能,使得数据处理、查询和分析变得更加高效,满足各类复杂的数据处理、分析和查询需求。
+Cloudberry Database provides powerful data analysis features. These features make data processing, query and analysis more efficient, and meets multiple complex data processing, analysis and query requirements.
-点击以查看详情
+Click to view details
-- **并行优化器和执行器**:Cloudberry Database 内核内置了并行优化器和执行器,不仅能够兼容 PostgreSQL 生态,还支持数据分区裁剪、多种索引技术(包括 BTree,Bitmap,Hash,Brin,GIN等),以及 JIT(表达式即时编译处理)等。
+- **Parallel optimizer and executor**: The Cloudberry Database kernel has a built-in parallel optimizer and executor, which is not only compatible with the PostgreSQL ecosystem, but also supports data partition pruning and multiple indexing technologies (including B-Tree, Bitmap, Hash, Brin, GIN), and JIT (expression just-in-time compilation processing).
-- **机器学习组件 - MADlib**:Cloudberry Database 集成了 MADlib 组件,为用户提供了全 SQL 驱动的机器学习功能,让算法、算力和数据能够深度融合。
+- **Machine learning components - MADlib**: Cloudberry Database integrates MADlib components, providing users with fully SQL-driven machine learning features, enabling deep integration of algorithms, computing power, and data.
-- **支持多种编程语言**:Cloudberry Database 为开发者提供了丰富的编程语言选择,包括 R、Python、Perl、Java和 PostgreSQL 等,使得用户可以方便地编写自定义函数。
+- **Support multiple programming languages**: Cloudberry Database provides developers with a rich choice of programming languages, including R, Python, Perl, Java, and PostgreSQL, so that they can easily write custom functions.
-- **基于 MPP 引擎的高性能并行计算**:Cloudberry Database 的 MPP 引擎支持高性能并行计算,与 SQL 无缝集成,可以针对 SQL 执行结果进行快速的计算和分析。
+- **High-performance parallel computing based on MPP engine**: Cloudberry Database's MPP engine supports high-performance parallel computing, seamlessly integrated with SQL, and can perform fast computing and analysis on SQL execution results.
-- **PostGIS 地理数据处理**:Cloudberry Database 引入了升级版的 PostGIS 2.X,支持其 MPP 架构,进一步提升了对地理空间数据的处理能力。主要特性包括:
+- **PostGIS geographic data processing**: Cloudberry Database introduces an upgraded version of PostGIS 2.X, supports its MPP architecture, and further improves the processing capability of geospatial data. Key features include:
- - 集成对象存储:支持大容量地理空间数据从对象存储(OSS)直接加载入库。
- - 全面的空间数据类型支持:包括 geometry(几何)、geography(地理)、Raster(栅格)等空间数据类型。
- - 时空索引:提供时空索引技术,可以有效加速空间和时间相关的查询。
- - 复杂的空间和地理位置计算:包括球体长度计算以及空间聚集函数(如包含、覆盖、相交等)。
+ - Integrated object storage: supports directly loading large-capacity geospatial data from object storage (OSS) into the database.
+ - Comprehensive spatial data type support: including geometry, geography, and raster.
+ - Spatio-temporal index: Provides spatio-temporal index technology, which can effectively accelerate space- and time-related queries.
+ - Complex spatial and geographic calculations: including sphere length calculations as well as spatial aggregation functions (such as contain, cover, intersect).
-- **Cloudberry Database Text 组件**:这个组件支持利用 ElasticSearch 加速文件检索能力,相比传统的 GIN 数据文本查询性能有数量级的提升,支持多种分词,自然语言处理,以及查询结果渲染等。
+- **Cloudberry Database text component**: This component supports using ElasticSearch to accelerate file retrieval capabilities. Compared with traditional GIN data text query performance, this component has an order of magnitude improvement. It supports multiple word segmentation, natural language processing, and query result rendering.
-## 灵活的工作负载管理
+## Flexible workload management
-Cloudberry Database 提供了全面的工作负载管理功能,旨在有效地利用和优化数据库资源,以确保高效、稳定的运行。其工作负载管理主要包括连接级别管理、会话级别管理、SQL 级别管理三个层次的控制。
+Cloudberry Database provides comprehensive workload management capabilities designed to effectively utilize and optimize database resources to ensure efficient and stable operations. Its workload management includes three levels of control: connection level management, session level management, and SQL level management.
-点击以查看详情
+Click to view details
-- **连接池 PGBouncer(连接级别管理)**:通过连接池,Cloudberry Database 对用户接入进行统一管理,限制同时活跃的用户数量,以提高效率并避免因频繁创建和销毁服务进程而浪费资源。连接池具有较小的内存占用,并能够支持高并发连接,使用 libevent 进行 Socket 通信以提高通信效率。
+- **Connection pool PGBouncer (connection-level management)**: Through the connection pool, Cloudberry Database manages user access in a unified manner, and limits the number of concurrently active users to improve efficiency, and avoid wasting resources caused by frequently creating and destructing service processes. The connection pool has a small memory footprint and can support high concurrent connections, using libevent for Socket communication to improve communication efficiency.
-- **资源组 Resource Group(会话级别管理)**:通过资源组,Cloudberry Database 能够分析并分类典型的工作负载,量化每个工作负载所需的 CPU、内存、并发度等资源。这样,根据工作负载的实际需求,可以设定适合的资源组,并动态调整资源使用,以确保整体运行效率。同时,还可以利用规则清理空闲的会话,释放不必要的资源。
+- **Resource Group (session-level management)**: Through resource groups, Cloudberry Database can analyze and categorize typical workloads, and quantify the CPU, memory, concurrency and other resources required by each workload. In this way, according to the actual needs of the workload, you can set a suitable resource group and dynamically adjust the resource usage to ensure the overall operating efficiency. At the same time, you can use rules to clean up idle sessions and release unnecessary resources.
-- **动态资源组分配(SQL 级别管理)**:通过动态资源组分配,Cloudberry Database 能够在 SQL 语句执行前或执行过程中灵活地分配资源,以便优待特定的查询,缩短其运行时间。
+- **Dynamic resource group allocation (SQL-level management)**: Through dynamic resource group allocation, Cloudberry Database can flexibly allocate resources before or during the execution of SQL statements, which can give priority to specific queries and shorten their execution time.
-## 多种兼容性
+## Multiple compatibility
-Cloudberry Database 的兼容性表现在 SQL 语法、组件、工具和程序、硬件平台和操作系统等多个方面,这使得它能够灵活应对各种工具、平台和语言。
+The compatibility of Cloudberry Database is manifested in multiple aspects such as SQL syntax, components, tools and programs, hardware platforms and operating systems. This makes the database flexible enough to deal with different tools, platforms and languages.
-点击以查看详情
+Click to view details
-- **SQL 兼容性**:Cloudberry Database 兼容 PostgreSQL 和 Greenplum 语法,支持 SQL-92,SQL-99,以及 SQL 2003 标准,包括 SQL 2003 OLAP 扩展,如窗口函数,`rollup`,`cube` 等。
+- **SQL compatibility**: Cloudberry Database is compatible with PostgreSQL and Greenplum syntax, supports SQL-92, SQL-99, and SQL 2003 standards, including SQL 2003 OLAP extensions, such as window functions, `rollup`, and `cube`.
-- **组件兼容性**:基于 PostgreSQL 14.4 内核,Cloudberry Database 兼容市面上常用的大多数 PostgreSQL 组件和扩展。
+- **Component compatibility**: Based on the PostgreSQL 14.4 kernel, Cloudberry Database is compatible with most of the PostgreSQL components and extensions commonly used.
-- **工具和程序兼容性**:与多种 BI 工具、挖掘预测工具、ETL 工具,以及 J2EE/.NET 应用程序都有良好的连通性。
+- **Tool and program compatibility**: Good connectivity with various BI tools, mining forecasting tools, ETL tools, and J2EE/.NET applications.
-- **硬件平台兼容性**:能够在多种硬件架构下运行,包括 X86、ARM、飞腾、鲲鹏、海光等。
+- **Hardware platform compatibility**: Can run on a variety of hardware architectures, including X86, ARM, Phytium, Kunpeng, and Haiguang.
-- **操作系统兼容性**:兼容多种操作系统环境,如 CentOS、Ubuntu、Kylin、BC-Linux 等。
+- **Operating system compatibility**: Compatible with multiple operating system environments, such as CentOS, Ubuntu, Kylin, and BC-Linux.
From 9708868780641cbef43575690821740ab4df0c1e Mon Sep 17 00:00:00 2001
From: TomShawn <41534398+TomShawn@users.noreply.github.com>
Date: Mon, 19 Jun 2023 17:16:26 +0800
Subject: [PATCH 14/21] Apply suggestions from code review
Co-authored-by: IdaLee666 <59157379+IdaLee666@users.noreply.github.com>
---
docs/cbdb-overview.md | 42 +++++++++----------
.../current/cbdb-overview.md | 4 +-
2 files changed, 23 insertions(+), 23 deletions(-)
diff --git a/docs/cbdb-overview.md b/docs/cbdb-overview.md
index 2b681c432..660cc8616 100644
--- a/docs/cbdb-overview.md
+++ b/docs/cbdb-overview.md
@@ -4,7 +4,7 @@ title: Feature Overview
# Cloudberry Database Feature Overview
-Cloudberry Database, built on the latest PostgreSQL 14.4 kernel, is one of the most advanced and mature open-source MPP databases available today. It comes with multiple features, including high concurrency and high availability. It can perform quick and efficient computing for complex tasks, meeting the demands of managing and computing vast amounts of data. It is widely applied in multiple fields.
+Cloudberry Database, built on the latest PostgreSQL 14.4 kernel, is one of the most advanced and mature open-source MPP databases available. It comes with multiple features, including high concurrency and high availability. It can perform quick and efficient computing for complex tasks, meeting the demands of managing and computing vast amounts of data. It is widely applied in multiple fields.
This document gives a general introduction to the features of Cloudberry Database.
@@ -26,32 +26,32 @@ For different scenarios, Cloudberry Database supports multiple storage formats,
Click to see details
- **Even data distribution**: By using Hash and Random methods for data distribution, Cloudberry Database takes better advantage of disk performance and solves I/O bottleneck issues.
-- **Choice of multiple storage types**:
+- **Storage types**:
- Row-based storage: Suitable for scenarios where most fields are frequently queried, and there are many random row accesses.
- Column-based storage: When you need to query a small number of fields, this method can greatly save I/O operations, making it ideal for scenarios where large amounts of data are accessed frequently.
-- **Special storage modes**: Cloudberry Database has different storage modes such as Heap storage, AO row storage, AOCS column storage to optimize the performance of different types of applications. At the finest granularity level of partitioning, a table can have multiple storage modes.
-- **Support for partitioned tables**: You can define the partitioning of a table based on specific conditions. During querying, the system will automatically filter out the sub-tables that are not needed for the query, thus enhancing query efficiency.
-- **Efficient data compression function**: Cloudberry Database supports multiple compression algorithms, such as Zlib 1-9 and Zstandard 1~19, to enhance data processing performance and maintain a balance between CPU and compression ratio.
+- **Specialized storage modes**: Cloudberry Database has different storage modes such as Heap storage, AO row storage, AOCS column storage to optimize the performance of different types of applications. At the finest granularity level of partitioning, a table can have multiple storage modes.
+- **Support for partitioned tables**: You can define the partitioning of a table based on specific conditions. During querying, the system will automatically filter out the sub-tables that are not needed for the query to improve query efficiency.
+- **Efficient data compression function**: Cloudberry Database supports multiple compression algorithms, such as Zlib 1-9 and Zstandard 1~19, to improve data processing performance and maintain a balance between CPU and compression ratio.
- **Optimization for small tables**: You can choose to use the Replication Table and specify a custom Hash algorithm when creating the table, allowing for more flexible control of data distribution.
## Multi-layer data security
-Cloudberry Database enhances user data protection by supporting function encryption and transparent data encryption (TDE). This means that the Cloudberry Database kernel performs these processes invisibly to users. The data formats subject to this encryption include Heap tables, AO row storage, and AOCS column storage. In addition to common encryption algorithms like AES, Cloudberry Database also supports national secret algorithms, allowing seamless integration of your own algorithms into the transparent data encryption process.
+Cloudberry Database enhances user data protection by supporting function encryption and transparent data encryption (TDE). TDE means that the Cloudberry Database kernel performs these processes invisibly to users. The data formats subject to TDE include Heap tables, AO row storage, and AOCS column storage. In addition to common encryption algorithms like AES, Cloudberry Database also supports national secret algorithms, allowing seamless integration of your own algorithms into TDE process.
Click to view details
Cloudberry Database focuses on data security and provides security protection measures. These security measures are designed to satisfy different database environment needs and offer multi-layer security protection:
-- **Database isolation**: In Cloudberry Database, data is not shared between databases, which achieves isolation in a multi-database environment. If cross-database access is needed, you can use the DBLink feature.
+- **Database isolation**: In Cloudberry Database, data is not shared between databases, which achieves isolation in a multi-database environment. If cross-database access is required, you can use the DBLink feature.
- **Internal data organization**: The logical organization of data in the database includes data objects such as tables, views, indexes, and functions. Data access can be performed across schemas.
-- **Data storage security**: Cloudberry Database offers different storage modes to support data redundancy. It uses encryption methods including AES 128, 192, 256, DES, and national secret encryption to secure data storage. It also supports ciphertext authentication, which includes encryption algorithms like SCRAM-SHA-256, MD5, LDAP, RADIUS.
-- **User data protection**: Cloudberry Database supports function encryption and decryption, and transparent data encryption and decryption. The process is implemented by the Cloudberry Database kernel without any user interaction. It supports data formats such as Heap tables, AO row storage, and AOCS column storage. In addition to common encryption algorithms like AES, it also supports national secret algorithms, allowing you to easily add your own algorithms into transparent data encryption.
-- **Detailed permission settings**: To satisfy different users and objects (like schemas, tables, rows, columns, views, functions), Cloudberry Database provides a range of permission setting options. These include `SELECT`, `UPDATE`, execution rights, and ownership.
+- **Data storage security**: Cloudberry Database offers different storage modes to support data redundancy. It uses encryption methods including AES 128, AES 192, AES 256, DES, and national secret encryption to secure data storage. It also supports ciphertext authentication, which includes encryption algorithms like SCRAM-SHA-256, MD5, LDAP, RADIUS.
+- **User data protection**: Cloudberry Database supports function encryption and decryption, and transparent data encryption and decryption. The process is implemented by the Cloudberry Database kernel without any user interaction. It supports data formats such as Heap tables, AO row storage, and AOCS column storage. In addition to common encryption algorithms like AES, Cloudberry Database also supports national secret algorithms, allowing you to easily add your own algorithms into transparent data encryption.
+- **Detailed permission settings**: To satisfy different users and objects (like schemas, tables, rows, columns, views, functions), Cloudberry Database provides a range of permission setting options, including `SELECT`, `UPDATE`, execution, and ownership.
@@ -62,13 +62,13 @@ Cloudberry Database provides a series of efficient and flexible data loading sol
Click to view details
-- **Parallel and persistent data loading**: Using external table technology, Cloudberry Database supports loading massive parallel and persistent data, and performs automatic conversion between character sets, such as from GBK to UTF-8. This feature makes data entry much smoother.
+- **Parallel and persistent data loading**: Cloudberry Database supports massive parallel and persistent data loading through external table technology, and performs automatic conversion between character sets, such as from GBK to UTF-8. This feature makes data entry much smoother.
- **Flexible data source and file format support**: Cloudberry Database supports data sources such as external file servers, Hive, Hbase, HDFS or S3, and supports data formats such as CSV, Text, JSON, ORC, and Parquet. In addition, the database can also load compressed data files such as Zip.
-- **Integrate multiple ETL tools**: DataStage, Informatica, Kettle and other ETL tools have been integrated into Cloudberry Database to facilitate data processing.
+- **Integrate multiple ETL tools**: Cloudberry Database is integrated with ETL tools such as DataStage, Informatica, and Kettle to facilitate data processing.
-- **Support stream data loading**: Cloudberry Database can start multiple parallel read tasks for the subscribed Kafka topic, cache the read records, and load the records into the database via gpfdist after a certain time or number of records. This method can ensure the integrity of data without duplication or loss, and is very suitable for stream data collection and real-time analysis scenarios. Cloudberry Database supports data loading throughput of tens of millions per minute.
+- **Support stream data loading**: Cloudberry Database can start multiple parallel read tasks for the subscribed Kafka topic, cache the read records, and load the records into the database via gpfdist after a certain time or number of records. This method can ensure the integrity of data without duplication or loss, and is suitable for stream data collection and real-time analysis scenarios. Cloudberry Database supports data loading throughput of tens of millions per minute.
- **High-performance data access**: PXF is a built-in component of Cloudberry Database, which can map external data sources to external tables of Cloudberry Database to achieve parallel and high-speed data access. PXF supports the management and access of hybrid data ecology and helps realize the Data Fabric architecture.
@@ -84,7 +84,7 @@ To ensure data security and service continuity, Cloudberry Database adopts a mul
- **Checksum of data page**: In the underlying storage, Cloudberry Database uses the checksum mechanism to detect bad blocks to ensure data integrity.
-- **Mirror node configuration**: By configuring mirror nodes among segments (or data nodes), Cloudberry Database can achieve high availability and failover of services. Once an unrecoverable failure of the primary node is detected, the system will automatically switch to the backup segment to ensure that user queries will not be affected.
+- **Mirror node configuration**: By configuring mirror nodes among segments (or data nodes), Cloudberry Database can achieve high availability and failover of services. Once an unrecoverable failure of the master node is detected, the system will automatically switch to the backup segment to ensure that user queries will not be affected.
- **Backup of control nodes**: Similar to segments, master nodes (or control nodes) can also be configured as backup nodes or standby nodes in case the master node fails. Once the master node fails, the system will automatically switch to the standby node to ensure the continuity of services.
@@ -99,15 +99,15 @@ Cloudberry Database provides powerful data analysis features. These features mak
- **Parallel optimizer and executor**: The Cloudberry Database kernel has a built-in parallel optimizer and executor, which is not only compatible with the PostgreSQL ecosystem, but also supports data partition pruning and multiple indexing technologies (including B-Tree, Bitmap, Hash, Brin, GIN), and JIT (expression just-in-time compilation processing).
-- **Machine learning components - MADlib**: Cloudberry Database integrates MADlib components, providing users with fully SQL-driven machine learning features, enabling deep integration of algorithms, computing power, and data.
+- **Machine learning components MADlib**: Cloudberry Database integrates MADlib components, providing users with fully SQL-driven machine learning features, enabling deep integration of algorithms, computing power, and data.
-- **Support multiple programming languages**: Cloudberry Database provides developers with a rich choice of programming languages, including R, Python, Perl, Java, and PostgreSQL, so that they can easily write custom functions.
+- **Support multiple programming languages**: Cloudberry Database provides developers with rich programming languages, including R, Python, Perl, Java, and PostgreSQL, so that they can easily write custom functions.
-- **High-performance parallel computing based on MPP engine**: Cloudberry Database's MPP engine supports high-performance parallel computing, seamlessly integrated with SQL, and can perform fast computing and analysis on SQL execution results.
+- **High-performance parallel computing based on MPP engine**: The MPP engine of Cloudberry Database supports high-performance parallel computing, seamlessly integrated with SQL, and can perform fast computing and analysis on SQL execution results.
- **PostGIS geographic data processing**: Cloudberry Database introduces an upgraded version of PostGIS 2.X, supports its MPP architecture, and further improves the processing capability of geospatial data. Key features include:
- - Integrated object storage: supports directly loading large-capacity geospatial data from object storage (OSS) into the database.
+ - Support for object storage: supports directly loading large-capacity geospatial data from object storage (OSS) into the database.
- Comprehensive spatial data type support: including geometry, geography, and raster.
- Spatio-temporal index: Provides spatio-temporal index technology, which can effectively accelerate space- and time-related queries.
- Complex spatial and geographic calculations: including sphere length calculations as well as spatial aggregation functions (such as contain, cover, intersect).
@@ -125,15 +125,15 @@ Cloudberry Database provides comprehensive workload management capabilities desi
- **Connection pool PGBouncer (connection-level management)**: Through the connection pool, Cloudberry Database manages user access in a unified manner, and limits the number of concurrently active users to improve efficiency, and avoid wasting resources caused by frequently creating and destructing service processes. The connection pool has a small memory footprint and can support high concurrent connections, using libevent for Socket communication to improve communication efficiency.
-- **Resource Group (session-level management)**: Through resource groups, Cloudberry Database can analyze and categorize typical workloads, and quantify the CPU, memory, concurrency and other resources required by each workload. In this way, according to the actual needs of the workload, you can set a suitable resource group and dynamically adjust the resource usage to ensure the overall operating efficiency. At the same time, you can use rules to clean up idle sessions and release unnecessary resources.
+- **Resource Group (session-level management)**: Through resource groups, Cloudberry Database can analyze and categorize typical workloads, and quantify the CPU, memory, concurrency and other resources required by each workload. In this way, according to the actual requirements of the workload, you can set a suitable resource group and dynamically adjust the resource usage to ensure the overall operating efficiency. At the same time, you can use rules to clean up idle sessions and release unnecessary resources.
-- **Dynamic resource group allocation (SQL-level management)**: Through dynamic resource group allocation, Cloudberry Database can flexibly allocate resources before or during the execution of SQL statements, which can give priority to specific queries and shorten their execution time.
+- **Dynamic resource group allocation (SQL-level management)**: Through dynamic resource group allocation, Cloudberry Database can flexibly allocate resources before or during the execution of SQL statements, which can give priority to specific queries and shorten the execution time.
## Multiple compatibility
-The compatibility of Cloudberry Database is manifested in multiple aspects such as SQL syntax, components, tools and programs, hardware platforms and operating systems. This makes the database flexible enough to deal with different tools, platforms and languages.
+The compatibility of Cloudberry Database is reflected in multiple aspects such as SQL syntax, components, tools and programs, hardware platforms and operating systems. This makes the database flexible enough to deal with different tools, platforms and languages.
Click to view details
diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md
index 4f92fa3cb..2302fb417 100644
--- a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md
+++ b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-overview.md
@@ -1,8 +1,8 @@
---
-title: 特性概览
+title: 产品特性
---
-# Cloudberry Database 特性概览
+# Cloudberry Database 产品特性
Cloudberry Database 基于最新的 PostgreSQL 14.4 内核,是当前最先进的成熟开源 MPP 数据库之一,具备高并发、高可用等多种特性,可以对复杂任务进行快速高效计算,以满足海量数据管理和计算的需求,目前在多个领域都有着广泛应用。
From 0dc729ecee9c77a020d88a8f227a3c4dadb0f2e3 Mon Sep 17 00:00:00 2001
From: TomShawn <41534398+TomShawn@users.noreply.github.com>
Date: Mon, 19 Jun 2023 17:16:56 +0800
Subject: [PATCH 15/21] Update docs/cbdb-overview.md
Co-authored-by: IdaLee666 <59157379+IdaLee666@users.noreply.github.com>
---
docs/cbdb-overview.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/cbdb-overview.md b/docs/cbdb-overview.md
index 660cc8616..addc3dfab 100644
--- a/docs/cbdb-overview.md
+++ b/docs/cbdb-overview.md
@@ -109,7 +109,7 @@ Cloudberry Database provides powerful data analysis features. These features mak
- Support for object storage: supports directly loading large-capacity geospatial data from object storage (OSS) into the database.
- Comprehensive spatial data type support: including geometry, geography, and raster.
- - Spatio-temporal index: Provides spatio-temporal index technology, which can effectively accelerate space- and time-related queries.
+ - Spatio-temporal index: Provides spatio-temporal index technology, which can effectively accelerate spatial and temporal queries.
- Complex spatial and geographic calculations: including sphere length calculations as well as spatial aggregation functions (such as contain, cover, intersect).
- **Cloudberry Database text component**: This component supports using ElasticSearch to accelerate file retrieval capabilities. Compared with traditional GIN data text query performance, this component has an order of magnitude improvement. It supports multiple word segmentation, natural language processing, and query result rendering.
From f1bfe0f20e654cdc84e4ab497cd3fdc4418d5d03 Mon Sep 17 00:00:00 2001
From: TomShawn <41534398+TomShawn@users.noreply.github.com>
Date: Sun, 25 Jun 2023 15:00:20 +0800
Subject: [PATCH 16/21] add cbdb scenarios
---
.../current/cbdb-scenarios.md | 33 ++++++++++++++++++-
1 file changed, 32 insertions(+), 1 deletion(-)
diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-scenarios.md b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-scenarios.md
index 1a4f3f169..d173a64f9 100644
--- a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-scenarios.md
+++ b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-scenarios.md
@@ -2,4 +2,35 @@
title: 使用场景
---
-TODO
\ No newline at end of file
+本文档介绍 Cloudberry Database 的使用场景。
+
+**场景一:离线批处理数据仓库和数据集市建设 (Data Warehousing and Data Marts)**
+
+- 构建高性能的 Cloudberry Database 数据仓库和数据集市,用于存储和查询大规模数据集,包含贴源层、明细层、汇总层等等,支持贴源模型建设、范式化模型建设、维度表和事实表建设等等,支持多种方式将源数据加载至数据仓库。
+- 支持多种类型的数据加工处理。
+- 支持高并发、高性能、低运维的数据仓库和数据集市建设。
+- 支持复杂的数据分析和查询需求,包括数据聚合、多维分析、关联查询等。
+
+**场景二:实时数据仓库建设**
+
+- 支持高时效的数据仓库建设,支持流式数据的采集和处理,实现数据实时分析。
+
+**场景三:数据中台建设**
+
+- 支持数据中台中MPP数据平台的建设,支持分布式并行处理架构。
+- 支持数据中台数据仓库的建设,支持多种主流ETL工具的对接。
+
+**场景四:湖仓一体建设**
+
+- 支持企业湖仓一体建设,支持数据湖和数据仓库之间数据高效的互访。
+
+**场景五:现有MPP数据库替换**
+
+- 支持非国产数据库的替换,例如 Oracle、TeraData、Greenplum、Vertical 等。
+- 支持其他类型MPP数据库的替换,例如 Gbase8a、GaussDB 等。
+
+场景六:地理信息系统 (GIS) 应用 (Geographic Information System Applications)
+
+- 在 Cloudberry Database 上构建地理信息系统 (GIS) 应用。
+- 存储和查询地理位置数据,支持空间数据分析、地理编码和地图可视化等功能。
+- 可以应用于城市规划、地理分析、地图导航等领域。
From 2ea499c781a87efe1ebe93ed505ece1b985bd054 Mon Sep 17 00:00:00 2001
From: TomShawn <41534398+TomShawn@users.noreply.github.com>
Date: Sun, 25 Jun 2023 15:01:15 +0800
Subject: [PATCH 17/21] Update cbdb-scenarios.md
---
.../zh/docusaurus-plugin-content-docs/current/cbdb-scenarios.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-scenarios.md b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-scenarios.md
index d173a64f9..eb7440a5a 100644
--- a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-scenarios.md
+++ b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-scenarios.md
@@ -29,7 +29,7 @@ title: 使用场景
- 支持非国产数据库的替换,例如 Oracle、TeraData、Greenplum、Vertical 等。
- 支持其他类型MPP数据库的替换,例如 Gbase8a、GaussDB 等。
-场景六:地理信息系统 (GIS) 应用 (Geographic Information System Applications)
+**场景六:地理信息系统 (GIS) 应用 (Geographic Information System Applications)**
- 在 Cloudberry Database 上构建地理信息系统 (GIS) 应用。
- 存储和查询地理位置数据,支持空间数据分析、地理编码和地图可视化等功能。
From de064868905b9f07c77043708c301a8a0be78d65 Mon Sep 17 00:00:00 2001
From: TomShawn <41534398+TomShawn@users.noreply.github.com>
Date: Tue, 27 Jun 2023 14:48:39 +0800
Subject: [PATCH 18/21] Apply suggestions from ljj
Co-authored-by: IdaLee666 <59157379+IdaLee666@users.noreply.github.com>
---
docs/cbdb-architecture.md | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/docs/cbdb-architecture.md b/docs/cbdb-architecture.md
index ab5324f51..fdc5544f5 100644
--- a/docs/cbdb-architecture.md
+++ b/docs/cbdb-architecture.md
@@ -6,13 +6,13 @@ title: Architecture
This document introduces the product architecture and the implementation mechanism of the internal modules in Cloudberry Database.
-In most cases, Cloudberry Database is similar to PostgreSQL in terms of SQL support, features, configuration options, and user functionalities. The experience users have with Cloudberry Database is similar to interacting with a standalone PostgreSQL system.
+In most cases, Cloudberry Database is similar to PostgreSQL in terms of SQL support, features, configuration options, and user functionalities. Users can interact with Cloudberry Database in a similar way to how they interact with a standalone PostgreSQL system.
Cloudberry Database uses MPP (Massively Parallel Processing) architecture to store and process large volumes of data, by distributing data and computing workloads across multiple servers or hosts.
-MPP, also known as the shared-nothing architecture, refers to systems with multiple hosts that work together to perform a task. Each host has its own processor, memory, disk, network resources, and operating system. Cloudberry Database uses this high-performance architecture to distribute data loads and can use all system resources in parallel to process queries.
+MPP, known as the shared-nothing architecture, refers to systems with multiple hosts that work together to perform a task. Each host has its own processor, memory, disk, network resources, and operating system. Cloudberry Database uses this high-performance architecture to distribute data loads and can use all system resources in parallel to process queries.
-From users' view, Cloudberry Database is a complete relational database management system (RDBMS). In a physical view, it contains multiple PostgreSQL instances. To make these independent PostgreSQL instances work together, Cloudberry Database performs distributed cluster processing at different levels for data storage, computing, communication, and management. Although Cloudberry Database is a cluster, it hides all the distributed details from the user and provides a single logical database. This greatly eases the work of developers and operational staff.
+From users' view, Cloudberry Database is a complete relational database management system (RDBMS). In a physical view, it contains multiple PostgreSQL instances. To make these independent PostgreSQL instances work together, Cloudberry Database performs distributed cluster processing at different levels for data storage, computing, communication, and management. Cloudberry Database hides the complex details of the distributed system, giving users a single logical database view. This greatly eases the work of developers and operational staff.
The architecture diagram of Cloudberry Database is as follows:
@@ -20,9 +20,9 @@ The architecture diagram of Cloudberry Database is as follows:
- **Master node** (or control node) is the gateway to the Cloudberry Database system, which accepts client connections and SQL queries, and allocates tasks to data node instances. Users interact with Cloudberry Database by connecting to the master node using a client program (such as psql) or an application programming interface (API) (such as JDBC, ODBC, or libpq PostgreSQL C API).
- The master node acts as the global system directory, containing a set of system tables that record the metadata of Cloudberry Database.
- - The master node does not store any user data. User data is stored only in the data node instances.
- - The master node performs authentication for client connections, processes SQL commands, distributes workload among segments, coordinates the results returned by each segment, and presents the final results to the client program.
- - Cloudberry Database uses Write Ahead Logging (WAL) for master node/standby mirroring. In WAL-based logging, all modifications are first written to a log before being written to the disk, which ensures the data integrity of any in-process operation.
+ - The master node does not store user data. User data is stored only in data node instances.
+ - The master node performs authentication for client connections, processes SQL commands, distributes workload among segments, coordinates the results returned by each segment, and returns the final results to the client program.
+ - Cloudberry Database uses Write Ahead Logging (WAL) for master/standby mirroring. In WAL-based logging, all modifications are first written to a log before being written to the disk, which ensures the data integrity of in-process operations.
- **Segment** (or data node) instances are individual Postgres processes, each storing a portion of the data and executing the corresponding part of the query. When a user connects to the database through the master node and submits a query request, a process is created on each segment node to handle the query. User-defined tables and their indexes are distributed across the available segments, and each segment node contains distinct portions of the data. The processes of data processing runs in the corresponding segment. Users interact with segments through the master, and the segment operate on servers known as the segment host.
@@ -30,10 +30,10 @@ The architecture diagram of Cloudberry Database is as follows:
- **Interconnect** is the network layer in the Cloudberry Database system architecture. Interconnect refers to the network infrastructure upon which the communication between the master node and the segments relies, which uses a standard Ethernet switching structure.
- For performance reasons, it is recommended to use a network with a speed of 10 GB or faster. By default, the Interconnect module uses the UDP protocol with flow control (UDPIFC) for communication to send messages through the network. The data packet verification performed by Cloudberry Database exceeds the scope provided by UDP, meaning that its reliability is equivalent to using the TCP protocol, and its performance and scalability surpass the TCP protocol. If the Interconnect is changed to use the TCP protocol, the scalability of Cloudberry Database is limited to 1000 segments. This limit does not apply when UDPIFC is used as the default protocol.
+ For performance reasons, a 10 GB or faster network is recommended. By default, the Interconnect module uses the UDP protocol with flow control (UDPIFC) for communication to send messages through the network. The data packet verification performed by Cloudberry Database exceeds the scope provided by UDP, which means that its reliability is equivalent to using the TCP protocol, and its performance and scalability surpass the TCP protocol. If the Interconnect is changed to the TCP protocol instead, the scalability of Cloudberry Database is limited to 1000 segments. This limit does not apply when UDPIFC is used as the default protocol.
-- Cloudberry Database uses Multiversion Concurrency Control (MVCC) to ensure data consistency. This means that when querying the database, each transaction only sees a snapshot of the data, ensuring that current transactions do not see modifications made by other transactions on the same records. This provides transaction isolation for each transaction in the database.
+- Cloudberry Database uses Multiversion Concurrency Control (MVCC) to ensure data consistency. When querying the database, each transaction only sees a snapshot of the data, ensuring that current transactions do not see modifications made by other transactions on the same records. In this way, MVCC provides transaction isolation in the database.
MVCC minimizes lock contention to ensure performance in a multi-user environment. This is done by avoiding explicit locking for database transactions.
- In concurrency control, MVCC does not introduce conflicts for query (read) locks and write locks. In addition, read and write operations do not block each other. This is the biggest advantages of using MVCC over using lock mechanism.
+ In concurrency control, MVCC does not introduce conflicts for query (read) locks and write locks. In addition, read and write operations do not block each other. This is the biggest advantages of MVCC over the lock mechanism.
From 92db88574cc70f8cd28e04cdac7e52a5a74d66e4 Mon Sep 17 00:00:00 2001
From: TomShawn <41534398+TomShawn@users.noreply.github.com>
Date: Tue, 27 Jun 2023 14:54:52 +0800
Subject: [PATCH 19/21] address comments from my-ship-it
---
.../docusaurus-plugin-content-docs/current/cbdb-architecture.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-architecture.md b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-architecture.md
index a200cb821..30acb3648 100644
--- a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-architecture.md
+++ b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-architecture.md
@@ -10,7 +10,7 @@ title: 架构介绍
Cloudberry Database 采用 MPP 架构技术,通过在多个服务器或主机之间分配数据和处理工作负载来存储和处理大量数据。
-MPP 也称为无共享体系架构,是指具有多台主机的系统,这些主机协作执行一项操作。每台主机都有自己的处理器、内存、磁盘、网络资源和操作系统。Cloudberry Database使用这种高性能的系统架构来分配海量数据的负载,并且可以并行使用系统的所有资源来处理查询。
+MPP 也称为大规模并行处理架构,是指具有多台主机的系统,这些主机协作执行同一操作。每台主机都有自己的处理器、内存、磁盘、网络资源和操作系统。Cloudberry Database使用这种高性能的系统架构来分配海量数据的负载,并且可以并行使用系统的所有资源来处理查询。
从用户角度来看,Cloudberry Database 是一个完备的关系数据库管理系统 (RDBMS)。从物理层面来看,它内含多个 PostgreSQL 实例。为了实现多个独立 PostgreSQL 实例的分工和合作,Cloudberry Database 在不同层面对数据存储、计算、通信和管理进行了分布式集群化处理。Cloudberry Database 虽然是一个集群,然而对用户而言,它封装了所有分布式的细节,为用户提供了单个逻辑数据库。这种封装极大地解放了开发人员和运维人员的工作。
From 55eda0ddede5468b85b9edb2c2aaf6bcbcaed1df Mon Sep 17 00:00:00 2001
From: TomShawn <41534398+TomShawn@users.noreply.github.com>
Date: Tue, 27 Jun 2023 17:35:33 +0800
Subject: [PATCH 20/21] fix misspelling
---
docs/cbdb-vs-gp-features.md | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/docs/cbdb-vs-gp-features.md b/docs/cbdb-vs-gp-features.md
index deafc6a39..1efe68506 100644
--- a/docs/cbdb-vs-gp-features.md
+++ b/docs/cbdb-vs-gp-features.md
@@ -1,8 +1,8 @@
---
-title: Comparision with Greenplum Features
+title: Comparison with Greenplum Features
---
-# Comparision with Greenplum Features
+# Comparison with Greenplum Features
Cloudberry Database is 100% compatible with Greenplum, and provides all the Greenplum features you need.
From 1785a5ab38da75e12fdf3e27d6efbb4742f36366 Mon Sep 17 00:00:00 2001
From: TomShawn <41534398+TomShawn@users.noreply.github.com>
Date: Thu, 29 Jun 2023 17:00:26 +0800
Subject: [PATCH 21/21] add scenarios for en
---
docs/cbdb-scenarios.md | 33 ++++++++++++++++++-
.../current/cbdb-scenarios.md | 8 ++---
2 files changed, 36 insertions(+), 5 deletions(-)
diff --git a/docs/cbdb-scenarios.md b/docs/cbdb-scenarios.md
index 4ab7328c4..8f079c510 100644
--- a/docs/cbdb-scenarios.md
+++ b/docs/cbdb-scenarios.md
@@ -2,4 +2,35 @@
title: User Scenarios
---
-TODO
\ No newline at end of file
+This document introduces the use cases of Cloudberry Database.
+
+**Scenario 1: Batch processing data warehouse offline and building data marts**
+
+- Builds high-performance Cloudberry Database warehouses and data marts for storing and querying large-scale datasets. This includes Operational Data Store (ODS), Data Warehouse Detail (DWD), and Data Warehouse Summary (DWS). Supports building source model, normalization model, dimension tables, fact tables, and more, with multiple ways to load source data into the data warehouse.
+- Supports multiple types of data processing.
+- Supports building data warehouse and data marts with high concurrency, high performance, and low maintenance cost.
+- Supports complex data analysis and query needs, including data aggregation, multi-dimensional analysis, and correlated queries.
+
+**Scenario 2: Building data warehouse in real-time**
+
+- Supports building data warehouse in real-time, and supports collecting and processing streaming data to make real-time data analysis possible.
+
+**Scenario 3: Building mid-end**
+
+- Supports building MPP data platform in the data mid-end. Supports the distributed parallel processing architecture.
+- Supports building data warehouse in the data mid-end. Supports docking with mainstream ETL tools.
+
+**Scenario 4: Building lake-warehouse integration**
+
+- Supports building enterprise-level data lake-warehouse integration. Supports efficient data exchange between data lake and data warehouse.
+
+**Scenario 5: Alternative to existing MPP databases**
+
+- Supports replacing common databases, such as Oracle, TeraData, Greenplum, and Vertical.
+- Supports replacing other types of MPP databases, such as Gbase 8a, and GaussDB.
+
+**Scenario 6: Applicable to Geographic Information System (GIS) applications**
+
+- Builds Geographic Information System (GIS) applications on Cloudberry Database.
+- Stores and queries geographic location data. Supports spatial data analysis, geocoding, and map visualization.
+- Can be applied to city planning, geographic analysis, and map navigation.
diff --git a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-scenarios.md b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-scenarios.md
index eb7440a5a..f6d145f4b 100644
--- a/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-scenarios.md
+++ b/i18n/zh/docusaurus-plugin-content-docs/current/cbdb-scenarios.md
@@ -17,17 +17,17 @@ title: 使用场景
**场景三:数据中台建设**
-- 支持数据中台中MPP数据平台的建设,支持分布式并行处理架构。
-- 支持数据中台数据仓库的建设,支持多种主流ETL工具的对接。
+- 支持数据中台中 MPP 数据平台的建设,支持分布式并行处理架构。
+- 支持数据中台数据仓库的建设,支持多种主流 ETL 工具的对接。
**场景四:湖仓一体建设**
- 支持企业湖仓一体建设,支持数据湖和数据仓库之间数据高效的互访。
-**场景五:现有MPP数据库替换**
+**场景五:现有 MPP 数据库替换**
- 支持非国产数据库的替换,例如 Oracle、TeraData、Greenplum、Vertical 等。
-- 支持其他类型MPP数据库的替换,例如 Gbase8a、GaussDB 等。
+- 支持其他类型 MPP 数据库的替换,例如 Gbase 8a、GaussDB 等。
**场景六:地理信息系统 (GIS) 应用 (Geographic Information System Applications)**