Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry-pick Resgroup V2 and toolkit from GPDB #531

Merged
merged 51 commits into from
Aug 19, 2024

Conversation

jiaqizho
Copy link
Contributor

fix #ISSUE_Number


Change logs

Describe your change clearly, including what problem is being solved or what feature is being added.

If it has some breaking backward or forward compatibility, please clary.

Why are the changes needed?

Describe why the changes are necessary.

Does this PR introduce any user-facing change?

If yes, please clarify the previous behavior and the change this PR proposes.

How was this patch tested?

Please detail how the changes were tested, including manual tests and any relevant unit or integration tests.

Contributor's Checklist

Here are some reminders and checklists before/when submitting your pull request, please check them:

  • Make sure your Pull Request has a clear title and commit message. You can take git-commit template as a reference.
  • Sign the Contributor License Agreement as prompted for your first-time contribution(One-time setup).
  • Learn the coding contribution guide, including our code conventions, workflow and more.
  • List your communication in the GitHub Issues or Discussions (if has or needed).
  • Document changes.
  • Add tests for the change
  • Pass make installcheck
  • Pass make -C src/test installcheck-cbdb-parallel
  • Feel free to request cloudberrydb/dev team for review and approval when your PR is ready🥳

@CLAassistant
Copy link

CLAassistant commented Jul 23, 2024

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 16 committers have signed the CLA.

✅ jiaqizho
❌ fairyfar
❌ RMTT
❌ SmartKeyerror
❌ bmdoil
❌ andr-sokolov
❌ huansong
❌ beeender
❌ zxuejing
❌ Tao-T
❌ kainwen
❌ hyongtao-db
❌ adam8157
❌ l-wang
❌ yanwr1
❌ dreamedcheng
You have signed the CLA already but the status is still pending? Let us recheck it.

@jiaqizho jiaqizho force-pushed the cp-gp-723-final branch 2 times, most recently from f812bdb to 1ad3214 Compare July 23, 2024 09:32
@jiaqizho
Copy link
Contributor Author

@tuhaihe why this CI not running? --amend && push --force won't trigger CI?

@jiaqizho jiaqizho force-pushed the cp-gp-723-final branch 4 times, most recently from 1575a4a to d71e335 Compare July 25, 2024 06:56
@avamingli avamingli added the cherry-pick cherry-pick upstream commts label Jul 30, 2024
@jiaqizho jiaqizho force-pushed the cp-gp-723-final branch 2 times, most recently from 6fb3f29 to bd51bc4 Compare August 1, 2024 07:19
@jiaqizho jiaqizho force-pushed the cp-gp-723-final branch 4 times, most recently from 4683313 to b5f0d48 Compare August 14, 2024 07:22
SmartKeyerror and others added 13 commits August 14, 2024 15:27
After we changed the semantics of the resource group at #14562, it's time to add the implementation
of Linux Cgroup v2.

In Linux Kernel, there is no difference between Cgroup v1 and Cgroup v2 in interior implementation,
the only change is the pseudo-files interface. Thanks to the #14343, we just need to add a new file
and implement a new `cgroupOpsRoutine`, which can be found at `get_group_routine_v2()`.

In early times, we considered that developers will have different mount directories of Cgroup, so we
generate test files dynamically. But we found it is an over-design, so move some test files from `input/`
to `sql/`, and this will be convenient for developers.

This is been written at  `src/backend/utils/resgroup/cgroup-ops-linux-v2.c`. Cgroup version 2 uses
the "unified hierarchy", all the presudo-file of different controllers are all in one directory, and not
use the directyory "cpu, cpuset", "memory", etc.

This is the biggest difference between them, and directly decide how we configure them, here is an
example:

```
smart@stable:/sys/fs/cgroup$ ls
blkio  cpu  cpuacct  cpu,cpuacct  cpuset  devices  freezer  hugetlb  memory  net_cls  net_cls,net_prio  net_prio  perf_event  pids  rdma  systemd  unified

[root@localhost gpdb]# ls
cgroup.controllers  cgroup.max.descendants  cgroup.type    cpu.stat         cpuset.cpus            io.max         pids.current
cgroup.events       cgroup.procs            cpu.idle       cpu.uclamp.max   cpuset.cpus.effective  io.pressure    pids.events
cgroup.freeze       cgroup.stat             cpu.max        cpu.uclamp.min   cpuset.cpus.partition  io.prio.class  pids.max
cgroup.kill         cgroup.subtree_control  cpu.max.burst  cpu.weight       cpuset.mems            io.stat
cgroup.max.depth    cgroup.threads          cpu.pressure   cpu.weight.nice  cpuset.mems.effective  io.weight
```

The document of Cgroup v2 can be found at:

- [Control Group v2](https://docs.kernel.org/admin-guide/cgroup-v2.html)
- [Using cgroups-v2 to control distribution of CPU time for applications](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/managing_monitoring_and_updating_the_kernel/using-cgroups-v2-to-control-distribution-of-cpu-time-for-applications_managing-monitoring-and-updating-the-kernel)

For now, if we want to enable Cgroup v2, we can use:

```
gpconfig -c gp_resource_manager -v group-v2
```

to config Greenplum database, and if we want use Cgroup v1, just use:

```
gpconfig -c gp_resource_manager -v group
```
1. `pg_resgroup.dat` has defined reserved resource group oids. So, there is no need to redefine.
2. `DEFAULTAUXILIARY_OID` was renamed to `SYSTEMRESGROUP_OID`.
We need to bypass the query with only `catalog` tables in resource
groups in some situations, like the Database GUI client(DBeaver), which
will run some catalog queries to fetch metadata.

The code just walk the parse tree and find `RangeVar` to compare it's
schemaname with `pg_catalog`, it's just a few cases which be considered
now. Because every query will be walked after enabling resource group,
so the effort should be as small as as possible.

Following catalog query will be bypassed:
```sql
SELECT n.oid,n.*,d.description FROM pg_catalog.pg_namespace n
LEFT OUTER JOIN pg_catalog.pg_description d ON d.objoid=n.oid AND
d.objsubid=0 AND d.classoid='pg_namespace'::regclass
WHERE nspname='public' ORDER BY nspname;
```
But query without any `RangeVar`s will be ignored, such as:
```sql
SELECT 1;
SELECT some_udf();
```
…290)

In resource group mode, if user requests more than statement_mem,
grant that, otherwise set statement_mem to the query's memory limit.
since commit c25db0c support resgroup with memory_limit, adds it to
gp_resgroup_config view.
1. Add views to get "existing" relation files in the database,
  including the default, global and user tablespaces. Note that
  this won't expose files outside of the data directory as we
  only use pg_ls_dir to get the file list, which won't have
  reach to any files outside the data directories (including
  the user tablespace directory).

2. Add views to get "expected" relation files in the database,
  using the knowledge from the catalog.

3. Using 1 and 2, construct views to get the missing files (i.e.
  files that are expected but not existed) and orphaned files (i.e.
  files that are there unexpectedly).

4. Create views to run the above views in MPP. Also, we
  support checking extended data files for AO/CO tables

5. Add regress tests.

To use:
```
  -- checking non-extended files
  select * from gp_toolkit.gp_check_missing_files;
  select * from gp_toolkit.gp_check_orphaned_files;

  -- checking all data files including the extended data files
  -- (e.g. 12345.1, 99999.2). These do not count supporting files
  -- such as .fsm .vm etc. And currently we only support checking
  -- extended data files for AO/CO tables, not heap.
  select * from gp_toolkit.gp_check_missing_files_ext;
  select * from gp_toolkit.gp_check_orphaned_files_ext;
```

Note:
* As mentioned, currently support checking all the non-extended data
  files and the extended data files of AO/CO tables. The main reason
  to separate these two is performance: constructing expected file
  list for AO/CO segments runs dynamic SQL on each aoseg/aocsseg table
  and could be slow. So only do that if really required.
* For heap tables, currently have no way to get the expected number
  of datafiles for a certain table: we cannot use pg_relation_size
  because that is in turn dependent on the number of datafiels itself.
  So always skip its extended files for now.
In the case of insert-select, the value column contained information about
coordinator only. The error was fixed by setting the proexeclocation property of
the function to i (PROEXECLOCATION_INITPLAN).
https://github.com/greenplum-db/gpdb/issues/14154

Execution of the "SELECT * FROM pg_resgroup_get_status_kv(NULL);" query resulted
in a segmentation fault. NULL check has been added.

The dumpResGroupInfo() function was called 2 times per query. Now it is called
only once.

Simplification of the function code.
This commit introduces a new function and 2 new views to gp_toolkit
that summarize column size and compression ratio for column-oriented tables.

Gather column size and compression ratio for given AOCO table.
- gp_toolkit.get_column_size(oid)

Gather column size and compression ratio for column-oriented tables from
all segments
- gp_toolkit.gp_column_size

Summary view of gp_column_size. Aggregates column size and compression
ratio for column-oriented tables from all segments.
- gp_toolkit.gp_column_size_summary

Authored-by: Brent Doil <[email protected]>
…tatus() function (#14959)

In the case of insert-select, the cpu_usage field contains information about
master only. This leads to an error at the moment of analyzing the table, as the
contents of the field do not correspond to json format.
The error was fixed by setting the proexeclocation property of the function
to i (PROEXECLOCATION_INITPLAN).

https://github.com/greenplum-db/gpdb/issues/14154
Add memory_usage into the function pg_resgroup_get_status() result.

postgres=# select * from gp_toolkit.gp_resgroup_status_per_host;
    rsgname    | groupid | hostname | cpu_usage | memory_usage
---------------+---------+----------+-----------+--------------
 admin_group   |    6438 | zero     |      0.09 |         5.55
 default_group |    6437 | zero     |      0.00 |         0.00
 system_group  |    6441 | zero     |      0.01 |        40.55
(3 rows)

It'll return the memory_usage in megabytes, to help the customer address memory problems.

For Linux Cgroup v1, the memory usage is read from `memory.usage_in_bytes`. For Linux Cgroup
v2, it's read from `memory.current`.
Greenplum introduces auto_stats for a long time, please refer to
https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/bAyw2KBP6yE/m/hmoWikrPAgAJ
for details and decision of auto_stats.

auto_stats() now is invoked at the following 7 places:
  1. ProcessQuery()
  2. _SPI_pquery()
  3. postquel_end()
  4. ATExecExpandTableCTAS()
  5. ATExecSetDistributedBy()
  6. DoCopy()
  7. ExecCreateTableAs()

Previously, Place 2, 3 is hard-coded as inside function,
Place 1, 4~7 is hard-coded as not-inside function.
Place 4~7 does not cover the case that COPY or CTAS
is called inside procedure language.

Since in future auto_stats will be removed, for now let's
just to do some simple fix instead of big refactor.

To correctly pass the inFunction parameter for auto_stats()
at Place 4~7 we introduce executor_run_nesting_level to mark
if the program is already under ExecutorRun(). Place 4~7 is
directly taken as Utility and will not call ExecutorRun() if
they are not inside procedure language. This skill is like
the extension `auto_explain`.

For Place 1~3, the context is clear we do not need to check
executor_run_nesting_level.

NOTE: now, Greenplum support create procedure, and invoke
it as call proc() instead of select proc(). To handle this case,
we also have a static module variable to mark the nest level
of standard_ProcessUtility, if the nest level >= 2, means it is
inside a procedure call.
When gp_resource_group_bypass_catalog_query is on, unassign the pure
catalog query from resource group, and bypass the queries.

Traverse the flattened rangetable entries to decide whether the query
is pure-catalog query.

In addition, it's hard to see whether the query contains UDF, traverse
the whole plan is expensive, so it will be ignored even if there can
have non-catalog query in the sql commands inside the UDF. But for the
special case like 'select UDF;', the query won't be bypassed. We have
no need to consider the sql commands inside the UDF, they will also be
bypassed or use the same resgroup as the outer query.
SmartKeyerror and others added 16 commits August 14, 2024 15:27
* Revise some docs about resource group

* fix typo

* address review comments

* Update workload_mgmt_resgroups.html.md

few corrections

* update cgroup v2 doc

* Update workload_mgmt_resgroups.html.md

formatting

* Update workload_mgmt_resgroups.html.md

---------

Co-authored-by: mperezfuster <[email protected]>
Without the fix, the view gp_toolkit.gp_workfile_entries will display
the same pid of a query across all segments.

This change needed a version bump for gp_toolkit.
* Add cgroup disk io monitor function.

Add cgroup disk io monitor function like 'systemd-cgtop'.

Implementation:
 1. read content of 'io.stat' under cgroup folder
 2. sleep an interval(1 second)
 3. read content again of 'io.stat' under cgroup folder
 4. calculate io speed by diff result of step 1 and step 3

Create a new view under gp_toolkit:
    gp_toolkit.gp_resgroup_iostats_per_host

Create a new UDF to get 'io.stat' diff on segment:
    gp_toolkit.__gp_resgroup_iostats()
The Cgroups, which the resource group depends on, is a Linux kernel
feature.
PORTNAME is defined in Makefile.global, we can only use it after the
definition. But for extensions we also need to define the OBJS before
all the PGXS stuff.

This commit checks the operating system via `uname -s` instead, other
systems like Windows are not covered since Greenplum server is not
supported on them anyway.
An oversight in d24424a76fc79f2229e93aca63f032082dd47a71. Fixing now.

Also let all installcheck regress tests process the regress init_file which
includes many tweaks that are needed for GPDB.
Move the test to the end of the schedule in order to have more chance
to catch abnormalies.
When checking orphaned files with numbered extension, we used to check
exactly whether they are expected in catalogs (e.g. aoseg/aocsseg tables),
and report a file as orphaned as long as it's not in the catalog.
However, this is problematic due to false reporting: in many cases, like
TRUNCATE, we might have droped the catalog entry but not yet physically
removed the file. Also importantly, the use case of orphaned files is
usually when user wants to make sure no file is left behind when a relation
is dropped. This is different than checking missing files where usually user
wants to check *all* files that are expected to be present for a relation
are indeed present.

Therefore, now we only check if the base relfilenode of a relation file is
expected or not. Only if the base relfilenode is unexpected, we would report
the file as orphaned. As a result, gp_check_orphaned_files can report extended
orphaned files as well, so we do not need the slow and error-prone
gp_check_orphaned_files_ext any more.

P.S. for checking missing files we still keep the two views w/wo '_ext'
to offer more flexibility in different scenarios.
Currently when we plan a subquery, we refer to the subquery RTE's
forceDistRandom field in order to understand if it is a gp_dist_random.
If so we will set its locus to CdbLocusType_Strewn. In utility mode,
that could result in motion being created which it couldn't handle
(see #15238).

Therefore, now we don't force the random distribution in utility mode
and let the subquery decides its locus (which might involve
gp_dist_random again but we'll follow the same logic here).

Fixed #15238
This commit mainly improves the gp_check_orphaned_files view in the sense that,
since ultimately its user is likely going to remove the reported orphaned files,
we would not like that to cause any potential issues due to causes like:

* Main relation files that are associated with dropped tables are kept until the
  next CHECKPOINT in order to prevent potential issue with crash recovery (see
  comments for mdunlink()). So removing them would have issue.
* Relation files that are created during an ongoing transactions could be
  recognized as orphaned: another session will see an old pg_class.relfilenode
  so it would think the new relfilenode is orphaned. So if one removes that, we
  might have data loss.

So accordingly, the improvements are:
* We should force a CHECKPOINT prior to collecting the orphaned file list.
* We should exclude other activities that might change pg_class.relfilenode
  while running the view. This is done by locking pg_class in SHARE mode (which
  blocks writes but allows read) with "nowait" flag (which allows the lock
  attempt to immediately return so we are not blocked forever). We should also
  check pg_stat_activity to make sure that there is no idle transaction (because
  the idle transaction might've already modified pg_class and released the
  lock). In the new view we do that by simply making sure there's no concurrent
  client sessions.

These steps will need to be written in a function. So the rewrite the
gp_check_orphaned_files view to be SELECT'ing from a new UDF.

Also improve the view results by adding relative path of each file being
reported for convenience about further action of the files.

For the test, adjusted a few places so that the new changes won't cause flakiness.
We did that when we made gp_toolkit an extension. All objects in the extension
are under 'gp_toolkit' schema, but the extension itself is created under
'public' schema (pg_extension reflects that and there's a pg_depend linkage).
However, conventionally 'public' schema should not contain anything other than
user themselves created. And in many cases the user would opt to drop the
'public' schema - and they now have to drop gp_toolkit extension before doing
that, which is not nice.

So now we do not create the schema within the extension, but do that via the
control file which implicitly creates the schema and use it for the extension.
This removes the dependency on 'public' and also conforms the conventions
better. Because we cannot remove "CREATE SCHEMA gp_toolkit" from gp_toolkit--1.0.sql,
create a new default extension version 1.3 to achieve that.

This fixes it for new GPDB installations but existing GPDB7 installations will
continue to have gp_toolkit registered under the 'public' schema which is a
decision based on both impact (not a bug but rather a nice-to-have feature) and
the state of 7X (not many existing installaions). If they re-create gp_toolkit
extension after upgrading to a version containing this change, everything would
be aligned with new installations.

Restore a regress test case that drops the 'public' schema w/o dropping
anything else.

P.S. we continue to explicitly specify the schema name 'gp_toolkit.XXX' in the
extension script, just for more clarity as many views/functions use other
views/functions internally and need to have the right schema to look at things.
The test was flaky with a diff like:

```
--- /tmp/build/e18b2f02/gpdb_src/gpcontrib/gp_toolkit/expected/gp_toolkit.out	2023-11-07 20:06:36.276652825 +0000
+++ /tmp/build/e18b2f02/gpdb_src/gpcontrib/gp_toolkit/results/gp_toolkit.out	2023-11-07 20:06:36.304655494 +0000
@@ -399,9 +399,11 @@
 -- However it will show there is a difference if segment guc is different
 -- primary_conninfo can be different due to failovers in isolation2 tests
 select * from gp_toolkit.gp_param_settings_seg_value_diffs where psdname != 'primary_conninfo';
- psdname | psdvalue | psdcount
----------+----------+----------
-(0 rows)
+      psdname      |           psdvalue            | psdcount
+-------------------+-------------------------------+----------
+ primary_slot_name |                               |        2
+ primary_slot_name | internal_wal_replication_slot |        1
+(2 rows)

 -- gp_locks_on_relation
```

The reason is that primary_slot_name just like primary_conninfo, can be
different between primary segments when some primary has previously been a
mirror so these mirror-specific GUCs persist. We might want to clean them up
properly in future but currently just ignore the GUCs in test.
As the title said, fix a typo `timestap` to `timestamp`.

Signed-off-by: Yongtao Huang <[email protected]>
Provide a superuser-only UDF gp_toolkit.gp_move_orphaned_files to move orphaned
files found by gp_toolkit.gp_check_orphaned_files into a location specified by
user, which can be removed/backup by the user later.

gp_toolkit.gp_move_orphaned_files(target_directory text)

After moving, the file will be renamed to something that indicates its original
location in the data directory. For example, say the file '12345' in the
default tablespace is orphaned on primary segment 2, it will be moved like this:

Original location:
<data_directory>/base/13700/12345

After moving:
<target_directory>/seg2_base_13700_12345

The function will return each moved file's before/after paths, and whether the
move succeeded. E.g.:

postgres=# select * from gp_toolkit.gp_move_orphaned_files('/home/csteam/workspace/gpdb7/files');
 gp_segment_id | move_success |           oldpath          |         newpath
---------------+--------------+----------------------------+-----------------------------------
            -1 | t            | /data_dir/base/13715/99999 | /target_dir/seg-1_base_13715_99999
             1 | t            | /data_dir/base/13715/99999 | /target_dir/seg1_base_13715_99999
             2 | t            | /data_dir/base/13715/99999 | /target_dir/seg2_base_13715_99999
(3 rows)
This commit introduces the following UDFs and view to help users tackle
various partition maintenance tasks.

Helper UDFs:

pg_partition_rank()
   - Rank of a non-default range partition among siblings (position of
     partition in a hypothetical sorted array of siblings)
pg_partition_range_from()
   - Lower range bound of a range partition
pg_partition_range_to()
   - Higher range bound of a range partition
pg_partition_bound_value()
   - Textual representation of the range bounds of a partition
pg_partition_isdefault()
   - Checks if given partition is a default partition
pg_partition_lowest_child()
   - Finds the lowest ranked child of given partition
pg_partition_highest_child()
   - Finds the highest ranked child of a given partition

View:

gp_partitions: Equivalent of legacy view pg_partitions from older major
versions of GPDB, minus some columns (which don't really apply to the
current world view anymore):

partitionname
partitionposition
partitionstartinclusive
partitionendinclusive
partitioneveryclause

Notable differences from 6X for this view:
(1) The root is level = 0, its immediate children are level 1 and so on.
(2) Immediate children of the root have parentpartitiontablename equal
    to that of the root (instead of being null).

Co-authored-by: Soumyadeep Chakraborty <[email protected]>
Reviewed-by: Huansong Fu <[email protected]>
… gp_resgroup_status_per_host (#15688)

change "rsgname" to "groupname" on the view of gp_resgroup_status
and gp_resgroup_status_per_host, make them consistent.
@jiaqizho jiaqizho changed the title [DNM]Cherry-pick Resgroup V2 and toolkit from GPDB Cherry-pick Resgroup V2 and toolkit from GPDB Aug 14, 2024
@jiaqizho jiaqizho requested a review from my-ship-it August 14, 2024 09:27
Re-enable the resgroup v1 and v2 test cases. Also fix a
`hash_create` in `cgroup_io_limit.c`
- the view `gp_stat_activity` is missing, temp add into system_view.sql
- allow non-root partition get tupledesc which same as GP
- change some some of outputs in icw tests
…iled

In the test case `local_directory_table_mixed`, created the tablespace
`directory_tblspc` which depends on the local dir `@testtablespace@`.
But in the end of test, we will remove the local dir `@testtablespace@`,
This will cause an error when replaying the WAL log in standby.

Also we should not use the `@testtablespace@`, but with a subdir in this
dir.
@my-ship-it my-ship-it merged commit 1542162 into apache:main Aug 19, 2024
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick cherry-pick upstream commts
Projects
None yet
Development

Successfully merging this pull request may close these issues.