-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cherry-pick Resgroup V2 and toolkit from GPDB #531
Conversation
|
f812bdb
to
1ad3214
Compare
@tuhaihe why this CI not running? |
1575a4a
to
d71e335
Compare
d71e335
to
17506d4
Compare
6fb3f29
to
bd51bc4
Compare
4683313
to
b5f0d48
Compare
After we changed the semantics of the resource group at #14562, it's time to add the implementation of Linux Cgroup v2. In Linux Kernel, there is no difference between Cgroup v1 and Cgroup v2 in interior implementation, the only change is the pseudo-files interface. Thanks to the #14343, we just need to add a new file and implement a new `cgroupOpsRoutine`, which can be found at `get_group_routine_v2()`. In early times, we considered that developers will have different mount directories of Cgroup, so we generate test files dynamically. But we found it is an over-design, so move some test files from `input/` to `sql/`, and this will be convenient for developers. This is been written at `src/backend/utils/resgroup/cgroup-ops-linux-v2.c`. Cgroup version 2 uses the "unified hierarchy", all the presudo-file of different controllers are all in one directory, and not use the directyory "cpu, cpuset", "memory", etc. This is the biggest difference between them, and directly decide how we configure them, here is an example: ``` smart@stable:/sys/fs/cgroup$ ls blkio cpu cpuacct cpu,cpuacct cpuset devices freezer hugetlb memory net_cls net_cls,net_prio net_prio perf_event pids rdma systemd unified [root@localhost gpdb]# ls cgroup.controllers cgroup.max.descendants cgroup.type cpu.stat cpuset.cpus io.max pids.current cgroup.events cgroup.procs cpu.idle cpu.uclamp.max cpuset.cpus.effective io.pressure pids.events cgroup.freeze cgroup.stat cpu.max cpu.uclamp.min cpuset.cpus.partition io.prio.class pids.max cgroup.kill cgroup.subtree_control cpu.max.burst cpu.weight cpuset.mems io.stat cgroup.max.depth cgroup.threads cpu.pressure cpu.weight.nice cpuset.mems.effective io.weight ``` The document of Cgroup v2 can be found at: - [Control Group v2](https://docs.kernel.org/admin-guide/cgroup-v2.html) - [Using cgroups-v2 to control distribution of CPU time for applications](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/managing_monitoring_and_updating_the_kernel/using-cgroups-v2-to-control-distribution-of-cpu-time-for-applications_managing-monitoring-and-updating-the-kernel) For now, if we want to enable Cgroup v2, we can use: ``` gpconfig -c gp_resource_manager -v group-v2 ``` to config Greenplum database, and if we want use Cgroup v1, just use: ``` gpconfig -c gp_resource_manager -v group ```
1. `pg_resgroup.dat` has defined reserved resource group oids. So, there is no need to redefine. 2. `DEFAULTAUXILIARY_OID` was renamed to `SYSTEMRESGROUP_OID`.
We need to bypass the query with only `catalog` tables in resource groups in some situations, like the Database GUI client(DBeaver), which will run some catalog queries to fetch metadata. The code just walk the parse tree and find `RangeVar` to compare it's schemaname with `pg_catalog`, it's just a few cases which be considered now. Because every query will be walked after enabling resource group, so the effort should be as small as as possible. Following catalog query will be bypassed: ```sql SELECT n.oid,n.*,d.description FROM pg_catalog.pg_namespace n LEFT OUTER JOIN pg_catalog.pg_description d ON d.objoid=n.oid AND d.objsubid=0 AND d.classoid='pg_namespace'::regclass WHERE nspname='public' ORDER BY nspname; ``` But query without any `RangeVar`s will be ignored, such as: ```sql SELECT 1; SELECT some_udf(); ```
…290) In resource group mode, if user requests more than statement_mem, grant that, otherwise set statement_mem to the query's memory limit.
since commit c25db0c support resgroup with memory_limit, adds it to gp_resgroup_config view.
1. Add views to get "existing" relation files in the database, including the default, global and user tablespaces. Note that this won't expose files outside of the data directory as we only use pg_ls_dir to get the file list, which won't have reach to any files outside the data directories (including the user tablespace directory). 2. Add views to get "expected" relation files in the database, using the knowledge from the catalog. 3. Using 1 and 2, construct views to get the missing files (i.e. files that are expected but not existed) and orphaned files (i.e. files that are there unexpectedly). 4. Create views to run the above views in MPP. Also, we support checking extended data files for AO/CO tables 5. Add regress tests. To use: ``` -- checking non-extended files select * from gp_toolkit.gp_check_missing_files; select * from gp_toolkit.gp_check_orphaned_files; -- checking all data files including the extended data files -- (e.g. 12345.1, 99999.2). These do not count supporting files -- such as .fsm .vm etc. And currently we only support checking -- extended data files for AO/CO tables, not heap. select * from gp_toolkit.gp_check_missing_files_ext; select * from gp_toolkit.gp_check_orphaned_files_ext; ``` Note: * As mentioned, currently support checking all the non-extended data files and the extended data files of AO/CO tables. The main reason to separate these two is performance: constructing expected file list for AO/CO segments runs dynamic SQL on each aoseg/aocsseg table and could be slow. So only do that if really required. * For heap tables, currently have no way to get the expected number of datafiles for a certain table: we cannot use pg_relation_size because that is in turn dependent on the number of datafiels itself. So always skip its extended files for now.
In the case of insert-select, the value column contained information about coordinator only. The error was fixed by setting the proexeclocation property of the function to i (PROEXECLOCATION_INITPLAN). https://github.com/greenplum-db/gpdb/issues/14154 Execution of the "SELECT * FROM pg_resgroup_get_status_kv(NULL);" query resulted in a segmentation fault. NULL check has been added. The dumpResGroupInfo() function was called 2 times per query. Now it is called only once. Simplification of the function code.
This commit introduces a new function and 2 new views to gp_toolkit that summarize column size and compression ratio for column-oriented tables. Gather column size and compression ratio for given AOCO table. - gp_toolkit.get_column_size(oid) Gather column size and compression ratio for column-oriented tables from all segments - gp_toolkit.gp_column_size Summary view of gp_column_size. Aggregates column size and compression ratio for column-oriented tables from all segments. - gp_toolkit.gp_column_size_summary Authored-by: Brent Doil <[email protected]>
…tatus() function (#14959) In the case of insert-select, the cpu_usage field contains information about master only. This leads to an error at the moment of analyzing the table, as the contents of the field do not correspond to json format. The error was fixed by setting the proexeclocation property of the function to i (PROEXECLOCATION_INITPLAN). https://github.com/greenplum-db/gpdb/issues/14154
Add memory_usage into the function pg_resgroup_get_status() result. postgres=# select * from gp_toolkit.gp_resgroup_status_per_host; rsgname | groupid | hostname | cpu_usage | memory_usage ---------------+---------+----------+-----------+-------------- admin_group | 6438 | zero | 0.09 | 5.55 default_group | 6437 | zero | 0.00 | 0.00 system_group | 6441 | zero | 0.01 | 40.55 (3 rows) It'll return the memory_usage in megabytes, to help the customer address memory problems. For Linux Cgroup v1, the memory usage is read from `memory.usage_in_bytes`. For Linux Cgroup v2, it's read from `memory.current`.
Greenplum introduces auto_stats for a long time, please refer to https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/bAyw2KBP6yE/m/hmoWikrPAgAJ for details and decision of auto_stats. auto_stats() now is invoked at the following 7 places: 1. ProcessQuery() 2. _SPI_pquery() 3. postquel_end() 4. ATExecExpandTableCTAS() 5. ATExecSetDistributedBy() 6. DoCopy() 7. ExecCreateTableAs() Previously, Place 2, 3 is hard-coded as inside function, Place 1, 4~7 is hard-coded as not-inside function. Place 4~7 does not cover the case that COPY or CTAS is called inside procedure language. Since in future auto_stats will be removed, for now let's just to do some simple fix instead of big refactor. To correctly pass the inFunction parameter for auto_stats() at Place 4~7 we introduce executor_run_nesting_level to mark if the program is already under ExecutorRun(). Place 4~7 is directly taken as Utility and will not call ExecutorRun() if they are not inside procedure language. This skill is like the extension `auto_explain`. For Place 1~3, the context is clear we do not need to check executor_run_nesting_level. NOTE: now, Greenplum support create procedure, and invoke it as call proc() instead of select proc(). To handle this case, we also have a static module variable to mark the nest level of standard_ProcessUtility, if the nest level >= 2, means it is inside a procedure call.
When gp_resource_group_bypass_catalog_query is on, unassign the pure catalog query from resource group, and bypass the queries. Traverse the flattened rangetable entries to decide whether the query is pure-catalog query. In addition, it's hard to see whether the query contains UDF, traverse the whole plan is expensive, so it will be ignored even if there can have non-catalog query in the sql commands inside the UDF. But for the special case like 'select UDF;', the query won't be bypassed. We have no need to consider the sql commands inside the UDF, they will also be bypassed or use the same resgroup as the outer query.
* Revise some docs about resource group * fix typo * address review comments * Update workload_mgmt_resgroups.html.md few corrections * update cgroup v2 doc * Update workload_mgmt_resgroups.html.md formatting * Update workload_mgmt_resgroups.html.md --------- Co-authored-by: mperezfuster <[email protected]>
Without the fix, the view gp_toolkit.gp_workfile_entries will display the same pid of a query across all segments. This change needed a version bump for gp_toolkit.
* Add cgroup disk io monitor function. Add cgroup disk io monitor function like 'systemd-cgtop'. Implementation: 1. read content of 'io.stat' under cgroup folder 2. sleep an interval(1 second) 3. read content again of 'io.stat' under cgroup folder 4. calculate io speed by diff result of step 1 and step 3 Create a new view under gp_toolkit: gp_toolkit.gp_resgroup_iostats_per_host Create a new UDF to get 'io.stat' diff on segment: gp_toolkit.__gp_resgroup_iostats()
The Cgroups, which the resource group depends on, is a Linux kernel feature.
PORTNAME is defined in Makefile.global, we can only use it after the definition. But for extensions we also need to define the OBJS before all the PGXS stuff. This commit checks the operating system via `uname -s` instead, other systems like Windows are not covered since Greenplum server is not supported on them anyway.
An oversight in d24424a76fc79f2229e93aca63f032082dd47a71. Fixing now. Also let all installcheck regress tests process the regress init_file which includes many tweaks that are needed for GPDB.
Move the test to the end of the schedule in order to have more chance to catch abnormalies.
When checking orphaned files with numbered extension, we used to check exactly whether they are expected in catalogs (e.g. aoseg/aocsseg tables), and report a file as orphaned as long as it's not in the catalog. However, this is problematic due to false reporting: in many cases, like TRUNCATE, we might have droped the catalog entry but not yet physically removed the file. Also importantly, the use case of orphaned files is usually when user wants to make sure no file is left behind when a relation is dropped. This is different than checking missing files where usually user wants to check *all* files that are expected to be present for a relation are indeed present. Therefore, now we only check if the base relfilenode of a relation file is expected or not. Only if the base relfilenode is unexpected, we would report the file as orphaned. As a result, gp_check_orphaned_files can report extended orphaned files as well, so we do not need the slow and error-prone gp_check_orphaned_files_ext any more. P.S. for checking missing files we still keep the two views w/wo '_ext' to offer more flexibility in different scenarios.
Currently when we plan a subquery, we refer to the subquery RTE's forceDistRandom field in order to understand if it is a gp_dist_random. If so we will set its locus to CdbLocusType_Strewn. In utility mode, that could result in motion being created which it couldn't handle (see #15238). Therefore, now we don't force the random distribution in utility mode and let the subquery decides its locus (which might involve gp_dist_random again but we'll follow the same logic here). Fixed #15238
This commit mainly improves the gp_check_orphaned_files view in the sense that, since ultimately its user is likely going to remove the reported orphaned files, we would not like that to cause any potential issues due to causes like: * Main relation files that are associated with dropped tables are kept until the next CHECKPOINT in order to prevent potential issue with crash recovery (see comments for mdunlink()). So removing them would have issue. * Relation files that are created during an ongoing transactions could be recognized as orphaned: another session will see an old pg_class.relfilenode so it would think the new relfilenode is orphaned. So if one removes that, we might have data loss. So accordingly, the improvements are: * We should force a CHECKPOINT prior to collecting the orphaned file list. * We should exclude other activities that might change pg_class.relfilenode while running the view. This is done by locking pg_class in SHARE mode (which blocks writes but allows read) with "nowait" flag (which allows the lock attempt to immediately return so we are not blocked forever). We should also check pg_stat_activity to make sure that there is no idle transaction (because the idle transaction might've already modified pg_class and released the lock). In the new view we do that by simply making sure there's no concurrent client sessions. These steps will need to be written in a function. So the rewrite the gp_check_orphaned_files view to be SELECT'ing from a new UDF. Also improve the view results by adding relative path of each file being reported for convenience about further action of the files. For the test, adjusted a few places so that the new changes won't cause flakiness.
We did that when we made gp_toolkit an extension. All objects in the extension are under 'gp_toolkit' schema, but the extension itself is created under 'public' schema (pg_extension reflects that and there's a pg_depend linkage). However, conventionally 'public' schema should not contain anything other than user themselves created. And in many cases the user would opt to drop the 'public' schema - and they now have to drop gp_toolkit extension before doing that, which is not nice. So now we do not create the schema within the extension, but do that via the control file which implicitly creates the schema and use it for the extension. This removes the dependency on 'public' and also conforms the conventions better. Because we cannot remove "CREATE SCHEMA gp_toolkit" from gp_toolkit--1.0.sql, create a new default extension version 1.3 to achieve that. This fixes it for new GPDB installations but existing GPDB7 installations will continue to have gp_toolkit registered under the 'public' schema which is a decision based on both impact (not a bug but rather a nice-to-have feature) and the state of 7X (not many existing installaions). If they re-create gp_toolkit extension after upgrading to a version containing this change, everything would be aligned with new installations. Restore a regress test case that drops the 'public' schema w/o dropping anything else. P.S. we continue to explicitly specify the schema name 'gp_toolkit.XXX' in the extension script, just for more clarity as many views/functions use other views/functions internally and need to have the right schema to look at things.
The test was flaky with a diff like: ``` --- /tmp/build/e18b2f02/gpdb_src/gpcontrib/gp_toolkit/expected/gp_toolkit.out 2023-11-07 20:06:36.276652825 +0000 +++ /tmp/build/e18b2f02/gpdb_src/gpcontrib/gp_toolkit/results/gp_toolkit.out 2023-11-07 20:06:36.304655494 +0000 @@ -399,9 +399,11 @@ -- However it will show there is a difference if segment guc is different -- primary_conninfo can be different due to failovers in isolation2 tests select * from gp_toolkit.gp_param_settings_seg_value_diffs where psdname != 'primary_conninfo'; - psdname | psdvalue | psdcount ----------+----------+---------- -(0 rows) + psdname | psdvalue | psdcount +-------------------+-------------------------------+---------- + primary_slot_name | | 2 + primary_slot_name | internal_wal_replication_slot | 1 +(2 rows) -- gp_locks_on_relation ``` The reason is that primary_slot_name just like primary_conninfo, can be different between primary segments when some primary has previously been a mirror so these mirror-specific GUCs persist. We might want to clean them up properly in future but currently just ignore the GUCs in test.
As the title said, fix a typo `timestap` to `timestamp`. Signed-off-by: Yongtao Huang <[email protected]>
Provide a superuser-only UDF gp_toolkit.gp_move_orphaned_files to move orphaned files found by gp_toolkit.gp_check_orphaned_files into a location specified by user, which can be removed/backup by the user later. gp_toolkit.gp_move_orphaned_files(target_directory text) After moving, the file will be renamed to something that indicates its original location in the data directory. For example, say the file '12345' in the default tablespace is orphaned on primary segment 2, it will be moved like this: Original location: <data_directory>/base/13700/12345 After moving: <target_directory>/seg2_base_13700_12345 The function will return each moved file's before/after paths, and whether the move succeeded. E.g.: postgres=# select * from gp_toolkit.gp_move_orphaned_files('/home/csteam/workspace/gpdb7/files'); gp_segment_id | move_success | oldpath | newpath ---------------+--------------+----------------------------+----------------------------------- -1 | t | /data_dir/base/13715/99999 | /target_dir/seg-1_base_13715_99999 1 | t | /data_dir/base/13715/99999 | /target_dir/seg1_base_13715_99999 2 | t | /data_dir/base/13715/99999 | /target_dir/seg2_base_13715_99999 (3 rows)
This commit introduces the following UDFs and view to help users tackle various partition maintenance tasks. Helper UDFs: pg_partition_rank() - Rank of a non-default range partition among siblings (position of partition in a hypothetical sorted array of siblings) pg_partition_range_from() - Lower range bound of a range partition pg_partition_range_to() - Higher range bound of a range partition pg_partition_bound_value() - Textual representation of the range bounds of a partition pg_partition_isdefault() - Checks if given partition is a default partition pg_partition_lowest_child() - Finds the lowest ranked child of given partition pg_partition_highest_child() - Finds the highest ranked child of a given partition View: gp_partitions: Equivalent of legacy view pg_partitions from older major versions of GPDB, minus some columns (which don't really apply to the current world view anymore): partitionname partitionposition partitionstartinclusive partitionendinclusive partitioneveryclause Notable differences from 6X for this view: (1) The root is level = 0, its immediate children are level 1 and so on. (2) Immediate children of the root have parentpartitiontablename equal to that of the root (instead of being null). Co-authored-by: Soumyadeep Chakraborty <[email protected]> Reviewed-by: Huansong Fu <[email protected]>
… gp_resgroup_status_per_host (#15688) change "rsgname" to "groupname" on the view of gp_resgroup_status and gp_resgroup_status_per_host, make them consistent.
b5f0d48
to
a3cb7f0
Compare
src/test/isolation2/expected/resgroup/resgroup_alter_concurrency.out
Outdated
Show resolved
Hide resolved
Re-enable the resgroup v1 and v2 test cases. Also fix a `hash_create` in `cgroup_io_limit.c`
- the view `gp_stat_activity` is missing, temp add into system_view.sql - allow non-root partition get tupledesc which same as GP - change some some of outputs in icw tests
…iled In the test case `local_directory_table_mixed`, created the tablespace `directory_tblspc` which depends on the local dir `@testtablespace@`. But in the end of test, we will remove the local dir `@testtablespace@`, This will cause an error when replaying the WAL log in standby. Also we should not use the `@testtablespace@`, but with a subdir in this dir.
a3cb7f0
to
21d416f
Compare
fix #ISSUE_Number
Change logs
Describe your change clearly, including what problem is being solved or what feature is being added.
If it has some breaking backward or forward compatibility, please clary.
Why are the changes needed?
Describe why the changes are necessary.
Does this PR introduce any user-facing change?
If yes, please clarify the previous behavior and the change this PR proposes.
How was this patch tested?
Please detail how the changes were tested, including manual tests and any relevant unit or integration tests.
Contributor's Checklist
Here are some reminders and checklists before/when submitting your pull request, please check them:
make installcheck
make -C src/test installcheck-cbdb-parallel
cloudberrydb/dev
team for review and approval when your PR is ready🥳