[OKD SCOS 4.17] Only 4 CPU cores are recognized, no matter the actual number of cores of the Node #2119

ioannisgk · 2025-02-23T11:33:01Z

I have a bastion host on AWS and I did the following, to prepare for the creation of a new SNO OKD v4.17 SCOS node:

wget https://github.com/okd-project/okd/releases/download/4.17.0-okd-scos.0/openshift-client-linux-4.17.0-okd-scos.0.tar.gz
wget https://github.com/okd-project/okd/releases/download/4.17.0-okd-scos.0/openshift-install-linux-4.17.0-okd-scos.0.tar.gz
tar -xvf openshift-client-linux-4.17.0-okd-scos.0.tar.gz
tar -xvf openshift-install-linux-4.17.0-okd-scos.0.tar.gz
sudo mv oc kubectl /usr/local/bin
sudo mv openshift-install /usr/local/bin

I used this install configuration yaml file (EC2 type of "t3a.2xlarge" has 8 vCPUs and 32GB of memory):

additionalTrustBundlePolicy: Proxyonly
apiVersion: v1
baseDomain: okd-lab........
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform:
    aws:
      type: t3a.2xlarge
  replicas: 1
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform:
    aws:
      type: t3a.2xlarge
  replicas: 0
metadata:
  creationTimestamp: null
  name: sno
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 10.0.0.0/16
  networkType: OVNKubernetes
  serviceNetwork:
  - 172.30.0.0/16
platform:
  aws:
    region: eu-central-1
publish: External
pullSecret: 'pull-secret'
sshKey: |
  <ssh-key>

Then, I created the SNO OKD node:

cp install-config.yaml ./install-config
openshift-install create cluster --dir ./install-config

Result output:

WARNING Release Image Architecture not detected. Release Image Architecture is unknown 
INFO Credentials loaded from the "default" profile in file "/home/ec2-user/.aws/credentials" 
WARNING Making control-plane schedulable by setting MastersSchedulable to true for Scheduler cluster settings 
INFO Consuming Install Config from target directory 
INFO Adding clusters...                           
INFO Creating infrastructure resources...         
INFO Reconciling IAM roles for control-plane and compute nodes 
INFO Creating IAM role for master                 
INFO Creating IAM role for worker                 
INFO Started local control plane with envtest     
INFO Stored kubeconfig for envtest in: /home/ec2-user/okd/install-config/.clusterapi_output/envtest.kubeconfig 
INFO Running process: Cluster API with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:42687 --webhook-port=33045 --webhook-cert-dir=/tmp/envtest-serving-certs-3860374580 --kubeconfig=/home/ec2-user/okd/install-config/.clusterapi_output/envtest.kubeconfig] 
INFO Running process: aws infrastructure provider with args [-v=4 --diagnostics-address=0 --health-addr=127.0.0.1:42031 --webhook-port=37465 --webhook-cert-dir=/tmp/envtest-serving-certs-1388629115 --feature-gates=BootstrapFormatIgnition=true,ExternalResourceGC=true,TagUnmanagedNetworkResources=false,EKS=false --kubeconfig=/home/ec2-user/okd/install-config/.clusterapi_output/envtest.kubeconfig] 
INFO Creating infra manifests...                  
INFO Created manifest *v1.Namespace, namespace= name=openshift-cluster-api-guests 
INFO Created manifest *v1beta2.AWSClusterControllerIdentity, namespace= name=default 
INFO Created manifest *v1beta1.Cluster, namespace=openshift-cluster-api-guests name=sno-q8jt9 
INFO Created manifest *v1beta2.AWSCluster, namespace=openshift-cluster-api-guests name=sno-q8jt9 
INFO Done creating infra manifests                
INFO Creating kubeconfig entry for capi cluster sno-q8jt9 
INFO Waiting up to 15m0s (until 9:25AM UTC) for network infrastructure to become ready... 
INFO Network infrastructure is ready              
INFO Creating private Hosted Zone                 
INFO Creating Route53 records for control plane load balancer 
INFO Created manifest *v1beta2.AWSMachine, namespace=openshift-cluster-api-guests name=sno-q8jt9-bootstrap 
INFO Created manifest *v1beta2.AWSMachine, namespace=openshift-cluster-api-guests name=sno-q8jt9-master-0 
INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=sno-q8jt9-bootstrap 
INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=sno-q8jt9-master-0 
INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=sno-q8jt9-bootstrap 
INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=sno-q8jt9-master 
INFO Waiting up to 15m0s (until 9:33AM UTC) for machines [sno-q8jt9-bootstrap sno-q8jt9-master-0] to provision... 
INFO Control-plane machines are ready             
INFO Cluster API resources have been created. Waiting for cluster to become ready... 
INFO Waiting up to 20m0s (until 9:39AM UTC) for the Kubernetes API at https://api.sno.okd-lab.ioannisgk.com:6443... 
INFO API v1.30.6-dirty up                         
INFO Waiting up to 45m0s (until 10:10AM UTC) for bootstrapping to complete... 
INFO Detected Single Node deployment              
INFO Waiting up to 5m0s (until 9:48AM UTC) for the bootstrap etcd member to be removed... 
INFO Destroying the bootstrap resources...        
INFO Waiting up to 5m0s for bootstrap machine deletion openshift-cluster-api-guests/sno-q8jt9-bootstrap... 
INFO Shutting down local Cluster API controllers... 
INFO Stopped controller: Cluster API              
INFO Stopped controller: aws infrastructure provider 
INFO Shutting down local Cluster API control plane... 
INFO Local Cluster API system has completed operations 
INFO Finished destroying bootstrap resources      
INFO Waiting up to 40m0s (until 10:24AM UTC) for the cluster at https://api.sno.okd-lab.ioannisgk.com:6443 to initialize... 
INFO Waiting up to 30m0s (until 10:30AM UTC) to ensure each cluster operator has finished progressing... 
INFO All cluster operators have completed progressing 
INFO Checking to see if there is a route at openshift-console/console... 
INFO Install complete!                            
INFO To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/home/ec2-user/okd/install-config/auth/kubeconfig' 
INFO Access the OpenShift web-console here: https://console-openshift-console.apps.sno........ 
INFO Login to the console with user: "kubeadmin", and password: "xxxxx-xxxxx-xxxxx-xxxxx" 
INFO Time elapsed: 53m12s

So far, everything seems good.

The Bug:

When I login to the OKD web console, I see in the Cluster Utilization section that only 4 CPU cores are available, instead of 8. I tried the same process with EC2 type of "m5.2xlarge", still only 4 cores are recognized instead of 8.

This is indeed a problem, because I also tested the same install configuration and EC2 type with an OpenShift installation, and OpenShift recognizes all 8 CPU cores.

Could you suggest on how to fix this issue?

The text was updated successfully, but these errors were encountered:

melledouwsma · 2025-02-25T10:50:40Z

This problem seems related to hyperthreading. Can you start a terminal on your SNO node (using oc debug node/[name-of-node] and then chroot /host) and run the following commands to investigate further:

To display the state of all detected cores/threads:
```
lscpu --all --extended
```
To check whether nosmt is included in the kernel parameters:
```
cat /proc/cmdline
```

Additionally, I noticed that you posted your initial kubeadmin password in the report above. If this cluster was connected to the internet, I strongly recommend deleting and reinstalling it, as someone may have accessed it using those credentials.

ioannisgk · 2025-02-26T09:06:30Z

Hello @melledouwsma and thank you for your reply.

I executed the commands and this is the output:

$ ./oc debug node/ip-10-0-14-26.eu-central-1.compute.internal
Starting pod/ip-10-0-14-26eu-central-1computeinternal-debug-8b8bx ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.14.26
If you don't see a command prompt, try pressing enter.
sh-5.1# lscpu --all --extended
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
  0    0      0    0 0:0:0:0          yes
  1    0      0    1 1:1:1:0          yes
  2    0      0    2 2:2:2:0          yes
  3    0      0    3 3:3:3:0          yes
  4    -      -    - -                 no
  5    -      -    - -                 no
  6    -      -    - -                 no
  7    -      -    - -                 no

sh-5.1# cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt3)/ostree/fedora-coreos-10bab7aa118f2f60a690e932f4d711c49a053a9d714a986b6e17ec0c79398063/vmlinuz-5.14.0-567.el9.x86_64 mitigations=auto,nosmt ostree=/ostree/boot.0/fedora-coreos/10bab7aa118f2f60a690e932f4d711c49a053a9d714a986b6e17ec0c79398063/0 ignition.platform.id=aws console=tty0 console=ttyS0,115200n8 root=UUID=05d6b6da-7f24-4a1a-bb48-2e2f68fd0e42 rw rootflags=prjquota boot=UUID=76d462ac-cedc-40bc-bd38-00469425f7d2 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all psi=0

The nosmt is included in the kernel parameters.
Do you have any idea on how to troubleshoot next?

I have recreated the cluster, so the credentials do not matter any more, thanks for pointing it out :-)

bergner · 2025-03-06T13:46:53Z

I also ran into this issue with 4.17. The workaround we've just now implemented is to deploy 2 additional MachineConfig objects for $role master and worker respectively. When deploying them OpenShift will drain/cordon and reboot nodes. When they come back online SMT is enabled.

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: $role
  name: 99-$role-enable-hyperthreading
spec:
  kernelArguments:
    - mitigations=auto

This results in /proc/cmdline having a mitigations=auto at the end, which overrides the previous mitigations=auto,nosmt.

I believe this is an issue that stems from a change in CoreOS where they changed from "mitigations=auto" to "mitigations=auto,nosmt", see for example this discussion: coreos/fedora-coreos-tracker#181. I don't know exactly when that CoreOS change happened or when it trickled into OpenShift/OKD.

I think another consequence of this is that the minimum vCPU requirements for the control plane nodes doubled with OKD 4.17. I've installed OKD 4.15 on AWS successfully with both t3.xlarge and m6i.xlarge control plane nodes (both being 4 vCPU and 16 GB ram) but when using the OKD 4.17 version of openshift-install the installation kept failing until I increased the control plane EC2 instance type to c6i.2xlarge (8 vCPU and 16 GB ram).

I've reported this issue to Redhat as well because as of right now the "hyperthreading: Enabled" setting in the openshift-install install-config.yaml doesn't seem to have any effect since the resulting machines run with "mitigations=auto,nosmt" instead of "mitigations=auto". Common Linux distros like RHEL (and its derivatives), and Amazon Linux all run with "mitigations=auto" by default but CoreOS decided to disable SMT by default.

ioannisgk · 2025-03-14T08:54:08Z

@bergner thank you very much for your reply.
I followed your instructions and it fixes the issue on OKD version 4.18.0-okd-scos.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OKD SCOS 4.17] Only 4 CPU cores are recognized, no matter the actual number of cores of the Node #2119

[OKD SCOS 4.17] Only 4 CPU cores are recognized, no matter the actual number of cores of the Node #2119

ioannisgk commented Feb 23, 2025 •

edited

Loading

melledouwsma commented Feb 25, 2025

ioannisgk commented Feb 26, 2025 •

edited

Loading

bergner commented Mar 6, 2025

ioannisgk commented Mar 14, 2025

[OKD SCOS 4.17] Only 4 CPU cores are recognized, no matter the actual number of cores of the Node #2119

[OKD SCOS 4.17] Only 4 CPU cores are recognized, no matter the actual number of cores of the Node #2119

Comments

ioannisgk commented Feb 23, 2025 • edited Loading

melledouwsma commented Feb 25, 2025

ioannisgk commented Feb 26, 2025 • edited Loading

bergner commented Mar 6, 2025

ioannisgk commented Mar 14, 2025

ioannisgk commented Feb 23, 2025 •

edited

Loading

ioannisgk commented Feb 26, 2025 •

edited

Loading