We have inconsistent sets of options passed to the playbooks during our
CI runs.
Don't run ansible-playbook directly, instead factorize the execution in
a bash function using all the common flags.
Also remove various ENABLE_* variables and instead directly test for the
relevant conditions at execution time, as this makes it more obvious and
does not force one to go back and forth in the script.
With CentOS, kubespray currently produces the following warning:
[WARNING]: TASK: bootstrap-os : Enable Oracle Linux repo: The loop variable
'item' is already in use. You should set the `loop_var` value in the
`loop_control` option for the task to something else to avoid variable
collisions and unexpected behavior.
This could bites us in nasty ways, so fix it.
c58497cde (Refactor bootstrap-os (#10983), 2024-03-27) refactored the
boostrap-os include but didn't adapt the amazon linux tasks to the
actual ID of amazon linux ('amzn')
Re-enable the CI so we can avoid that kind of breakage.
if node.projectcalico.org already existe patch node to set asNumber
instead of apply resource to prevent remove of existing fields feed by
calico-node pods
✅ Closes: 11096
When using the kube_node_instancers_with_disks* variables, there were
no configuration block using those to provision disks with the
VirtualBox provider.
This commit fixes it.
Some packages requirements depends on inventory variables
(`kube_proxy_mode` in that case but it could apply to others).
As the case seems pretty rare, instead of adding complexity to pkgs, we
add an escape hatch to use jinja conditions.
That should be revisited if we find ourselves shoehorning lots of logic
in this later on.
Uses the logic introduced in the previous patch to convert all
kubernetes/preinstall/vars/* os specific files to the `pkgs`
dictionary.
Some niceties for devs:
- always validate the `pkgs` variable to catch mistakes in CI.
- ensure that `pkgs` is always sorted. This makes it easier to find the
packages you're looking for.
Adds infrastructure to install OS packages depending not only on OS
(family, versions, etc) but on groups.
All the informations related to a particular package should reside in
the `pkgs` dictionnary, which takes inspiration from the `downloads`
dictionary structure.
Since the structure we're setting in place for installing packages has
some complexity, add a JSON schema to avoid frustrating errors when
modifying the informations (adding/removing packages install).
This reverts commit 4b0a134bc9.
The mentionned PR break scale.yml. This goes back to the status quo
until a proper fix can be provided, at which point we'll reapply the
PR.
Upgrade Snapshot controller installed for all supported Kubernetes
versions to v7.0.2. Also update the manifests used to deploy the
Snapshot controller.
* feat: add user facing variable with default
* feat: remove rolebinding to anonymous users after init and upgrade
* feat: use file discovery for secondary control plane nodes
* feat: use file discovery for nodes
* fix: do not fail if rolebinding does not exist
* docs: add warning about kube_api_anonymous_auth
* style: improve readability of delegate_to parameter
* refactor: rename discovery kubeconfig file
* test: enable new variable in hardening and upgrade test cases
* docs: add option to config parameters
* test: multiple instances and upgrade
* Move fedora ansible python install to bootstrap-os
* /bin/dir is set in bootstrap-os
* Removing ansible_os_family workarounds
Support for these distributions was merged in Ansible, no need to
override it ourselves now.
https://github.com/ansible/ansible/pull/69324 openEuler
https://github.com/ansible/ansible/pull/77275/ UnionTech OS Server 20
https://github.com/ansible/ansible/pull/78232/ Kylin
* Don't unconditionnaly set VARIANT_ID=coreos in os-release
WTF, this is so wrong.
Furthermore, is_fedora_coreos is already handled in boostrap-os
* Handle Clearlinux generically
Followup of 4eec302e86 (since we're using
package module anyway, let's get rid of the custom task)
* Remove leftover files for Coreos
Coreos was replaced by flatcar in 058438a25 but the file was copied
instead of moved.
* Remove workarounds for resolved ansible issues
* boostrap: Use first_found to include per distro
Using directly ID and VARIANT_ID with first_found allow for less manual
includes.
Distro "families" are simply handled by symlinks.
* boostrap: don't set ansible_python_interpreter
- Allows users to override the chosen python_interpreter with group_vars
easily (group_vars have lesser precedence than facts)
- Allows us to use vars at the task scope to use a virtual env
Ansible python discovery has improved, so those workarounds should not
be necessary anymore.
Special workaround for Flatcar, due to upstream ansible not willing to
support it.
* upgrade ansible version
Needed for with_first_found to work correctly:
https://github.com/ansible/ansible/issues/70772 fixed in 2.16
* Remove unused google cloud cloud_playbook
* Fix dpkg_selection on non-existing packages
Needed since ansible-core>2.16, see:
f10d11bcdc
* scripts: ignore download_hash download failures
Binary names on github releases often change and this script might break
because of that, this commit allow to ignore these failures as a mean to
be able to run the script anyway.
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
* scripts: use sha256sums for crio as well
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
* scripts: add ppc64le support for crio
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
---------
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
The new version brings the following improvements:
- remove having to resort to python python to limit tags (it it slower than
the sh equivalent as python has a somewhat significant startup time).
- Introduce a concept of min version so that it can only get Kubernetes
version supported by Kubespray.
- Fix an issue with kata changing their file scheme (the arch
specifically)
- Now download sha256/sha256sum files if provided rather than
downloading the full file and computing the hash
- A few minor style tweaks
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr.fr>
Fixes bug for retrieving images with tags containing image digests.
Script now gets images from jobs and cronjobs as well.
New env variable DESTINATION_REGISTRY to push to another registry
instead of local registry.
New env variable IMAGES_FROM_FILE to pull images listed in a file
instead of getting images from a running k8s environment.
New env variable REGISTRY_PORT to override port (default is 5000).
Under the original code, leader election failed for ingress controllers
as a result of mismatch between election-id in the controller config,
and the resourceName in the relevant rule of role 'ingress-nginx'.
This appeared in the controller logs.
To fix the issue, a command-line option was added to container
execution (--election-id=...).
Now, the election-id agrees with the resourceName provided in
the role-ingress-nginx.yml file. A comment in that file was
changed to reflect the new logic.
Co-authored-by: Vasilis Samoladas <vsam@softnet.tuc.gr>
Co-authored-by: Mohamed Omar Zaian <mohamedzaian@gmail.com>
This should avoid permissions problems when the user creating the
directory and the user creating the content are different (when
containers images are saved by root for instances, because the user
can't use the container runtime).
* Refactor of kubeadm images listing
Instead of setting multiples facts, we directly create the dict we need from
kubeadm output.
* Remove useless 'default' filters in roles/download
* Only download kubeadm images where needed
The current state waiting method is bad to implement.
When changing the deployment version, which is execute with the upgrade_cluster in the previous ansible task: "Kubernetes Apps | Install and configure MetalLB", next ansible task: "Kubernetes Apps | Wait for MetalLB controller to be running" may fall with an error.
* [kubernetes] Make kubernetes 1.29.1 default
* [cri-o]: support cri-o 1.29
Use "crio status" instead of "crio-status" for cri-o >=1.29.0
* Remove GAed feature gates SecCompDefault
The SecCompDefault feature gate was removed since k8s 1.29
https://github.com/kubernetes/kubernetes/pull/121246
* Update upgrades.md
modify env serial to have real rolling upgrades
* Update upgrades.md
change section for serial
* Update docs/upgrades.md
Co-authored-by: Kundan Kumar <kundan.kumar@india.nec.com>
---------
Co-authored-by: Kundan Kumar <kundan.kumar@india.nec.com>
Handle all old dns domains:
- for nodelocaldns: in the same server block as the current dns_domain
- for coredns: uffix rewrite of each of the old dns domains to the
current one
The old version of the script downloaded all binaries and generated file checksums locally.
This was a slow process since all binaries of all architectures needed to be downloaded.
The new version simply downloads the .sha256 files containing the binary checksum in text
form which saves a lot of traffic and time.
* Enable configuring mountOptions, reclaimPolicy and volumeBindingMode for cinder-csi StorageClasses
* Check if class.mount_options is defined at all, before generating the option list
In the case were vagrant is not invoked directly from the repository,
but from another location, and the Vagrantfile is "included" into
another, we need to be able to specify where the location of the vagrant
directory is, as of now it's hardcoded relative to the Vagrantfile
location. This commit fix it.
* ignore_unreachable for etcd dir cleanup
ignore_errors ignores errors occur within "file" module. However, when
the target node is offline, the playbook will still fail at this task
with node "unreachable" state. Setting "ignore_unreachable: true" allows
the playbook to bypass offline nodes and move on to proceed recovery
tasks on remaining online nodes.
* Re-arrange control plane recovery runbook steps
* Remove suggestion to manually update IP addresses
The suggestion was added in 48a182844c 4
years ago. But a new task added 2 years ago, in
ee0f1e9d58, automatically update API
server arg with updated etcd node ip addresses. This suggestion is no
longer needed.
* ci: redefine multinode to node-etcd-client
This should allow to catch several class of problem rather than just
one -> from network plugin such as calico or cilium talking directly to
the etcd.
* Dynamically define etcd host range
This has two benefits:
- We don't play the etcd role twice for no reason
- We have access to the whole cluster (if needed) to use things like
group_by.
* Convert the bug-report template to issue form
* Convert the enchancement issue template to form
* Convert "Failing Test" template to issue form
* github: Remove support request template, direct to slack instead
* Remove checks for docs using exact tags
Instead use a more generic documentation for installing kubespray as a
collection from git.
* Check that we upgraded galaxy.yml to next version
This is only intented to check for human error. The version in galaxy
should be the next (which does not mean the same if we're on master or a
release branch).
* Set collection version to KUBESPRAY_NEXT_VERSION
* update cilium configmap template for new routing mode and tunnel-protocol options
Ryan Lonergan ryan.tlonergan@gmail.com
* add rbac for new cilium crd in 1.14
Ryan Lonergan ryan.tlonergan@gmail.com
* add conditional for cni-install.sh that's no longer included in cilium 1.14
Ryan Lonergan ryan.tlonergan@gmail.com
* Update roles/network_plugin/cilium/templates/cilium/ds.yml.j2
Co-authored-by: Cyclinder <qifeng.guo@daocloud.io>
---------
Co-authored-by: Cyclinder <qifeng.guo@daocloud.io>
* Disable control plane allocating podCIDR for nodes when using calico
Calico does not use the .spec.podCIDR field for its IP address
management.
Furthermore, it can false positives from the kube controller manager if
kube_network_node_prefix and calico_pool_blocksize are unaligned, which
is the case with the default shipped by kubespray.
If the subnets obtained from using kube_network_node_prefix are bigger,
this would result at some point in the control plane thinking it does
not have subnets left for a new node, while calico will work without
problems.
Explicitely set a default value of false for calico_ipam_host_local to
facilitate its use in templates.
* Don't default to kube_network_node_prefix for calico_pool_blocksize
They have different semantics: kube_network_node_prefix is intended to
be the size of the subnet for all pods on a node, while there can be
more than on calico block of the specified size (they are allocated on
demand).
Besides, this commit does not actually change anything, because the
current code is buggy: we don't ever default to
kube_network_node_prefix, since the variable is defined in the role
defaults.
We take advantage of group_by to create the list of nodes needing new
certs, instead of manually looping inside a Jinja template.
This should make the role more readable and less susceptible to
white space problems.
* Decouple role kubespray-defaults from download
Avoids doing re-importing the download role on every invocation of
kubespray-defaults (and skipping everything).
This has a measurable effect on playbook performance.
* Update docs refering to moved download defaults
* Mask systemd swap.target do disable swap
This is a more generic way to disable swap, since it pulls .swap units
in systemd distributions; fstab is only one way to generate .swap units.
* Unconditionally disable swap
We only care to disable it (the "swapon" registered variable is not used
anywhere else.
This allows to get rid of the ignore_errors, since this was added
because swapon.stdout does not exist in check_mode (see issue #6642).
* Don't explicitly disable swapOnZram
We're already masking the swap.target, which would pull the zram unit,
hence no need to handle zram-generator specifically.
Allow to fail early (pre-commit time) for jinja error, rather than
waiting until executing the playbook and the invalid template.
I could not find a simple jinja pre-commit hook in the wild.
* Clean up redondant defaulting
drain_{timeout,grace_period}_after_failure don't exist at this point, so
they always default.
* Remove useless facts
The drain_*_after_failure are never used
* Try both conntrack modules instead of checking kernel version
Depending on kernel distributor, the kernel version might not be a
correct indicator of the conntrack module use.
Instead, we check both (and use the first found).
* Use modproble.persistent rather than manual persistence
When installed as an ansible collection, roles in
ansible_play_role_names will be designated by their FQDN (i.e
'kubernetes-sigs.kubespray.<role-name>).
It means we need to check for both when checking for roles in the play.
* Validate systemd unit files
This ensure that we fail early if we have a bad systemd unit file
(syntax error, using a version not available in the local version, etc)
* Hack to check systemd version for service files validation
factory-reset.target was introduced in system 250, same version as the
aliasing feature we need for verifying systemd services with ansible.
So we only actually executes the validation if that target is present.
This is an horrible hack which should be reverted as soon as we drop
support for distributions with systemd<250.
* ansible: upgrade to version >= 2.15.5
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
* tests: update requirements
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
* contrib/openstack: fix wrong gitignore pattern
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
* tests: add missing tzdata requirement
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
* tests: remove some molecules tests
Those doesn't work in Ansible 2.15. Ansible can't load builtin now
apparently and these tests are not worth it.
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
---------
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
Sets ignore_unreachable: true to `Gather ansible_default_ipv4 from all hosts`
task from fallback_ips.yml
Without this scale.yml will fail if a single node in the cluster is down, which
for large clusters happens often.
Remove cri-o apt repo job has state present but need absent
Uninstall CRI-O packages job has undefined variable crio_packages
replaced by list of packages
* metallb --lb-class cmd arg to support multiple load balancer implementations
* removed loadbalancer_class from metallb_config; metallb_loadbalancer_class in role defaults
* Use RandomizedDelaySec to spread out control certificates renewal plane
If the number of control plane node is superior to 6, using (index * 10
minutes) will fail (03:60:00 is not a valid timestamp).
Compared to just fixing the jinja expression (to use a modulo for
example), this should avoid having two control planes certificates
update node being triggered at the same time.
* Make k8s-certs-renew.timer Persistent
If the control plane happens to be offline during the scheduled
certificates renewal (node failure or anything like that), we still want
the renewal to happen.
* containerd: refactor handlers to use 'listen'
* cri-dockerd: refactor handlers to use 'listen'
* cri-o: refactor handlers to use 'listen'
* docker: refactor handlers to use 'listen'
* etcd: refactor handlers to use 'listen'
* control-plane: refactor handlers to use 'listen'
* kubeadm: refactor handlers to use 'listen'
* node: refactor handlers to use 'listen'
* preinstall: refactor handlers to use 'listen'
* calico: refactor handlers to use 'listen'
* kube-router: refactor handlers to use 'listen'
* macvlan: refactor handlers to use 'listen'
It was not 'false', which made some tasks (e.g. using systemd-resolved
template) to effectively remove default search domains; caused DNS loop
after rebooting the node/restarting cluster, so localdns service didn't
run correctly.
This make native ansible features (dry-run, changed state) easier to
have, and should have a minimal performance impact, since it only runs
on the etcd members.
* Specify the runc path when we use the containerd container engine
and change the bin_dir path.
Signed-off-by: Jin Li <qlijin@gmail.com>
* Update roles/container-engine/containerd/templates/config.toml.j2
Co-authored-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
---------
Signed-off-by: Jin Li <qlijin@gmail.com>
Co-authored-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
The blockSize attribute from Calico IPPool resources cannot be changed
once set [1]. Consequently, we use the one currently defined when
configuring the existing IPPool, avoiding upgrade errors by trying to
change it.
In particular, this can be useful when calico_pool_blocksize default
changes in kubespray, which would otherwise force users to add an
explicit setting to their inventories.
[1]: https://docs.tigera.io/calico/latest/reference/resources/ippool#spec
* modify variables.tf to accept AMI attributes via variables
* update README to guide users on utilizing variable-driven AMI configuration
* fix markdown lint error
This allows this task to work with a forks count > 10 and the default
configuration of sshd, which is to limit sessions to 10. (see
MaxSessions in sshd_config).
Since this is a delegate_to task, it connects to the same host (first
etcd) for each node in the cluster, thus easily going above 10.
Raising the ssh connection attempts allow for more robustness, without
decreasing the forks count or serialising the tasks, which could slow
the task (or the playbook as a whole, if decreasing forks).
* Migrate node-role.kubernetes.io/master to node-role.kubernetes.io/control-plane
* Migrate node-role.kubernetes.io/master to node-role.kubernetes.io/control-plane
* Migrate node-role.kubernetes.io/master to node-role.kubernetes.io/control-plane
Refactor NRI (Node Resource Interface) activation in CRI-O and
containerd. Introduce a shared variable, nri_enabled, to streamline
the process. Currently, enabling NRI requires a separate update of
defaults for each container runtime independently, without any
verification of NRI support for the specific version of containerd
or CRI-O in use.
With this commit, the previous approach is replaced. Now, a single
variable, nri_enabled, handles this functionality. Also, this commit
separates the responsibility of verifying NRI supported versions of
containerd and CRI-O from cluster administrators, and leaves it to
Ansible.
Signed-off-by: Feruzjon Muyassarov <feruzjon.muyassarov@intel.com>
* [containerd] Add Configuration option for Node Resource Interface
Node Resource Interface (NRI) is a common is a common framework for
plugging domain or vendor-specific custom logic into container
runtime like containerd. With this commit, we introduce the
containerd_disable_nri configuration flag, providing cluster
administrators the flexibility to opt in or out (defaulted to 'out')
of this feature in containerd. In line with containerd's default
configuration, NRI is disabled by default in this containerd role
defaults.
Signed-off-by: Feruzjon Muyassarov <feruzjon.muyassarov@intel.com>
* [cri-o] Add configuration option for Node Resource Interface
Node Resource Interface (NRI) is a common is a common framework for
plugging domain or vendor-specific custom logic into container
runtimes like containerd/crio. With this commit, we introduce the
crio_enable_nri configuration flag, providing cluster
administrators the flexibility to opt in or out (defaulted to 'out')
of this feature in cri-o runtime. In line with crio's default
configuration, NRI is disabled by default in this cri-o role
defaults.
Signed-off-by: Feruzjon Muyassarov <feruzjon.muyassarov@intel.com>
---------
Signed-off-by: Feruzjon Muyassarov <feruzjon.muyassarov@intel.com>
* terraform-openstack: Updated extra partitions to use empty list by default
* terraform-openstack: Added possibility to enable dhcp flag critical on one interface
when people run playbook with option `--tags=kubelet`, the kubelet config may changed, because some variables used in task populating `kubelet-config.yml` could be different with running task(`Fetch facts`)
* Fix containerd_registries in config_path for mirrors and remove nerdctl global insecure_registry setting
* Make containerd hosts.toml mode 0640
* Add containerd_registries_mirrors and keep containerd_registries to pass packet_debian11-calico-upgrade
Set owner/group to root/root when unarchiving kata-containers binary to prevent kata-containers binaries/directories and especially / from getting chowned to 1001:123, the file owner specified in the kata-containers archive
* tests: replace fedora35 with fedora37
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* tests: replace fedora36 with fedora38
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* docs: update fedora version in docs
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* molecule: upgrade fedora version
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* tests: upgrade fedora images for vagrant and kubevirt
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* vagrant: workaround to fix private network ip address in fedora
Fedora stop supporting syconfig network script so we added a workaround
here
https://github.com/hashicorp/vagrant/issues/12762#issuecomment-1535957837
to fix it.
* netowrkmanager: do not configure dns if using systemd-resolved
We should not configure dns if we point to systemd-resolved.
Systemd-resolved is using NetworkManager to infer the upstream DNS
server so if we set NetworkManager to 127.0.0.53 it will prevent
systemd-resolved to get the correct network DNS server.
Thus if we are in this case we just don't set this setting.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* image-builder: update centos7 image
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* gitlab-ci: mark fedora packet jobs as allow failure
Fedora networking is still broken on Packet, let's mark it as allow
failure for now.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
---------
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* Update reset.yml
reset confirmation user input fix
* Update reset.yml
added default for non-interactive run in ci/cd
* fix reset_confirmation in reset.yml
* skip reset_confirmation promtp when reset_confirmation is defined via extra-vars option (for tests)
* check both string type and object type with user_input for reset_confirmation var
* reset_confirmation_prompt in conjunction with reset_confirmation
improvement inspired by:
https://github.com/kubernetes-sigs/kubespray/pull/10288#issuecomment-1637056880
* project: update all dependencies including ansible
Upgrade to ansible 7.x and ansible-core 2.14.x. There seems to be issue
with ansible 8/ansible-core 2.15 so we remain on those versions for now.
It's quite a big bump already anyway.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* tests: install aws galaxy collection
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* ansible-lint: disable various rules after ansible upgrade
Temporarily disable a bunch of linting action following ansible upgrade.
Those should be taken care of separately.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* project: resolve deprecated-module ansible-lint error
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* project: resolve no-free-form ansible-lint error
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* project: resolve schema[meta] ansible-lint error
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* project: resolve schema[playbook] ansible-lint error
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* project: resolve schema[tasks] ansible-lint error
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* project: resolve risky-file-permissions ansible-lint error
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* project: resolve risky-shell-pipe ansible-lint error
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* project: remove deprecated warn args
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* project: use fqcn for non builtin tasks
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* project: resolve syntax-check[missing-file] for contrib playbook
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* project: use arithmetic inside jinja to fix ansible 6 upgrade
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
---------
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* tests: cleanup stale packet namespace automatically
Cancelled job on Gitlab can produce stale VMs as the delete playbook
will never be executed. This commits allow removing old vms by getting
all the namespace created from the same branch with an older pipeline
id.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* tests: cleanup stale packet namespace after 2 hours
This ensure that we don't have any packet namespace remaining for more
than 2 hours. All the jobs complete usually within 30min-1hour so 2
hours is enough to detect a stale namespace.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* tests: ignore vm cleanup failure
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* tests: use pipeline_id var instead of fetching namespace for cleanup packet vm
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
---------
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
New line was not inserted between image and imagePullPolicy for some
reasons with the jinja. Simplifying this altogether should fix this.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* Fix upgrade-path for kubelet-csr-approver
Fixes an error when you enable kubelet-csr-approver when upgrading.
It hangs waiting for the certificate to be approved since the
kubelet-csr-approver is not installed yet.
* Add missing package when using helm role
Molecule 5.0 require ansible-core 2.12.10.
So this commit we update ansible-core from 2.12.5 to 2.12.10.
We also drop supporting two ansible-core version. Also we now use the "oldest"
still supported ansible-core version as both 2.11 is EOL and not
supported by molecule.
tests/molecule: remove linting in molecule to support molecule 5
tests/molecule: remove role name check for molecule 5 support
Kubespray doesn't use ansible galaxy style naming so we have to disable
that check.
contrib/inventory_builder: fix tox.ini for tox4
tests/molecule: fix get_playbook in testinfra tests
tests: upgrade most tests requirements
Exclude ansible-lint for now, I will do that in a separate PR.
tests/molecule: force kvm driver option
If we don't do this it fallbacks to qemu emulated on our CI for some
reasons.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* Fix `The task includes an option with an undefined variable` for 1.27
* delete old flag --container-runtime
Signed-off-by: Victor Login <batazor@evrone.com>
---------
Signed-off-by: Victor Login <batazor@evrone.com>
Add option to configure class as the default class
Add option to disable wathcing for ingresses without class
Remove redundant if that always evaluates to true
Fix default value missing for ingress_nginx_default
``
"msg": "Failed to template loop_control.label: 'ansible.utils.unsafe_proxy.AnsibleUnsafeText object' has no attribute 'item'. 'ansible.utils.unsafe_proxy.AnsibleUnsafeText object' has no attribute 'item'", "skip_reason": "Conditional result was False"}
``
fixes case when multus should NOT be included.
Calling bootstrap in facts.yaml so that we can always collect facts even on
new nodes. This is useful when you want to add nodes to an inventory
beforehand and then collect facts and scale the cluster with the scale
playbook and --limits. With dynamic inventory sometimes it might be more
difficult to add the nodes after running the facts playbook in this
specific situation.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
This feature no longer works on Ansible 6 / ansible-core 2.13. We do not
support these version officially yet but this will help for the future
upgrade and may help some people running those inadvertently.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
- Test with new version: 37.20230322.3.0. Both containerd and
cri-o is tested
- bugfix: when we use crio and the var bin_dir is changed,
there will be some error about the new bin dir.
* remove-debian9-support
* Add six module into openstack-cleanup/requirements.txt (#10099)
To fix tf-elastx_cleanup job which was failed with the following error:
File "/usr/local/lib/python3.11/site-packages/keystoneauth1/identity/generic/password.py", line 16, in <module>
from keystoneauth1.identity import v3
File "/usr/local/lib/python3.11/site-packages/keystoneauth1/identity/v3/__init__.py", line 27, in <module>
from keystoneauth1.identity.v3.oauth2_mtls_client_credential import * # noqa
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/keystoneauth1/identity/v3/oauth2_mtls_client_credential.py", line 17, in <module>
import six
ModuleNotFoundError: No module named 'six'
---------
Co-authored-by: Kenichi Omichi <ken1ohmichi@gmail.com>
According to the canal github[1] the repo is not maintained over 5 years.
In addition, the README says
```
Originally, we thought we might more deeply integrate the two projects
(possibly even going as far as a rebranding!). However, over time it
became clear that that wasn't really necessary to fulfil our goal of
making them work well together. Ultimately, we decided to focus on
adding features to both projects rather than doing work just to
combine them.
```
So it is difficult to support canal by Kubespray at this situation.
[1]: https://github.com/projectcalico/canal
To fix tf-elastx_cleanup job which was failed with the following error:
File "/usr/local/lib/python3.11/site-packages/keystoneauth1/identity/generic/password.py", line 16, in <module>
from keystoneauth1.identity import v3
File "/usr/local/lib/python3.11/site-packages/keystoneauth1/identity/v3/__init__.py", line 27, in <module>
from keystoneauth1.identity.v3.oauth2_mtls_client_credential import * # noqa
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/keystoneauth1/identity/v3/oauth2_mtls_client_credential.py", line 17, in <module>
import six
ModuleNotFoundError: No module named 'six'
* Drop CI jobs related to canal
According to the canal github[1] the repo is not maintained over 5 years.
In addition, the README says
Originally, we thought we might more deeply integrate the two projects
(possibly even going as far as a rebranding!). However, over time it
became clear that that wasn't really necessary to fulfil our goal of
making them work well together. Ultimately, we decided to focus on
adding features to both projects rather than doing work just to
combine them.
So we don't need to run CI jobs related to the canal at this situation.
[1]: https://github.com/projectcalico/canal
* Update ci.md
* chore(helm-apps): fix README example
README shows a non-working example according to the specs for this role.
* Add support for kubelet-csr-approver
Co-Authored-By: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* Add tests for kubelet-csr-approver
Co-Authored-By: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* Add Documentation for Kubelet CSR Approver
Co-Authored-By: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
---------
Co-authored-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* Remove deprecated provider and fix flatcar configs
* Refactor for DRYness
* Add missing line endings
* Enable tests for hetzner terraform in CI
* Add missing inventory for CI tests
* Fix: vSphere Error: `Apply a CSI secret manifest`
This PR will fix an issue that you will see on 2nd deploy when deploying External vSphere
How to re-produce:
1. Set custom `vsphere_csi_namespace: "vmware-system-csi"`
2. Deploy as usual
3. Observe no errors
4. Deploy 2nd time without `reset`
5. Playbook fails with:
```
TASK [kubernetes-apps/csi_driver/vsphere : vSphere CSI Driver | Apply a CSI secret manifest]
fatal: [node-00]: FAILED! => changed=true
censored: 'the output has been hidden due to the fact that ''no_log: true'' was specified for this result'
```
* create namespace if does not exist
* lint fix
* try to fix lint errors
* fix `too few spaces before comment`
* change the order of applied manifests
* typo
* [cilium] fix rbac and upgrade hubble v0.11.0 (#3)
* [cilium] fix rbac for LB bgp ipam
* [cilium] Upgrade Hubble to v0.11.0 and add mTLS between Hubble UI and Hubble Relay
* fix dns domain hubble for tls
---------
Co-authored-by: Thuon Jeremy <d107869@olinfra1.infra.bdm.outscale.c1.dav.fr>
* Fix blank line
---------
Co-authored-by: Thuon Jeremy <d107869@olinfra1.infra.bdm.outscale.c1.dav.fr>
Cilium 1.13.1 changed how the cilium-cni binary gets placed in /opt/cni/bin,
so that it takes place in an init container rather than in the main agent.
This commit removes the variable `use_localhost_as_kubeapi_loadbalancer`
and rather detects that we are in a situation where we can use the
localhost apiserver loadbalancer (meaning that we use the localhost load
balancer and that the same ports are used for both the load balancer and
the kube-apiserver).
This also cleanups the calico code to use `kube_apiserver_global_endpoint`
rather than implementing the same logic all over again.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* node: fix default kubelet/runtime cgroups when kube_reserved is false (default)
Commit 1c4db6132d introduced a notion of
kube_reserved. This introduced a breaking change defaulting to use
kube.slice for the container_manager and the kubelet as if kube_reserved
was always enabled whereas it is disabled by default.
This commit fixes this by bringing back system.slice whenever
kube_reserved is disabled.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* inventory/sample: change false for kube_reserved as its the default
Changing the commented value in sample inventory to the actual default
value.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
---------
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* network_plugin/custom_cni: add CNI to apply provided manifests
Add a new simple custom_cni to install provided Kubernetes manifests.
This could be useful to use manifests directly provided by a CNI when
there are not support by Kubespray (i.e.: helm chart or any other manifests
generation method).
Co-authored-by: James Landrein <james.landrein@proton.ch>
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* network_plugin/custom_cni: add test with cilium
Co-authored-by: James Landrein <james.landrein@proton.ch>
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
---------
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
Co-authored-by: James Landrein <james.landrein@proton.ch>
crun_bin_dir was used to specify the destination of the crun binary during the
download process. This path must match with the value provided in the CRI-O
configuration file. So changing its value to bin_dir helps to mismatch errors.
Signed-off-by: Victor Morales <chipahuac@hotmail.com>
requirements-$ANSIBLE_VERSION.yml doesn't exist in Kubespray repo.
That was for supporting ansible 2.10-, and now Kubespray supports
2.11+. So this drops the part to avoid confusion.
Reducing the number of layers, increasing readability, reducing the size of the image (how much I can’t check, it’s impossible for me to build due to the unavailability of the vagrant repository)
* Fix yq install in argocd role: use download_file instead of get_url
* Fix use download_file instead of get_url to download argocd-install manifest in argocd role
* Fix order and add arm64 checksum
* Fix: Failed to template loop_control.label: 'None'
Since Kubespray v2.21.0, the commit a98ab40434 removes copying
contrib/ accidentaly. The contrib/ contains useful tools like offline
tools etc. This adds the contrib/ to Dockerfile again.
1.5.7 was released Aug 2, 2021 and 1.6.1 came out on Dec 13, 2022.
There's been a good amount of new features, improvements and fixes since
1.5.7 and the changelogs for each version are available in the docs:
https://ara.readthedocs.io/en/latest/changelog-release-notes.html
This fixes the CrashLoopBackoff error that appears because envoy
configuration has changed a lot and upstream removed the envoy proxy to
use nginx only instead. Those changes are based on upstream cilium helm.
It is quite confusing that there's an all-caps, bolded comment that seems to imply that `etcd_download_url` is relevant only when not using host-based deployment. The opposite is true: of course the artifact download URL is relevant and required for host-based etcd.
Perhaps the entire comment can be read in a different way, and should perhaps be reworded entirely, cf. 374438a3d6/docs/offline-environment.md (L38)
Removing the "**DON'T**" matches the way the other comments in this file are written and matches my personal interpretation.
This commit uses a kubeadm join config to pull down cert for etcd in
workers nodes (which is needed in some circumstances, for instance with
calico or cilium).
The previous way didn't allow us to pass certain parameters which was
typically given in the config in other kubeadm invokations in Kubespray.
This made kubeadm produced some errors for some edge cases.
For example, in our deployment we don't have a default route and even
though it's only to download the certificates, kubeadm produce an error
`unable to select an IP from default routes` (these command are kubeadm
controlplane command, so kubeadm does some additional checks). This is
fixed by specifying `advertiseAddress` within the kubeadm config.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
Add a variable `calico_felix_floatingIPs` which permit to enable calico feature `floatingIPs`
(disabled per default).
Signed-off-by: MatthieuFin <matthieu2717@gmail.com>
#9679
In 6db6c8678c, this was disabled becaue
kubesrpay gave too much permissions that were not needed. This commit
re-enable back this option by default and also removes the extra
permissions that kubespray gave that were in fact not needed.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* optimize cgroups settings for node reserved
* fix
* set cgroup slice for multi container engine
* set cgroup slice for crio
* add reserved cgroups variables to sample files
* Compatible with cgroup path for different container managers
* add cgroups doc
* fix markdown
At the upstream calico development, the v3.21 branch is not updated
over 2 monthes. In addition, unnecessary error message is output at
Kubespray deployment due to different URLs for calico v3.21 or v3.22+
This drops the v3.21 support to solve the issue.
Without minimal cluster configuration, even on a one node control plane,
the health check of the ovn-cental container always fails as it queries the
cluster/status.
Crio registries configuration changed from crio_registries_mirrors to
crio_registries. The configuration in the test was however forgotten.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* feat(): Add wireguard backend to flannel cni
As described in the flannel docs:
https://github.com/flannel-io/flannel/blob/master/Documentation/backends.md#wireguard
This does not support optional configuration methods like:
- setting a psk (will be autogenerated by default)
- chang listening ports
- change mode (defaults to 'separate')
- change PersistentKeepaliveInterval (defaults to 0)
* Add supported backends to flannel docs
* Fix markdown in docs
This fixes a task failure in the rescue block that uncordons nodes after an unsuccessful drain. The issue occurs when `kube_override_hostname` is set and does not match `inventory_hostname`.
The `containerd-common` role is responsible for gathering OS specific variables from the vars directory of the roles that include or import it. `containerd-common` is imported via role dependency by a total of two roles, `container-engine/docker`, and `container-engine/containerd`.
containerd-common is needed by both the docker and containerd roles as a dependency when:
- containerd is selected as the container engine
- a docker install is detected and needs to be removed
- apt is the package manager
However, by default, roles can not be invoked more than once in the same play, unless `allow_duplicates: true` is set for that role. This results in the failure of the `containerd | Remove containerd repository` task, since only the docker vars will be loaded in the play, and `containerd_repo_info.repos`, normally populated by containerd/vars, is left empty.
This change sets `allow_duplicates: true` for `containerd-common` which fixes the currently failing containerd tasks if docker was detected and removed in the same play.
During the reset, restart network was not completing in distros
like RHEL/CentOS/AlmaLinux with major version higher than 8.
Example:
kubespray> ansible-playbook -i inventory/mydomain/hosts.yml reset.yml -b -v
fatal: [mynode]: FAILED! => {"changed": false, "msg": "Could not find the requested service network: host"}
Signed-off-by: Douglas Schilling Landgraf <dlandgra@redhat.com>
Signed-off-by: Douglas Schilling Landgraf <dlandgra@redhat.com>
There is a wrong directory path to all.yml and vsphere.yml. The wrong directory is `inventory/sample/group_vars/all.yml` and `inventory/sample/group_vars/all/vsphere.yml` which should be `inventory/sample/group_vars/all/all.yml` and `inventory/sample/group_vars/all/vsphere.yml`.
When operating kubespray from kubespray image with docker run,
we need to checkout the specific kubespray version as the same as
the image, because the sample inventory contains kubernetes version
and the version of master branch could not be supported on the released
kubespray, for example.
The certified-conformance mode took 2+ hours and that was too long
by comparing Quick mode which was specified previously.
So this updates the mode to Quick again.
The latest version of sonobuoy is v0.56.11.
This updates the version to the latest.
As the file name, this makes it use certified-conformance mode
clearly for the latest version of sonobuoy.
by setting a default runtime spec with a patch for RLIMIT_NOFILE.
- Introduces containerd_base_runtime_spec_rlimit_nofile.
- Generates base_runtime_spec on-the-fly, to use the containerd version
of the node.
- Update and re-work the documentation:
- Update links
- Fix formatting (especially for lists)
- Remove documentation about `useAlphaApi`,
a flag only for k8s versions < v1.10
- Attempt to clarify the doc
- Update to version 1.5.0
- Remove PodSecurityPolicy (deprecated in k8s v1.21+)
- Update ClusterRole following upstream
(cf https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner/pull/292)
- Add nodeSelector to DaemonSet (following upstream)
We made all vagrant jobs non-voting because those jobs were not stable.
However the setting allowed a pull request which broke vagrant jobs
completely merged into the master branch.
To avoid such situation, this makes one of vagrant jobs voting.
Let's see the stability of the job.
The install-cni-plugin image was not updated to the corresponding
arch when building the different DS.
Fixes issue #9460
Signed-off-by: Fred Rolland <frolland@nvidia.com>
Signed-off-by: Fred Rolland <frolland@nvidia.com>
`test` is is a internal Python package (see [doc]), and as such should not be
used here. It make tests fail in some environments.
[doc]: https://docs.python.org/3/library/test.html
* Fix inconsistent handling of admission plugin list
* Adjust hardening doc with the normalized admission plugin list
* Add pre-check for admission plugins format change
* Ignore checking admission plugins value when variable is not defined
On hardening environments, cert-manager pods could not be created
from the corresponding deployments. This adds the securityContext
to solve the issue.
To verify the hardening method works always.
The configuration comes from docs/hardening.md
Fix yaml format of hardening.yml
Add condition to skip 040 test for hardening
busybox container requires a root permission for ping.
For testing hardening method at CI, we need to switch to another image
which doesn't require the root permission for network testing.
On kubernetes/kubernetes repo, we are using agnhost which doesn't
require it. So this makes the test use aghhost image.
In addition, this updates the test manifest to specify securityContext
without any privilege.
* Avoid MetalLB speaker image download when metallb_speaker_enabled is set to
* Move metallb_speaker_enabled var to allow outside metalLB role references
* Move metallb_speaker_enabled var to allow outside metalLB role references
* Improve metallb_speaker_enabled default values
When trying to add a hardening CI job by copying configuration from
hardening.md, yamllint CI job deleted invalid format.
This fixes it for maintaining the CI job.
* Fix: install policy controller on kdd too
* Remove the calico_policy_version condition altogether
* Install policy controller both on canal and calico under same condition
* Support kubeadm patches in v1beta3
* Update kubeadm patches sample files in inventory
* Fix pre-commit syntax
* Set kubeadm_patches enabled to false in sample inventory
* Add optional NAT support in calico router mode
* Add a blank line in front of lists
* Remove mutual exclusivity: NAT and router mode
* Ignore router mode from NAT
* Update calico doc
Since the commit fad296616c cri_dockerd_enabled
has not been used. But the packet_ubuntu22-aio-docker.yml still contains
the configuration and causes confusions.
This removes the configuration for cleanup.
It seems that PR #8839 broke `calico_datastore: etcd` when it removed ipamconfig support for etcd mode.
This PR fixes some failing tasks when `calico_datastore == etcd`, but it does not restore ipamconfig support for calico in etcd mode. If someone wants to restore ipamconfig support for `calico_datastore: etcd` please submit a follow up PR for that.
For the following configuration
```
containerd_insecure_registries:
docker.io:
- dockerhubcache.example.com
```
the rendered /etc/containerd/config.toml contains
```
[plugins."io.containerd.grpc.v1.cri".registry.configs."docker.io".tls]
insecure_skip_verify = true
```
but it needs to be
```
[plugins."io.containerd.grpc.v1.cri".registry.configs."dockerhubcache.example.com".tls]
insecure_skip_verify = true
```
* Add the option to enable default Pod Security Configuration
Enable Pod Security in all namespaces by default with the option to
exempt some namespaces. Without the change only namespaces explicitly
configured will receive the admission plugin treatment.
* Fix the PR according to code review comments
* Revert the latest changes
- leave the empty file when kube_pod_security_use_default, but add comment explaining the empty file
- don't attempt magic at conditionally adding PodSecurity to kube_apiserver_admission_plugins_needs_configuration
* Add 'flush ip6tables' task in reset role
If enable_dual_stack_networks is set to true and ip6 is defined,ip6tables will be created. But when reset the kubernetes cluster, kubespray doesn't flush ip6tables.
* [CI] fix molecule tests on opensuse by upgrading to 15.4 (#9175)
* [CI] fix molecule tests on opensuse by upgrading to 15.4
* [opensuse] use correct python crytography package name depending on distribution version
Co-authored-by: Cristian Calin <6627509+cristicalin@users.noreply.github.com>
This condition blocks the creation of the `etcd` user in certain conditions.
Specifically, when you have a `etcd_deployment_type: kubeadm` and `kube_owner: root`.
Being the `root` user already present on the system, this will not be a problem (due to the idempotency of ansible).
Today we have many contributions to contrib/offline/ and some PRs
contained invalid coding style for those scripts.
This enables shellcheck to make such invalid coding style easily.
* Update main.yaml
* remove version in dpkg_selection name
* make lint happy
* Fix typo
* add comment / remove useless contition
* remove dpkg hold in reset tasks
* This release removes support for Kubernetes v1.19.0
* This release adds support for Kubernetes v1.24.0
* Starting with this release, we will need permissions on the coordination.k8s.io/leases resource for leaderelection lock
The commit 1ce2f04 tried to merge multiple SUSE OS checks including
"openSUSE Leap" and "openSUSE Tumbleweed" into a single SUSE, but
that was a perfect change.
Then the commit c16efc9 tried to fix it for "openSUSE Leap", but it
didn't take care of "openSUSE Tumbleweed".
Then this adds "openSUSE Tumbleweed" to the OS check.
* Fix vcloud-csi bug related to #9046
Signed-off-by: yasintahaerol <yasintahaerol@gmail.com>
* add supervisor-fss-namespace=kube-system flag to vsphere-csi-controller-deployment
Signed-off-by: yasintahaerol <yasintahaerol@gmail.com>
This adds target components on check_readme_versions.sh after
merging https://github.com/kubernetes-sigs/kubespray/pull/9044
In addition, this fixes typo on check_readme_versions.sh
This adds `foo_version` variables for some components because
check_readme_versions.sh verifies the corresponding version for
`<component name>_version` from main.yml. This change also makes
consistency in the main.yml. In long-term, we will be able to
remove the existing `foo_image_tag` variables, but that is not now
for backwards compatibility for users.
During code-review, reviwers needed to take care of README.md also
should be updated when the pull request updated component versions.
This adds the corresponding check to reduce reviwer's burden.
* Added new configuration item for extra tolerations in policy controllers
Signed-off-by: Sébastien Masset <smt.masset@gmail.com>
* Added new configuration item for extra tolerations in DNS autoscaler
Signed-off-by: Sébastien Masset <smt.masset@gmail.com>
* Aligned existing handling of extra DNS tolerations
Signed-off-by: Sébastien Masset <smt.masset@gmail.com>
* feat: make kubernetes owner parametrized
* docs: update hardening guide with configuration for CIS 1.1.19
* fix: set etcd data directory permissions to be compliant to CIS 1.1.12
* extra admission controls now don't have a version in their file names
eventratelimit.v1beta2.yaml.j2 -> eventratelimit.yaml.j2
* cri_socket variable includes the unix:// prefix to be conformat with
upstream
When running molecule jobs, we saw the folloing warning message:
[DEPRECATION WARNING]: [defaults]callback_whitelist option, normalizing names
to new standard, use callbacks_enabled instead. This feature will be removed
from ansible-core in version 2.15. Deprecation warnings can be disabled by
setting deprecation_warnings=False in ansible.cfg.
callbacks_enabled has been added since Ansible 2.11 and Kubespray is using
Ansible 2.12 at master branch. So we can use callbacks_enabled safely to
avoid the warning message.
* Allow disabling calico CNI logs with calico_cni_log_file_path
Calico CNI logs up to 1G if it log a lot with current default settings:
log_file_max_size 100 Max file size in MB log files can reach before they are rotated.
log_file_max_age 30 Max age in days that old log files will be kept on the host before they are removed.
log_file_max_count 10 Max number of rotated log files allowed on the host before they are cleaned up.
See https://projectcalico.docs.tigera.io/reference/cni-plugin/configuration#logging
To save disk space, make the path configurable and allow disabling this log by setting
`calico_cni_log_file_path: false`
* Fix markdown
* Update roles/network_plugin/canal/templates/cni-canal.conflist.j2
Co-authored-by: Kenichi Omichi <ken1ohmichi@gmail.com>
Co-authored-by: Kenichi Omichi <ken1ohmichi@gmail.com>
This reverts commit e375678674.
The workaround of explicitly specifying root for the kubelet unit was
for pulling images from private registry. Kubernetes now have a
dedicated mechanism with imagePullSecret.
Current Kubespray supports the Kubernetes version 1.21 or upper with
`kube_version_min_required: v1.21.0`
Then kube_version v1.20- related code is not used at all.
This deletes those code for cleanup.
* [etcd] Add extra documentation for `etcd_memory_limit` and `etcd_quota_backend_bytes`
Signed-off-by: necatican <necaticanyildirim@gmail.com>
* [etcd] Add support for setting ETCD_MAX_REQUEST_BYTES
Signed-off-by: necatican <necaticanyildirim@gmail.com>
* [upcloud] add option to use preconfigured cpu/mem plan
* [upcloud] add option to use firewall rules for API server/SSH access
* [upcloud] add option to use managed load balancer
* [cilium] Separate templates for cilium, cilium-operator, and hubble installations
Signed-off-by: necatican <necaticanyildirim@gmail.com>
* [cilium] Update cilium-operator templates
Signed-off-by: necatican <necaticanyildirim@gmail.com>
* [cilium] Allow using custom args and mounting extra volumes for the Cilium Operator
Signed-off-by: necatican <necaticanyildirim@gmail.com>
* [cilium] Update the cilium configmap to filter out the deprecated variables, and add the new variables
Signed-off-by: necatican <necaticanyildirim@gmail.com>
* [cilium] Add an option to use Wireguard encryption on Cilium 1.10 and up
Signed-off-by: necatican <necaticanyildirim@gmail.com>
* [cilium] Update cilium-agent templates
Signed-off-by: necatican <necaticanyildirim@gmail.com>
* [cilium] Bump Cilium version to 1.11.3
Signed-off-by: necatican <necaticanyildirim@gmail.com>
* Assert that IP range is enough for the nodes
Co-authored-by: Necatican Yıldırım <necaticanyildirim@gmail.com>
* Fixed whitespace
* Fixed errors
* Fixed errors
Co-authored-by: Necatican Yıldırım <necaticanyildirim@gmail.com>
The version 1.4 of containerd has been End of Life since March 3, 2022
as https://containerd.io/releases/#support-horizon
It is nice to drop the support from Kubespray also to follow containerd.
tests/requirements.txt links to tests/requirements-2.12.txt, so
Kubespray uses ansible 2.12 by default for testing. However we
forgot to update testcases_prepare.sh to use ansible 2.12.
This updates testcases_prepare to use ansible 2.12.
* Force containerd service unmasking
Force systemd to unmask and start service when adding containerd service
* Eliminate restart and move unmasking step
Switch to start instead of restart
Move unmasking to restart handler
* Add unmasking to similar container runtimes
* Add missing service names
aufs-tools was required for docker.io package originally,
but Kubespray installs docker-ce package instead today.
In addition, Ubuntu 20.04 doesn't provide aufs-tools as [1].
Then this removes aufs-tools from Ubuntu requirement.
[1]: https://bugs.launchpad.net/ubuntu/+source/aufs-tools/+bug/1947004
* Update verbs for volumeattachments resource
Update verbs for volumeattachments resource so that the kubelet can create volumeattachments and mount volumes when deploying Kubernetes on VMware vSphere.
* Update verbs for volumeattachments resource
Update verbs for volumeattachments resource to match upstream
* Update vsphere-csi-controller-rbac.yml.j2
* - add ability to specify the network_zone in hetzner terraform
- Export the network id from hetzner terraform the the generated inventory.ini
* - Add with_networks variable to allow different deployments of hcloud controller manager
- Add network id to hcloud controller secret (added via the inventory)
- Don't include extra_args if it's not set
Current ansible.tags 'facts' is for skipping actual Kubespray deployment
at vagrant CI because the deployment takes much time. However the static
'facts' skips the deployment for normal usage of vagrant also.
That causes confusions.
This adds VAGRANT_ANSIBLE_TAGS to skip the deployment for vagrant CI.
The quotations in the variable nerdctl_extra_flags are not required for the `nerdctl_image_pull_command` and throw the following error when executing the cluster-playbook with `container_insecure_registries` set:
unknown flag: --insecure-registry\\\"
This happens as the complete nerdctl_image_pull_command string variable gets split into an array string for the cmd task. The escaped quotation doesn't get escaped properly and is added to the cmd-string array as part of the command. This leads to a wrong written insecure-registry flag, which throws this error.
Due to missing quotation of nerdctl_extra_flags, ansible-playbook was failed:
Using module file /usr/local/lib/python3.6/dist-packages/ansible/modules/command.py
Pipelining is enabled.
[..]
File "/usr/lib/python3.8/shlex.py", line 191, in read_token
raise ValueError("No closing quotation")
This fixes the issue.
T-Eberle investigated the issue and found the solution.
Thank you T-Eberle!
* [ansible] make ansible 5.x the new default version and move different versions tested to nightly jobs
* [CI] jobs were missing proper ansible cleanup
If running Kubespray on static IP environments, a task was failed like:
TASK [kubernetes/preinstall : Configure dhclient hooks for resolv.conf (RH-only)]
fatal: [ak8s2]: FAILED! => {
"changed": false, "checksum": "..",
"msg": "Destination directory /etc/dhcp/dhclient.d does not exist"}
This adds a check for dhclientconffile for running 0100-dhclient-hooks to
run the task only if dhcpclient is enabled.
* terraform/openstack: Use path.module for ansible_bastion_template.txt
This extends on #7643 by not using path.root, but switching to path.module
to allow use of the terraform code as a module itself. This change then keeps
all calls to the template file stable even for that use-case.
* terraform/openstack: Make sed calls fail on errors
By using a single call with two replacements to use of sed will create proper exit codes
and allowing for errors to be recognized by terraform.
When running cluster.yml for new machines what containerd is already
install but Kubernetes cluster were not installed before, the task
"remove-node | List nodes" is failed like
"changed": false,
"cmd": [
"/usr/local/bin/kubectl", "--kubeconfig",
"/etc/kubernetes/admin.conf", "get", "nodes", "-o",
"go-template={{ range .items }}{{ .metadata.name }}
{{ "\n" }}{{ end }}"
],
..
"stderr": "error: stat /etc/kubernetes/admin.conf: no such file or directory",
That was due to lack to check the existing Kubernetes cluster exists
or not before running "kubectl drain" command.
This adds the check to avoid the issue.
* [calico] make vxlan encapsulation the default
* don't enable ipip encapsulation by default
* set calico_network_backend by default to vxlan
* update sample inventory and documentation
* [CI] pin default calico parameters for upgrade tests to ensure proper upgrade
* [CI] improve netchecker connectivity testing
* [CI] show logs for tests
* [calico] tweak task name
* [CI] Don't run the provisioner from vagrant since we run it in testcases_run.sh
* [CI] move kube-router tests to vagrant to avoid network connectivity issues during netchecker check
* service proxy mode still fails connectivity tests so keeping it manual mode
* [kube-router] account for containerd use-case
* Sketch of helm-apps role interface
* helm-apps: Early implementation and settings
* helm-apps: Fix README.md example playbook
* fixup! Sketch of helm-apps role interface
* Make the argument specs more explicit
* Remove exposed options from hardcoded default
* Simplify example playbook in README.md
- Define directly the roles parameters
- Add an example of option override for one chart only
* Use release instead of charts
Make explicit that the role is mananing releases, not charts.
Simplify parameters naming
* Add epoch to docker-ce and docker-ce-cli packages to ensure docker upgrade
* Split container-engine redhat vars to support legacy RHEL 7 version management
* Support ansible_distribution_major_version when disvering vars with ansible_os_family
* Update ansible-lint to 5.4.0 (#8607)
It seems that the Rich version 11.0.0 has a breaking change.
So need to update ansible-lint to 5.3.2 or later.
* Fix for ansible-lint no-changed-when rule (#8607)
* There is an issue with etcd v3.5.0 where it resurrects ancient members see: https://github.com/etcd-io/etcd/issues/13196
This issue is clearly fixed in etcd v3.5.2
* Just keep the checksums
* [containerd] add hashes for 1.6.1
* [contained] make 1.6.1 the default
* [containerd] add hashes for 1.5.10
* [containerd] add hashes for 1.4.13
* [nerdct] bump to 0.17.1
* [containerd] add checksums for 1.6.0
* [containerd] promote 1.6.0 as the new default
* [runc] promote 1.1.0 as the new default to allow arm deployments out of the box
* [nerdctl] bump to 0.17.0 to align with containerd 1.6.0
* [reset] allow crictl stopp and rmp commands to fail
As far as I can tell this is simply a typo that has existed from the beginning. Having it this way around (`etcd` group as a child and thus subset of `k8s_cluster`) mirrors what is written in the preceeding sentence.
When trying to run print_hostnames of inventory.py, it outputs the following
error:
$ CONFIG_FILE=./test-hosts.yaml python3 ./inventory.py print_hostnames
Traceback (most recent call last):
File "./inventory.py", line 472, in <module>
sys.exit(main())
File "./inventory.py", line 467, in main
KubesprayInventory(argv, CONFIG_FILE)
File "./inventory.py", line 92, in __init__
self.parse_command(changed_hosts[0], changed_hosts[1:])
File "./inventory.py", line 415, in parse_command
self.print_hostnames()
File "./inventory.py", line 455, in print_hostnames
print(' '.join(self.yaml_config['all']['hosts'].keys()))
KeyError: 'all'
because it is missed to load a hosts config file before printing hostnames.
This fixes the issue.
* [calico] upgrade 3.19.x to 3.19.4
* [calico] upgrade 3.20.x to 3.20.4
* [calico] upgrade 3.21.x to 3.21.4 and make it the default
* [calico] add 3.22.0 checksums
* [calico] account for path changes in calico 3.21.4 crd archive and above
Fixed the problem when call ansible-playbook contrib/mitogen/mitogen.yml
"The error was: 'dict object' has no attribute 'section'"
What type of PR is this?
/kind bug
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Use openstack_networking_port_v2 and openstack_networking_floatingip_associate_v2
to attach floating ips. This gives us more flexibility on disabling port security
when binding instances directly on provider networks in private cloud scenario.
If kubelet is run with systemd (as it always is when using kubespray),
it starts in systemd's /system.slice/kubelet.service cgroup.
This commit prevents a creation and usage of a second unrelated cgroup.
* terraform/gcp: Do not create unused subnetworks
By default terraform creates a subnetwork in each 39 regions
* terraform/gcp: Upgrade to latest google provider
... where "one of source_tags, source_ranges, or source_service_accounts must be defined"
* Use sysctl_file_path variable for all sysctl_file locations
* Add sysctl_file_path variable to kubespay-defaults
* Remove previously used sysctl file locations if present
* Use explicit filename in roles/kubernetes/node/defaults/main.yml
* Defaults: use explicit value
targetRef on endpoints surfaces as
__meta_kubernetes_endpoint_address_target_kind/__meta_kubernetes_endpoint_address_target_name
in prometheus and gets converted to the label `node` by
prometheus-operator
* Fix terraform Warning
Version constraints inside provider configuration blocks are deprecated
Terraform 0.13 and earlier allowed provider version constraints inside the
provider configuration block, but that is now deprecated and will be removed
in a future version of Terraform. To silence this warning, move the provider
version constraint into the required_providers block.
* Fix terraform Warning: Quoted references are deprecated
* terraform: Update GCP Ubuntu to latest LTS
The tf-elastx_cleanup test job was failed with error message:
Port xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx cannot be deleted
directly via the port API: has device owner network:router_interface.
That means necessary to remove a subnet from the router before
deleting the port.
This adds a method to removes a subnet from the router automatically.
This allow to workaround #8375 by using image_command_tool=crictl
when containerd_registries is used for containerd.
Also changes image_info_command_on_localhost for docker to return digests.
This fixes the following types of failures:
- empty-string-compare
- literal-compare
- risky-file-permissions
- risky-shell-pipe
- var-spacing
In addition, this changes .gitlab-ci/lint.yml to block the same issue
by using the same method at Kubespray CI.
When running ansible-lint directly, we can see a lot of warning
message like
risky-file-permissions File permissions unset or incorrect
This fixes the warning messages.
All container image versions were defined in download/defaults/main.yml
except containerd.
The inconsistency caused the offline script(generate_list.sh) could not
output the URL of containerd image.
This moves the definition into a valid file.
In addition, this adds host_os to generate_list.sh for downloading
krew from a valid URL.
* Update config.toml.j2
i think this commit code is not completed works
exam registry address : a.com:5000
insecure registry must be http://a.com:5000
but this code add insecure a.com:5000 (without http://)
If there is no http, containerd accesses with https even if insecure_skip_verify = true
solution is code edit
* Update config.toml.j2
* Update containerd.yml
* Update containerd.yml
* Update containerd.yml
* Update config.toml.j2
I was deploy my cluster with separate etcd cluster and not intersect with kube_control_plane or kube_node. And I want to run etcd cluster in docker but still used containerd to make container runtime for all other nodes. Therefore, I was added note to this doc for everyone
Thank !
- Use builtin task scheduling of ansible (same task on each host)
instead of manual looping on master
Benefits:
- One less play in remove-node.yml playbook
- Parralel node drain
- Drain parameters (timeout, grace period, retries,
allow_ungraceful_removal) can be adjusted separately for each node
with ansible variables
* Ensure entries for 1.23 are added for supported_versions vars
* cri-o: add support for kubernetes 1.23 but still use cri-o 1.22
* kubescheduler-config: diferentiate config versions based on kube_version
* registry: service add clusterIP, nodePort, loadBalancer support
* modify camelcase name to underscore
* Add registry service type compatibility check
* containerd: change default resolvconf_mode to host_resolvconf
* Wait for kube-apiserver to come back after pod refresh
* Handle resolv.conf gracefully
* Retain currently configured DNS entries to ensure we don't break the resolvers
* Suse uses wickedd for network management so no dhcp hooks
* Molecule: increase ansible timeout
* CI: Increase ansible timeout to 120s for Packet jobs
* Improve control plane scale flow (#13)
* Added version 1.20.10 of K8s
* Setting first_kube_control_plane to a existing one
* Setting first_kube_control_plane to a existing one
* change first_kube_master for first_kube_control_plane
* Ansible-lint changes
If trying to pull k8scsi/csi-resizer image from gcr.io, we face the error
like:
$ docker pull gcr.io/k8scsi/csi-resizer:v1.0.0
Error response from daemon: Head https://gcr.io/v2/k8scsi/csi-resizer/
manifests/v1.0.0: unknown: Project 'project:k8scsi' not found or deleted.
$
We can pull the image from quay.io instead.
This fixes the issue.
* containerd: add hashes for 1.5.8 and 1.4.12 and make 1.5.8 the new default
* containerd: make nerdctl mandatory for container_manager = containerd
* nerdctl: bump to version 0.14.0
* containerd: use nerdctl for image manipulation
* OpenSuSE: install basic nerdctl dependencies
* set ingress-nginx default terminationGracePeriodSeconds to 5 min for the drain of connection
* Add ingress_nginx_termination_grace_period_seconds at sample inventory
install containerd on centos need to binary download it
but offline.yml has no that value
binary download url default in
roles/download/defaults/main.yml:runc_download_url: "https://github.com/opencontainers/runc/releases/download/{{ runc_version }}/runc.{{ image_arch }}"
roles/download/defaults/main.yml:containerd_download_url: "https://github.com/containerd/containerd/releases/download/v{{ containerd_version }}/containerd-{{ containerd_version }}-linux-{{ image_arch }}.tar.gz"
if i use default offlie.yml, it's error from task download files
because runc,containerd down url is none offline
i want fix this
just add 2 new line
* Docs: update CONTRIBUTING.md
* Docs: clean up outdated roadmap and point to github issues instead
* Docs: update note on kubelet_cgroup_driver
* Docs: update kata containers docs with note about cgroup driver
* Docs: note about CI specific overrides
* Defaults: replace docker with containerd as our default container_manager
* CI: Use docker for download_localhost test
* Defaults: with container_manager=containerd we need etcd_deployment_type=host
* CI: Run weave jobs with docker
* CI: Vagrant don't download_force_cache
* CI: Fix upgrade tests
* should run compatible with old settings, this means docker
* we need to run with a distro that has at least modern containerd,
this means move from debian9 to debian10 to allow `containerd_version`
to match between 2.17 and master
* add metallb auto-assign property for main IP range & update addons.yml for sample inventory
* add new line at the end of file roles\kubernetes-apps\metallb\defaults\main.yml
* set default value for matallb_auto_assign = true
* Fixes various issues in vSphere Terraform code
Provided to address various shortcomings and to fix the following
issue in upstream Kubespray:
https://github.com/kubernetes-sigs/kubespray/issues/8176
* Resolves Terraform formatting issues
* Sets default prefix to human-readable name
* Documents new default prefix in README
* Ansible: separate requirements files for supported ansible versions
* Ansible: allow using ansible 2.11
* CI: Exercise Ansible 2.9 and Ansible 2.11 in a basic AIO CI job
* CI: Allow running a reset test outside of idempotency tests and running it in stage1
* CI: move ubuntu18-calico-aio job to stage2 and relay only on ubuntu20 with the variously supported ansible versions for stage1
* CI: add capability to install collections or roles from ansible-galaxy to mitigate missing behavior in older ansible versions
* Kata-containes: Fix for ubuntu and centos sometimes kata containers fail to start because of access errors to /dev/vhost-vsock and /dev/vhost-net
* Kata-containers: use similar testing strategy as gvisor
* Kata-Containers: adjust values for 2.2.0 defaults
Make CI tests actually pass
* Kata-Containers: bump to 2.2.2 to fix sandbox_cgroup_only issue
* Limit kubectl delete node to k8s nodes
This avoids the use of `kubectl delete node` when removing etcd nodes
which are not part of the cluser (separate etcd)
* Take errors into account when deleting node
There should not be error now that we're limiting the deletion to nodes
actually in the cluster
* Retrying on error
* Ensure addon-resizer 1.8.11 only effective at arch amd64.
k8s.gcr.io/addon-resizer:1.8.11 returns the amd64 image which is not executable at arm64.
Disable addon-resizer when the platform is not amd64.
When metrics-server upgrade and use addon-resizer:2.3, then revert this
commit and `image_arch` will determine the `addon_resizer_image_tag`.
* Add metrics_server_resizer architectures check
* nginx-ingress: bump to 1.0.4
* Disable builtin ssl_session_cache solving the problem with OpenSSL consuming memory.
* Print warning only instead of error if no IngressClass permission is available.
* nginx-ingress: bump to 1.0.4 in the README
* Disable builtin ssl_session_cache solving the problem with OpenSSL consuming memory.
* Print warning only instead of error if no IngressClass permission is available.
* Containerd: download containerd from upstream instead of using distro specific packages
split runc download to separate role
make bootstrap-os role deploy container-selinux and seccomp libraries
clean up package manager provided containerd
move variables to docker role that are no longer common with containerd
* Containerd: make molecule testing more relevant
* replace ubuntu18 with ubuntu20
* add centos8 and debian11 to molecule tests
* run kubernetes/preinstall role to ensure relevancy
of test including dependency packages
* CI: adjust test scenarios for downloaded containerd
kube-bench scan outputs warning related to Calico like:
* text: "Ensure that the Container Network Interface file
permissions are set to 644 or more restrictive (Manual)"
* text: "Ensure that the Container Network Interface file
ownership is set to root:root (Manual)"
This fixes these warnings.
* netchecker: update images to 1.2.2 from Mirantis which is slightly less ancinet than the l23networks images
* Netchecker: use local etcd instead of kubernetes v1beta1 crds which are no longer suported by kube 1.22+
* Add Rocky as a known OS
* Make sure Rocky includes bootstrap-centos.yml
* Update docs with Rocky Linux
* Rocky Linux wireguard and EPEL
* Rocky Linux in the list of supported distributions
If the etcd cluster is separate and the etcd_deployment_type is "host",
there is no need for a container engine on the etcd nodes
Do not rely on a 'default(true)' filter, but define a proper default in
kubespray-defaults depending on etcd deployment method and if internal
or external etcd is used
to remove deprecation warning:
> Flag --feature-gates has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag.
Kubespray deployment failed when using containerd backend on nodes that apparmor was not installed or previously removed. This PR ensure apparmor is installed by adding it into required_pkgs var.
* Fix invalid link to Ansible documentation
* Fix invalid link to mitogen doc page
* Fix invalid link to calico doc page
* Fix all invalid links to doc pages
The addon-resizer container can reduce resource limits of cpu and
memory of metrics-server container in the pod, and that caused
OOMKilled.
In addition, the original metrics-server manifest doesn't contain
the addon-resizer container as [1].
So this adds metrics_server_resizer option to control the addon-resizer
container deployment and the default value is false to make it stable
for most environments.
[1]: 527679e5e8/manifests/base/deployment.yaml
"allowPrivilegeEscalation: false" blocks deploying metrics-server
on CentOS7. In addition, the original metrics-server manifest doesn't
contain it as [1]. This removes it.
[1]: 527679e5e8/manifests/base/deployment.yaml
* Kata-Containers: add 2.2.0 hashes and make default
* Kata-Containers: replace 2.1.0 with bugfix version 2.1.1
* Kata-Containers: move to q35 a more modern VM architecture as 'pc' is removed in 2.2.0
Kubespray deployment failed when using containerd backend on nodes that apparmor was not installed or previously removed. This PR ensure apparmor is installed by adding it into required_pkgs var.
The path of kubeconfig should be configurable, and its default value
is /etc/kubernetes/admin.conf. Most paths of the file are configurable
but some were not. This make those configurable.
* Calico: make calico_min_version check relevant
* Calico: only check currently installed version against the oldest supported version by the previous release
Modify connection_strings_etcd to only return etcd nodes - not master nodes - since this results in duplicate hosts in the generated Ansible inventory and is unnecessary.
On Debian 11, `ipset` just recommend `iptables` so on the system that apt is configured with `APT::Install-Recommends "0";` iptables will not install automatically.
* Fix: adding new ips with inventory builder (#7577)
* moved conflig loading logic
to after checking whether the config
should be loaded, and added check for
whether the config should be loaded
* added check for removing nodes from config
if the user wants to remove a node, we
need to load the config
* Fix tox errors
* Fix missing file mode (risky-file-permissions)
Found this using ansible-lint.
Signed-off-by: Bryan Hundven <bryanhundven@gmail.com>
* Fix another missing file mode (risky-file-permissions)
This one fixes `/etc/crio/config.json`
Signed-off-by: Bryan Hundven <bryanhundven@gmail.com>
* CSI: update CSI snapshot CRDs
* CSI: update snapshot controller tag version with kubernetes specific versions
* CSI: allow enabling csi_snapshot_controller independent of Cinder CSI
* CSI: Align csi-snapshot-controller with upstream and use a Deployment instead of a StatefulSet
When using Calico with:
- `calico_network_backend: vxlan`,
- `calico_ipip_mode: "Never"`,
- `calico_vxlan_mode: "Always"`,
the `FelixConfiguration` object has `ipipEnabled: true`, when it should be false:
This is caused by an error in the `| bool` conversion in the install task:
when `calico_ipip_mode` is `Never`,
`{{ calico_ipip_mode != 'Never' | bool }}` evaluates to `true`:
* Fedora and RHEL use etc_t and the convention is <type_name>_t
* Docs: specify all values for preinstall_selinux_state
* CI: Add Fedora 34 with SELinux in enforcing mode
using --no-cache-dir flag in pip install ,make sure downloaded packages
by pip don't cached on system . This is a best practice which make sure
to fetch from repo instead of using local cached one . Further , in case
of Docker Containers , by restricting caching , we can reduce image size.
In term of stats , it depends upon the number of python packages
multiplied by their respective size . e.g for heavy packages with a lot
of dependencies it reduce a lot by don't caching pip packages.
Further , more detail information can be found at
https://medium.com/sciforce/strategies-of-docker-images-optimization-2ca9cc5719b6
Signed-off-by: Pratik Raj <rajpratik71@gmail.com>
Fix task 'Cert Manager | Wait for Webhook pods become ready' failed due to webhook pods don't exist yet by using `retries..until` trick like kubernetes-sigs/kubespray#7842
This fix should be removed in the future if the kubernetes/kubernetes#83242 is resolved.
Signed-off-by: rtsp <git@rtsp.us>
The main functions are wrapped by a sys.exit function which expects and
argument. The curent implementation isn't returning values in all cases.
This change ensures main functions return a value in all cases.
Fix task 'Cert Manager | Apply ClusterIssuer manifest' failed due to service/endpoints updating delayed even though the wekhook pod status is ready.
Signed-off-by: rtsp <git@rtsp.us>
Changes:
* ClusterRole updated according to the latest manifests from
https://github.com/kubernetes/cloud-provider-vsphere
* vSphere CPI/CSI default versions bumped and
tested successfully on K8S 1.21.1
* vSphere documentation updated
Signed-off-by: Vitaliy D <vi7alya@gmail.com>
non-kubeadm mode has been removed since ddffdb63bf
2.5 years ago. The non-kubeadm makes unnecessary confusion today, then
this updates the documentation.
Previously IDs of container images were gotten from tar files of container
images but that way was wrong. If multiple json files are contained in a
tar file, the script got multiple IDs and tried to pass these IDs on
`docker tag` command. Then the command was failed.
This updates the script to get image IDs from `docker image inspect` command
to fix this issue.
In addition, this adds a check a registry container exists already or not
before deploying registry container to avoid a container conflict failure.
* CRI-O: Install libseccomp2 from backports on Debian 10
libseccomp2 is a required dependency of cri-o-runc package
The one provided in Debian 10 repositories is outdated
* 7816: Remove useless when condition
As this condition is handled by block
* fix(misc): terraform/aws
- handles deployment with a single availability zone
- handles deployment with more than two availability zone
- handles etcd collocation with control-plane nodes (`aws_etcd_num=0`)
- allows to set a bastion instances count (`aws_bastion_num`)
- allows to set bastion/etcd/control-plane/workers rootfs volume size
- removes variables from terraform.tfvars that were not re-used
- adds .terraform.lock.hcl to .gitignore
- changes/updates base image from ubuntu-18.03 to debian-10
tested by a few coworkers of mine, and myself: thanks for the outstanding
work, on both those terraform samples and kubespray playbooks.
I did not test ubuntu deployments, I could still swap from buster to
focal. LMK.
* fix(gitlab-ci)
AFAIU, terraform.tfvars indentation should be fixed for / no diff
returned running `terraform fmt -check -diff`
https://gitlab.com/kargo-ci/kubernetes-sigs-kubespray/-/jobs/1445622114
To download necessary files in advance for offline deployment,
we can see all file URLs with contrib/offline/generate_list.sh
Most URLs are downloadable, but gvisor's one is not because the
URL is a part of full URLs for gvisor.
To download gvisor's files from the URLs directory, this separates
into two URLs for runsc and the shim.
tf-elax_ubuntu18-calico is so flake today. The test job is failed
due to SSH connectivity check error after deploying virtual machines
which are used for Kubernetes nodes.
This allows failure on the job to see the test situation without
pull request merger failures.
When running the script, I faced the following error but it was
difficult to know the root problem due to lack of error handling.
docker tag" requires exactly 2 arguments.
See 'docker tag --help'.
Usage: docker tag SOURCE_IMAGE[:TAG] TARGET_IMAGE[:TAG]
Create a tag TARGET_IMAGE that refers to SOURCE_IMAGE
To investigate such errors easily, this adds an error handling.
* csi-driver: Added possibility to use application credentials for cinder
* external-cloud-controller: Added env vars for openstack application credentials
* set selinux type t_etc if selinux state is enforcing
* workaround with update repo is no longer needed
remove comments about failing playbook
* grubby is not available in distros using ostree
* remove docker support because removed in fcos
update install script example with live rootfs
* do not call grubby on ostree based distro
* update docs enabling containerd on fedora coreos
* Ansible: move to Ansible 3.4.0 which uses ansible-base 2.10.10
* Docs: add a note about ansible upgrade post 2.9.x
* CI: ensure ansible is removed before ansible 3.x is installed to avoid pip failures
* Ansible: use newer ansible-lint
* Fix ansible-lint 5.0.11 found issues
* syntax issues
* risky-file-permissions
* var-naming
* role-name
* molecule tests
* Mitogen: use 0.3.0rc1 which adds support for ansible 2.10+
* Pin ansible-base to 2.10.11 to get package fix on RHEL8
* terraform/openstack: Use path.root for ansible_bastion_template.txt
The path.root variable points to the root module path. Using this
instead of a relative path makes less assumptions about the current
working directory.
* terraform/openstack: Add group_vars_path variable
Previously, the group_vars path was assumed to be in CWD. The
default value for the group_vars_path variable is still relative
to CWD and thus should be backwards compatible if unset.
* Calico: align manifests with upstream
* allow enabling typha prometheus metrics
* Calico: enable eBPF support
* manage the kubernetes-services-endpoint configmap
* Calico: document the use of eBPF dataplane
* Calico: improve checks before deployment
* enforce disabling kube-proxy when using eBPF dataplane
* ensure calico_version is supported
* Kata: add Kata 2.x checksums and adjust download urls for 2.x
* Kata: drop 1.x version which is no longer supported
* Kata: set default version 2.1.0
* Packet->Equinix Metal rename #6901
Updates throughout to reflect #6901 renaming for Packet to Equinix Metal.
* Rename Packet to Equinix Metal throughout the project #6901
Packet is renamed to Equinix Metal in more contexts including
documentation links. The Terraform provider used is still the Packet
provider. The environment variables and configuration options still
refer to the Packet name.
Signed-off-by: Marques Johansson <mjohansson@equinix.com>
Co-authored-by: Edward Vielmetti <ed@packet.net>
* Calico: add v3.19.1 hashes
* enable liveness probe for calico-kube-controllers
3.19.1
* Calico: drop support for v3.16.x
* Calico: promote v3.18.3 as default
* Override the default value of containerd's root, state, and oom_score configurations
* Add tests data for containerd_storage_dir, containerd_state_dir and containerd_oom_score variables
Basically we need to make necessary TCP/UDP ports open.
However the necessary ports are so many, and sometimes it is difficult
to figure out that is due to firewall issues or not if facing deployment
issues.
To distinguish a root problem on such situation, this adds contrib
playbook to disable the service firewall for Kubespray development
and test.
* add support for using ansible 2.10.x for deploying kubespray
* move dns-autoscaler-clusterrole{binding}.yml to files/ folder
* note that ansible 2.10 is now experimentally supported
* coredns: move files to templates like before #4341
* add initial MetalLB docs
* metallb allow disabling the deployment of the metallb speaker
* calico>=3.18 allow using calico to advertise service loadbalancer IPs
* Document the use of MetalLB and Calico
* clean MetalLB docs
Since K8S 1.21, BoundServiceAccountTokenVolume feature gate is in beta stage, thus activated by default (anyone who follows CSI guidelines has enabled AllAlpha and faced the issue before 1.21).
With this feature, SA tokens are regenerated every hour.
As a consequence for Calico CNI, token in /etc/cni/net.d/calico-kubeconfig copied from /var/run/secrets/kubernetes.io/serviceaccount in install-cni initContainer expires after one hour and any pod creation fails due to unauthorization.
Calico pods need to be restarted so that /etc/cni/net.d/calico-kubeconfig is updated with the new SA token.
follow new naming conventions for gcr's coredns image.
starting from 1.21 kubeadm assumes it to be `coredns/coredns`:
this causes the kubeadm deployment being unable to pull image, beacuse `v`
was also added in image tag, until the role `kubernetes-apps` ovverides
it with the old name, which is only compatible with <=1.7.
Backward comptability with kubeadm <=1.20 is mantained checking
kubernetes version and falling back to old names (`coredns:1.xx`) when
the version is less than 1.21
* rename ansible groups to use _ instead of -
k8s-cluster -> k8s_cluster
k8s-node -> k8s_node
calico-rr -> calico_rr
no-floating -> no_floating
Note: kube-node,k8s-cluster groups in upgrade CI
need clean-up after v2.16 is tagged
* ensure old groups are mapped to the new ones
* crio: add supported versions 1.20 and 1.21 and align default with k8s version
* cri-o: drop versions 1.17 and 1.18 from version matrix
* update note on cri-o version alignment
* calico: drop support for version 3.15
* drop check for calico version >= 3.3, we are at 3.16 minimum now
* we moved to calico 3.16+ so we can default to /opt/cni/bin/install
* AlmaLinux: ansible>2.9.19 is needed to know about AlmaLinux
* AlmaLinux: identify as a centos derrivative
* AlmaLinux: add AlmaLinux to checks for CentOS
* Use ansible_os_family to compare family and not distribution
As the official document[1], the parameter keepcache should be
'0' or '1' as string. To avoid the following warning message,
this fixes the parameter value:
[WARNING]: The value False (type bool) in a string field was
converted to u'False' (type string). If this does not look
like what you expect, quote the entire value to ensure it
does not change.
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/yum_repository_module.html
Context: Load-balancing in Exoscale is performed by associating many
workers with the same EIP. This works, however, the workers cannot access
themselves via the EIP, which is needed at least for cert-managers
"self-test".
Problem: The old iptables based workaround felt fragile and disappointed
me at least once.
New solution: Add the EIP to a loopback interface on each worker.
* Add containerd_extra_args
This is useful for custom containerd config, e.g. auth
Signed-off-by: Zhong Jianxin <azuwis@gmail.com>
* Make containerd config.toml mode 0640
It may contain sensitive information like password
Signed-off-by: Zhong Jianxin <azuwis@gmail.com>
This PR is to move the cilium kvstore options to the configmap
rather than specifying them in the deployment as args. This
is not technically necessary but keeping all the options in
one place is probably not a bad idea.
Tested with cilium 1.9.5.
When attempting a fresh install without cilium_ipsec_enabled I ran
into the following error:
failed: [k8m01] (item={'name': 'cilium', 'file': 'cilium-secret.yml', 'type': 'secret', 'when': 'cilium_ipsec_enabled'}) =>
{"ansible_loop_var": "item", "changed": false, "item": {"file": "cilium-secret.yml", "name": "cilium", "type": "secret",
"when": "cilium_ipsec_enabled"},"msg": "AnsibleUndefinedVariable: 'cilium_ipsec_key' is undefined"}
Moving the when condition from the item level to the task level solved
the issue.
* Add KubeSchedulerConfiguration for k8s 1.19 and up
With release of version 1.19.0 of kubernetes KubeSchedulerConfiguration
was graduated to beta. It allows to extend different stages of
scheduling with profiles. Such effect is achieved by using plugins and
extensions.
This patch adds KubeSchedulerConfiguration for versions 1.19 and later.
Configuration is set to k8s defaults or to kubespray vars. Moving those
defaults to new vars will be done in following patch.
Signed-off-by: Maciej Wereski <m.wereski@partner.samsung.com>
* KubeSchedulerConfiguration: add defaults
Signed-off-by: Maciej Wereski <m.wereski@partner.samsung.com>
* Add documentation for audit webhook variables
* Enclose the value of audit_webhook_server_url in a codeblock
* Add default value for audit_webhook_batch_max_wait
Starting with Cilium v1.9 the default ipam mode has changed to "Cluster
Scope". See:
https://docs.cilium.io/en/v1.9/concepts/networking/ipam/
With this ipam mode Cilium handles assigning subnets to nodes to use
for pod ip addresses. The default Kubespray deploy uses the Kube
Controller Manager for this (the --allocate-node-cidrs
kube-controller-manager flag is set). This makes the proper ipam mode
for kubespray using cilium v1.9+ "kubernetes".
Tested with Cilium 1.9.5.
This PR also mounts the cilium-config ConfigMap for this variable
to be read properly.
In the future we can probably remove the kvstore and kvstore-opt
Cilium Operator args since they can be in the ConfigMap. I will tackle
that after this merges.
When upgrading cilium from 1.8.8 to 1.9.5 I ran into the following
error:
level=error msg="Unable to update CRD" error="customresourcedefinitions.apiextensions.k8s.io
\"ciliumnodes.cilium.io\" is forbidden: User \"system:serviceaccount:kube-system:cilium-operator\"
cannot update resource \"customresourcedefinitions\" in API group \"apiextensions.k8s.io\" at the
cluster scope" name=CiliumNode/v2 subsys=k8s
The fix was to add the update verb to the clusterrole. I also added
create to match the clusterrole created by the cilium helm chart.
DNSSEC is off by default on ubuntu/bionic64 (18.04) as per resolved.conf(5).
These tasks are artefacts of obsolete infra configuration, and no longer needed.
Further removing these tasks resolves the issue that the tasks always reports
'changed' and bounces systemd-resolved unneccesarily, even if there was no
actual modification of /etc/systemd/resolved.conf.
* Remove contrib/vault
This is marked as broken since 2018 / 3dcb914607
This still reference apiserver.pem, not used since ddffdb63bf
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
* Finish nuking vault from the codebase
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
This replaces kube-master with kube_control_plane because of [1]:
The Kubernetes project is moving away from wording that is
considered offensive. A new working group WG Naming was created
to track this work, and the word "master" was declared as offensive.
A proposal was formalized for replacing the word "master" with
"control plane". This means it should be removed from source code,
documentation, and user-facing configuration from Kubernetes and
its sub-projects.
NOTE: The reason why this changes it to kube_control_plane not
kube-control-plane is for valid group names on ansible.
[1]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-cluster-lifecycle/kubeadm/2067-rename-master-label-taint/README.md#motivation
While at it remove force_certificate_regeneration
This boolean only forced the renewal of the apiserver certs
Either manually use k8s-certs-renew.sh or set auto_renew_certificates
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
* Add crun download_url and checksum
* Change versioning format to crun native versioning
* Download crun using download_file.yml
* Get crun version from download defaults
* Delegate crun binary copy task to crun role
* Download Calico KDD CRDs
* Replace kustomize with lineinfile and use ansible assemble module
* Replace find+lineinfile by sed in shell module to avoid nested loop
* add condition on sed
* use block for kdd tasks + remove supernumerary kdd manifest apply in start "Start Calico resources"
"The error was: 'proxy_disable_env' is undefined\n\nThe error appears to
be in '<censored>scale.yml': line 72, column 7"
Fixes 067db686f6
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
upgrades.md explains how to do upgrade from v1.4.3 to v1.4.6 as an
example. The versions are a little old, and the doc readers would
have a concern the upgrade works fine or not.
This updates versions after verifying the way works fine by hands.
* terraform support for UpCloud
* terraform support for UpCloud
* terraform support for UpCloud
* terraform support for UpCloud
* terraform support for UpCloud
* terraform support for UpCloud
* terraform support for UpCloud
* Updates to README.md and main.tf files
* formatting and updating readme
* added a .terraform_validate CI job
* fixed format issue
* added sample inventory
* added symbolic link to group_vars
* added missing tf variables and minor fixes
* added text formatting
* minor formatting fixes
* add nodeselector and tolerations for metallb
* remove unnecessary commented lines in metallb template
* set default speaker toleration to match original manifest
When privileged is enabled for a container, all the `/dev/*` block
devices from the host are mounted into the guest. The
`privileged_without_host_devices` flag prevents host devices from
being passed to privileged containers.
More information:
* https://github.com/containerd/cri/pull/1225
* 1d0f68156b
The important action in kubeadm-version.yml is the templating of the configuration,
not finding / setting the version
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
kubeadm is the default for a long time now,
and admin.conf is created by it, so let kubeadm handle it
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
Using `kubeadm init phase kubeconfig all` breaks kubelet client certificate rotation
as we are missing `kubeadm init phase kubelet-finalize all` to point to `kubelet-client-current.pem`
kubeconfig format is stable so let's just use lineinfile,
this will avoid other future breakage
This revert to the logic before 6fe2248314
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
On CentOS 8 they seem to be ignored by default, but better be extra safe
This also make it easy to exclude other network plugin interfaces
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
* Add terraform scripts for vSphere
* Fixup: Add terraform scripts for vSphere
* Add inventory generation
* Use machines var to provide IPs
* Add README file
* Add default.tfvars file
* Fix newlines at the end of files
* Remove master.count and worker.count variables
* Fixup cloud-init formatting
* Fixes after initial review
* Add warning about disabled DHCP
* Fixes after second review
* Add sample-inventory
* use external_openstack_lbaas_use_octavia for template openstack-cloud-config
* Delete external_openstack_lbaas_use_octavia from default values. Added description and default values of variables to docs
* markdown fix
* make this simple
* set external_openstack_lbaas_use_octavia in default values
* duplicated variable in doc
This replaces KUBE_MASTERS with KUBE_CONTROL_HOSTS because of [1]:
```
The Kubernetes project is moving away from wording that is
considered offensive. A new working group WG Naming was created
to track this work, and the word "master" was declared as offensive.
A proposal was formalized for replacing the word "master" with
"control plane". This means it should be removed from source code,
documentation, and user-facing configuration from Kubernetes and
its sub-projects.
```
[1]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-cluster-lifecycle/kubeadm/2067-rename-master-label-taint/README.md#motivation
Since a790935d02 all proxy users
should be properly configured
Now when you have *_PROXY vars in your environment it can leads to failure
if NO_PROXY is not correct, or to persistent configuration changes
as seen with kubeadm in 1c5391dda7
Instead of playing constant whack-a-bug, inject empty *_PROXY vars everywhere
at the play level, and override at the task level when needed
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
Before this commit, we were gathering:
1 !all
7 network
7 hardware
After we are gathering:
1 !all
1 network
1 hardware
ansible_distribution_major_version is gathered by '!all'
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
* Move proxy_env to kubespray-defaults/defaults
There is no reasons to use set_facts here
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
* Ensure kubeadm doesn't use proxy
*_proxy variables might be present in the environment (/etc/environment, bash profile, ...)
When this is the case we end up with those proxy configuration in /etc/kubernetes/manifests/kube-*.yaml manifests
We cannot unset env variables, but kubeadm is nice enough to ignore empty vars
93d288e2a4/cmd/kubeadm/app/util/env.go (L27)
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
Ubuntu 18.04 crio package ships with 'mountopt = "nodev,metacopy=on"'
even if GA kernel is 4.15 (HWE Kernel can be more recent)
Fedora package ships without metacopy=on
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
By default Ansible stat module compute checksum, list extended attributes and find mime type
To find all stat invocations that really use one of those:
git grep -F stat. | grep -vE 'stat.(islnk|exists|lnk_source|writeable)'
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
`containerd.io` is the companion package of `docker-ce` and is the
proper package name. This is needed to avoid apt upgrade/dist-upgrade
from breaking kubernetes.
Running remove-node.yml tasks for clean up cluster on Fedora CoreOS.
The task failed to restart network daemon (task name: "reset | Restart network").
Fedora CoreOS is essentially using NetworkManager, but this task returns network.
Signed-off-by: Takashi IIGUNI <iiguni.tks@gmail.com>
* Add unique annotation on coredns deployment and only remove existing deployment if annotation is missing.
* Ignore errors when gathering coredns deployment details to handle case where it doesn't exist yet
* Remove run_once, deletegate_to and add to when statement
* Added force_etcd_cert_refresh var to maintain existing functionality. Broke out etcd node cert syncing from member and admin cert sync logic. Now first etcd will sync node certs to other etcd members on every run to keep all etcds up to date after adding additional worker nodes to the cluster
* Updated etcd cert check tasks to better detect when new certificates need to be generated
* Move usage of force_etcd_cert_refresh var to gen_certs fact set
* Force etcd cert generation per server if force_etcd_cert_refresh is set to true
* Include gathering of node certs even if k8s-cluster member and in etcd group.
* Removed run_once due to when statement
Helm v3.5.2 is a security (patch) release. Users are strongly
recommended to update to this release. It fixes two security issues in
upstream dependencies and one security issue in the Helm codebase.
See https://github.com/helm/helm/releases/tag/v3.5.2
This makes the docker role work the same as the containerd role.
Being able to override this is needed when you have your own debian
repository. E.g. when performing an airgapped installation
* contrib/terraform/exoscale: Rework SSH public keys
Exoscale has a few limitations with `exoscale_ssh_keypair` resources.
Creating several clusters with these scripts may lead to an error like:
```
Error: API error ParamError 431 (InvalidParameterValueException 4350): The key pair "lj-sc-ssh-key" already has this fingerprint
```
This patch reworks handling of SSH public keys. Specifically, we rely on
the more cloud-agnostic way of configuring SSH public keys via
`cloud-init`.
* contrib/terraform/exoscale: terraform fmt
* contrib/terraform/exoscale: Add terraform validate
* contrib/terraform/exoscale: Inline public SSH keys
The Terraform scripts need to install some SSH key, so that Kubespray
(i.e., the "Ansible part") can take over. Initially, we pointed the
Terraform scripts to `~/.ssh/id_rsa.pub`. This proved to be suboptimal:
Operators sharing responbility for a cluster risk unnecessarily replacing resources.
Therefore, it has been determined that it's best to inline the public
SSH keys. The chosen variable `ssh_public_keys` provides some uniformity
with `contrib/azurerm`.
* Fix Terraform Exoscale test
* Fix Terraform 0.14 test
* update local-path-storage config template to version v0.0.19
* changes local_path_provisioner image tag to v0.0.19
* removes copy paste example from rancher local-path-provisioner repo
According to the following recommendation, this moves the directory
to control-plane:
The Kubernetes project is moving away from wording that is considered
offensive. A new working group WG Naming was created to track this work,
and the word "master" was declared as offensive. A proposal was formalized
for replacing the word "master" with "control plane".
Fixes the following error when using Bastion Node with the sample config.
```
fatal: [bastion]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'bastion'\n\nThe error appears to be in '/home/felix/inovex/kubespray/roles/bastion-ssh-config/tasks/main.yml': line 2, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n---\n- name: set bastion host IP\n ^ here\n"}
```
Previous check for presence of NM assumed "systemctl show
NetworkManager" would exit with a nonzero status code, which seems not
the case anymore with recent Flatcar Container Linux.
This new check also checks the activeness of network manager, as
`is-active` implies presence.
Signed-off-by Jorik Jonker <jorik@kippendief.biz>
This was introduced in 143e2272ff
Extra repo is enabled by default in CentOS, and is not the right repo for EL8
Instead of adding a CentOS repo to RHEL, enable the needed RHEL repos with rhsm_repository
For RHEL 7, we need the "extras" repo for container-selinux
For RHEL 8, we need the "appstream" repo for container-selinux, ipvsadm and socat
Signed-off-by: Etienne Champetier <champetier.etienne@gmail.com>
Only checking the kubernetes api on the first master when upgrading is not enough.
Each master needs to be checked before it's upgrade.
Signed-off-by: Rick Haan <rickhaan94@gmail.com>
yum_repository expect really different params, so nothing to factor here
Ubuntu is not an ansible_os_family, the OS family for Ubuntu is Debian
Check for ansible_pkg_mgr == apt
Signed-off-by: Etienne Champetier <champetier.etienne@gmail.com>
we don't need rpm_key, so nothing to factor here
Ubuntu is not an ansible_os_family, the OS family for Ubuntu is Debian
Check for ansible_pkg_mgr == apt
Signed-off-by: Etienne Champetier <champetier.etienne@gmail.com>
Here the desciption from Ansible docs
Corresponds to the --force-yes to apt-get and implies allow_unauthenticated: yes
This option will disable checking both the packages' signatures and the certificates of the web servers they are downloaded from.
This option *is not* the equivalent of passing the -f flag to apt-get on the command line
**This is a destructive operation with the potential to destroy your system, and it should almost never be used.** Please also see man apt-get for more information.
Signed-off-by: Etienne Champetier <champetier.etienne@gmail.com>
about: Support request or question relating to Kubespray
labels: kind/support
---
<!--
STOP -- PLEASE READ!
GitHub is not the right place for support requests.
If you're looking for help, check [Stack Overflow](https://stackoverflow.com/questions/tagged/kubespray) and the [troubleshooting guide](https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/).
You can also post your question on the [Kubernetes Slack](http://slack.k8s.io/) or the [Discuss Kubernetes](https://discuss.kubernetes.io/) forum.
If the matter is security related, please disclose it privately via https://kubernetes.io/security/.
## How to become a contributor and submit your own code
@ -6,11 +6,23 @@
It is recommended to use filter to manage the GitHub email notification, see [examples for setting filters to Kubernetes Github notifications](https://github.com/kubernetes/community/blob/master/communication/best-practices.md#examples-for-setting-filters-to-kubernetes-github-notifications)
To install development dependencies you can use`pip install -r tests/requirements.txt`
To install development dependencies you can set up a python virtual env with the necessary dependencies:
```ShellSession
virtualenv venv
source venv/bin/activate
pip install -r tests/requirements.txt
ansible-galaxy install -r tests/requirements.yml
```
#### Linting
Kubespray uses `yamllint` and `ansible-lint`. To run them locally use `yamllint .` and `ansible-lint`
Kubespray uses [pre-commit](https://pre-commit.com) hook configuration to run several linters, please install this tool and use it to run validation tests before submitting a PR.
```ShellSession
pre-commit install
pre-commit run -a # To run pre-commit hook on all files in the repository, even if they were not modified
```
#### Molecule
@ -27,5 +39,9 @@ Vagrant with VirtualBox or libvirt driver helps you to quickly spin test cluster
1. Submit an issue describing your proposed change to the repo in question.
2. The [repo owners](OWNERS) will respond to your issue promptly.
3. Fork the desired repo, develop and test your code changes.
4.Sign the CNCF CLA (<https://git.k8s.io/community/CLA.md#the-contributor-license-agreement>)
5.Submit a pull request.
4.Install [pre-commit](https://pre-commit.com) and install it in your development repo.
5.Addess any pre-commit validation failures.
6. Sign the CNCF CLA (<https://git.k8s.io/community/CLA.md#the-contributor-license-agreement>)
7. Submit a pull request.
8. Work with the reviewers on their suggestions.
9. Ensure to rebase to the HEAD of your target branch and squash un-necessary commits (<https://blog.carbonfive.com/always-squash-and-rebase-your-git-commits/>) before final merger of your contribution.
If you have questions, check the documentation at [kubespray.io](https://kubespray.io) and join us on the [kubernetes slack](https://kubernetes.slack.com), channel **\#kubespray**.
You can get your invite [here](http://slack.k8s.io/)
- Can be deployed on **[AWS](docs/aws.md), GCE, [Azure](docs/azure.md), [OpenStack](docs/openstack.md), [vSphere](docs/vsphere.md), [Packet](docs/packet.md) (bare metal), Oracle Cloud Infrastructure (Experimental), or Baremetal**
- Can be deployed on **[AWS](docs/cloud_providers/aws.md), GCE, [Azure](docs/cloud_providers/azure.md), [OpenStack](docs/cloud_providers/openstack.md), [vSphere](docs/cloud_providers/vsphere.md), [Equinix Metal](docs/cloud_providers/equinix-metal.md) (bare metal), Oracle Cloud Infrastructure (Experimental), or Baremetal**
- **Highly available** cluster
- **Composable** (Choice of the network plugin for instance)
- Supports most popular **Linux distributions**
@ -13,16 +13,16 @@ You can get your invite [here](http://slack.k8s.io/)
## Quick Start
To deploy the cluster you can use :
Below are several ways to use Kubespray to deploy a Kubernetes cluster.
### Ansible
#### Usage
```ShellSession
# Install dependencies from ``requirements.txt``
sudo pip3 install -r requirements.txt
Install Ansible according to [Ansible installation guide](/docs/ansible/ansible.md#installing-ansible)
then run the following steps:
```ShellSession
# Copy ``inventory/sample`` as ``inventory/mycluster``
Note: When Ansible is already installed via system packages on the control machine, other python packages installed via `sudo pip install -r requirements.txt` will go to a different directory tree (e.g. `/usr/local/lib/python2.7/dist-packages` on Ubuntu) from Ansible's (e.g. `/usr/lib/python2.7/dist-packages/ansible` still on Ubuntu).
As a consequence, `ansible-playbook` command will fail with:
Note: When Ansible is already installed via system packages on the control node,
Python packages installed via `sudo pip install -r requirements.txt` will go to
a different directory tree (e.g. `/usr/local/lib/python2.7/dist-packages` on
Ubuntu) from Ansible's (e.g. `/usr/lib/python2.7/dist-packages/ansible` still on
Ubuntu). As a consequence, the `ansible-playbook` command will fail with:
```raw
ERROR! no action detected in task. This often indicates a misspelled module name, or incorrect module path.
```
probably pointing on a task depending on a module present in requirements.txt (i.e. "unseal vault").
This likely indicates that a task depends on a module present in ``requirements.txt``.
One way of solving this would be to uninstall the Ansible package and then, to install it via pip but it is not always possible.
A workaround consists of setting `ANSIBLE_LIBRARY` and `ANSIBLE_MODULE_UTILS` environment variables respectively to the `ansible/modules` and `ansible/module_utils` subdirectories of pip packages installation location, which can be found in the Location field of the output of `pip show [package]` before executing `ansible-playbook`.
One way of addressing this is to uninstall the system Ansible package then
reinstall Ansible via ``pip``, but this not always possible and one must
take care regarding package versions.
A workaround consists of setting the `ANSIBLE_LIBRARY`
and `ANSIBLE_MODULE_UTILS` environment variables respectively to
the `ansible/modules` and `ansible/module_utils` subdirectories of the ``pip``
installation location, which is the ``Location`` shown by running
`pip show [package]` before executing `ansible-playbook`.
A simple way to ensure you get all the correct version of Ansible is to use
the [pre-built docker image from Quay](https://quay.io/repository/kubespray/kubespray?tab=tags).
You will then need to use [bind mounts](https://docs.docker.com/storage/bind-mounts/)
to access the inventory and SSH key in the container, like this:
```ShellSession
git checkout v2.25.0
docker pull quay.io/kubespray/kubespray:v2.25.0
docker run --rm -it --mount type=bind,source="$(pwd)"/inventory/sample,dst=/inventory \
See [here](docs/ansible/ansible_collection.md) if you wish to use this repository as an Ansible collection
### Vagrant
For Vagrant we need to install python dependencies for provisioning tasks.
Check if Python and pip are installed:
For Vagrant we need to install Python dependencies for provisioning tasks.
Check that ``Python`` and ``pip`` are installed:
```ShellSession
python -V && pip -V
```
If this returns the version of the software, you're good to go. If not, download and install Python from here <https://www.python.org/downloads/source/>
Install the necessary requirements
Install Ansible according to [Ansible installation guide](/docs/ansible/ansible.md#installing-ansible)
then run the following step:
```ShellSession
sudo pip install -r requirements.txt
vagrant up
```
## Documents
- [Requirements](#requirements)
- [Kubespray vs ...](docs/comparisons.md)
- [Getting started](docs/getting-started.md)
- [Setting up your first cluster](docs/setting-up-your-first-cluster.md)
- [Ansible inventory and tags](docs/ansible.md)
- [Integration with existing ansible repo](docs/integration.md)
- [Deployment data variables](docs/vars.md)
- [DNS stack](docs/dns-stack.md)
- [HA mode](docs/ha-mode.md)
- [Kubespray vs ...](docs/getting_started/comparisons.md)
Note: The list of available docker version is 18.09, 19.03 and 20.10. The recommended docker version is 19.03. The kubelet might break on docker's non-standard version numbering (it no longer uses semantic versioning). To ensure auto-updates don't break your cluster look into e.g. yum versionlock plugin or apt pin).
## Container Runtime Notes
- The cri-o version should be aligned with the respective kubernetes version (i.e. kube_version=1.20.x, crio_version=1.20)
## Requirements
- **Minimum required version of Kubernetes is v1.18**
- **Ansible v2.9.x, Jinja 2.11+ and python-netaddr is installed on the machine that will run Ansible commands, Ansible 2.10.x is not supported for now**
- The target servers must have **access to the Internet** in order to pull docker images. Otherwise, additional configuration is required (See [Offline Environment](docs/offline-environment.md))
- **Minimum required version of Kubernetes is v1.28**
- **Ansible v2.14+, Jinja 2.11+ and python-netaddr is installed on the machine that will run Ansible commands**
- The target servers must have **access to the Internet** in order to pull docker images. Otherwise, additional configuration is required (See [Offline Environment](docs/operations/offline-environment.md))
- The target servers are configured to allow **IPv4 forwarding**.
- If using IPv6 for pods and services, the target servers are configured to allow **IPv6 forwarding**.
- The **firewalls are not managed**, you'll need to implement your own rules the way you used to.
in order to avoid any issue during deployment you should disable your firewall.
- If kubespray is ran from non-root user account, correct privilege escalation method
- If kubespray is run from non-root user account, correct privilege escalation method
should be configured in the target servers. Then the `ansible_become` flag
or command parameters `--become or -b` should be specified.
@ -164,45 +222,44 @@ These limits are safe guarded by Kubespray. Actual requirements for your workloa
## Network Plugins
You can choose between 10 network plugins. (default: `calico`, except Vagrant uses `flannel`)
You can choose among ten network plugins. (default: `calico`, except Vagrant uses `flannel`)
- [Calico](https://docs.projectcalico.org/latest/introduction/) is a networking and network policy provider. Calico supports a flexible set of networking options
- [Calico](https://docs.tigera.io/calico/latest/about/) is a networking and network policy provider. Calico supports a flexible set of networking options
designed to give you the most efficient networking across a range of situations, including non-overlay
and overlay networks, with or without BGP. Calico uses the same engine to enforce network policy for hosts,
pods, and (if using Istio and Envoy) applications at the service mesh layer.
- [canal](https://github.com/projectcalico/canal): a composition of calico and flannel plugins.
- [cilium](http://docs.cilium.io/en/latest/): layer 3/4 networking (as well as layer 7 to protect and secure application protocols), supports dynamic insertion of BPF bytecode into the Linux kernel to implement security services, networking and visibility logic.
- [ovn4nfv](docs/ovn4nfv.md): [ovn4nfv-k8s-plugins](https://github.com/opnfv/ovn4nfv-k8s-plugin) is the network controller, OVS agent and CNI server to offer basic SFC and OVN overlay networking.
- [weave](docs/weave.md): Weave is a lightweight container overlay network that doesn't require an external K/V database cluster.
- [weave](docs/CNI/weave.md): Weave is a lightweight container overlay network that doesn't require an external K/V database cluster.
(Please refer to `weave` [troubleshooting documentation](https://www.weave.works/docs/net/latest/troubleshooting/)).
- [kube-ovn](docs/kube-ovn.md): Kube-OVN integrates the OVN-based Network Virtualization with Kubernetes. It offers an advanced Container Network Fabric for Enterprises.
- [kube-ovn](docs/CNI/kube-ovn.md): Kube-OVN integrates the OVN-based Network Virtualization with Kubernetes. It offers an advanced Container Network Fabric for Enterprises.
- [kube-router](docs/kube-router.md): Kube-router is a L3 CNI for Kubernetes networking aiming to provide operational
- [kube-router](docs/CNI/kube-router.md): Kube-router is a L3 CNI for Kubernetes networking aiming to provide operational
simplicity and high performance: it uses IPVS to provide Kube Services Proxy (if setup to replace kube-proxy),
iptables for network policies, and BGP for ods L3 networking (with optionally BGP peering with out-of-cluster BGP peers).
It can also optionally advertise routes to Kubernetes cluster Pods CIDRs, ClusterIPs, ExternalIPs and LoadBalancerIPs.
- [macvlan](docs/macvlan.md): Macvlan is a Linux network driver. Pods have their own unique Mac and Ip address, connected directly the physical (layer 2) network.
- [macvlan](docs/CNI/macvlan.md): Macvlan is a Linux network driver. Pods have their own unique Mac and Ip address, connected directly the physical (layer 2) network.
- [multus](docs/multus.md): Multus is a meta CNI plugin that provides multiple network interface support to pods. For each interface Multus delegates CNI calls to secondary CNI plugins such as Calico, macvlan, etc.
- [multus](docs/CNI/multus.md): Multus is a meta CNI plugin that provides multiple network interface support to pods. For each interface Multus delegates CNI calls to secondary CNI plugins such as Calico, macvlan, etc.
The choice is defined with the variable `kube_network_plugin`. There is also an
- [custom_cni](roles/network-plugin/custom_cni/) : You can specify some manifests that will be applied to the clusters to bring you own CNI and use non-supported ones by Kubespray.
See `tests/files/custom_cni/README.md` and `tests/files/custom_cni/values.yaml`for an example with a CNI provided by a Helm Chart.
The network plugin to use is defined by the variable `kube_network_plugin`. There is also an
option to leverage built-in cloud provider networking instead.
See also [Network checker](docs/netcheck.md).
See also [Network checker](docs/advanced/netcheck.md).
## Ingress Plugins
- [ambassador](docs/ambassador.md): the Ambassador Ingress Controller and API gateway.
- [nginx](https://kubernetes.github.io/ingress-nginx): the NGINX Ingress Controller.
- [metallb](docs/ingress/metallb.md): the MetalLB bare-metal service LoadBalancer provider.
The Kubespray Project is released on an as-needed basis. The process is as follows:
1. An issue is proposing a new release with a changelog since the last release
2. At least one of the [approvers](OWNERS_ALIASES) must approve this release
3. The `kube_version_min_required` variable is set to `n-1`
4. Remove hashes for [EOL versions](https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md) of kubernetes from `*_checksums` variables.
5.An approver creates [new release in GitHub](https://github.com/kubernetes-sigs/kubespray/releases/new) using a version and tag name like `vX.Y.Z` and attaching the release notes
6. An approver creates a release branch in the form `release-X.Y`
7.The corresponding version of [quay.io/kubespray/kubespray:vX.Y.Z](https://quay.io/repository/kubespray/kubespray) and [quay.io/kubespray/vagrant:vX.Y.Z](https://quay.io/repository/kubespray/vagrant) docker images are built and tagged
8.The `KUBESPRAY_VERSION` variable is updated in `.gitlab-ci.yml`
9.The release issue is closed
10.An announcement email is sent to `kubernetes-dev@googlegroups.com` with the subject `[ANNOUNCE] Kubespray $VERSION is released`
11.The topic of the #kubespray channel is updated with `vX.Y.Z is released! | ...`
1. An issue is proposing a new release with a changelog since the last release. Please see [a good sample issue](https://github.com/kubernetes-sigs/kubespray/issues/8325)
1. At least one of the [approvers](OWNERS_ALIASES) must approve this release
1. (Only for major releases) The `kube_version_min_required` variable is set to `n-1`
1. (Only for major releases) Remove hashes for [EOL versions](https://github.com/kubernetes/website/blob/main/content/en/releases/patch-releases.md) of kubernetes from `*_checksums` variables.
1.Create the release note with [Kubernetes Release Notes Generator](https://github.com/kubernetes/release/blob/master/cmd/release-notes/README.md). See the following `Release note creation` section for the details.
1. An approver creates [new release in GitHub](https://github.com/kubernetes-sigs/kubespray/releases/new) using a version and tag name like `vX.Y.Z` and attaching the release notes
1.(Only for major releases) An approver creates a release branch in the form `release-X.Y`
1.(For major releases) On the `master` branch: bump the version in `galaxy.yml` to the next expected major release (X.y.0 with y = Y + 1), make a Pull Request.
1.(For minor releases) On the `release-X.Y` branch: bump the version in `galaxy.yml` to the next expected minor release (X.Y.z with z = Z + 1), make a Pull Request.
1.The corresponding version of [quay.io/kubespray/kubespray:vX.Y.Z](https://quay.io/repository/kubespray/kubespray) and [quay.io/kubespray/vagrant:vX.Y.Z](https://quay.io/repository/kubespray/vagrant) container images are built and tagged. See the following `Container image creation` section for the details.
1.(Only for major releases) The `KUBESPRAY_VERSION` in `.gitlab-ci.yml` is upgraded to the version we just released # TODO clarify this, this variable is for testing upgrades.
1. The release issue is closed
1. An announcement email is sent to `dev@kubernetes.io` with the subject `[ANNOUNCE] Kubespray $VERSION is released`
1. The topic of the #kubespray channel is updated with `vX.Y.Z is released! | ...`
## Major/minor releases and milestones
@ -46,3 +49,37 @@ The Kubespray Project is released on an as-needed basis. The process is as follo
then Kubespray v2.1.0 may be bound to only minor changes to `kube_version`, like v1.5.1
and *any* changes to other components, like etcd v4, or calico 1.2.3.
And Kubespray v3.x.x shall be bound to `kube_version: 2.x.x` respectively.
If the release note file(/tmp/kubespray-release-note) contains "### Uncategorized" pull requests, those pull requests don't have a valid kind label(`kind/feature`, etc.).
It is necessary to put a valid label on each pull request and run the above release-notes command again to get a better release note
## Container image creation
The container image `quay.io/kubespray/kubespray:vX.Y.Z` can be created from Dockerfile of the kubespray root directory:
- name:Create dind node containers from "containers" inventory section
docker_container:
community.docker.docker_container:
image:"{{ distro_image }}"
name:"{{ item }}"
state:started
@ -69,7 +69,7 @@
# Running systemd-machine-id-setup doesn't create a unique id for each node container on Debian,
# handle manually
- name:Re-create unique machine-id (as we may just get what comes in the docker image), needed by some CNIs for mac address seeding (notably weave)# noqa 301
- name:Re-create unique machine-id (as we may just get what comes in the docker image), needed by some CNIs for mac address seeding (notably weave)
@ -8,18 +8,22 @@ Installs and configures GlusterFS on Linux.
For GlusterFS to connect between servers, TCP ports `24007`, `24008`, and `24009`/`49152`+ (that port, plus an additional incremented port for each additional server in the cluster; the latter if GlusterFS is version 3.4+), and TCP/UDP port `111` must be open. You can open these using whatever firewall you wish (this can easily be configured using the `geerlingguy.firewall` role).
This role performs basic installation and setup of Gluster, but it does not configure or mount bricks (volumes), since that step is easier to do in a series of plays in your own playbook. Ansible 1.9+ includes the [`gluster_volume`](https://docs.ansible.com/gluster_volume_module.html) module to ease the management of Gluster volumes.
This role performs basic installation and setup of Gluster, but it does not configure or mount bricks (volumes), since that step is easier to do in a series of plays in your own playbook. Ansible 1.9+ includes the [`gluster_volume`](https://docs.ansible.com/ansible/latest/collections/gluster/gluster/gluster_volume_module.html) module to ease the management of Gluster volumes.
## Role Variables
Available variables are listed below, along with default values (see `defaults/main.yml`):
```yaml
glusterfs_default_release:""
```
You can specify a `default_release` for apt on Debian/Ubuntu by overriding this variable. This is helpful if you need a different package or version for the main GlusterFS packages (e.g. GlusterFS 3.5.x instead of 3.2.x with the `wheezy-backports` default release on Debian Wheezy).
```yaml
glusterfs_ppa_use:yes
glusterfs_ppa_version:"3.5"
```
For Ubuntu, specify whether to use the official Gluster PPA, and which version of the PPA to use. See Gluster's [Getting Started Guide](https://docs.gluster.org/en/latest/Quick-Start-Guide/Quickstart/) for more info.
@ -29,9 +33,11 @@ None.
## Example Playbook
```yaml
- hosts:server
roles:
- geerlingguy.glusterfs
```
For a real-world use example, read through [Simple GlusterFS Setup with Ansible](http://www.jeffgeerling.com/blog/simple-glusterfs-setup-ansible), a blog post by this role's author, which is included in Chapter 8 of [Ansible for DevOps](https://www.ansiblefordevops.com/).
when:inventory_hostname == groups['kube-master'][0] and groups['gfs-cluster'] is defined and hostvars[groups['gfs-cluster'][0]].gluster_disk_size_gb is defined
when:inventory_hostname == groups['kube_control_plane'][0] and groups['gfs-cluster'] is defined and hostvars[groups['gfs-cluster'][0]].gluster_disk_size_gb is defined
- name:Kubernetes Apps | Set GlusterFS endpoint and PV
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.