cgroup ID is currently allocated using a dedicated per-hierarchy idr
and used internally and exposed through tracepoints and bpf. This is
confusing because there are tracepoints and other interfaces which use
the cgroupfs ino as IDs.
The preceding changes made kn->id exposed as ino as 64bit ino on
supported archs or ino+gen (low 32bits as ino, high gen). There's no
reason for cgroup to use different IDs. The kernfs IDs are unique and
userland can easily discover them and map them back to paths using
standard file operations.
This patch replaces cgroup IDs with kernfs IDs.
* cgroup_id() is added and all cgroup ID users are converted to use it.
* kernfs_node creation is moved to earlier during cgroup init so that
cgroup_id() is available during init.
* While at it, s/cgroup/cgrp/ in psi helpers for consistency.
* Fallback ID value is changed to 1 to be consistent with root cgroup
ID.
Change-Id: Iab2ee05e4e75671c3ee6799e7e2b3358394fec66
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
btrfs is going to use css_put() and wbc helpers to improve cgroup
writeback support. Add dummy css_get() definition and export wbc
helpers to prepare for module and !CONFIG_CGROUP builds.
Reported-by: kbuild test robot <lkp@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Change-Id: I2fef472a0ce04d6bb32213500f57c72045b13633
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
kernfs_node->id is currently a union kernfs_node_id which represents
either a 32bit (ino, gen) pair or u64 value. I can't see much value
in the usage of the union - all that's needed is a 64bit ID which the
current code is already limited to. Using a union makes the code
unnecessarily complicated and prevents using 64bit ino without adding
practical benefits.
This patch drops union kernfs_node_id and makes kernfs_node->id a u64.
ino is stored in the lower 32bits and gen upper. Accessors -
kernfs[_id]_ino() and kernfs[_id]_gen() - are added to retrieve the
ino and gen. This simplifies ID handling less cumbersome and will
allow using 64bit inos on supported archs.
This patch doesn't make any functional changes.
Change-Id: I289fc21fdfd22b7c7cae73626665b0cb100a0c5f
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexei Starovoitov <ast@kernel.org>
Currently the lifetime of bpf programs attached to a cgroup is bound
to the lifetime of the cgroup itself. It means that if a user
forgets (or intentionally avoids) to detach a bpf program before
removing the cgroup, it will stay attached up to the release of the
cgroup. Since the cgroup can stay in the dying state (the state
between being rmdir()'ed and being released) for a very long time, it
leads to a waste of memory. Also, it blocks a possibility to implement
the memcg-based memory accounting for bpf objects, because a circular
reference dependency will occur. Charged memory pages are pinning the
corresponding memory cgroup, and if the memory cgroup is pinning
the attached bpf program, nothing will be ever released.
A dying cgroup can not contain any processes, so the only chance for
an attached bpf program to be executed is a live socket associated
with the cgroup. So in order to release all bpf data early, let's
count associated sockets using a new percpu refcounter. On cgroup
removal the counter is transitioned to the atomic mode, and as soon
as it reaches 0, all bpf programs are detached.
Because cgroup_bpf_release() can block, it can't be called from
the percpu ref counter callback directly, so instead an asynchronous
work is scheduled.
The reference counter is not socket specific, and can be used for any
other types of programs, which can be executed from a cgroup-bpf hook
outside of the process context, had such a need arise in the future.
Change-Id: I522bed27d3800bf9276272348898ed5b393fa5f2
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: jolsa@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Changes in 4.19.269
arm: dts: rockchip: fix node name for hym8563 rtc
ARM: dts: rockchip: fix ir-receiver node names
ARM: 9251/1: perf: Fix stacktraces for tracepoint events in THUMB2 kernels
ARM: 9266/1: mm: fix no-MMU ZERO_PAGE() implementation
ARM: dts: rockchip: disable arm_global_timer on rk3066 and rk3188
9p/fd: Use P9_HDRSZ for header size
ALSA: seq: Fix function prototype mismatch in snd_seq_expand_var_event
ASoC: soc-pcm: Add NULL check in BE reparenting
regulator: twl6030: fix get status of twl6032 regulators
fbcon: Use kzalloc() in fbcon_prepare_logo()
9p/xen: check logical size for buffer size
net: usb: qmi_wwan: add u-blox 0x1342 composition
xen/netback: Ensure protocol headers don't fall in the non-linear area
xen/netback: do some code cleanup
xen/netback: don't call kfree_skb() with interrupts disabled
rcutorture: Automatically create initrd directory
media: v4l2-dv-timings.c: fix too strict blanking sanity checks
memcg: fix possible use-after-free in memcg_write_event_control()
KVM: s390: vsie: Fix the initialization of the epoch extension (epdx) field
HID: hid-lg4ff: Add check for empty lbuf
HID: core: fix shift-out-of-bounds in hid_report_raw_event
ieee802154: cc2520: Fix error return code in cc2520_hw_init()
ca8210: Fix crash by zero initializing data
gpio: amd8111: Fix PCI device reference count leak
e1000e: Fix TX dispatch condition
igb: Allocate MSI-X vector when testing
Bluetooth: 6LoWPAN: add missing hci_dev_put() in get_l2cap_conn()
Bluetooth: Fix not cleanup led when bt_init fails
selftests: rtnetlink: correct xfrm policy rule in kci_test_ipsec_offload
mac802154: fix missing INIT_LIST_HEAD in ieee802154_if_add()
net: encx24j600: Add parentheses to fix precedence
net: encx24j600: Fix invalid logic in reading of MISTAT register
xen-netfront: Fix NULL sring after live migration
net: mvneta: Prevent out of bounds read in mvneta_config_rss()
i40e: Fix not setting default xps_cpus after reset
i40e: Fix for VF MAC address 0
i40e: Disallow ip4 and ip6 l4_4_bytes
NFC: nci: Bounds check struct nfc_target arrays
nvme initialize core quirks before calling nvme_init_subsystem
net: stmmac: fix "snps,axi-config" node property parsing
net: hisilicon: Fix potential use-after-free in hisi_femac_rx()
net: hisilicon: Fix potential use-after-free in hix5hd2_rx()
tipc: Fix potential OOB in tipc_link_proto_rcv()
ethernet: aeroflex: fix potential skb leak in greth_init_rings()
xen/netback: fix build warning
net: plip: don't call kfree_skb/dev_kfree_skb() under spin_lock_irq()
ipv6: avoid use-after-free in ip6_fragment()
net: mvneta: Fix an out of bounds check
can: esd_usb: Allow REC and TEC to return to zero
Linux 4.19.269
Change-Id: Ie79e55bad6376d2314d44479ef1ec3c546d24030
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 4a7ba45b1a435e7097ca0f79a847d0949d0eb088 upstream.
memcg_write_event_control() accesses the dentry->d_name of the specified
control fd to route the write call. As a cgroup interface file can't be
renamed, it's safe to access d_name as long as the specified file is a
regular cgroup file. Also, as these cgroup interface files can't be
removed before the directory, it's safe to access the parent too.
Prior to 347c4a8747 ("memcg: remove cgroup_event->cft"), there was a
call to __file_cft() which verified that the specified file is a regular
cgroupfs file before further accesses. The cftype pointer returned from
__file_cft() was no longer necessary and the commit inadvertently dropped
the file type check with it allowing any file to slip through. With the
invarients broken, the d_name and parent accesses can now race against
renames and removals of arbitrary files and cause use-after-free's.
Fix the bug by resurrecting the file type check in __file_cft(). Now that
cgroupfs is implemented through kernfs, checking the file operations needs
to go through a layer of indirection. Instead, let's check the superblock
and dentry type.
Link: https://lkml.kernel.org/r/Y5FRm/cfcKPGzWwl@slm.duckdns.org
Fixes: 347c4a8747 ("memcg: remove cgroup_event->cft")
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jann Horn <jannh@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org> [3.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
PSI accounts stalls for each cgroup separately and aggregates it at each
level of the hierarchy. This causes additional overhead with psi_avgs_work
being called for each cgroup in the hierarchy. psi_avgs_work has been
highly optimized, however on systems with large number of cgroups the
overhead becomes noticeable.
Systems which use PSI only at the system level could avoid this overhead
if PSI can be configured to skip per-cgroup stall accounting.
Add "cgroup_disable=pressure" kernel command-line option to allow
requesting system-wide only pressure stall accounting. When set, it
keeps system-wide accounting under /proc/pressure/ but skips accounting
for individual cgroups and does not expose PSI nodes in cgroup hierarchy.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/patchwork/patch/1435705
(cherry picked from commit 3958e2d0c34e18c41b60dc01832bd670a59ef70f
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git tj)
Conflicts:
include/linux/cgroup-defs.h
kernel/cgroup/cgroup.c
1. Trivial merge conflict in cgroup-defs.h due to missing CFTYPE_DEBUG
2. Changed flags to (CFTYPE_NOT_ON_ROOT | CFTYPE_PRESSURE) in cgroup.c
because in 4.19 psi files were allowed only in non-root cgroups.
Bug: 178872719
Bug: 191734423
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ifc8fbc52f9a1131d7c2668edbb44c525c76c3360
Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.
This patch implements freezer for cgroup v2.
Cgroup v2 freezer tries to put tasks into a state similar to jobctl
stop. This means that tasks can be killed, ptraced (using
PTRACE_SEIZE*), and interrupted. It is possible to attach to
a frozen task, get some information (e.g. read registers) and detach.
It's also possible to migrate a frozen tasks to another cgroup.
This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
tried to imitate the system-wide freezer. However uninterruptible
sleep is fine when all tasks are going to be frozen (hibernation case),
it's not the acceptable state for some subset of the system.
Cgroup v2 freezer is not supporting freezing kthreads.
If a non-root cgroup contains kthread, the cgroup still can be frozen,
but the kthread will remain running, the cgroup will be shown
as non-frozen, and the notification will not be delivered.
* PTRACE_ATTACH is not working because non-fatal signal delivery
is blocked in frozen state.
There are some interface differences between cgroup v1 and cgroup v2
freezer too, which are required to conform the cgroup v2 interface
design principles:
1) There is no separate controller, which has to be turned on:
the functionality is always available and is represented by
cgroup.freeze and cgroup.events cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
No-objection-from-me-by: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@fb.com
Change-Id: I3404119678cbcd7410aa56e9334055cee79d02fa
(cherry picked from commit 76f969e8948d82e78e1bc4beb6b9465908e74873)
cgroup-defs.h: use the struct cgroup_freezer_state and the
freezer field from definitions in I6221a975c04f06249a4f8d693852776ae08a8d8e
sched.h: use the frozen field defined in
I6221a975c04f06249a4f8d693852776ae08a8d8e
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
Changes in 4.19.134
perf: Make perf able to build with latest libbfd
net: rmnet: fix lower interface leak
genetlink: remove genl_bind
ipv4: fill fl4_icmp_{type,code} in ping_v4_sendmsg
l2tp: remove skb_dst_set() from l2tp_xmit_skb()
llc: make sure applications use ARPHRD_ETHER
net: Added pointer check for dst->ops->neigh_lookup in dst_neigh_lookup_skb
net_sched: fix a memory leak in atm_tc_init()
net: usb: qmi_wwan: add support for Quectel EG95 LTE modem
tcp: fix SO_RCVLOWAT possible hangs under high mem pressure
tcp: make sure listeners don't initialize congestion-control state
tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key()
tcp: md5: do not send silly options in SYNCOOKIES
tcp: md5: refine tcp_md5_do_add()/tcp_md5_hash_key() barriers
tcp: md5: allow changing MD5 keys in all socket states
cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
cgroup: Fix sock_cgroup_data on big-endian.
sched: consistently handle layer3 header accesses in the presence of VLANs
vlan: consolidate VLAN parsing code and limit max parsing depth
drm/msm: fix potential memleak in error branch
drm/exynos: fix ref count leak in mic_pre_enable
m68k: nommu: register start of the memory with memblock
m68k: mm: fix node memblock init
arm64/alternatives: use subsections for replacement sequences
tpm_tis: extra chip->ops check on error path in tpm_tis_core_init
gfs2: read-only mounts should grab the sd_freeze_gl glock
i2c: eg20t: Load module automatically if ID matches
arm64/alternatives: don't patch up internal branches
iio:magnetometer:ak8974: Fix alignment and data leak issues
iio:humidity:hdc100x Fix alignment and data leak issues
iio: magnetometer: ak8974: Fix runtime PM imbalance on error
iio: mma8452: Add missed iio_device_unregister() call in mma8452_probe()
iio: pressure: zpa2326: handle pm_runtime_get_sync failure
iio:humidity:hts221 Fix alignment and data leak issues
iio:pressure:ms5611 Fix buffer element alignment
iio:health:afe4403 Fix timestamp alignment and prevent data leak.
spi: fix initial SPI_SR value in spi-fsl-dspi
spi: spi-fsl-dspi: Fix lockup if device is shutdown during SPI transfer
net: dsa: bcm_sf2: Fix node reference count
of: of_mdio: Correct loop scanning logic
Revert "usb/ohci-platform: Fix a warning when hibernating"
Revert "usb/xhci-plat: Set PM runtime as active on resume"
Revert "usb/ehci-platform: Set PM runtime as active on resume"
net: sfp: add support for module quirks
net: sfp: add some quirks for GPON modules
HID: quirks: Remove ITE 8595 entry from hid_have_special_driver
ARM: at91: pm: add quirk for sam9x60's ulp1
scsi: sr: remove references to BLK_DEV_SR_VENDOR, leave it enabled
ALSA: usb-audio: Create a registration quirk for Kingston HyperX Amp (0951:16d8)
doc: dt: bindings: usb: dwc3: Update entries for disabling SS instances in park mode
mmc: sdhci: do not enable card detect interrupt for gpio cd type
ALSA: usb-audio: Rewrite registration quirk handling
ACPI: video: Use native backlight on Acer Aspire 5783z
ALSA: usb-audio: Add registration quirk for Kingston HyperX Cloud Alpha S
Input: mms114 - add extra compatible for mms345l
ACPI: video: Use native backlight on Acer TravelMate 5735Z
ALSA: usb-audio: Add registration quirk for Kingston HyperX Cloud Flight S
iio:health:afe4404 Fix timestamp alignment and prevent data leak.
phy: sun4i-usb: fix dereference of pointer phy0 before it is null checked
arm64: dts: meson: add missing gxl rng clock
spi: spi-sun6i: sun6i_spi_transfer_one(): fix setting of clock rate
usb: gadget: udc: atmel: fix uninitialized read in debug printk
staging: comedi: verify array index is correct before using it
Revert "thermal: mediatek: fix register index error"
ARM: dts: socfpga: Align L2 cache-controller nodename with dtschema
regmap: debugfs: Don't sleep while atomic for fast_io regmaps
copy_xstate_to_kernel: Fix typo which caused GDB regression
apparmor: ensure that dfa state tables have entries
perf stat: Zero all the 'ena' and 'run' array slot stats for interval mode
soc: qcom: rpmh: Update dirty flag only when data changes
soc: qcom: rpmh: Invalidate SLEEP and WAKE TCSes before flushing new data
soc: qcom: rpmh-rsc: Clear active mode configuration for wake TCS
soc: qcom: rpmh-rsc: Allow using free WAKE TCS for active request
mtd: rawnand: marvell: Use nand_cleanup() when the device is not yet registered
mtd: rawnand: marvell: Fix probe error path
mtd: rawnand: timings: Fix default tR_max and tCCS_min timings
mtd: rawnand: brcmnand: fix CS0 layout
mtd: rawnand: oxnas: Keep track of registered devices
mtd: rawnand: oxnas: Unregister all devices on error
mtd: rawnand: oxnas: Release all devices in the _remove() path
slimbus: core: Fix mismatch in of_node_get/put
HID: magicmouse: do not set up autorepeat
HID: quirks: Always poll Obins Anne Pro 2 keyboard
HID: quirks: Ignore Simply Automated UPB PIM
ALSA: line6: Perform sanity check for each URB creation
ALSA: line6: Sync the pending work cancel at disconnection
ALSA: usb-audio: Fix race against the error recovery URB submission
ALSA: hda/realtek - change to suitable link model for ASUS platform
ALSA: hda/realtek - Enable Speaker for ASUS UX533 and UX534
USB: c67x00: fix use after free in c67x00_giveback_urb
usb: dwc2: Fix shutdown callback in platform
usb: chipidea: core: add wakeup support for extcon
usb: gadget: function: fix missing spinlock in f_uac1_legacy
USB: serial: iuu_phoenix: fix memory corruption
USB: serial: cypress_m8: enable Simply Automated UPB PIM
USB: serial: ch341: add new Product ID for CH340
USB: serial: option: add GosunCn GM500 series
USB: serial: option: add Quectel EG95 LTE modem
virt: vbox: Fix VBGL_IOCTL_VMMDEV_REQUEST_BIG and _LOG req numbers to match upstream
virt: vbox: Fix guest capabilities mask check
virtio: virtio_console: add missing MODULE_DEVICE_TABLE() for rproc serial
serial: mxs-auart: add missed iounmap() in probe failure and remove
ovl: inode reference leak in ovl_is_inuse true case.
ovl: relax WARN_ON() when decoding lower directory file handle
ovl: fix unneeded call to ovl_change_flags()
fuse: Fix parameter for FS_IOC_{GET,SET}FLAGS
Revert "zram: convert remaining CLASS_ATTR() to CLASS_ATTR_RO()"
mei: bus: don't clean driver pointer
Input: i8042 - add Lenovo XiaoXin Air 12 to i8042 nomux list
uio_pdrv_genirq: fix use without device tree and no interrupt
timer: Prevent base->clk from moving backward
timer: Fix wheel index calculation on last level
MIPS: Fix build for LTS kernel caused by backporting lpj adjustment
riscv: use 16KB kernel stack on 64-bit
hwmon: (emc2103) fix unable to change fan pwm1_enable attribute
powerpc/book3s64/pkeys: Fix pkey_access_permitted() for execute disable pkey
intel_th: pci: Add Jasper Lake CPU support
intel_th: pci: Add Tiger Lake PCH-H support
intel_th: pci: Add Emmitsburg PCH support
intel_th: Fix a NULL dereference when hub driver is not loaded
dmaengine: fsl-edma: Fix NULL pointer exception in fsl_edma_tx_handler
misc: atmel-ssc: lock with mutex instead of spinlock
thermal/drivers/cpufreq_cooling: Fix wrong frequency converted from power
arm64: ptrace: Override SPSR.SS when single-stepping is enabled
arm64: ptrace: Consistently use pseudo-singlestep exceptions
arm64: compat: Ensure upper 32 bits of x0 are zero on syscall return
sched: Fix unreliable rseq cpu_id for new tasks
sched/fair: handle case of task_h_load() returning 0
genirq/affinity: Handle affinity setting on inactive interrupts correctly
printk: queue wake_up_klogd irq_work only if per-CPU areas are ready
libceph: don't omit recovery_deletes in target_copy()
rxrpc: Fix trace string
spi: sprd: switch the sequence of setting WDG_LOAD_LOW and _HIGH
Linux 4.19.134
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ieeb9e03f4a2d51aeebe3a3eadd9c1b93a26088a0
[ Upstream commit ad0f75e5f57ccbceec13274e1e242f2b5a6397ed ]
When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
copied, so the cgroup refcnt must be taken too. And, unlike the
sk_alloc() path, sock_update_netprioidx() is not called here.
Therefore, it is safe and necessary to grab the cgroup refcnt
even when cgroup_sk_alloc is disabled.
sk_clone_lock() is in BH context anyway, the in_interrupt()
would terminate this function if called there. And for sk_alloc()
skcd->val is always zero. So it's safe to factor out the code
to make it more readable.
The global variable 'cgroup_sk_alloc_disabled' is used to determine
whether to take these reference counts. It is impossible to make
the reference counting correct unless we save this bit of information
in skcd->val. So, add a new bit there to record whether the socket
has already taken the reference counts. This obviously relies on
kmalloc() to align cgroup pointers to at least 4 bytes,
ARCH_KMALLOC_MINALIGN is certainly larger than that.
This bug seems to be introduced since the beginning, commit
d979a39d72 ("cgroup: duplicate cgroup reference when cloning sockets")
tried to fix it but not compeletely. It seems not easy to trigger until
the recent commit 090e28b229af
("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged.
Fixes: bd1060a1d6 ("sock, cgroup: add sock->sk_cgroup")
Reported-by: Cameron Berkenpas <cam@neo-zeon.de>
Reported-by: Peter Geis <pgwipeout@gmail.com>
Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reported-by: Daniël Sonck <dsonck92@gmail.com>
Reported-by: Zhang Qiang <qiang.zhang@windriver.com>
Tested-by: Cameron Berkenpas <cam@neo-zeon.de>
Tested-by: Peter Geis <pgwipeout@gmail.com>
Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9c974c77246460fa6a92c18554c3311c8c83c160 upstream.
PF_EXITING is set earlier than actual removal from css_set when a task
is exitting. This can confuse cgroup.procs readers who see no PF_EXITING
tasks, however, rmdir is checking against css_set membership so it can
transitionally fail with EBUSY.
Fix this by listing tasks that weren't unlinked from css_set active
lists.
It may happen that other users of the task iterator (without
CSS_TASK_ITER_PROCS) spot a PF_EXITING task before cgroup_exit(). This
is equal to the state before commit c03cd7738a83 ("cgroup: Include dying
leaders with live threads in PROCS iterations") but it may be reviewed
later.
Reported-by: Suren Baghdasaryan <surenb@google.com>
Fixes: c03cd7738a83 ("cgroup: Include dying leaders with live threads in PROCS iterations")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
PF_EXITING is set earlier than actual removal from css_set when a task
is exitting. This can confuse cgroup.procs readers who see no PF_EXITING
tasks, however, rmdir is checking against css_set membership so it can
transitionally fail with EBUSY.
Fix this by listing tasks that weren't unlinked from css_set active
lists.
It may happen that other users of the task iterator (without
CSS_TASK_ITER_PROCS) spot a PF_EXITING task before cgroup_exit(). This
is equal to the state before commit c03cd7738a83 ("cgroup: Include dying
leaders with live threads in PROCS iterations") but it may be reviewed
later.
Reported-by: Suren Baghdasaryan <surenb@google.com>
Fixes: c03cd7738a83 ("cgroup: Include dying leaders with live threads in PROCS iterations")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
(cherry picked from commit 9c974c77246460fa6a92c18554c3311c8c83c160)
Bug: 141213848
Bug: 146758430
Test: test_cgcore_destroy from linux-kselftest
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Iac57661b931129ed1e44b89675f8115bb89084ff
(cherry picked from commit 21ee296526c70d6dc3c64639406f156f39b80fd0)
cgroup already uses floating point for percent[ile] numbers and there
are several controllers which want to take them as input. Add a
generic parse helper to handle inputs.
Update the interface convention documentation about the use of
percentage numbers. While at it, also clarify the default time unit.
Bug: 120440300
Signed-off-by: Tejun Heo <tj@kernel.org>
(cherry picked from commit a5e112e6424adb77d953eac20e6936b952fd6b32)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: Ic1fcf21d7955eb8edd2e8e91517bca6aef41694f
Signed-off-by: Quentin Perret <qperret@google.com>
Changes in 4.19.66
scsi: fcoe: Embed fc_rport_priv in fcoe_rport structure
gcc-9: don't warn about uninitialized variable
driver core: Establish order of operations for device_add and device_del via bitflag
drivers/base: Introduce kill_device()
libnvdimm/bus: Prevent duplicate device_unregister() calls
libnvdimm/region: Register badblocks before namespaces
libnvdimm/bus: Prepare the nd_ioctl() path to be re-entrant
libnvdimm/bus: Fix wait_nvdimm_bus_probe_idle() ABBA deadlock
HID: wacom: fix bit shift for Cintiq Companion 2
HID: Add quirk for HP X1200 PIXART OEM mouse
IB: directly cast the sockaddr union to aockaddr
atm: iphase: Fix Spectre v1 vulnerability
bnx2x: Disable multi-cos feature.
ife: error out when nla attributes are empty
ip6_gre: reload ipv6h in prepare_ip6gre_xmit_ipv6
ip6_tunnel: fix possible use-after-free on xmit
ipip: validate header length in ipip_tunnel_xmit
mlxsw: spectrum: Fix error path in mlxsw_sp_module_init()
mvpp2: fix panic on module removal
mvpp2: refactor MTU change code
net: bridge: delete local fdb on device init failure
net: bridge: mcast: don't delete permanent entries when fast leave is enabled
net: fix ifindex collision during namespace removal
net/mlx5e: always initialize frag->last_in_page
net/mlx5: Use reversed order when unregister devices
net: phylink: Fix flow control for fixed-link
net: qualcomm: rmnet: Fix incorrect UL checksum offload logic
net: sched: Fix a possible null-pointer dereference in dequeue_func()
net sched: update vlan action for batched events operations
net: sched: use temporary variable for actions indexes
net/smc: do not schedule tx_work in SMC_CLOSED state
NFC: nfcmrvl: fix gpio-handling regression
ocelot: Cancel delayed work before wq destruction
tipc: compat: allow tipc commands without arguments
tun: mark small packets as owned by the tap sock
net/mlx5: Fix modify_cq_in alignment
net/mlx5e: Prevent encap flow counter update async to user query
r8169: don't use MSI before RTL8168d
compat_ioctl: pppoe: fix PPPOEIOCSFWD handling
cgroup: Call cgroup_release() before __exit_signal()
cgroup: Implement css_task_iter_skip()
cgroup: Include dying leaders with live threads in PROCS iterations
cgroup: css_task_iter_skip()'d iterators must be advanced before accessed
cgroup: Fix css_task_iter_advance_css_set() cset skip condition
spi: bcm2835: Fix 3-wire mode if DMA is enabled
Linux 4.19.66
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Id33ce169af8bf14a3791040b4cf923832ce84f6c
commit c03cd7738a83b13739f00546166969342c8ff014 upstream.
CSS_TASK_ITER_PROCS currently iterates live group leaders; however,
this means that a process with dying leader and live threads will be
skipped. IOW, cgroup.procs might be empty while cgroup.threads isn't,
which is confusing to say the least.
Fix it by making cset track dying tasks and include dying leaders with
live threads in PROCS iteration.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Topi Miettinen <toiwoton@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b636fd38dc40113f853337a7d2a6885ad23b8811 upstream.
When a task is moved out of a cset, task iterators pointing to the
task are advanced using the normal css_task_iter_advance() call. This
is fine but we'll be tracking dying tasks on csets and thus moving
tasks from cset->tasks to (to be added) cset->dying_tasks. When we
remove a task from cset->tasks, if we advance the iterators, they may
move over to the next cset before we had the chance to add the task
back on the dying list, which can allow the task to escape iteration.
This patch separates out skipping from advancing. Skipping only moves
the affected iterators to the next pointer rather than fully advancing
it and the following advancing will recognize that the cursor has
already been moved forward and do the rest of advancing. This ensures
that when a task moves from one list to another in its cset, as long
as it moves in the right direction, it's always visible to iteration.
This doesn't cause any visible behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 4.19.53
drm/nouveau: add kconfig option to turn off nouveau legacy contexts. (v3)
nouveau: Fix build with CONFIG_NOUVEAU_LEGACY_CTX_SUPPORT disabled
HID: multitouch: handle faulty Elo touch device
HID: wacom: Don't set tool type until we're in range
HID: wacom: Don't report anything prior to the tool entering range
HID: wacom: Send BTN_TOUCH in response to INTUOSP2_BT eraser contact
HID: wacom: Correct button numbering 2nd-gen Intuos Pro over Bluetooth
HID: wacom: Sync INTUOSP2_BT touch state after each frame if necessary
Revert "ALSA: hda/realtek - Improve the headset mic for Acer Aspire laptops"
ALSA: oxfw: allow PCM capture for Stanton SCS.1m
ALSA: hda/realtek - Update headset mode for ALC256
ALSA: firewire-motu: fix destruction of data for isochronous resources
libata: Extend quirks for the ST1000LM024 drives with NOLPM quirk
mm/list_lru.c: fix memory leak in __memcg_init_list_lru_node
fs/ocfs2: fix race in ocfs2_dentry_attach_lock()
mm/vmscan.c: fix trying to reclaim unevictable LRU page
signal/ptrace: Don't leak unitialized kernel memory with PTRACE_PEEK_SIGINFO
ptrace: restore smp_rmb() in __ptrace_may_access()
iommu/arm-smmu: Avoid constant zero in TLBI writes
i2c: acorn: fix i2c warning
bcache: fix stack corruption by PRECEDING_KEY()
bcache: only set BCACHE_DEV_WB_RUNNING when cached device attached
cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css()
ASoC: cs42xx8: Add regcache mask dirty
ASoC: fsl_asrc: Fix the issue about unsupported rate
drm/i915/sdvo: Implement proper HDMI audio support for SDVO
x86/uaccess, kcov: Disable stack protector
ALSA: seq: Protect in-kernel ioctl calls with mutex
ALSA: seq: Fix race of get-subscription call vs port-delete ioctls
Revert "ALSA: seq: Protect in-kernel ioctl calls with mutex"
s390/kasan: fix strncpy_from_user kasan checks
Drivers: misc: fix out-of-bounds access in function param_set_kgdbts_var
f2fs: fix to avoid accessing xattr across the boundary
scsi: qedi: remove memset/memcpy to nfunc and use func instead
scsi: qedi: remove set but not used variables 'cdev' and 'udev'
scsi: lpfc: correct rcu unlock issue in lpfc_nvme_info_show
scsi: lpfc: add check for loss of ndlp when sending RRQ
arm64/mm: Inhibit huge-vmap with ptdump
nvme: fix srcu locking on error return in nvme_get_ns_from_disk
nvme: remove the ifdef around nvme_nvm_ioctl
nvme: merge nvme_ns_ioctl into nvme_ioctl
nvme: release namespace SRCU protection before performing controller ioctls
nvme: fix memory leak for power latency tolerance
platform/x86: pmc_atom: Add Lex 3I380D industrial PC to critclk_systems DMI table
platform/x86: pmc_atom: Add several Beckhoff Automation boards to critclk_systems DMI table
scsi: bnx2fc: fix incorrect cast to u64 on shift operation
libnvdimm: Fix compilation warnings with W=1
selftests: fib_rule_tests: fix local IPv4 address typo
selftests/timers: Add missing fflush(stdout) calls
tracing: Prevent hist_field_var_ref() from accessing NULL tracing_map_elts
usbnet: ipheth: fix racing condition
KVM: arm/arm64: Move cc/it checks under hyp's Makefile to avoid instrumentation
KVM: x86/pmu: mask the result of rdpmc according to the width of the counters
KVM: x86/pmu: do not mask the value that is written to fixed PMUs
KVM: s390: fix memory slot handling for KVM_SET_USER_MEMORY_REGION
tools/kvm_stat: fix fields filter for child events
drm/vmwgfx: integer underflow in vmw_cmd_dx_set_shader() leading to an invalid read
drm/vmwgfx: NULL pointer dereference from vmw_cmd_dx_view_define()
usb: dwc2: Fix DMA cache alignment issues
usb: dwc2: host: Fix wMaxPacketSize handling (fix webcam regression)
USB: Fix chipmunk-like voice when using Logitech C270 for recording audio.
USB: usb-storage: Add new ID to ums-realtek
USB: serial: pl2303: add Allied Telesis VT-Kit3
USB: serial: option: add support for Simcom SIM7500/SIM7600 RNDIS mode
USB: serial: option: add Telit 0x1260 and 0x1261 compositions
timekeeping: Repair ktime_get_coarse*() granularity
RAS/CEC: Convert the timer callback to a workqueue
RAS/CEC: Fix binary search function
x86/microcode, cpuhotplug: Add a microcode loader CPU hotplug callback
x86/kasan: Fix boot with 5-level paging and KASAN
x86/mm/KASLR: Compute the size of the vmemmap section properly
x86/resctrl: Prevent NULL pointer dereference when local MBM is disabled
drm/edid: abstract override/firmware EDID retrieval
drm: add fallback override/firmware EDID modes workaround
rtc: pcf8523: don't return invalid date when battery is low
Linux 4.19.53
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 18fa84a2db0e15b02baa5d94bdb5bd509175d2f6 upstream.
A PF_EXITING task can stay associated with an offline css. If such
task calls task_get_css(), it can get stuck indefinitely. This can be
triggered by BSD process accounting which writes to a file with
PF_EXITING set when racing against memcg disable as in the backtrace
at the end.
After this change, task_get_css() may return a css which was already
offline when the function was called. None of the existing users are
affected by this change.
INFO: rcu_sched self-detected stall on CPU
INFO: rcu_sched detected stalls on CPUs/tasks:
...
NMI backtrace for cpu 0
...
Call Trace:
<IRQ>
dump_stack+0x46/0x68
nmi_cpu_backtrace.cold.2+0x13/0x57
nmi_trigger_cpumask_backtrace+0xba/0xca
rcu_dump_cpu_stacks+0x9e/0xce
rcu_check_callbacks.cold.74+0x2af/0x433
update_process_times+0x28/0x60
tick_sched_timer+0x34/0x70
__hrtimer_run_queues+0xee/0x250
hrtimer_interrupt+0xf4/0x210
smp_apic_timer_interrupt+0x56/0x110
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:balance_dirty_pages_ratelimited+0x28f/0x3d0
...
btrfs_file_write_iter+0x31b/0x563
__vfs_write+0xfa/0x140
__kernel_write+0x4f/0x100
do_acct_process+0x495/0x580
acct_process+0xb9/0xdb
do_exit+0x748/0xa00
do_group_exit+0x3a/0xa0
get_signal+0x254/0x560
do_signal+0x23/0x5c0
exit_to_usermode_loop+0x5d/0xa0
prepare_exit_to_usermode+0x53/0x80
retint_user+0x8/0x8
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v4.2+
Fixes: ec438699a9 ("cgroup, block: implement task_get_css() and use it in bio_associate_current()")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 4.19.34
arm64: debug: Don't propagate UNKNOWN FAR into si_code for debug signals
ext4: cleanup bh release code in ext4_ind_remove_space()
tty/serial: atmel: Add is_half_duplex helper
tty/serial: atmel: RS485 HD w/DMA: enable RX after TX is stopped
CIFS: fix POSIX lock leak and invalid ptr deref
h8300: use cc-cross-prefix instead of hardcoding h8300-unknown-linux-
f2fs: fix to adapt small inline xattr space in __find_inline_xattr()
f2fs: fix to avoid deadlock in f2fs_read_inline_dir()
tracing: kdb: Fix ftdump to not sleep
net/mlx5: Avoid panic when setting vport rate
net/mlx5: Avoid panic when setting vport mac, getting vport config
gpio: gpio-omap: fix level interrupt idling
include/linux/relay.h: fix percpu annotation in struct rchan
sysctl: handle overflow for file-max
net: stmmac: Avoid sometimes uninitialized Clang warnings
enic: fix build warning without CONFIG_CPUMASK_OFFSTACK
libbpf: force fixdep compilation at the start of the build
scsi: hisi_sas: Set PHY linkrate when disconnected
scsi: hisi_sas: Fix a timeout race of driver internal and SMP IO
iio: adc: fix warning in Qualcomm PM8xxx HK/XOADC driver
x86/hyperv: Fix kernel panic when kexec on HyperV
perf c2c: Fix c2c report for empty numa node
mm/sparse: fix a bad comparison
mm/cma.c: cma_declare_contiguous: correct err handling
mm/page_ext.c: fix an imbalance with kmemleak
mm, swap: bounds check swap_info array accesses to avoid NULL derefs
mm,oom: don't kill global init via memory.oom.group
memcg: killed threads should not invoke memcg OOM killer
mm, mempolicy: fix uninit memory access
mm/vmalloc.c: fix kernel BUG at mm/vmalloc.c:512!
mm/slab.c: kmemleak no scan alien caches
ocfs2: fix a panic problem caused by o2cb_ctl
f2fs: do not use mutex lock in atomic context
fs/file.c: initialize init_files.resize_wait
page_poison: play nicely with KASAN
cifs: use correct format characters
dm thin: add sanity checks to thin-pool and external snapshot creation
f2fs: fix to check inline_xattr_size boundary correctly
cifs: Accept validate negotiate if server return NT_STATUS_NOT_SUPPORTED
cifs: Fix NULL pointer dereference of devname
netfilter: nf_tables: check the result of dereferencing base_chain->stats
netfilter: conntrack: tcp: only close if RST matches exact sequence
jbd2: fix invalid descriptor block checksum
fs: fix guard_bio_eod to check for real EOD errors
tools lib traceevent: Fix buffer overflow in arg_eval
PCI/PME: Fix hotplug/sysfs remove deadlock in pcie_pme_remove()
wil6210: check null pointer in _wil_cfg80211_merge_extra_ies
mt76: fix a leaked reference by adding a missing of_node_put
crypto: crypto4xx - add missing of_node_put after of_device_is_available
crypto: cavium/zip - fix collision with generic cra_driver_name
usb: chipidea: Grab the (legacy) USB PHY by phandle first
powerpc/powernv/ioda: Fix locked_vm counting for memory used by IOMMU tables
scsi: core: replace GFP_ATOMIC with GFP_KERNEL in scsi_scan.c
kbuild: invoke syncconfig if include/config/auto.conf.cmd is missing
powerpc/xmon: Fix opcode being uninitialized in print_insn_powerpc
coresight: etm4x: Add support to enable ETMv4.2
serial: 8250_pxa: honor the port number from devicetree
ARM: 8840/1: use a raw_spinlock_t in unwind
iommu/io-pgtable-arm-v7s: Only kmemleak_ignore L2 tables
powerpc/hugetlb: Handle mmap_min_addr correctly in get_unmapped_area callback
btrfs: qgroup: Make qgroup async transaction commit more aggressive
mmc: omap: fix the maximum timeout setting
net: dsa: mv88e6xxx: Add lockdep classes to fix false positive splat
e1000e: Fix -Wformat-truncation warnings
mlxsw: spectrum: Avoid -Wformat-truncation warnings
platform/x86: ideapad-laptop: Fix no_hw_rfkill_list for Lenovo RESCUER R720-15IKBN
platform/mellanox: mlxreg-hotplug: Fix KASAN warning
loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part()
IB/mlx4: Increase the timeout for CM cache
clk: fractional-divider: check parent rate only if flag is set
perf annotate: Fix getting source line failure
ASoC: qcom: Fix of-node refcount unbalance in qcom_snd_parse_of()
cpufreq: acpi-cpufreq: Report if CPU doesn't support boost technologies
efi: cper: Fix possible out-of-bounds access
s390/ism: ignore some errors during deregistration
scsi: megaraid_sas: return error when create DMA pool failed
scsi: fcoe: make use of fip_mode enum complete
drm/amd/display: Clear stream->mode_changed after commit
perf test: Fix failure of 'evsel-tp-sched' test on s390
mwifiex: don't advertise IBSS features without FW support
perf report: Don't shadow inlined symbol with different addr range
SoC: imx-sgtl5000: add missing put_device()
media: ov7740: fix runtime pm initialization
media: sh_veu: Correct return type for mem2mem buffer helpers
media: s5p-jpeg: Correct return type for mem2mem buffer helpers
media: rockchip/rga: Correct return type for mem2mem buffer helpers
media: s5p-g2d: Correct return type for mem2mem buffer helpers
media: mx2_emmaprp: Correct return type for mem2mem buffer helpers
media: mtk-jpeg: Correct return type for mem2mem buffer helpers
mt76: usb: do not run mt76u_queues_deinit twice
xen/gntdev: Do not destroy context while dma-bufs are in use
vfs: fix preadv64v2 and pwritev64v2 compat syscalls with offset == -1
HID: intel-ish-hid: avoid binding wrong ishtp_cl_device
cgroup, rstat: Don't flush subtree root unless necessary
jbd2: fix race when writing superblock
leds: lp55xx: fix null deref on firmware load failure
perf report: Add s390 diagnosic sampling descriptor size
iwlwifi: pcie: fix emergency path
ACPI / video: Refactor and fix dmi_is_desktop()
selftests: skip seccomp get_metadata test if not real root
kprobes: Prohibit probing on bsearch()
kprobes: Prohibit probing on RCU debug routine
netfilter: conntrack: fix cloned unconfirmed skb->_nfct race in __nf_conntrack_confirm
ARM: 8833/1: Ensure that NEON code always compiles with Clang
ARM: dts: meson8b: fix the Ethernet data line signals in eth_rgmii_pins
ALSA: PCM: check if ops are defined before suspending PCM
ath10k: fix shadow register implementation for WCN3990
usb: f_fs: Avoid crash due to out-of-scope stack ptr access
sched/topology: Fix percpu data types in struct sd_data & struct s_data
bcache: fix input overflow to cache set sysfs file io_error_halflife
bcache: fix input overflow to sequential_cutoff
bcache: fix potential div-zero error of writeback_rate_i_term_inverse
bcache: improve sysfs_strtoul_clamp()
genirq: Avoid summation loops for /proc/stat
net: marvell: mvpp2: fix stuck in-band SGMII negotiation
iw_cxgb4: fix srqidx leak during connection abort
net: phy: consider latched link-down status in polling mode
fbdev: fbmem: fix memory access if logo is bigger than the screen
cdrom: Fix race condition in cdrom_sysctl_register
drm: rcar-du: add missing of_node_put
drm/amd/display: Don't re-program planes for DPMS changes
drm/amd/display: Disconnect mpcc when changing tg
perf/aux: Make perf_event accessible to setup_aux()
e1000e: fix cyclic resets at link up with active tx
e1000e: Exclude device from suspend direct complete optimization
platform/x86: intel_pmc_core: Fix PCH IP sts reading
i2c: of: Try to find an I2C adapter matching the parent
staging: spi: mt7621: Add return code check on device_reset()
iwlwifi: mvm: fix RFH config command with >=10 CPUs
ASoC: fsl-asoc-card: fix object reference leaks in fsl_asoc_card_probe
sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK
efi/memattr: Don't bail on zero VA if it equals the region's PA
sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock()
drm/vkms: Bugfix extra vblank frame
ARM: dts: lpc32xx: Remove leading 0x and 0s from bindings notation
efi/arm/arm64: Allow SetVirtualAddressMap() to be omitted
soc: qcom: gsbi: Fix error handling in gsbi_probe()
mt7601u: bump supported EEPROM version
ARM: 8830/1: NOMMU: Toggle only bits in EXC_RETURN we are really care of
ARM: avoid Cortex-A9 livelock on tight dmb loops
block, bfq: fix in-service-queue check for queue merging
bpf: fix missing prototype warnings
selftests/bpf: skip verifier tests for unsupported program types
powerpc/64s: Clear on-stack exception marker upon exception return
cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting
backlight: pwm_bl: Use gpiod_get_value_cansleep() to get initial state
tty: increase the default flip buffer limit to 2*640K
powerpc/pseries: Perform full re-add of CPU for topology update post-migration
drm/amd/display: Enable vblank interrupt during CRC capture
ALSA: dice: add support for Solid State Logic Duende Classic/Mini
usb: dwc3: gadget: Fix OTG events when gadget driver isn't loaded
platform/x86: intel-hid: Missing power button release on some Dell models
perf script python: Use PyBytes for attr in trace-event-python
perf script python: Add trace_context extension module to sys.modules
media: mt9m111: set initial frame size other than 0x0
hwrng: virtio - Avoid repeated init of completion
soc/tegra: fuse: Fix illegal free of IO base address
HID: intel-ish: ipc: handle PIMR before ish_wakeup also clear PISR busy_clear bit
f2fs: UBSAN: set boolean value iostat_enable correctly
hpet: Fix missing '=' character in the __setup() code of hpet_mmap_enable
cpu/hotplug: Mute hotplug lockdep during init
dmaengine: imx-dma: fix warning comparison of distinct pointer types
dmaengine: qcom_hidma: assign channel cookie correctly
dmaengine: qcom_hidma: initialize tx flags in hidma_prep_dma_*
netfilter: physdev: relax br_netfilter dependency
media: rcar-vin: Allow independent VIN link enablement
media: s5p-jpeg: Check for fmt_ver_flag when doing fmt enumeration
regulator: act8865: Fix act8600_sudcdc_voltage_ranges setting
pinctrl: meson: meson8b: add the eth_rxd2 and eth_rxd3 pins
drm: Auto-set allow_fb_modifiers when given modifiers at plane init
drm/nouveau: Stop using drm_crtc_force_disable
x86/build: Specify elf_i386 linker emulation explicitly for i386 objects
selinux: do not override context on context mounts
brcmfmac: Use firmware_request_nowarn for the clm_blob
wlcore: Fix memory leak in case wl12xx_fetch_firmware failure
x86/build: Mark per-CPU symbols as absolute explicitly for LLD
drm/fb-helper: fix leaks in error path of drm_fb_helper_fbdev_setup
clk: meson: clean-up clock registration
clk: rockchip: fix frac settings of GPLL clock for rk3328
dmaengine: tegra: avoid overflow of byte tracking
Input: soc_button_array - fix mapping of the 5th GPIO in a PNP0C40 device
drm/dp/mst: Configure no_stop_bit correctly for remote i2c xfers
net: stmmac: Avoid one more sometimes uninitialized Clang warning
ACPI / video: Extend chassis-type detection with a "Lunch Box" check
bcache: fix potential div-zero error of writeback_rate_p_term_inverse
kprobes/x86: Blacklist non-attachable interrupt functions
Linux 4.19.34
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 51bee5abeab2058ea5813c5615d6197a23dbf041 ]
The only user of cgroup_subsys->free() callback is pids_cgrp_subsys which
needs pids_free() to uncharge the pid.
However, ->free() is called from __put_task_struct()->cgroup_free() and this
is too late. Even the trivial program which does
for (;;) {
int pid = fork();
assert(pid >= 0);
if (pid)
wait(NULL);
else
exit(0);
}
can run out of limits because release_task()->call_rcu(delayed_put_task_struct)
implies an RCU gp after the task/pid goes away and before the final put().
Test-case:
mkdir -p /tmp/CG
mount -t cgroup2 none /tmp/CG
echo '+pids' > /tmp/CG/cgroup.subtree_control
mkdir /tmp/CG/PID
echo 2 > /tmp/CG/PID/pids.max
perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' &
echo $! > /tmp/CG/PID/cgroup.procs
Without this patch the forking process fails soon after migration.
Rename cgroup_subsys->free() to cgroup_subsys->release() and move the callsite
into the new helper, cgroup_release(), called by release_task() which actually
frees the pid(s).
Reported-by: Herton R. Krzesinski <hkrzesin@redhat.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
On a system that executes multiple cgrouped jobs and independent
workloads, we don't just care about the health of the overall system, but
also that of individual jobs, so that we can ensure individual job health,
fairness between jobs, or prioritize some jobs over others.
This patch implements pressure stall tracking for cgroups. In kernels
with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure,
and io.pressure files that track aggregate pressure stall times for only
the tasks inside the cgroup.
Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 2ce7135adc9ad081aa3c49744144376ac74fea60)
Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I163e6657aaa60aa5aab9372616a3bce2a65e90ec
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
== Problem description ==
It's useful to be able to identify cgroup associated with skb in TC so
that a policy can be applied to this skb, and existing bpf_skb_cgroup_id
helper can help with this.
Though in real life cgroup hierarchy and hierarchy to apply a policy to
don't map 1:1.
It's often the case that there is a container and corresponding cgroup,
but there are many more sub-cgroups inside container, e.g. because it's
delegated to containerized application to control resources for its
subsystems, or to separate application inside container from infra that
belongs to containerization system (e.g. sshd).
At the same time it may be useful to apply a policy to container as a
whole.
If multiple containers like this are run on a host (what is often the
case) and many of them have sub-cgroups, it may not be possible to apply
per-container policy in TC with existing helpers such as
bpf_skb_under_cgroup or bpf_skb_cgroup_id:
* bpf_skb_cgroup_id will return id of immediate cgroup associated with
skb, i.e. if it's a sub-cgroup inside container, it can't be used to
identify container's cgroup;
* bpf_skb_under_cgroup can work only with one cgroup and doesn't scale,
i.e. if there are N containers on a host and a policy has to be
applied to M of them (0 <= M <= N), it'd require M calls to
bpf_skb_under_cgroup, and, if M changes, it'd require to rebuild &
load new BPF program.
== Solution ==
The patch introduces new helper bpf_skb_ancestor_cgroup_id that can be
used to get id of cgroup v2 that is an ancestor of cgroup associated
with skb at specified level of cgroup hierarchy.
That way admin can place all containers on one level of cgroup hierarchy
(what is a good practice in general and already used in many
configurations) and identify specific cgroup on this level no matter
what sub-cgroup skb is associated with.
E.g. if there is a cgroup hierarchy:
root/
root/container1/
root/container1/app11/
root/container1/app11/sub-app-a/
root/container1/app12/
root/container2/
root/container2/app21/
root/container2/app22/
root/container2/app22/sub-app-b/
, then having skb associated with root/container1/app11/sub-app-a/ it's
possible to get ancestor at level 1, what is container1 and apply policy
for this container, or apply another policy if it's container2.
Policies can be kept e.g. in a hash map where key is a container cgroup
id and value is an action.
Levels where container cgroups are created are usually known in advance
whether cgroup hierarchy inside container may be hard to predict
especially in case when its creation is delegated to containerized
application.
== Implementation details ==
The helper gets ancestor by walking parents up to specified level.
Another option would be to get different kind of "id" from
cgroup->ancestor_ids[level] and use it with idr_find() to get struct
cgroup for ancestor. But that would require radix lookup what doesn't
seem to be better (at least it's not obviously better).
Format of return value of the new helper is same as that of
bpf_skb_cgroup_id.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Currently, rstat flush path is protected with a mutex which is fine as
all the existing users are from interface file show path. However,
rstat is being generalized for use by controllers and flushing from
atomic contexts will be necessary.
This patch replaces cgroup_rstat_mutex with a spinlock and adds a
irq-safe flush function - cgroup_rstat_flush_irqsafe(). Explicit
yield handling is added to the flush path so that other flush
functions can yield to other threads and flushers.
Signed-off-by: Tejun Heo <tj@kernel.org>
cgroup_rstat is being generalized so that controllers can use it too.
This patch factors out and exposes the following interface functions.
* cgroup_rstat_updated(): Renamed from cgroup_rstat_cpu_updated() for
consistency.
* cgroup_rstat_flush_hold/release(): Factored out from base stat
implementation.
* cgroup_rstat_flush(): Verbatim expose.
While at it, drop assert on cgroup_rstat_mutex in
cgroup_base_stat_flush() as it crosses layers and make a minor comment
update.
v2: Added EXPORT_SYMBOL_GPL(cgroup_rstat_updated) to fix a build bug.
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull cgroup updates from Tejun Heo:
"Cgroup2 cpu controller support is finally merged.
- Basic cpu statistics support to allow monitoring by default without
the CPU controller enabled.
- cgroup2 cpu controller support.
- /sys/kernel/cgroup files to help dealing with new / optional
features"
* 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: export list of cgroups v2 features using sysfs
cgroup: export list of delegatable control files using sysfs
cgroup: mark @cgrp __maybe_unused in cpu_stat_show()
MAINTAINERS: relocate cpuset.c
cgroup, sched: Move basic cpu stats from cgroup.stat to cpu.stat
sched: Implement interface for cgroup unified hierarchy
sched: Misc preps for cgroup unified hierarchy interface
sched/cputime: Add dummy cputime_adjust() implementation for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
cgroup: statically initialize init_css_set->dfl_cgrp
cgroup: Implement cgroup2 basic CPU usage accounting
cpuacct: Introduce cgroup_account_cputime[_field]()
sched/cputime: Expose cputime_adjust()
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The basic cpu stat is currently shown with "cpu." prefix in
cgroup.stat, and the same information is duplicated in cpu.stat when
cpu controller is enabled. This is ugly and not very scalable as we
want to expand the coverage of stat information which is always
available.
This patch makes cgroup core always create "cpu.stat" file and show
the basic cpu stat there and calls the cpu controller to show the
extra stats when enabled. This ensures that the same information
isn't presented in multiple places and makes future expansion of basic
stats easier.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
In cgroup1, while cpuacct isn't actually controlling any resources, it
is a separate controller due to combination of two factors -
1. enabling cpu controller has significant side effects, and 2. we
have to pick one of the hierarchies to account CPU usages on. cpuacct
controller is effectively used to designate a hierarchy to track CPU
usages on.
cgroup2's unified hierarchy removes the second reason and we can
account basic CPU usages by default. While we can use cpuacct for
this purpose, both its interface and implementation leave a lot to be
desired - it collects and exposes two sources of truth which don't
agree with each other and some of the exposed statistics don't make
much sense. Also, it propagates all the way up the hierarchy on each
accounting event which is unnecessary.
This patch adds basic resource accounting mechanism to cgroup2's
unified hierarchy and accounts CPU usages using it.
* All accountings are done per-cpu and don't propagate immediately.
It just bumps the per-cgroup per-cpu counters and links to the
parent's updated list if not already on it.
* On a read, the per-cpu counters are collected into the global ones
and then propagated upwards. Only the per-cpu counters which have
changed since the last read are propagated.
* CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
prefix. Total usage is collected from scheduling events. User/sys
breakdown is sourced from tick sampling and adjusted to the usage
using cputime_adjust().
This keeps the accounting side hot path O(1) and per-cpu and the read
side O(nr_updated_since_last_read).
v2: Minor changes and documentation updates as suggested by Waiman and
Roman.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Roman Gushchin <guro@fb.com>
Introduce cgroup_account_cputime[_field]() which wrap cpuacct_charge()
and cgroup_account_field(). This doesn't introduce any functional
changes and will be used to add cgroup basic resource accounting.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Pull block layer updates from Jens Axboe:
"This is the first pull request for 4.14, containing most of the code
changes. It's a quiet series this round, which I think we needed after
the churn of the last few series. This contains:
- Fix for a registration race in loop, from Anton Volkov.
- Overflow complaint fix from Arnd for DAC960.
- Series of drbd changes from the usual suspects.
- Conversion of the stec/skd driver to blk-mq. From Bart.
- A few BFQ improvements/fixes from Paolo.
- CFQ improvement from Ritesh, allowing idling for group idle.
- A few fixes found by Dan's smatch, courtesy of Dan.
- A warning fixup for a race between changing the IO scheduler and
device remova. From David Jeffery.
- A few nbd fixes from Josef.
- Support for cgroup info in blktrace, from Shaohua.
- Also from Shaohua, new features in the null_blk driver to allow it
to actually hold data, among other things.
- Various corner cases and error handling fixes from Weiping Zhang.
- Improvements to the IO stats tracking for blk-mq from me. Can
drastically improve performance for fast devices and/or big
machines.
- Series from Christoph removing bi_bdev as being needed for IO
submission, in preparation for nvme multipathing code.
- Series from Bart, including various cleanups and fixes for switch
fall through case complaints"
* 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
kernfs: checking for IS_ERR() instead of NULL
drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
drbd: Fix allyesconfig build, fix recent commit
drbd: switch from kmalloc() to kmalloc_array()
drbd: abort drbd_start_resync if there is no connection
drbd: move global variables to drbd namespace and make some static
drbd: rename "usermode_helper" to "drbd_usermode_helper"
drbd: fix race between handshake and admin disconnect/down
drbd: fix potential deadlock when trying to detach during handshake
drbd: A single dot should be put into a sequence.
drbd: fix rmmod cleanup, remove _all_ debugfs entries
drbd: Use setup_timer() instead of init_timer() to simplify the code.
drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
drbd: new disk-option disable-write-same
drbd: Fix resource role for newly created resources in events2
drbd: mark symbols static where possible
drbd: Send P_NEG_ACK upon write error in protocol != C
drbd: add explicit plugging when submitting batches
drbd: change list_for_each_safe to while(list_first_entry_or_null)
drbd: introduce drbd_recv_header_maybe_unplug
...
Misc trivial changes to prepare for future changes. No functional
difference.
* Expose cgroup_get(), cgroup_tryget() and cgroup_parent().
* Implement task_dfl_cgroup() which dereferences css_set->dfl_cgrp.
* Rename cgroup_stats_show() to cgroup_stat_show() for consistency
with the file name.
Signed-off-by: Tejun Heo <tj@kernel.org>
By default we output cgroup id in blktrace. This adds an option to
display cgroup path. Since get cgroup path is a relativly heavy
operation, we don't enable it by default.
with the option enabled, blktrace will output something like this:
dd-1353 [007] d..2 293.015252: 8,0 /test/level D R 24 + 8 [dd]
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add an API to export cgroup fhandle info. We don't export a full 'struct
file_handle', there are unrequired info. Sepcifically, cgroup is always
a directory, so we don't need a 'FILEID_INO32_GEN_PARENT' type fhandle,
we only need export the inode number and generation number just like
what generic_fh_to_dentry does. And we can avoid the overhead of getting
an inode too, since kernfs_node_id (ino and generation) has all the info
required.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
inode number and generation can identify a kernfs node. We are going to
export the identification by exportfs operations, so put ino and
generation into a separate structure. It's convenient when later patches
use the identification.
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
cgroup v2 is in the process of growing thread granularity support.
Once thread mode is enabled, the root cgroup of the subtree serves as
the dom_cgrp to which the processes of the subtree conceptually belong
and domain-level resource consumptions not tied to any specific task
are charged. In the subtree, threads won't be subject to process
granularity or no-internal-task constraint and can be distributed
arbitrarily across the subtree.
This patch implements a new task iterator flag CSS_TASK_ITER_THREADED,
which, when used on a dom_cgrp, makes the iteration include the tasks
on all the associated threaded css_sets. "cgroup.procs" read path is
updated to use it so that reading the file on a proc_cgrp lists all
processes. This will also be used by controller implementations which
need to walk processes or tasks at the resource domain level.
Task iteration is implemented nested in css_set iteration. If
CSS_TASK_ITER_THREADED is specified, after walking tasks of each
!threaded css_set, all the associated threaded css_sets are visited
before moving onto the next !threaded css_set.
v2: ->cur_pcset renamed to ->cur_dcset. Updated for the new
enable-threaded-per-cgroup behavior.
Signed-off-by: Tejun Heo <tj@kernel.org>
cgroup v2 is in the process of growing thread granularity support. A
threaded subtree is composed of a thread root and threaded cgroups
which are proper members of the subtree.
The root cgroup of the subtree serves as the domain cgroup to which
the processes (as opposed to threads / tasks) of the subtree
conceptually belong and domain-level resource consumptions not tied to
any specific task are charged. Inside the subtree, threads won't be
subject to process granularity or no-internal-task constraint and can
be distributed arbitrarily across the subtree.
This patch introduces cgroup->dom_cgrp along with threaded css_set
handling.
* cgroup->dom_cgrp points to self for normal and thread roots. For
proper thread subtree members, points to the dom_cgrp (the thread
root).
* css_set->dom_cset points to self if for normal and thread roots. If
threaded, points to the css_set which belongs to the cgrp->dom_cgrp.
The dom_cgrp serves as the resource domain and keeps the matching
csses available. The dom_cset holds those csses and makes them
easily accessible.
* All threaded csets are linked on their dom_csets to enable iteration
of all threaded tasks.
* cgroup->nr_threaded_children keeps track of the number of threaded
children.
This patch adds the above but doesn't actually use them yet. The
following patches will build on top.
v4: ->nr_threaded_children added.
v3: ->proc_cgrp/cset renamed to ->dom_cgrp/cset. Updated for the new
enable-threaded-per-cgroup behavior.
v2: Added cgroup_is_threaded() helper.
Signed-off-by: Tejun Heo <tj@kernel.org>
css_task_iter currently always walks all tasks. With the scheduled
cgroup v2 thread support, the iterator would need to handle multiple
types of iteration. As a preparation, add @flags to
css_task_iter_start() and implement CSS_TASK_ITER_PROCS. If the flag
is not specified, it walks all tasks as before. When asserted, the
iterator only walks the group leaders.
For now, the only user of the flag is cgroup v2 "cgroup.procs" file
which no longer needs to skip non-leader tasks in cgroup_procs_next().
Note that cgroup v1 "cgroup.procs" can't use the group leader walk as
v1 "cgroup.procs" doesn't mean "list all thread group leaders in the
cgroup" but "list all thread group id's with any threads in the
cgroup".
While at it, update cgroup_procs_show() to use task_pid_vnr() instead
of task_tgid_vnr(). As the iteration guarantees that the function
only sees group leaders, this doesn't change the output and will allow
sharing the function for thread iteration.
Signed-off-by: Tejun Heo <tj@kernel.org>
cgrp->populated_cnt counts both local (the cgroup's populated
css_sets) and subtree proper (populated children) so that it's only
zero when the whole subtree, including self, is empty.
This patch splits the counter into two so that local and children
populated states are tracked separately. It allows finer-grained
tests on the state of the hierarchy which will be used to replace
css_set walking local populated test.
Signed-off-by: Tejun Heo <tj@kernel.org>
In most cases, a cgroup controller don't care about the liftimes of
cgroups. For the controller, a css becomes online when ->css_online()
is called on it and offline when ->css_offline() is called.
However, cpuset is special in that the user interface it exposes cares
whether certain cgroups exist or not. Combined with the RCU delay
between cgroup removal and css offlining, this can lead to user
visible behavior oddities where operations which should succeed after
cgroup removals fail for some time period. The effects of cgroup
removals are delayed when seen from userland.
This patch adds css_is_dying() which tests whether offline is pending
and updates is_cpuset_online() so that the function returns false also
while offline is pending. This gets rid of the userland visible
delays.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Link: http://lkml.kernel.org/r/327ca1f5-7957-fbb9-9e5f-9ba149d40ba2@oracle.com
Cc: stable@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull cgroup updates from Tejun Heo:
"Nothing major. Two notable fixes are Li's second stab at fixing the
long-standing race condition in the mount path and suppression of
spurious warning from cgroup_get(). All other changes are trivial"
* 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: mark cgroup_get() with __maybe_unused
cgroup: avoid attaching a cgroup root to two different superblocks, take 2
cgroup: fix spurious warnings on cgroup_is_dead() from cgroup_sk_alloc()
cgroup: move cgroup_subsys_state parent field for cache locality
cpuset: Remove cpuset_update_active_cpus()'s parameter.
cgroup: switch to BUG_ON()
cgroup: drop duplicate header nsproxy.h
kernel: convert css_set.refcount from atomic_t to refcount_t
kernel: convert cgroup_namespace.count from atomic_t to refcount_t
Creation of a kthread goes through a couple interlocked stages between
the kthread itself and its creator. Once the new kthread starts
running, it initializes itself and wakes up the creator. The creator
then can further configure the kthread and then let it start doing its
job by waking it up.
In this configuration-by-creator stage, the creator is the only one
that can wake it up but the kthread is visible to userland. When
altering the kthread's attributes from userland is allowed, this is
fine; however, for cases where CPU affinity is critical,
kthread_bind() is used to first disable affinity changes from userland
and then set the affinity. This also prevents the kthread from being
migrated into non-root cgroups as that can affect the CPU affinity and
many other things.
Unfortunately, the cgroup side of protection is racy. While the
PF_NO_SETAFFINITY flag prevents further migrations, userland can win
the race before the creator sets the flag with kthread_bind() and put
the kthread in a non-root cgroup, which can lead to all sorts of
problems including incorrect CPU affinity and starvation.
This bug got triggered by userland which periodically tries to migrate
all processes in the root cpuset cgroup to a non-root one. Per-cpu
workqueue workers got caught while being created and ended up with
incorrected CPU affinity breaking concurrency management and sometimes
stalling workqueue execution.
This patch adds task->no_cgroup_migration which disallows the task to
be migrated by userland. kthreadd starts with the flag set making
every child kthread start in the root cgroup with migration
disallowed. The flag is cleared after the kthread finishes
initialization by which time PF_NO_SETAFFINITY is set if the kthread
should stay in the root cgroup.
It'd be better to wait for the initialization instead of failing but I
couldn't think of a way of implementing that without adding either a
new PF flag, or sleeping and retrying from waiting side. Even if
userland depends on changing cgroup membership of a kthread, it either
has to be synchronized with kthread_create() or periodically repeat,
so it's unlikely that this would break anything.
v2: Switch to a simpler implementation using a new task_struct bit
field suggested by Oleg.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-and-debugged-by: Chris Mason <clm@fb.com>
Cc: stable@vger.kernel.org # v4.3+ (we can't close the race on < v4.3)
Signed-off-by: Tejun Heo <tj@kernel.org>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull cgroup updates from Tejun Heo:
- tracepoints for basic cgroup management operations added
- kernfs and cgroup path formatting functions updated to behave in the
style of strlcpy()
- non-critical bug fixes
* 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
blkcg: Unlock blkcg_pol_mutex only once when cpd == NULL
cgroup: fix error handling regressions in proc_cgroup_show() and cgroup_release_agent()
cpuset: fix error handling regression in proc_cpuset_show()
cgroup: add tracepoints for basic operations
cgroup: make cgroup_path() and friends behave in the style of strlcpy()
kernfs: remove kernfs_path_len()
kernfs: make kernfs_path*() behave in the style of strlcpy()
kernfs: add dummy implementation of kernfs_path_from_node()
Pull namespace updates from Eric Biederman:
"This set of changes is a number of smaller things that have been
overlooked in other development cycles focused on more fundamental
change. The devpts changes are small things that were a distraction
until we managed to kill off DEVPTS_MULTPLE_INSTANCES. There is an
trivial regression fix to autofs for the unprivileged mount changes
that went in last cycle. A pair of ioctls has been added by Andrey
Vagin making it is possible to discover the relationships between
namespaces when referring to them through file descriptors.
The big user visible change is starting to add simple resource limits
to catch programs that misbehave. With namespaces in general and user
namespaces in particular allowing users to use more kinds of
resources, it has become important to have something to limit errant
programs. Because the purpose of these limits is to catch errant
programs the code needs to be inexpensive to use as it always on, and
the default limits need to be high enough that well behaved programs
on well behaved systems don't encounter them.
To this end, after some review I have implemented per user per user
namespace limits, and use them to limit the number of namespaces. The
limits being per user mean that one user can not exhause the limits of
another user. The limits being per user namespace allow contexts where
the limit is 0 and security conscious folks can remove from their
threat anlysis the code used to manage namespaces (as they have
historically done as it root only). At the same time the limits being
per user namespace allow other parts of the system to use namespaces.
Namespaces are increasingly being used in application sand boxing
scenarios so an all or nothing disable for the entire system for the
security conscious folks makes increasing use of these sandboxes
impossible.
There is also added a limit on the maximum number of mounts present in
a single mount namespace. It is nontrivial to guess what a reasonable
system wide limit on the number of mount structure in the kernel would
be, especially as it various based on how a system is using
containers. A limit on the number of mounts in a mount namespace
however is much easier to understand and set. In most cases in
practice only about 1000 mounts are used. Given that some autofs
scenarious have the potential to be 30,000 to 50,000 mounts I have set
the default limit for the number of mounts at 100,000 which is well
above every known set of users but low enough that the mount hash
tables don't degrade unreaonsably.
These limits are a start. I expect this estabilishes a pattern that
other limits for resources that namespaces use will follow. There has
been interest in making inotify event limits per user per user
namespace as well as interest expressed in making details about what
is going on in the kernel more visible"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (28 commits)
autofs: Fix automounts by using current_real_cred()->uid
mnt: Add a per mount namespace limit on the number of mounts
netns: move {inc,dec}_net_namespaces into #ifdef
nsfs: Simplify __ns_get_path
tools/testing: add a test to check nsfs ioctl-s
nsfs: add ioctl to get a parent namespace
nsfs: add ioctl to get an owning user namespace for ns file descriptor
kernel: add a helper to get an owning user namespace for a namespace
devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts
devpts: Remove sync_filesystems
devpts: Make devpts_kill_sb safe if fsi is NULL
devpts: Simplify devpts_mount by using mount_nodev
devpts: Move the creation of /dev/pts/ptmx into fill_super
devpts: Move parse_mount_options into fill_super
userns: When the per user per user namespace limit is reached return ENOSPC
userns; Document per user per user namespace limits.
mntns: Add a limit on the number of mount namespaces.
netns: Add a limit on the number of net namespaces
cgroupns: Add a limit on the number of cgroup namespaces
ipcns: Add a limit on the number of ipc namespaces
...
This commit adds an inline function to cgroup.h to check whether a given
task is under a given cgroup hierarchy. This is to avoid having to put
ifdefs in .c files to gate access to cgroups. When cgroups are disabled
this always returns true.
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
cgroup_path() and friends used to format the path from the end and
thus the resulting path usually didn't start at the start of the
passed in buffer. Also, when the buffer was too small, the partial
result was truncated from the head rather than tail and there was no
way to tell how long the full path would be. These make the functions
less robust and more awkward to use.
With recent updates to kernfs_path(), cgroup_path() and friends can be
made to behave in strlcpy() style.
* cgroup_path(), cgroup_path_ns[_locked]() and task_cgroup_path() now
always return the length of the full path. If buffer is too small,
it contains nul terminated truncated output.
* All users updated accordingly.
v2: cgroup_path() usage in kernel/sched/debug.c converted.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
kernfs_path*() functions always return the length of the full path but
the path content is undefined if the length is larger than the
provided buffer. This makes its behavior different from strlcpy() and
requires error handling in all its users even when they don't care
about truncation. In addition, the implementation can actully be
simplified by making it behave properly in strlcpy() style.
* Update kernfs_path_from_node_locked() to always fill up the buffer
with path. If the buffer is not large enough, the output is
truncated and terminated.
* kernfs_path() no longer needs error handling. Make it a simple
inline wrapper around kernfs_path_from_node().
* sysfs_warn_dup()'s use of kernfs_path() doesn't need error handling.
Updated accordingly.
* cgroup_path()'s use of kernfs_path() updated to retain the old
behavior.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>