Discussion:
[PATCH v3 2/7] block_dev: Check permissions towards block device inode when mounting
Seth Forshee
2015-11-17 16:39:05 UTC
Permalink
Unprivileged users should not be able to mount block devices when
they lack sufficient privileges towards the block device inode.
Update blkdev_get_by_path() to validate that the user has the
required access to the inode at the specified path. The check
will be skipped for CAP_SYS_ADMIN, so privileged mounts will
continue working as before.

Signed-off-by: Seth Forshee <***@canonical.com>
---
fs/block_dev.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index f1f0aa7214a3..54d94cd64577 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1394,9 +1394,14 @@ struct block_device *blkdev_get_by_path(const char *path, fmode_t mode,
void *holder)
{
struct block_device *bdev;
+ int perm = 0;
int err;

- bdev = lookup_bdev(path, 0);
+ if (mode & FMODE_READ)
+ perm |= MAY_READ;
+ if (mode & FMODE_WRITE)
+ perm |= MAY_WRITE;
+ bdev = lookup_bdev(path, perm);
if (IS_ERR(bdev))
return bdev;
--
1.9.1
Seth Forshee
2015-11-17 16:39:07 UTC
Permalink
From: Andy Lutomirski <***@amacapital.net>

If a process gets access to a mount from a different user
namespace, that process should not be able to take advantage of
setuid files or selinux entrypoints from that filesystem. Prevent
this by treating mounts from other mount namespaces and those not
owned by current_user_ns() or an ancestor as nosuid.

This will make it safer to allow more complex filesystems to be
mounted in non-root user namespaces.

This does not remove the need for MNT_LOCK_NOSUID. The setuid,
setgid, and file capability bits can no longer be abused if code in
a user namespace were to clear nosuid on an untrusted filesystem,
but this patch, by itself, is insufficient to protect the system
from abuse of files that, when execed, would increase MAC privilege.

As a more concrete explanation, any task that can manipulate a
vfsmount associated with a given user namespace already has
capabilities in that namespace and all of its descendents. If they
can cause a malicious setuid, setgid, or file-caps executable to
appear in that mount, then that executable will only allow them to
elevate privileges in exactly the set of namespaces in which they
are already privileges.

On the other hand, if they can cause a malicious executable to
appear with a dangerous MAC label, running it could change the
caller's security context in a way that should not have been
possible, even inside the namespace in which the task is confined.

As a hardening measure, this would have made CVE-2014-5207 much
more difficult to exploit.

Signed-off-by: Andy Lutomirski <***@amacapital.net>
Signed-off-by: Seth Forshee <***@canonical.com>
---
fs/exec.c | 2 +-
fs/namespace.c | 13 +++++++++++++
include/linux/mount.h | 1 +
security/commoncap.c | 2 +-
security/selinux/hooks.c | 2 +-
5 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index b06623a9347f..ea7311d72cc3 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1295,7 +1295,7 @@ static void bprm_fill_uid(struct linux_binprm *bprm)
bprm->cred->euid = current_euid();
bprm->cred->egid = current_egid();

- if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
+ if (!mnt_may_suid(bprm->file->f_path.mnt))
return;

if (task_no_new_privs(current))
diff --git a/fs/namespace.c b/fs/namespace.c
index da70f7c4ece1..2101ce7b96ab 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3276,6 +3276,19 @@ found:
return visible;
}

+bool mnt_may_suid(struct vfsmount *mnt)
+{
+ /*
+ * Foreign mounts (accessed via fchdir or through /proc
+ * symlinks) are always treated as if they are nosuid. This
+ * prevents namespaces from trusting potentially unsafe
+ * suid/sgid bits, file caps, or security labels that originate
+ * in other namespaces.
+ */
+ return !(mnt->mnt_flags & MNT_NOSUID) && check_mnt(real_mount(mnt)) &&
+ in_userns(current_user_ns(), mnt->mnt_sb->s_user_ns);
+}
+
static struct ns_common *mntns_get(struct task_struct *task)
{
struct ns_common *ns = NULL;
diff --git a/include/linux/mount.h b/include/linux/mount.h
index f822c3c11377..54a594d49733 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -81,6 +81,7 @@ extern void mntput(struct vfsmount *mnt);
extern struct vfsmount *mntget(struct vfsmount *mnt);
extern struct vfsmount *mnt_clone_internal(struct path *path);
extern int __mnt_is_readonly(struct vfsmount *mnt);
+extern bool mnt_may_suid(struct vfsmount *mnt);

struct path;
extern struct vfsmount *clone_private_mount(struct path *path);
diff --git a/security/commoncap.c b/security/commoncap.c
index 400aa224b491..6243aef5860e 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -448,7 +448,7 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
if (!file_caps_enabled)
return 0;

- if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
+ if (!mnt_may_suid(bprm->file->f_path.mnt))
return 0;
if (!in_userns(current_user_ns(), bprm->file->f_path.mnt->mnt_sb->s_user_ns))
return 0;
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index e4369d86e588..de05207eb665 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2171,7 +2171,7 @@ static int check_nnp_nosuid(const struct linux_binprm *bprm,
const struct task_security_struct *new_tsec)
{
int nnp = (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS);
- int nosuid = (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID);
+ int nosuid = !mnt_may_suid(bprm->file->f_path.mnt);
int rc;

if (!nnp && !nosuid)
--
1.9.1
James Morris
2015-11-18 00:00:53 UTC
Permalink
Post by Seth Forshee
If a process gets access to a mount from a different user
namespace, that process should not be able to take advantage of
setuid files or selinux entrypoints from that filesystem. Prevent
this by treating mounts from other mount namespaces and those not
owned by current_user_ns() or an ancestor as nosuid.
This will make it safer to allow more complex filesystems to be
mounted in non-root user namespaces.
This does not remove the need for MNT_LOCK_NOSUID. The setuid,
setgid, and file capability bits can no longer be abused if code in
a user namespace were to clear nosuid on an untrusted filesystem,
but this patch, by itself, is insufficient to protect the system
from abuse of files that, when execed, would increase MAC privilege.
As a more concrete explanation, any task that can manipulate a
vfsmount associated with a given user namespace already has
capabilities in that namespace and all of its descendents. If they
can cause a malicious setuid, setgid, or file-caps executable to
appear in that mount, then that executable will only allow them to
elevate privileges in exactly the set of namespaces in which they
are already privileges.
On the other hand, if they can cause a malicious executable to
appear with a dangerous MAC label, running it could change the
caller's security context in a way that should not have been
possible, even inside the namespace in which the task is confined.
As a hardening measure, this would have made CVE-2014-5207 much
more difficult to exploit.
Acked-by: James Morris <***@oracle.com>
--
James Morris
<***@namei.org>
Seth Forshee
2015-11-17 16:39:06 UTC
Permalink
Unprivileged users should not be able to mount mtd block devices
when they lack sufficient privileges towards the block device
inode. Update mount_mtd() to validate that the user has the
required access to the inode at the specified path. The check
will be skipped for CAP_SYS_ADMIN, so privileged mounts will
continue working as before.

Signed-off-by: Seth Forshee <***@canonical.com>
---
drivers/mtd/mtdsuper.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
index b5b60e1af31c..5d7e7705fed8 100644
--- a/drivers/mtd/mtdsuper.c
+++ b/drivers/mtd/mtdsuper.c
@@ -125,6 +125,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
#ifdef CONFIG_BLOCK
struct block_device *bdev;
int ret, major;
+ int perm;
#endif
int mtdnr;

@@ -176,7 +177,10 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
/* try the old way - the hack where we allowed users to mount
* /dev/mtdblock$(n) but didn't actually _use_ the blockdev
*/
- bdev = lookup_bdev(dev_name, 0);
+ perm = MAY_READ;
+ if (!(flags & MS_RDONLY))
+ perm |= MAY_WRITE;
+ bdev = lookup_bdev(dev_name, perm);
if (IS_ERR(bdev)) {
ret = PTR_ERR(bdev);
pr_debug("MTDSB: lookup_bdev() returned %d\n", ret);
--
1.9.1
Seth Forshee
2015-11-17 16:39:09 UTC
Permalink
All current callers of in_userns pass current_user_ns as the
first argument. Simplify by replacing in_userns with
current_in_userns which checks whether current_user_ns is in the
namespace supplied as an argument.

Signed-off-by: Seth Forshee <***@canonical.com>
---
fs/namespace.c | 2 +-
include/linux/user_namespace.h | 6 ++----
kernel/user_namespace.c | 6 +++---
security/commoncap.c | 2 +-
4 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 2101ce7b96ab..18fc58760aec 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3286,7 +3286,7 @@ bool mnt_may_suid(struct vfsmount *mnt)
* in other namespaces.
*/
return !(mnt->mnt_flags & MNT_NOSUID) && check_mnt(real_mount(mnt)) &&
- in_userns(current_user_ns(), mnt->mnt_sb->s_user_ns);
+ current_in_userns(mnt->mnt_sb->s_user_ns);
}

static struct ns_common *mntns_get(struct task_struct *task)
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index a43faa727124..9217169c64cb 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -72,8 +72,7 @@ extern ssize_t proc_projid_map_write(struct file *, const char __user *, size_t,
extern ssize_t proc_setgroups_write(struct file *, const char __user *, size_t, loff_t *);
extern int proc_setgroups_show(struct seq_file *m, void *v);
extern bool userns_may_setgroups(const struct user_namespace *ns);
-extern bool in_userns(const struct user_namespace *ns,
- const struct user_namespace *target_ns);
+extern bool current_in_userns(const struct user_namespace *target_ns);
#else

static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
@@ -103,8 +102,7 @@ static inline bool userns_may_setgroups(const struct user_namespace *ns)
return true;
}

-static inline bool in_userns(const struct user_namespace *ns,
- const struct user_namespace *target_ns)
+static inline bool current_in_userns(const struct user_namespace *target_ns)
{
return true;
}
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 69fbc377357b..5960edc7e644 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -949,10 +949,10 @@ bool userns_may_setgroups(const struct user_namespace *ns)
* Returns true if @ns is the same namespace as or a descendant of
* @target_ns.
*/
-bool in_userns(const struct user_namespace *ns,
- const struct user_namespace *target_ns)
+bool current_in_userns(const struct user_namespace *target_ns)
{
- for (; ns; ns = ns->parent) {
+ struct user_namespace *ns;
+ for (ns = current_user_ns(); ns; ns = ns->parent) {
if (ns == target_ns)
return true;
}
diff --git a/security/commoncap.c b/security/commoncap.c
index 6243aef5860e..2119421613f6 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -450,7 +450,7 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c

if (!mnt_may_suid(bprm->file->f_path.mnt))
return 0;
- if (!in_userns(current_user_ns(), bprm->file->f_path.mnt->mnt_sb->s_user_ns))
+ if (!current_in_userns(bprm->file->f_path.mnt->mnt_sb->s_user_ns))
return 0;

rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
--
1.9.1
James Morris
2015-11-18 00:03:58 UTC
Permalink
Post by Seth Forshee
All current callers of in_userns pass current_user_ns as the
first argument. Simplify by replacing in_userns with
current_in_userns which checks whether current_user_ns is in the
namespace supplied as an argument.
Nice cleanup.


Acked-by: James Morris <***@oracle.com>
--
James Morris
<***@namei.org>
Seth Forshee
2015-11-17 16:39:10 UTC
Permalink
The SMACK64, SMACK64EXEC, and SMACK64MMAP labels are all handled
differently in untrusted mounts. This is confusing and
potentically problematic. Change this to handle them all the same
way that SMACK64 is currently handled; that is, read the label
from disk and check it at use time. For SMACK64 and SMACK64MMAP
access is denied if the label does not match smk_root. To be
consistent with suid, a SMACK64EXEC label which does not match
smk_root will still allow execution of the file but will not run
with the label supplied in the xattr.

Signed-off-by: Seth Forshee <***@canonical.com>
---
security/smack/smack_lsm.c | 29 +++++++++++++++++++----------
1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index 621200f86b56..9b7ff781df9a 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -891,6 +891,7 @@ static int smack_bprm_set_creds(struct linux_binprm *bprm)
struct inode *inode = file_inode(bprm->file);
struct task_smack *bsp = bprm->cred->security;
struct inode_smack *isp;
+ struct superblock_smack *sbsp;
int rc;

if (bprm->cred_prepared)
@@ -900,6 +901,11 @@ static int smack_bprm_set_creds(struct linux_binprm *bprm)
if (isp->smk_task == NULL || isp->smk_task == bsp->smk_task)
return 0;

+ sbsp = inode->i_sb->s_security;
+ if ((sbsp->smk_flags & SMK_SB_UNTRUSTED) &&
+ isp->smk_task != sbsp->smk_root)
+ return 0;
+
if (bprm->unsafe & (LSM_UNSAFE_PTRACE | LSM_UNSAFE_PTRACE_CAP)) {
struct task_struct *tracer;
rc = 0;
@@ -1703,6 +1709,7 @@ static int smack_mmap_file(struct file *file,
struct task_smack *tsp;
struct smack_known *okp;
struct inode_smack *isp;
+ struct superblock_smack *sbsp;
int may;
int mmay;
int tmay;
@@ -1714,6 +1721,10 @@ static int smack_mmap_file(struct file *file,
isp = file_inode(file)->i_security;
if (isp->smk_mmap == NULL)
return 0;
+ sbsp = file_inode(file)->i_sb->s_security;
+ if (sbsp->smk_flags & SMK_SB_UNTRUSTED &&
+ isp->smk_mmap != sbsp->smk_root)
+ return -EACCES;
mkp = isp->smk_mmap;

tsp = current_security();
@@ -3492,16 +3503,14 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
if (rc >= 0)
transflag = SMK_INODE_TRANSMUTE;
}
- if (!(sbsp->smk_flags & SMK_SB_UNTRUSTED)) {
- /*
- * Don't let the exec or mmap label be "*" or "@".
- */
- skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
- if (IS_ERR(skp) || skp == &smack_known_star ||
- skp == &smack_known_web)
- skp = NULL;
- isp->smk_task = skp;
- }
+ /*
+ * Don't let the exec or mmap label be "*" or "@".
+ */
+ skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
+ if (IS_ERR(skp) || skp == &smack_known_star ||
+ skp == &smack_known_web)
+ skp = NULL;
+ isp->smk_task = skp;

skp = smk_fetch(XATTR_NAME_SMACKMMAP, inode, dp);
if (IS_ERR(skp) || skp == &smack_known_star ||
--
1.9.1
Casey Schaufler
2015-11-17 18:24:10 UTC
Permalink
Post by Seth Forshee
The SMACK64, SMACK64EXEC, and SMACK64MMAP labels are all handled
differently in untrusted mounts. This is confusing and
potentically problematic. Change this to handle them all the same
way that SMACK64 is currently handled; that is, read the label
from disk and check it at use time. For SMACK64 and SMACK64MMAP
access is denied if the label does not match smk_root. To be
consistent with suid, a SMACK64EXEC label which does not match
smk_root will still allow execution of the file but will not run
with the label supplied in the xattr.
---
security/smack/smack_lsm.c | 29 +++++++++++++++++++----------
1 file changed, 19 insertions(+), 10 deletions(-)
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index 621200f86b56..9b7ff781df9a 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -891,6 +891,7 @@ static int smack_bprm_set_creds(struct linux_binprm *bprm)
struct inode *inode = file_inode(bprm->file);
struct task_smack *bsp = bprm->cred->security;
struct inode_smack *isp;
+ struct superblock_smack *sbsp;
int rc;
if (bprm->cred_prepared)
@@ -900,6 +901,11 @@ static int smack_bprm_set_creds(struct linux_binprm *bprm)
if (isp->smk_task == NULL || isp->smk_task == bsp->smk_task)
return 0;
+ sbsp = inode->i_sb->s_security;
+ if ((sbsp->smk_flags & SMK_SB_UNTRUSTED) &&
+ isp->smk_task != sbsp->smk_root)
+ return 0;
+
if (bprm->unsafe & (LSM_UNSAFE_PTRACE | LSM_UNSAFE_PTRACE_CAP)) {
struct task_struct *tracer;
rc = 0;
@@ -1703,6 +1709,7 @@ static int smack_mmap_file(struct file *file,
struct task_smack *tsp;
struct smack_known *okp;
struct inode_smack *isp;
+ struct superblock_smack *sbsp;
int may;
int mmay;
int tmay;
@@ -1714,6 +1721,10 @@ static int smack_mmap_file(struct file *file,
isp = file_inode(file)->i_security;
if (isp->smk_mmap == NULL)
return 0;
+ sbsp = file_inode(file)->i_sb->s_security;
+ if (sbsp->smk_flags & SMK_SB_UNTRUSTED &&
+ isp->smk_mmap != sbsp->smk_root)
+ return -EACCES;
mkp = isp->smk_mmap;
tsp = current_security();
@@ -3492,16 +3503,14 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
if (rc >= 0)
transflag = SMK_INODE_TRANSMUTE;
}
- if (!(sbsp->smk_flags & SMK_SB_UNTRUSTED)) {
- /*
- */
- skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
- if (IS_ERR(skp) || skp == &smack_known_star ||
- skp == &smack_known_web)
- skp = NULL;
- isp->smk_task = skp;
- }
+ /*
+ */
+ skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
+ if (IS_ERR(skp) || skp == &smack_known_star ||
+ skp == &smack_known_web)
+ skp = NULL;
+ isp->smk_task = skp;
skp = smk_fetch(XATTR_NAME_SMACKMMAP, inode, dp);
if (IS_ERR(skp) || skp == &smack_known_star ||
James Morris
2015-11-18 00:12:51 UTC
Permalink
Post by Seth Forshee
+ sbsp = inode->i_sb->s_security;
+ if ((sbsp->smk_flags & SMK_SB_UNTRUSTED) &&
Where is SMK_SB_UNTRUSTED defined?

I can't see it in this patch series, mainline or security next.
--
James Morris
<***@namei.org>
Seth Forshee
2015-11-18 00:50:52 UTC
Permalink
Post by James Morris
Post by Seth Forshee
+ sbsp = inode->i_sb->s_security;
+ if ((sbsp->smk_flags & SMK_SB_UNTRUSTED) &&
Where is SMK_SB_UNTRUSTED defined?
I can't see it in this patch series, mainline or security next.
Eric's already applied a few of my patches to a branch in his tree,
including the one that defines this. See the for-testing branch of
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git,
specifically
https://git.kernel.org/cgit/linux/kernel/git/ebiederm/user-namespace.git/commit/?id=36979551f2345d4e20714c1fc53d6a15c87f8b64.

Thanks,
Seth
Seth Forshee
2015-11-17 16:39:04 UTC
Permalink
When looking up a block device by path no permission check is
done to verify that the user has access to the block device inode
at the specified path. In some cases it may be necessary to
check permissions towards the inode, such as allowing
unprivileged users to mount block devices in user namespaces.

Add an argument to lookup_bdev() to optionally perform this
permission check. A value of 0 skips the permission check and
behaves the same as before. A non-zero value specifies the mask
of access rights required towards the inode at the specified
path. The check is always skipped if the user has CAP_SYS_ADMIN.

All callers of lookup_bdev() currently pass a mask of 0, so this
patch results in no functional change. Subsequent patches will
add permission checks where appropriate.

Signed-off-by: Seth Forshee <***@canonical.com>
---
drivers/md/bcache/super.c | 2 +-
drivers/md/dm-table.c | 2 +-
drivers/mtd/mtdsuper.c | 2 +-
fs/block_dev.c | 13 ++++++++++---
fs/quota/quota.c | 2 +-
include/linux/fs.h | 2 +-
6 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 679a093a3bf6..e8287b0d1dac 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1926,7 +1926,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
sb);
if (IS_ERR(bdev)) {
if (bdev == ERR_PTR(-EBUSY)) {
- bdev = lookup_bdev(strim(path));
+ bdev = lookup_bdev(strim(path), 0);
mutex_lock(&bch_register_lock);
if (!IS_ERR(bdev) && bch_is_open(bdev))
err = "device already registered";
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index e76ed003769e..35bb3ea4cbe2 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -380,7 +380,7 @@ int dm_get_device(struct dm_target *ti, const char *path, fmode_t mode,
BUG_ON(!t);

/* convert the path to a device */
- bdev = lookup_bdev(path);
+ bdev = lookup_bdev(path, 0);
if (IS_ERR(bdev)) {
dev = name_to_dev_t(path);
if (!dev)
diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
index 20c02a3b7417..b5b60e1af31c 100644
--- a/drivers/mtd/mtdsuper.c
+++ b/drivers/mtd/mtdsuper.c
@@ -176,7 +176,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
/* try the old way - the hack where we allowed users to mount
* /dev/mtdblock$(n) but didn't actually _use_ the blockdev
*/
- bdev = lookup_bdev(dev_name);
+ bdev = lookup_bdev(dev_name, 0);
if (IS_ERR(bdev)) {
ret = PTR_ERR(bdev);
pr_debug("MTDSB: lookup_bdev() returned %d\n", ret);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 26cee058dc02..f1f0aa7214a3 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1396,7 +1396,7 @@ struct block_device *blkdev_get_by_path(const char *path, fmode_t mode,
struct block_device *bdev;
int err;

- bdev = lookup_bdev(path);
+ bdev = lookup_bdev(path, 0);
if (IS_ERR(bdev))
return bdev;

@@ -1706,12 +1706,14 @@ EXPORT_SYMBOL(ioctl_by_bdev);
/**
* lookup_bdev - lookup a struct block_device by name
* @pathname: special file representing the block device
+ * @mask: rights to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
*
* Get a reference to the blockdevice at @pathname in the current
* namespace if possible and return it. Return ERR_PTR(error)
- * otherwise.
+ * otherwise. If @mask is non-zero, check for access rights to the
+ * inode at @pathname.
*/
-struct block_device *lookup_bdev(const char *pathname)
+struct block_device *lookup_bdev(const char *pathname, int mask)
{
struct block_device *bdev;
struct inode *inode;
@@ -1726,6 +1728,11 @@ struct block_device *lookup_bdev(const char *pathname)
return ERR_PTR(error);

inode = d_backing_inode(path.dentry);
+ if (mask != 0 && !capable(CAP_SYS_ADMIN)) {
+ error = __inode_permission(inode, mask);
+ if (error)
+ goto fail;
+ }
error = -ENOTBLK;
if (!S_ISBLK(inode->i_mode))
goto fail;
diff --git a/fs/quota/quota.c b/fs/quota/quota.c
index 3746367098fd..a40eaecbd5cc 100644
--- a/fs/quota/quota.c
+++ b/fs/quota/quota.c
@@ -733,7 +733,7 @@ static struct super_block *quotactl_block(const char __user *special, int cmd)

if (IS_ERR(tmp))
return ERR_CAST(tmp);
- bdev = lookup_bdev(tmp->name);
+ bdev = lookup_bdev(tmp->name, 0);
putname(tmp);
if (IS_ERR(bdev))
return ERR_CAST(bdev);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 458ee7b213be..cc18dfb0b98e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2388,7 +2388,7 @@ static inline void unregister_chrdev(unsigned int major, const char *name)
#define BLKDEV_MAJOR_HASH_SIZE 255
extern const char *__bdevname(dev_t, char *buffer);
extern const char *bdevname(struct block_device *bdev, char *buffer);
-extern struct block_device *lookup_bdev(const char *);
+extern struct block_device *lookup_bdev(const char *, int mask);
extern void blkdev_show(struct seq_file *,off_t);

#else
--
1.9.1
Seth Forshee
2015-11-17 16:39:08 UTC
Permalink
Security labels from unprivileged mounts in user namespaces must
be ignored. Force superblocks from user namespaces whose labeling
behavior is to use xattrs to use mountpoint labeling instead.
For the mountpoint label, default to converting the current task
context into a form suitable for file objects, but also allow the
policy writer to specify a different label through policy
transition rules.

Pieced together from code snippets provided by Stephen Smalley.

Signed-off-by: Seth Forshee <***@canonical.com>
Acked-by: Stephen Smalley <***@tycho.nsa.gov>
---
security/selinux/hooks.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index de05207eb665..09be1dc21e58 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -756,6 +756,28 @@ static int selinux_set_mnt_opts(struct super_block *sb,
goto out;
}
}
+
+ /*
+ * If this is a user namespace mount, no contexts are allowed
+ * on the command line and security labels must be ignored.
+ */
+ if (sb->s_user_ns != &init_user_ns) {
+ if (context_sid || fscontext_sid || rootcontext_sid ||
+ defcontext_sid) {
+ rc = -EACCES;
+ goto out;
+ }
+ if (sbsec->behavior == SECURITY_FS_USE_XATTR) {
+ sbsec->behavior = SECURITY_FS_USE_MNTPOINT;
+ rc = security_transition_sid(current_sid(), current_sid(),
+ SECCLASS_FILE, NULL,
+ &sbsec->mntpoint_sid);
+ if (rc)
+ goto out;
+ }
+ goto out_set_opts;
+ }
+
/* sets the context of the superblock for the fs being mounted. */
if (fscontext_sid) {
rc = may_context_mount_sb_relabel(fscontext_sid, sbsec, cred);
@@ -824,6 +846,7 @@ static int selinux_set_mnt_opts(struct super_block *sb,
sbsec->def_sid = defcontext_sid;
}

+out_set_opts:
rc = sb_finish_set_opts(sb);
out:
mutex_unlock(&sbsec->lock);
--
1.9.1
James Morris
2015-11-18 00:02:25 UTC
Permalink
Post by Seth Forshee
Security labels from unprivileged mounts in user namespaces must
be ignored. Force superblocks from user namespaces whose labeling
behavior is to use xattrs to use mountpoint labeling instead.
For the mountpoint label, default to converting the current task
context into a form suitable for file objects, but also allow the
policy writer to specify a different label through policy
transition rules.
Pieced together from code snippets provided by Stephen Smalley.
Acked-by: James Morris <***@oracle.com>
--
James Morris
<***@namei.org>
Al Viro
2015-11-17 17:05:56 UTC
Permalink
Hi Eric,
Here's another update to my patches for user namespace mounts, based on
your for-testing branch. These patches add safeguards necessary to allow
unprivileged mounts and update SELinux and Smack to safely handle
device-backed mounts from unprivileged users.
The v2 posting received very little in the way of feedback, so changes
are minimal. I've made a trivial style change to the Smack changes at
Casey's request, and I've added Stephen's ack for the SELinux changes.
Would you mind explaining which filesystem types do you plan to allow?
SELinux and the rest of Linux S&M bunch do fuck-all for attacks via
handcrafted fs image fed to the code in fs driver that does not expect
a given kind of inconsistencies.

As it is, validation of on-disk metadata is not particularly strong;
what's more, protection against concurrent malicious *changes* of
fs image (via direct writes by root) is simply inexistent.

So what is that about?
Seth Forshee
2015-11-17 17:25:51 UTC
Permalink
Post by Al Viro
Hi Eric,
Here's another update to my patches for user namespace mounts, based on
your for-testing branch. These patches add safeguards necessary to allow
unprivileged mounts and update SELinux and Smack to safely handle
device-backed mounts from unprivileged users.
The v2 posting received very little in the way of feedback, so changes
are minimal. I've made a trivial style change to the Smack changes at
Casey's request, and I've added Stephen's ack for the SELinux changes.
Would you mind explaining which filesystem types do you plan to allow?
SELinux and the rest of Linux S&M bunch do fuck-all for attacks via
handcrafted fs image fed to the code in fs driver that does not expect
a given kind of inconsistencies.
As it is, validation of on-disk metadata is not particularly strong;
what's more, protection against concurrent malicious *changes* of
fs image (via direct writes by root) is simply inexistent.
So what is that about?
The first target is fuse, which won't be vulnerable to those attacks.

Shortly after that I plan to follow with support for ext4. I've been
fuzzing ext4 for a while now and it has held up well, and I'm currently
working on hand-crafted attacks. Ted has commented privately (to others,
not to me personally) that he will fix bugs for such attacks, though I
haven't seen any public comments to that effect.

Seth
Serge E. Hallyn
2015-11-17 17:45:00 UTC
Permalink
Post by Seth Forshee
Post by Al Viro
Hi Eric,
Here's another update to my patches for user namespace mounts, based on
your for-testing branch. These patches add safeguards necessary to allow
unprivileged mounts and update SELinux and Smack to safely handle
device-backed mounts from unprivileged users.
The v2 posting received very little in the way of feedback, so changes
are minimal. I've made a trivial style change to the Smack changes at
Casey's request, and I've added Stephen's ack for the SELinux changes.
Would you mind explaining which filesystem types do you plan to allow?
SELinux and the rest of Linux S&M bunch do fuck-all for attacks via
handcrafted fs image fed to the code in fs driver that does not expect
a given kind of inconsistencies.
As it is, validation of on-disk metadata is not particularly strong;
what's more, protection against concurrent malicious *changes* of
fs image (via direct writes by root) is simply inexistent.
So what is that about?
The first target is fuse, which won't be vulnerable to those attacks.
Shortly after that I plan to follow with support for ext4. I've been
fuzzing ext4 for a while now and it has held up well, and I'm currently
working on hand-crafted attacks. Ted has commented privately (to others,
not to me personally) that he will fix bugs for such attacks, though I
haven't seen any public comments to that effect.
Hi,

Not privately, but during the 2014 kernel summit. The only documentation
of it I've seen is at the bottom of Paul's summary at
http://lwn.net/Articles/609376/ .
Al Viro
2015-11-17 17:55:06 UTC
Permalink
Post by Seth Forshee
Shortly after that I plan to follow with support for ext4. I've been
fuzzing ext4 for a while now and it has held up well, and I'm currently
working on hand-crafted attacks. Ted has commented privately (to others,
not to me personally) that he will fix bugs for such attacks, though I
haven't seen any public comments to that effect.
_Static_ attacks, or change-image-under-mounted-fs attacks?
Seth Forshee
2015-11-17 18:34:44 UTC
Permalink
Post by Al Viro
Post by Seth Forshee
Shortly after that I plan to follow with support for ext4. I've been
fuzzing ext4 for a while now and it has held up well, and I'm currently
working on hand-crafted attacks. Ted has commented privately (to others,
not to me personally) that he will fix bugs for such attacks, though I
haven't seen any public comments to that effect.
_Static_ attacks, or change-image-under-mounted-fs attacks?
Right now only static attacks, change-image-under-mounted-fs attacks
will be next.
Richard Weinberger
2015-11-17 19:12:31 UTC
Permalink
On Tue, Nov 17, 2015 at 7:34 PM, Seth Forshee
Post by Seth Forshee
Post by Al Viro
Post by Seth Forshee
Shortly after that I plan to follow with support for ext4. I've been
fuzzing ext4 for a while now and it has held up well, and I'm currently
working on hand-crafted attacks. Ted has commented privately (to others,
not to me personally) that he will fix bugs for such attacks, though I
haven't seen any public comments to that effect.
_Static_ attacks, or change-image-under-mounted-fs attacks?
Right now only static attacks, change-image-under-mounted-fs attacks
will be next.
Do we *really* need to enable unprivileged mounting of kernel filesystems?
What about just enabling fuse and implement ext4 and friends as fuse
filesystems?
Using the approaching Linux Kernel Libary[1] this is easy.

[1] https://lkml.org/lkml/2015/11/3/706
--
Thanks,
//richard
Seth Forshee
2015-11-17 19:21:41 UTC
Permalink
Post by Richard Weinberger
On Tue, Nov 17, 2015 at 7:34 PM, Seth Forshee
Post by Seth Forshee
Post by Al Viro
Post by Seth Forshee
Shortly after that I plan to follow with support for ext4. I've been
fuzzing ext4 for a while now and it has held up well, and I'm currently
working on hand-crafted attacks. Ted has commented privately (to others,
not to me personally) that he will fix bugs for such attacks, though I
haven't seen any public comments to that effect.
_Static_ attacks, or change-image-under-mounted-fs attacks?
Right now only static attacks, change-image-under-mounted-fs attacks
will be next.
Do we *really* need to enable unprivileged mounting of kernel filesystems?
What about just enabling fuse and implement ext4 and friends as fuse
filesystems?
Using the approaching Linux Kernel Libary[1] this is easy.
I haven't looked at this project, but I'm guessing that programs must be
written specifically to make use of it? I.e. you can't just use the
mount syscall, and thus all existing software still doesn't work?
Post by Richard Weinberger
[1] https://lkml.org/lkml/2015/11/3/706
--
Thanks,
//richard
Octavian Purdila
2015-11-17 19:25:37 UTC
Permalink
On Tue, Nov 17, 2015 at 9:21 PM, Seth Forshee
Post by Seth Forshee
Post by Richard Weinberger
On Tue, Nov 17, 2015 at 7:34 PM, Seth Forshee
Post by Seth Forshee
Post by Al Viro
Post by Seth Forshee
Shortly after that I plan to follow with support for ext4. I've been
fuzzing ext4 for a while now and it has held up well, and I'm currently
working on hand-crafted attacks. Ted has commented privately (to others,
not to me personally) that he will fix bugs for such attacks, though I
haven't seen any public comments to that effect.
_Static_ attacks, or change-image-under-mounted-fs attacks?
Right now only static attacks, change-image-under-mounted-fs attacks
will be next.
Do we *really* need to enable unprivileged mounting of kernel filesystems?
What about just enabling fuse and implement ext4 and friends as fuse
filesystems?
Using the approaching Linux Kernel Libary[1] this is easy.
I haven't looked at this project, but I'm guessing that programs must be
written specifically to make use of it? I.e. you can't just use the
mount syscall, and thus all existing software still doesn't work?
The projects includes a lklfuse program that uses fuse to mount a
fileystem image.
Richard Weinberger
2015-11-17 20:12:16 UTC
Permalink
Post by Octavian Purdila
On Tue, Nov 17, 2015 at 9:21 PM, Seth Forshee
Post by Seth Forshee
Post by Richard Weinberger
On Tue, Nov 17, 2015 at 7:34 PM, Seth Forshee
Post by Seth Forshee
Post by Al Viro
Post by Seth Forshee
Shortly after that I plan to follow with support for ext4. I've been
fuzzing ext4 for a while now and it has held up well, and I'm currently
working on hand-crafted attacks. Ted has commented privately (to others,
not to me personally) that he will fix bugs for such attacks, though I
haven't seen any public comments to that effect.
_Static_ attacks, or change-image-under-mounted-fs attacks?
Right now only static attacks, change-image-under-mounted-fs attacks
will be next.
Do we *really* need to enable unprivileged mounting of kernel filesystems?
What about just enabling fuse and implement ext4 and friends as fuse
filesystems?
Using the approaching Linux Kernel Libary[1] this is easy.
I haven't looked at this project, but I'm guessing that programs must be
written specifically to make use of it? I.e. you can't just use the
mount syscall, and thus all existing software still doesn't work?
The projects includes a lklfuse program that uses fuse to mount a
fileystem image.
Cool. I gave it a try.
It seems to work fine, but only if I run it in foreground (using -d)
otherwise fuse blocks every filesystem request.

Thanks,
//richard
Octavian Purdila
2015-11-17 22:00:17 UTC
Permalink
Post by Richard Weinberger
Post by Octavian Purdila
On Tue, Nov 17, 2015 at 9:21 PM, Seth Forshee
Post by Seth Forshee
Post by Richard Weinberger
On Tue, Nov 17, 2015 at 7:34 PM, Seth Forshee
Post by Seth Forshee
Post by Al Viro
Post by Seth Forshee
Shortly after that I plan to follow with support for ext4. I've been
fuzzing ext4 for a while now and it has held up well, and I'm currently
working on hand-crafted attacks. Ted has commented privately (to others,
not to me personally) that he will fix bugs for such attacks, though I
haven't seen any public comments to that effect.
_Static_ attacks, or change-image-under-mounted-fs attacks?
Right now only static attacks, change-image-under-mounted-fs attacks
will be next.
Do we *really* need to enable unprivileged mounting of kernel filesystems?
What about just enabling fuse and implement ext4 and friends as fuse
filesystems?
Using the approaching Linux Kernel Libary[1] this is easy.
I haven't looked at this project, but I'm guessing that programs must be
written specifically to make use of it? I.e. you can't just use the
mount syscall, and thus all existing software still doesn't work?
The projects includes a lklfuse program that uses fuse to mount a
fileystem image.
Cool. I gave it a try.
It seems to work fine, but only if I run it in foreground (using -d)
otherwise fuse blocks every filesystem request.
Now it should work in the background as well, thanks for reporting the issue.
Seth Forshee
2015-11-19 15:23:47 UTC
Permalink
Post by Octavian Purdila
Post by Richard Weinberger
Post by Octavian Purdila
On Tue, Nov 17, 2015 at 9:21 PM, Seth Forshee
Post by Seth Forshee
Post by Richard Weinberger
On Tue, Nov 17, 2015 at 7:34 PM, Seth Forshee
Post by Seth Forshee
Post by Al Viro
Post by Seth Forshee
Shortly after that I plan to follow with support for ext4. I've been
fuzzing ext4 for a while now and it has held up well, and I'm currently
working on hand-crafted attacks. Ted has commented privately (to others,
not to me personally) that he will fix bugs for such attacks, though I
haven't seen any public comments to that effect.
_Static_ attacks, or change-image-under-mounted-fs attacks?
Right now only static attacks, change-image-under-mounted-fs attacks
will be next.
Do we *really* need to enable unprivileged mounting of kernel filesystems?
What about just enabling fuse and implement ext4 and friends as fuse
filesystems?
Using the approaching Linux Kernel Libary[1] this is easy.
I haven't looked at this project, but I'm guessing that programs must be
written specifically to make use of it? I.e. you can't just use the
mount syscall, and thus all existing software still doesn't work?
The projects includes a lklfuse program that uses fuse to mount a
fileystem image.
Cool. I gave it a try.
It seems to work fine, but only if I run it in foreground (using -d)
otherwise fuse blocks every filesystem request.
Now it should work in the background as well, thanks for reporting the issue.
I'm playing with lklfuse now, it's surprisingly easy to get up and
running. I did have a few problems though that I thought you'd like to
know about.

Unfortunately I still can't run it in background mode, I get a segfault.

It's working fine on light workloads, but I'm having issues when I start
trying to stress it. In a couple runs of the stress-ng filesystem
stressors I saw both stress-ng and lklfuse get stuck in uninterruptible
sleep during the first run, and during the second I got some OOM errors
in lklfuse followed by I/O errors and eventually a journal error that
cause the filesystem to go read-only.

The command I used for the first run was:

stress-ng --class filesystem --all 0

And for the second:

stress-ng --class filesystem --seq 0 -v -t 60

There really wasn't anything interesting in the lklfuse output for the
first run, but for the second run I pasted the output here:
http://paste.ubuntu.com/13346993/

I still need to compare this to other fuse filesystems since I haven't
tried this kind of stress test on any others.
Octavian Purdila
2015-11-19 16:19:10 UTC
Permalink
On Thu, Nov 19, 2015 at 5:23 PM, Seth Forshee
Post by Seth Forshee
Post by Octavian Purdila
Post by Richard Weinberger
Post by Octavian Purdila
On Tue, Nov 17, 2015 at 9:21 PM, Seth Forshee
Post by Seth Forshee
Post by Richard Weinberger
On Tue, Nov 17, 2015 at 7:34 PM, Seth Forshee
Post by Seth Forshee
Post by Al Viro
Post by Seth Forshee
Shortly after that I plan to follow with support for ext4. I've been
fuzzing ext4 for a while now and it has held up well, and I'm currently
working on hand-crafted attacks. Ted has commented privately (to others,
not to me personally) that he will fix bugs for such attacks, though I
haven't seen any public comments to that effect.
_Static_ attacks, or change-image-under-mounted-fs attacks?
Right now only static attacks, change-image-under-mounted-fs attacks
will be next.
Do we *really* need to enable unprivileged mounting of kernel filesystems?
What about just enabling fuse and implement ext4 and friends as fuse
filesystems?
Using the approaching Linux Kernel Libary[1] this is easy.
I haven't looked at this project, but I'm guessing that programs must be
written specifically to make use of it? I.e. you can't just use the
mount syscall, and thus all existing software still doesn't work?
The projects includes a lklfuse program that uses fuse to mount a
fileystem image.
Cool. I gave it a try.
It seems to work fine, but only if I run it in foreground (using -d)
otherwise fuse blocks every filesystem request.
Now it should work in the background as well, thanks for reporting the issue.
Hi Seth,
Post by Seth Forshee
I'm playing with lklfuse now, it's surprisingly easy to get up and
running. I did have a few problems though that I thought you'd like to
know about.
Great, thanks for giving it a try and reporting the issues.
Post by Seth Forshee
Unfortunately I still can't run it in background mode, I get a segfault.
I got it to reproduce as well now. Not sure why how it worked before,
probably a race condition between lkl initialization and fuse calls.
Post by Seth Forshee
It's working fine on light workloads, but I'm having issues when I start
trying to stress it. In a couple runs of the stress-ng filesystem
stressors I saw both stress-ng and lklfuse get stuck in uninterruptible
sleep during the first run, and during the second I got some OOM errors
in lklfuse followed by I/O errors and eventually a journal error that
cause the filesystem to go read-only.
stress-ng --class filesystem --all 0
I will reproduce it and take a look.
Post by Seth Forshee
stress-ng --class filesystem --seq 0 -v -t 60
There really wasn't anything interesting in the lklfuse output for the
http://paste.ubuntu.com/13346993/
lklfuse allocates a fixed 100MB to the kernel and this is probably not
enough. For the short term I can add a parameter to lklfuse that
allows the user to specify the amount of memory to allocate to lkl. A
better fix would probably be to dynamically adjust the memory size of
lkl. I am thinking of using the ballon virtio driver or the memory
hotplug infrastructure. Any other suggestions?

I created a couple of issues in github [1] that you can track if you
want - I want to avoid spamming the list with reporting progress on
them.

[1] https://github.com/lkl/linux/issues
Seth Forshee
2015-11-19 16:31:34 UTC
Permalink
Post by Octavian Purdila
On Thu, Nov 19, 2015 at 5:23 PM, Seth Forshee
Post by Seth Forshee
Post by Octavian Purdila
Post by Richard Weinberger
Post by Octavian Purdila
On Tue, Nov 17, 2015 at 9:21 PM, Seth Forshee
Post by Seth Forshee
Post by Richard Weinberger
On Tue, Nov 17, 2015 at 7:34 PM, Seth Forshee
Post by Seth Forshee
Post by Al Viro
Post by Seth Forshee
Shortly after that I plan to follow with support for ext4. I've been
fuzzing ext4 for a while now and it has held up well, and I'm currently
working on hand-crafted attacks. Ted has commented privately (to others,
not to me personally) that he will fix bugs for such attacks, though I
haven't seen any public comments to that effect.
_Static_ attacks, or change-image-under-mounted-fs attacks?
Right now only static attacks, change-image-under-mounted-fs attacks
will be next.
Do we *really* need to enable unprivileged mounting of kernel filesystems?
What about just enabling fuse and implement ext4 and friends as fuse
filesystems?
Using the approaching Linux Kernel Libary[1] this is easy.
I haven't looked at this project, but I'm guessing that programs must be
written specifically to make use of it? I.e. you can't just use the
mount syscall, and thus all existing software still doesn't work?
The projects includes a lklfuse program that uses fuse to mount a
fileystem image.
Cool. I gave it a try.
It seems to work fine, but only if I run it in foreground (using -d)
otherwise fuse blocks every filesystem request.
Now it should work in the background as well, thanks for reporting the issue.
Hi Seth,
Post by Seth Forshee
I'm playing with lklfuse now, it's surprisingly easy to get up and
running. I did have a few problems though that I thought you'd like to
know about.
Great, thanks for giving it a try and reporting the issues.
No problem, looks like a promising project.
Post by Octavian Purdila
Post by Seth Forshee
Unfortunately I still can't run it in background mode, I get a segfault.
I got it to reproduce as well now. Not sure why how it worked before,
probably a race condition between lkl initialization and fuse calls.
Post by Seth Forshee
It's working fine on light workloads, but I'm having issues when I start
trying to stress it. In a couple runs of the stress-ng filesystem
stressors I saw both stress-ng and lklfuse get stuck in uninterruptible
sleep during the first run, and during the second I got some OOM errors
in lklfuse followed by I/O errors and eventually a journal error that
cause the filesystem to go read-only.
stress-ng --class filesystem --all 0
I will reproduce it and take a look.
Post by Seth Forshee
stress-ng --class filesystem --seq 0 -v -t 60
There really wasn't anything interesting in the lklfuse output for the
http://paste.ubuntu.com/13346993/
lklfuse allocates a fixed 100MB to the kernel and this is probably not
enough. For the short term I can add a parameter to lklfuse that
allows the user to specify the amount of memory to allocate to lkl. A
better fix would probably be to dynamically adjust the memory size of
lkl. I am thinking of using the ballon virtio driver or the memory
hotplug infrastructure. Any other suggestions?
I created a couple of issues in github [1] that you can track if you
want - I want to avoid spamming the list with reporting progress on
them.
Makes sense, I'm watching those issues now and will direct any furure
discussion there. Thanks!
Serge E. Hallyn
2015-11-20 17:33:29 UTC
Permalink
Post by Octavian Purdila
On Thu, Nov 19, 2015 at 5:23 PM, Seth Forshee
Post by Seth Forshee
stress-ng --class filesystem --seq 0 -v -t 60
There really wasn't anything interesting in the lklfuse output for the
http://paste.ubuntu.com/13346993/
lklfuse allocates a fixed 100MB to the kernel and this is probably not
enough. For the short term I can add a parameter to lklfuse that
allows the user to specify the amount of memory to allocate to lkl. A
better fix would probably be to dynamically adjust the memory size of
lkl. I am thinking of using the ballon virtio driver or the memory
hotplug infrastructure. Any other suggestions?
Hi Octavian,

Like Seth said this was very nice to install and run. But I think I'm
doing something wrong. I cloned the lkl/linux git tree into a 9G ext4
loopback file, both one mounted with lkl and one mounted as loopback.
The results were worse than I expected - loopback was 0m10.078s for
clone and 0m0.877s for rm -rf; lkl was 5m5.625s for clone and 0m44.332s
for rm -rf. I assume this has to do with the 100MB buffer and subsequent
thrashing? How should I tune this to make it perform better?

No crashes from this, though.

thanks,
-serge

Richard Weinberger
2015-11-17 19:26:59 UTC
Permalink
Post by Seth Forshee
Post by Richard Weinberger
On Tue, Nov 17, 2015 at 7:34 PM, Seth Forshee
Post by Seth Forshee
Post by Al Viro
Post by Seth Forshee
Shortly after that I plan to follow with support for ext4. I've been
fuzzing ext4 for a while now and it has held up well, and I'm currently
working on hand-crafted attacks. Ted has commented privately (to others,
not to me personally) that he will fix bugs for such attacks, though I
haven't seen any public comments to that effect.
_Static_ attacks, or change-image-under-mounted-fs attacks?
Right now only static attacks, change-image-under-mounted-fs attacks
will be next.
Do we *really* need to enable unprivileged mounting of kernel filesystems?
What about just enabling fuse and implement ext4 and friends as fuse
filesystems?
Using the approaching Linux Kernel Libary[1] this is easy.
I haven't looked at this project, but I'm guessing that programs must be
written specifically to make use of it? I.e. you can't just use the
mount syscall, and thus all existing software still doesn't work?
You can easy bridge fuse and the LKL, I did already a hacky PoC. Worked nicely.
Octavian's tree has some nice examples, and AFAIK he is currently working
on a "mount any kernel fs via fuse" tool.

Thanks,
//richard
Theodore Ts'o
2015-11-18 19:10:45 UTC
Permalink
Post by Seth Forshee
Post by Al Viro
Post by Seth Forshee
Shortly after that I plan to follow with support for ext4. I've been
fuzzing ext4 for a while now and it has held up well, and I'm currently
working on hand-crafted attacks. Ted has commented privately (to others,
not to me personally) that he will fix bugs for such attacks, though I
haven't seen any public comments to that effect.
_Static_ attacks, or change-image-under-mounted-fs attacks?
Right now only static attacks, change-image-under-mounted-fs attacks
will be next.
I will fix bugs about static attacks. That is, it's interesting to me
that a buggy file system (no matter how it is created), not cause the
kernel to crash --- and privilege escalation attacks tend to be
strongly related to those bugs where we're not doing strong enough
checking.

Protecting against a malicious user which changes the image under the
file system is a whole other kettle of fish. I am not at all user you
can do this without completely sacrificing performance or making the
code impossible to maintain. So my comments do *not* extend to
protecting against a malicious user who is changing the block device
underneath the kernel.

If you want to submit patches to make the kernel more robust against
these attacks, I'm certainly willing to look at the patches. But I'm
certainly not guaranteeing that they will go in, and I'm certainly not
promising to fix all vulnerabilities that you might find that are
caused by a malicious block device. Sorry, that's too much buying a
pig in a poke....

- Ted
Seth Forshee
2015-11-18 19:28:55 UTC
Permalink
Post by Theodore Ts'o
Post by Seth Forshee
Post by Al Viro
Post by Seth Forshee
Shortly after that I plan to follow with support for ext4. I've been
fuzzing ext4 for a while now and it has held up well, and I'm currently
working on hand-crafted attacks. Ted has commented privately (to others,
not to me personally) that he will fix bugs for such attacks, though I
haven't seen any public comments to that effect.
_Static_ attacks, or change-image-under-mounted-fs attacks?
Right now only static attacks, change-image-under-mounted-fs attacks
will be next.
I will fix bugs about static attacks. That is, it's interesting to me
that a buggy file system (no matter how it is created), not cause the
kernel to crash --- and privilege escalation attacks tend to be
strongly related to those bugs where we're not doing strong enough
checking.
Protecting against a malicious user which changes the image under the
file system is a whole other kettle of fish. I am not at all user you
can do this without completely sacrificing performance or making the
code impossible to maintain. So my comments do *not* extend to
protecting against a malicious user who is changing the block device
underneath the kernel.
If you want to submit patches to make the kernel more robust against
these attacks, I'm certainly willing to look at the patches. But I'm
certainly not guaranteeing that they will go in, and I'm certainly not
promising to fix all vulnerabilities that you might find that are
caused by a malicious block device. Sorry, that's too much buying a
pig in a poke....
Thanks Ted. My plan right now is to explore the possibility of blocking
writes to the backing store from userspace while it's mounted.
Serge Hallyn
2015-11-18 19:32:40 UTC
Permalink
Post by Theodore Ts'o
Post by Seth Forshee
Post by Al Viro
Post by Seth Forshee
Shortly after that I plan to follow with support for ext4. I've been
fuzzing ext4 for a while now and it has held up well, and I'm currently
working on hand-crafted attacks. Ted has commented privately (to others,
not to me personally) that he will fix bugs for such attacks, though I
haven't seen any public comments to that effect.
_Static_ attacks, or change-image-under-mounted-fs attacks?
Right now only static attacks, change-image-under-mounted-fs attacks
will be next.
I will fix bugs about static attacks. That is, it's interesting to me
that a buggy file system (no matter how it is created), not cause the
kernel to crash --- and privilege escalation attacks tend to be
strongly related to those bugs where we're not doing strong enough
checking.
Protecting against a malicious user which changes the image under the
file system is a whole other kettle of fish. I am not at all user you
can do this without completely sacrificing performance or making the
code impossible to maintain. So my comments do *not* extend to
protecting against a malicious user who is changing the block device
underneath the kernel.
Yup, thanks, Ted. I think the only sane thing to do is work on making the
mounted files immutable. Guarding against under-mounted-writes seems
crazy. Well, actually it seems like a fascinating problem, and maybe
solvable without fs changes, but not in scope here.
Post by Theodore Ts'o
If you want to submit patches to make the kernel more robust against
these attacks, I'm certainly willing to look at the patches. But I'm
certainly not guaranteeing that they will go in, and I'm certainly not
promising to fix all vulnerabilities that you might find that are
caused by a malicious block device. Sorry, that's too much buying a
pig in a poke....
- Ted
Austin S Hemmelgarn
2015-11-17 19:02:09 UTC
Permalink
Post by Al Viro
Post by Seth Forshee
Shortly after that I plan to follow with support for ext4. I've been
fuzzing ext4 for a while now and it has held up well, and I'm currently
working on hand-crafted attacks. Ted has commented privately (to others,
not to me personally) that he will fix bugs for such attacks, though I
haven't seen any public comments to that effect.
_Static_ attacks, or change-image-under-mounted-fs attacks?
To properly protect against attacks on mounted filesystems, we'd need
some new concept of a userspace immutable file (that is, one where
nobody can write to it except the kernel, and only the kernel can change
it between regular access and this new state), and then have the kernel
set an image (or block device) to this state when a filesystem is
mounted from it (this introduces all kinds of other issues too however,
for example stuff that allows an online fsck on the device will stop
working, as will many un-deletion tools).

The only other option would be to force the FS to cache all metadata in
memory, and validate between the cache and what's on disk on every
access, which is not realistic for any real world system.

It's unfeasible from a practical standpoint to expect filesystems to
assume that stuff they write might change under them due to malicious
intent of a third party. Some filesystems may be more resilient to this
kind of attack (ZFS and BTRFS in some configurations come to mind), but
a determined attacker can still circumvent those protections (on at
least BTRFS, it's not all that hard to cause a sub-tree of the
filesystem to disappear with at most two 64k blocks being written to the
block device directly, and there is no way that this can be prevented
short of what I suggest above).
Seth Forshee
2015-11-17 19:16:06 UTC
Permalink
Post by Austin S Hemmelgarn
Post by Al Viro
Post by Seth Forshee
Shortly after that I plan to follow with support for ext4. I've been
fuzzing ext4 for a while now and it has held up well, and I'm currently
working on hand-crafted attacks. Ted has commented privately (to others,
not to me personally) that he will fix bugs for such attacks, though I
haven't seen any public comments to that effect.
_Static_ attacks, or change-image-under-mounted-fs attacks?
To properly protect against attacks on mounted filesystems, we'd
need some new concept of a userspace immutable file (that is, one
where nobody can write to it except the kernel, and only the kernel
can change it between regular access and this new state), and then
have the kernel set an image (or block device) to this state when a
filesystem is mounted from it (this introduces all kinds of other
issues too however, for example stuff that allows an online fsck on
the device will stop working, as will many un-deletion tools).
Yeah, Serge and I were just tossing that idea around on irc. If we can
make that work then it's probably the best solution.

From a naive perspective it seems like all we really have to do is make
the block device inode immutable to userspace when it is mounted. And
the parent block device if it's a partition, which might be a bit
troublesome. We'd have to ensure that writes couldn't happen via any fds
already open when the device was mounted too.

We'd need some cooperation from the loop driver too I suppose, to make
sure the file backing the loop device is also immutable.
Austin S Hemmelgarn
2015-11-17 20:54:50 UTC
Permalink
Post by Seth Forshee
Post by Austin S Hemmelgarn
Post by Al Viro
Post by Seth Forshee
Shortly after that I plan to follow with support for ext4. I've been
fuzzing ext4 for a while now and it has held up well, and I'm currently
working on hand-crafted attacks. Ted has commented privately (to others,
not to me personally) that he will fix bugs for such attacks, though I
haven't seen any public comments to that effect.
_Static_ attacks, or change-image-under-mounted-fs attacks?
To properly protect against attacks on mounted filesystems, we'd
need some new concept of a userspace immutable file (that is, one
where nobody can write to it except the kernel, and only the kernel
can change it between regular access and this new state), and then
have the kernel set an image (or block device) to this state when a
filesystem is mounted from it (this introduces all kinds of other
issues too however, for example stuff that allows an online fsck on
the device will stop working, as will many un-deletion tools).
Yeah, Serge and I were just tossing that idea around on irc. If we can
make that work then it's probably the best solution.
From a naive perspective it seems like all we really have to do is make
the block device inode immutable to userspace when it is mounted. And
the parent block device if it's a partition, which might be a bit
troublesome. We'd have to ensure that writes couldn't happen via any fds
already open when the device was mounted too.
We'd need some cooperation from the loop driver too I suppose, to make
sure the file backing the loop device is also immutable.
From a completeness perspective, you'd also need to hook into DM, MD,
and bcache to handle their backing devices. There's not much we could
do about iSCSI/ATAoE/NBD devices, and I think being able to restrict
stuff calling mount from a userns to only be able to use FUSE would
still be useful (FWIW, GRUB2 has a tool to use FUSE for testing it's own
filesystem drivers, which I use regularly when I just need a read-only
mount).
Seth Forshee
2015-11-17 21:32:55 UTC
Permalink
Post by Austin S Hemmelgarn
Post by Seth Forshee
Post by Austin S Hemmelgarn
Post by Al Viro
Post by Seth Forshee
Shortly after that I plan to follow with support for ext4. I've been
fuzzing ext4 for a while now and it has held up well, and I'm currently
working on hand-crafted attacks. Ted has commented privately (to others,
not to me personally) that he will fix bugs for such attacks, though I
haven't seen any public comments to that effect.
_Static_ attacks, or change-image-under-mounted-fs attacks?
To properly protect against attacks on mounted filesystems, we'd
need some new concept of a userspace immutable file (that is, one
where nobody can write to it except the kernel, and only the kernel
can change it between regular access and this new state), and then
have the kernel set an image (or block device) to this state when a
filesystem is mounted from it (this introduces all kinds of other
issues too however, for example stuff that allows an online fsck on
the device will stop working, as will many un-deletion tools).
Yeah, Serge and I were just tossing that idea around on irc. If we can
make that work then it's probably the best solution.
From a naive perspective it seems like all we really have to do is make
the block device inode immutable to userspace when it is mounted. And
the parent block device if it's a partition, which might be a bit
troublesome. We'd have to ensure that writes couldn't happen via any fds
already open when the device was mounted too.
We'd need some cooperation from the loop driver too I suppose, to make
sure the file backing the loop device is also immutable.
From a completeness perspective, you'd also need to hook into DM,
MD, and bcache to handle their backing devices. There's not much we
could do about iSCSI/ATAoE/NBD devices, and I think being able to
But really no one would be able to mount any of those without
intervention from a privileged user anyway. The same is true today of
loop devices, but I have some patches to change that.
Post by Austin S Hemmelgarn
restrict stuff calling mount from a userns to only be able to use
FUSE would still be useful (FWIW, GRUB2 has a tool to use FUSE for
testing it's own filesystem drivers, which I use regularly when I
just need a read-only mount).
Agreed, fuse alone is very useful, though there is a performance
penalty.
Austin S Hemmelgarn
2015-11-18 12:23:48 UTC
Permalink
Post by Seth Forshee
Post by Austin S Hemmelgarn
Post by Seth Forshee
Post by Austin S Hemmelgarn
Post by Al Viro
Post by Seth Forshee
Shortly after that I plan to follow with support for ext4. I've been
fuzzing ext4 for a while now and it has held up well, and I'm currently
working on hand-crafted attacks. Ted has commented privately (to others,
not to me personally) that he will fix bugs for such attacks, though I
haven't seen any public comments to that effect.
_Static_ attacks, or change-image-under-mounted-fs attacks?
To properly protect against attacks on mounted filesystems, we'd
need some new concept of a userspace immutable file (that is, one
where nobody can write to it except the kernel, and only the kernel
can change it between regular access and this new state), and then
have the kernel set an image (or block device) to this state when a
filesystem is mounted from it (this introduces all kinds of other
issues too however, for example stuff that allows an online fsck on
the device will stop working, as will many un-deletion tools).
Yeah, Serge and I were just tossing that idea around on irc. If we can
make that work then it's probably the best solution.
From a naive perspective it seems like all we really have to do is make
the block device inode immutable to userspace when it is mounted. And
the parent block device if it's a partition, which might be a bit
troublesome. We'd have to ensure that writes couldn't happen via any fds
already open when the device was mounted too.
We'd need some cooperation from the loop driver too I suppose, to make
sure the file backing the loop device is also immutable.
From a completeness perspective, you'd also need to hook into DM,
MD, and bcache to handle their backing devices. There's not much we
could do about iSCSI/ATAoE/NBD devices, and I think being able to
But really no one would be able to mount any of those without
intervention from a privileged user anyway. The same is true today of
loop devices, but I have some patches to change that.
Um, no, depending on how your system is configured, it's fully possible
to mount those as a regular user with no administrative interaction
required. All that needs done is some udev rules (or something
equivalent) to add ACL's to the device nodes allowing regular users to
access them (and there are systems I've seen that are naive enough to do
this). And I can almost assure you that there will be someone out there
who decides to directly expose such block devices to a user namespace.
Seth Forshee
2015-11-18 14:22:38 UTC
Permalink
Post by Austin S Hemmelgarn
Post by Seth Forshee
Post by Austin S Hemmelgarn
Post by Seth Forshee
Post by Austin S Hemmelgarn
Post by Al Viro
Post by Seth Forshee
Shortly after that I plan to follow with support for ext4. I've been
fuzzing ext4 for a while now and it has held up well, and I'm currently
working on hand-crafted attacks. Ted has commented privately (to others,
not to me personally) that he will fix bugs for such attacks, though I
haven't seen any public comments to that effect.
_Static_ attacks, or change-image-under-mounted-fs attacks?
To properly protect against attacks on mounted filesystems, we'd
need some new concept of a userspace immutable file (that is, one
where nobody can write to it except the kernel, and only the kernel
can change it between regular access and this new state), and then
have the kernel set an image (or block device) to this state when a
filesystem is mounted from it (this introduces all kinds of other
issues too however, for example stuff that allows an online fsck on
the device will stop working, as will many un-deletion tools).
Yeah, Serge and I were just tossing that idea around on irc. If we can
make that work then it's probably the best solution.
From a naive perspective it seems like all we really have to do is make
the block device inode immutable to userspace when it is mounted. And
the parent block device if it's a partition, which might be a bit
troublesome. We'd have to ensure that writes couldn't happen via any fds
already open when the device was mounted too.
We'd need some cooperation from the loop driver too I suppose, to make
sure the file backing the loop device is also immutable.
From a completeness perspective, you'd also need to hook into DM,
MD, and bcache to handle their backing devices. There's not much we
could do about iSCSI/ATAoE/NBD devices, and I think being able to
But really no one would be able to mount any of those without
intervention from a privileged user anyway. The same is true today of
loop devices, but I have some patches to change that.
Um, no, depending on how your system is configured, it's fully
possible to mount those as a regular user with no administrative
interaction required. All that needs done is some udev rules (or
something equivalent) to add ACL's to the device nodes allowing
regular users to access them (and there are systems I've seen that
are naive enough to do this). And I can almost assure you that
there will be someone out there who decides to directly expose such
block devices to a user namespace.
But it still requires the admin set it up that way, no? And aren't
privileges required to set up those devices in the first place?

I'm not saying that it wouldn't be a good idea to lock down the backing
stores for those types of devices too, just that it isn't something that
a regular user could exploit without an admin doing something to
facilitate it.
Al Viro
2015-11-18 14:58:18 UTC
Permalink
Post by Seth Forshee
But it still requires the admin set it up that way, no? And aren't
privileges required to set up those devices in the first place?
I'm not saying that it wouldn't be a good idea to lock down the backing
stores for those types of devices too, just that it isn't something that
a regular user could exploit without an admin doing something to
facilitate it.
Sigh... If it boils down to "all admins within all containers must be
trusted not to try and break out" (along with "roothole in any container
escalates to kernel-mode code execution on host"), then what the fuck
is the *point* of bothering with containers, userns, etc. in the first
place? If your model is basically "you want isolation, just use kvm",
fine, but where's the place for userns in all that?

And if you are talking about the _host_ admin, then WTF not have him just
mount what's needed as part of setup and to hell with mounting those
inside the container?

Look at that from the hosting company POV - they are offering a bunch of
virtual machines on one physical system. And you want the admins on those
virtual machines independent from the host admin. Fine, but then you
really need to keep them unable to screw each other or gain kernel-mode
execution on the host.

Again, what's the point of all that? I assumed the model where containers
do, you know, contain what's in them, regardless of trust. You guys seem
to assume something different and I really wonder what it _is_...
Seth Forshee
2015-11-18 15:05:12 UTC
Permalink
Post by Al Viro
Post by Seth Forshee
But it still requires the admin set it up that way, no? And aren't
privileges required to set up those devices in the first place?
I'm not saying that it wouldn't be a good idea to lock down the backing
stores for those types of devices too, just that it isn't something that
a regular user could exploit without an admin doing something to
facilitate it.
Sigh... If it boils down to "all admins within all containers must be
trusted not to try and break out" (along with "roothole in any container
escalates to kernel-mode code execution on host"), then what the fuck
is the *point* of bothering with containers, userns, etc. in the first
place? If your model is basically "you want isolation, just use kvm",
fine, but where's the place for userns in all that?
And if you are talking about the _host_ admin, then WTF not have him just
mount what's needed as part of setup and to hell with mounting those
inside the container?
Yes, the host admin. I'm not talking about trusting the admin inside the
container at all.

From my perspective the idea is essentially to allow mounting with fuse
or with ext4 using "mount -o loop ..." within a container.
Al Viro
2015-11-18 15:13:35 UTC
Permalink
Post by Seth Forshee
Yes, the host admin. I'm not talking about trusting the admin inside the
container at all.
Then why not have the same host admin just plain mount it when setting the
container up and be done with that? From the host namespace, before spawning
the docker instance or whatever framework you are using. IDGI...
Richard Weinberger
2015-11-18 15:19:56 UTC
Permalink
Post by Al Viro
Post by Seth Forshee
Yes, the host admin. I'm not talking about trusting the admin inside the
container at all.
Then why not have the same host admin just plain mount it when setting the
container up and be done with that? From the host namespace, before spawning
the docker instance or whatever framework you are using. IDGI...
Because hosting companies sell containers as "full virtual machines"
and customers expect to be able mount stuff like disk images they upload.
--
Thanks,
//richard
James Morris
2015-11-19 07:47:00 UTC
Permalink
Post by Richard Weinberger
Post by Al Viro
Post by Seth Forshee
Yes, the host admin. I'm not talking about trusting the admin inside the
container at all.
Then why not have the same host admin just plain mount it when setting the
container up and be done with that? From the host namespace, before spawning
the docker instance or whatever framework you are using. IDGI...
Because hosting companies sell containers as "full virtual machines"
and customers expect to be able mount stuff like disk images they upload.
I don't think this is a valid reason for merging functionality into the
kernel.
--
James Morris
<***@namei.org>
Richard Weinberger
2015-11-19 07:53:26 UTC
Permalink
Post by James Morris
Post by Richard Weinberger
Post by Al Viro
Post by Seth Forshee
Yes, the host admin. I'm not talking about trusting the admin inside the
container at all.
Then why not have the same host admin just plain mount it when setting the
container up and be done with that? From the host namespace, before spawning
the docker instance or whatever framework you are using. IDGI...
Because hosting companies sell containers as "full virtual machines"
and customers expect to be able mount stuff like disk images they upload.
I don't think this is a valid reason for merging functionality into the
kernel.
Erm, I don't want this in the kernel. That's why I've proposed the lklfuse approach.

Thanks,
//richard
Serge E. Hallyn
2015-11-19 14:21:13 UTC
Permalink
Post by Richard Weinberger
Post by James Morris
Post by Richard Weinberger
Post by Al Viro
Post by Seth Forshee
Yes, the host admin. I'm not talking about trusting the admin inside the
container at all.
Then why not have the same host admin just plain mount it when setting the
container up and be done with that? From the host namespace, before spawning
the docker instance or whatever framework you are using. IDGI...
Because hosting companies sell containers as "full virtual machines"
and customers expect to be able mount stuff like disk images they upload.
I don't think this is a valid reason for merging functionality into the
kernel.
Erm, I don't want this in the kernel. That's why I've proposed the lklfuse approach.
That would require the fuse-in-containers functionality, right?
Richard Weinberger
2015-11-19 15:04:06 UTC
Permalink
Post by Serge E. Hallyn
Post by Richard Weinberger
Post by James Morris
Post by Richard Weinberger
Post by Al Viro
Post by Seth Forshee
Yes, the host admin. I'm not talking about trusting the admin inside the
container at all.
Then why not have the same host admin just plain mount it when setting the
container up and be done with that? From the host namespace, before spawning
the docker instance or whatever framework you are using. IDGI...
Because hosting companies sell containers as "full virtual machines"
and customers expect to be able mount stuff like disk images they upload.
I don't think this is a valid reason for merging functionality into the
kernel.
Erm, I don't want this in the kernel. That's why I've proposed the lklfuse approach.
That would require the fuse-in-containers functionality, right?
Correct.
Still a less large attack surface. :-)

Thanks,
//richard
Colin Walters
2015-11-19 14:37:53 UTC
Permalink
Post by Richard Weinberger
Erm, I don't want this in the kernel. That's why I've proposed the lklfuse approach.
I already said this before but just to repeat, since I'm confused:

How would "lklfuse" be different from http://libguestfs.org/
which we at Red Hat (and a number of other organizations)
use quite widely now for build systems, debugging etc.

In the end it's just running the kernel in KVM with a custom protocol,
with support for non-filesystem things like "install a bootloader",
and it already supports FUSE.

I'm pretty firmly with Al here - the attack surface increase here
is too great, and we'd likely turn this off if it even did make it
into the kernel.
Richard Weinberger
2015-11-19 14:49:00 UTC
Permalink
Post by Colin Walters
Post by Richard Weinberger
Erm, I don't want this in the kernel. That's why I've proposed the lklfuse approach.
How would "lklfuse" be different from http://libguestfs.org/
which we at Red Hat (and a number of other organizations)
use quite widely now for build systems, debugging etc.
Currently libguestfs has a rather huge overhead because it
boots a full virtual machine and hence a lot of communication
is needed.
With LKL you can use Linux as Library and link it to fuse.
AFAIK Richard added already a LKL backend to libguestfs. :-)
Post by Colin Walters
In the end it's just running the kernel in KVM with a custom protocol,
with support for non-filesystem things like "install a bootloader",
and it already supports FUSE.
I'm pretty firmly with Al here - the attack surface increase here
is too great, and we'd likely turn this off if it even did make it
into the kernel.
Agreed. This is why I'm promoting the fuse solution.

Thanks,
//richard
Richard W.M. Jones
2015-11-19 15:17:33 UTC
Permalink
Post by Richard Weinberger
Post by Colin Walters
Post by Richard Weinberger
Erm, I don't want this in the kernel. That's why I've proposed the lklfuse approach.
How would "lklfuse" be different from http://libguestfs.org/
which we at Red Hat (and a number of other organizations)
use quite widely now for build systems, debugging etc.
Currently libguestfs has a rather huge overhead because it
boots a full virtual machine and hence a lot of communication
is needed.
With LKL you can use Linux as Library and link it to fuse.
AFAIK Richard added already a LKL backend to libguestfs. :-)
Right. For the longer story, see:

https://rwmj.wordpress.com/2015/11/07/linux-kernel-library-backend-for-libguestfs/#content

Rich.
--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines. Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top
Serge E. Hallyn
2015-11-19 14:58:13 UTC
Permalink
Post by Al Viro
Post by Seth Forshee
Yes, the host admin. I'm not talking about trusting the admin inside the
container at all.
Then why not have the same host admin just plain mount it when setting the
container up and be done with that? From the host namespace, before spawning
the docker instance or whatever framework you are using. IDGI...
fwiw one example use case is building a livecd or vm image from inside
container, on a system where each piece of functionality is
compartmentalized in a separate container.

For specific programs we can engineer them to call out to a helper
in the init user_ns, but that doesn't work for existing tools.

James Bottomley a year or two ago had mentioned the idea of using
seccomp to have the mount syscall in a container trap into a
init_user_ns helper, but we'd have to ptrace the whole container
for its whole lifetime to do that, or use SECCOMP_RET_TRAP with
a LD_PRELOAD that does goes through a proxy to do the remote
request (which becomes infeasible bc LD_PRELOAD won't stick).

If we did enable anything but FUSE I would hope each fs would have
a separate sysctl to enable non-init-userns mounts. Then the admin
could enable them just as they do the automount of whatever garbage
is on a usb stick found lying on the trade floor.

-serge
Nikolay Borisov
2015-11-18 15:36:49 UTC
Permalink
Post by Al Viro
Post by Seth Forshee
But it still requires the admin set it up that way, no? And aren't
privileges required to set up those devices in the first place?
I'm not saying that it wouldn't be a good idea to lock down the backing
stores for those types of devices too, just that it isn't something that
a regular user could exploit without an admin doing something to
facilitate it.
Sigh... If it boils down to "all admins within all containers must be
trusted not to try and break out" (along with "roothole in any container
escalates to kernel-mode code execution on host"), then what the fuck
is the *point* of bothering with containers, userns, etc. in the first
place? If your model is basically "you want isolation, just use kvm",
fine, but where's the place for userns in all that?
And if you are talking about the _host_ admin, then WTF not have him just
mount what's needed as part of setup and to hell with mounting those
inside the container?
Look at that from the hosting company POV - they are offering a bunch of
virtual machines on one physical system. And you want the admins on those
virtual machines independent from the host admin. Fine, but then you
really need to keep them unable to screw each other or gain kernel-mode
execution on the host.
Actually from the POV of a hosting company there's also the use case of
wanting to use container as substitutes for virtual machines (of course
we are a long way from that). But being able to do what those patches
enable (i.e. what Seth has pointed to with mount -o loop) is beneficial
and desirable.
Post by Al Viro
Again, what's the point of all that? I assumed the model where containers
do, you know, contain what's in them, regardless of trust. You guys seem
to assume something different and I really wonder what it _is_...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Austin S Hemmelgarn
2015-11-18 15:34:57 UTC
Permalink
Post by Al Viro
Post by Seth Forshee
But it still requires the admin set it up that way, no? And aren't
privileges required to set up those devices in the first place?
I'm not saying that it wouldn't be a good idea to lock down the backing
stores for those types of devices too, just that it isn't something that
a regular user could exploit without an admin doing something to
facilitate it.
Sigh... If it boils down to "all admins within all containers must be
trusted not to try and break out" (along with "roothole in any container
escalates to kernel-mode code execution on host"), then what the fuck
is the *point* of bothering with containers, userns, etc. in the first
place? If your model is basically "you want isolation, just use kvm",
fine, but where's the place for userns in all that?
In this case, Seth is referring to the host admin, not the container admin.
Post by Al Viro
And if you are talking about the _host_ admin, then WTF not have him just
mount what's needed as part of setup and to hell with mounting those
inside the container?
This is decidedly non-trivial to handle in some cases. IIRC, one of the
particular things that sparked this in the first place was the Chrome
Native Client having to have CAP_SYS_ADMIN or SUID set on it to handle
setting up it's own sandbox, which is not something that should ever be
set on an executable that runs untrusted code (which is the whole point
of NaCl).
Post by Al Viro
Look at that from the hosting company POV - they are offering a bunch of
virtual machines on one physical system. And you want the admins on those
virtual machines independent from the host admin. Fine, but then you
really need to keep them unable to screw each other or gain kernel-mode
execution on the host.
Again, what's the point of all that? I assumed the model where containers
do, you know, contain what's in them, regardless of trust. You guys seem
to assume something different and I really wonder what it _is_...
Yes, hosting and isolation of untrusted code are valid uses for
containers, which is why I suggested the ability to disallow mounts
other than FUSE, and make that the default state. There are other
perfectly valid uses for them as well, and for me the two I'm
particularly interested in are safe deployment of a new system from an
existing system (ala Gentoo or Arch installation, or manual installation
of *BSD), and running non-native distros without virtualization (On a
single user system, virtualization is overkill when all you want is a
Debian or Fedora or Arch testing environment and don't care about their
specific kernel features).
Al Viro
2015-11-17 19:30:12 UTC
Permalink
Post by Austin S Hemmelgarn
Post by Al Viro
_Static_ attacks, or change-image-under-mounted-fs attacks?
To properly protect against attacks on mounted filesystems, we'd
need some new concept of a userspace immutable file (that is, one
where nobody can write to it except the kernel, and only the kernel
can change it between regular access and this new state), and then
have the kernel set an image (or block device) to this state when a
filesystem is mounted from it (this introduces all kinds of other
issues too however, for example stuff that allows an online fsck on
the device will stop working, as will many un-deletion tools).
The only other option would be to force the FS to cache all metadata
in memory, and validate between the cache and what's on disk on
every access, which is not realistic for any real world system.
Doctor, it hurt when I do it...

IOW, the other option is to refuse attempting this insanity. Fuse probably
can be handled, but being able to mount (with kernel-space drivera) an
arbitrary ext4 image is equivalent to being able to do anything and it's
going to stay that way for the forseeable future. You are talking about
a large pile of code that deals with rather convoluted data structure,
had not been written with validation in mind *and* keeps being developed.
What's more, that code runs with maximal priveleges there are.

This is absolutely insane, no matter how much LSM snake oil you slatter on
the whole thing. All of a sudden you are exposing a huge attack surface
in the place where it would hurt most and as the consolation we are offered
basically "Ted is willing to fix holes when they are found".

I know that security community tends to be less than sane, but this really
takes the damn cake.

Al, still not quite able to believe this is not a badly mistimed AFD posting...
Austin S Hemmelgarn
2015-11-17 20:39:16 UTC
Permalink
Post by Al Viro
Post by Austin S Hemmelgarn
Post by Al Viro
_Static_ attacks, or change-image-under-mounted-fs attacks?
To properly protect against attacks on mounted filesystems, we'd
need some new concept of a userspace immutable file (that is, one
where nobody can write to it except the kernel, and only the kernel
can change it between regular access and this new state), and then
have the kernel set an image (or block device) to this state when a
filesystem is mounted from it (this introduces all kinds of other
issues too however, for example stuff that allows an online fsck on
the device will stop working, as will many un-deletion tools).
The only other option would be to force the FS to cache all metadata
in memory, and validate between the cache and what's on disk on
every access, which is not realistic for any real world system.
Doctor, it hurt when I do it...
IOW, the other option is to refuse attempting this insanity. Fuse probably
can be handled, but being able to mount (with kernel-space drivera) an
arbitrary ext4 image is equivalent to being able to do anything and it's
going to stay that way for the forseeable future. You are talking about
a large pile of code that deals with rather convoluted data structure,
had not been written with validation in mind *and* keeps being developed.
What's more, that code runs with maximal priveleges there are.
Without factoring in unprivileged mounts, for cases when mounting from a
block device, you shouldn't have to worry about a malicious third party,
because their ability to modify it under the filesystem implies that
they either already have root privileges, or they have direct access to
hardware, and in both cases, your system is already compromised to a
degree that makes the reliability of your filesystem irrelevant. Under
the same circumstances with filesystem images, the same statement applies.

However, while unprivileged mounts make validation more important, there
is still the fact that if you don't trust someone, then you shouldn't be
letting them have access to your system. No amount of sandboxing short
of full isolation can solve that, period. Yes people will try to crack
the system, but no matter how much sandboxing there is, it will not be
unbreakable unless nobody can access it.

The attack surface is already there, it's just hard to get to. There's
a reason I never mount anything I didn't create myself and can't prove
chain of custody on without at least running fsck on it first, and then
only mount it with the kernel if there isn't a FUSE driver for it that I
trust (and GRUB provides a lot of those).
Post by Al Viro
This is absolutely insane, no matter how much LSM snake oil you slatter on
the whole thing. All of a sudden you are exposing a huge attack surface
in the place where it would hurt most and as the consolation we are offered
basically "Ted is willing to fix holes when they are found".
For the context of static image attacks, anything that's found _needs_
to be fixed regardless, and unless you can find some way to actually
prevent attacks on mounted filesystems that doesn't involve a complete
re-write of the filesystem drivers, then there's not much we can do
about it. Yes, unprivileged mounts expose an attack surface, but so
does userspace access to the network stack, and so do a lot of other
features that are considered essential in a modern general purpose
operating system.
Al Viro
2015-11-17 21:05:42 UTC
Permalink
Post by Austin S Hemmelgarn
Post by Al Viro
This is absolutely insane, no matter how much LSM snake oil you slatter on
the whole thing. All of a sudden you are exposing a huge attack surface
in the place where it would hurt most and as the consolation we are offered
basically "Ted is willing to fix holes when they are found".
For the context of static image attacks, anything that's found
_needs_ to be fixed regardless, and unless you can find some way to
actually prevent attacks on mounted filesystems that doesn't involve
a complete re-write of the filesystem drivers, then there's not much
we can do about it. Yes, unprivileged mounts expose an attack
surface, but so does userspace access to the network stack, and so
do a lot of other features that are considered essential in a modern
general purpose operating system.
"X is exposes an attack surface. Y exposes a diferent attack surface.
Y is considered important. Therefore X is important enough to implement it"

Right...
Seth Forshee
2015-11-17 22:01:25 UTC
Permalink
Post by Al Viro
Post by Austin S Hemmelgarn
Post by Al Viro
This is absolutely insane, no matter how much LSM snake oil you slatter on
the whole thing. All of a sudden you are exposing a huge attack surface
in the place where it would hurt most and as the consolation we are offered
basically "Ted is willing to fix holes when they are found".
None of the LSM changes are intended to protect against attacks from
these sorts of attacks at all, so that's irrelevant.

As I said before, I'm also working to find holes up front. That plus a
commitment from the maintainer seems like a good start at least. What
bar would you set for a given filesystem to be considered "safe enough"?
Post by Al Viro
Post by Austin S Hemmelgarn
For the context of static image attacks, anything that's found
_needs_ to be fixed regardless, and unless you can find some way to
actually prevent attacks on mounted filesystems that doesn't involve
a complete re-write of the filesystem drivers, then there's not much
we can do about it. Yes, unprivileged mounts expose an attack
surface, but so does userspace access to the network stack, and so
do a lot of other features that are considered essential in a modern
general purpose operating system.
"X is exposes an attack surface. Y exposes a diferent attack surface.
Y is considered important. Therefore X is important enough to implement it"
Right...
That isn't the argument he made. I would summarize the argument as,
"Saying that X exposes an attack surface isn't by itself enough to
reject X, otherwise we wouldn't expose anything (such as example Y)."

You believe that the attack surface is too large, and that's
understandable. Is it your opinion that this is a fundamental problem
for an in-kernel filesystem driver, i.e. that we can never be confident
enough in an in-kernel filesystem parser to allow untrusted data? If
not, what would it take to establish a level of confidence that you
would be comfortable with?
Austin S Hemmelgarn
2015-11-18 12:46:53 UTC
Permalink
Post by Seth Forshee
Post by Al Viro
Post by Austin S Hemmelgarn
Post by Al Viro
This is absolutely insane, no matter how much LSM snake oil you slatter on
the whole thing. All of a sudden you are exposing a huge attack surface
in the place where it would hurt most and as the consolation we are offered
basically "Ted is willing to fix holes when they are found".
None of the LSM changes are intended to protect against attacks from
these sorts of attacks at all, so that's irrelevant.
As I said before, I'm also working to find holes up front. That plus a
commitment from the maintainer seems like a good start at least. What
bar would you set for a given filesystem to be considered "safe enough"?
Post by Al Viro
Post by Austin S Hemmelgarn
For the context of static image attacks, anything that's foun
_needs_ to be fixed regardless, and unless you can find some way to
actually prevent attacks on mounted filesystems that doesn't involve
a complete re-write of the filesystem drivers, then there's not much
we can do about it. Yes, unprivileged mounts expose an attack
surface, but so does userspace access to the network stack, and so
do a lot of other features that are considered essential in a modern
general purpose operating system.
"X is exposes an attack surface. Y exposes a diferent attack surface.
Y is considered important. Therefore X is important enough to implement it"
Right...
That isn't the argument he made. I would summarize the argument as,
"Saying that X exposes an attack surface isn't by itself enough to
reject X, otherwise we wouldn't expose anything (such as example Y)."
It's good to see someone understood my meaning...
Post by Seth Forshee
You believe that the attack surface is too large, and that's
understandable. Is it your opinion that this is a fundamental problem
for an in-kernel filesystem driver, i.e. that we can never be confident
enough in an in-kernel filesystem parser to allow untrusted data? If
not, what would it take to establish a level of confidence that you
would be comfortable with?
While I can't speak for Al's opinion on this, I would like to point out
Post by Seth Forshee
It's unfeasible from a practical standpoint to expect filesystems to
assume that stuff they write might change under them due to malicious
intent of a third party.
We can't protect against everything, not without making the system
completely unusable for general purpose computing. There is always some
degree of trust involved in usage of a computer, the OS has to trust
that the hardware works correctly, the administrator has to trust the OS
to behave correctly, and the users have to trust the administrator. The
administrator also needs to have at least some trust in the users,
otherwise he shouldn't be allowing them to use the system.

Perhaps we should have an option that can only be enabled on creation of
the userns that would allow it to use regular kernel mounts, and without
that option we default to only allowing FUSE and a couple of virtual
filesystems (like /proc and devtmpfs).
Seth Forshee
2015-11-18 14:30:22 UTC
Permalink
Post by Austin S Hemmelgarn
Post by Seth Forshee
Post by Al Viro
Post by Austin S Hemmelgarn
Post by Al Viro
This is absolutely insane, no matter how much LSM snake oil you slatter on
the whole thing. All of a sudden you are exposing a huge attack surface
in the place where it would hurt most and as the consolation we are offered
basically "Ted is willing to fix holes when they are found".
None of the LSM changes are intended to protect against attacks from
these sorts of attacks at all, so that's irrelevant.
As I said before, I'm also working to find holes up front. That plus a
commitment from the maintainer seems like a good start at least. What
bar would you set for a given filesystem to be considered "safe enough"?
Post by Al Viro
Post by Austin S Hemmelgarn
For the context of static image attacks, anything that's foun
_needs_ to be fixed regardless, and unless you can find some way to
actually prevent attacks on mounted filesystems that doesn't involve
a complete re-write of the filesystem drivers, then there's not much
we can do about it. Yes, unprivileged mounts expose an attack
surface, but so does userspace access to the network stack, and so
do a lot of other features that are considered essential in a modern
general purpose operating system.
"X is exposes an attack surface. Y exposes a diferent attack surface.
Y is considered important. Therefore X is important enough to implement it"
Right...
That isn't the argument he made. I would summarize the argument as,
"Saying that X exposes an attack surface isn't by itself enough to
reject X, otherwise we wouldn't expose anything (such as example Y)."
It's good to see someone understood my meaning...
Post by Seth Forshee
You believe that the attack surface is too large, and that's
understandable. Is it your opinion that this is a fundamental problem
for an in-kernel filesystem driver, i.e. that we can never be confident
enough in an in-kernel filesystem parser to allow untrusted data? If
not, what would it take to establish a level of confidence that you
would be comfortable with?
While I can't speak for Al's opinion on this, I would like to point
Post by Seth Forshee
It's unfeasible from a practical standpoint to expect filesystems
to > assume that stuff they write might change under them due to
malicious > intent of a third party.
So maybe the first requirement is that the user cannot modify the
backing store directly while the device is mounted.
Post by Austin S Hemmelgarn
We can't protect against everything, not without making the system
completely unusable for general purpose computing. There is always
some degree of trust involved in usage of a computer, the OS has to
trust that the hardware works correctly, the administrator has to
trust the OS to behave correctly, and the users have to trust the
administrator. The administrator also needs to have at least some
trust in the users, otherwise he shouldn't be allowing them to use
the system.
Perhaps we should have an option that can only be enabled on
creation of the userns that would allow it to use regular kernel
mounts, and without that option we default to only allowing FUSE and
a couple of virtual filesystems (like /proc and devtmpfs).
I've considered the idea of something more global like a sysctl, or a
per-filesystem knob in sysfs. I guess a per-container knob is another
option, I'm not sure what interface we use to expose it though.
Austin S Hemmelgarn
2015-11-18 15:38:58 UTC
Permalink
Post by Seth Forshee
Post by Austin S Hemmelgarn
Post by Seth Forshee
Post by Al Viro
Post by Austin S Hemmelgarn
Post by Al Viro
This is absolutely insane, no matter how much LSM snake oil you slatter on
the whole thing. All of a sudden you are exposing a huge attack surface
in the place where it would hurt most and as the consolation we are offered
basically "Ted is willing to fix holes when they are found".
None of the LSM changes are intended to protect against attacks from
these sorts of attacks at all, so that's irrelevant.
As I said before, I'm also working to find holes up front. That plus a
commitment from the maintainer seems like a good start at least. What
bar would you set for a given filesystem to be considered "safe enough"?
Post by Al Viro
Post by Austin S Hemmelgarn
For the context of static image attacks, anything that's foun
_needs_ to be fixed regardless, and unless you can find some way to
actually prevent attacks on mounted filesystems that doesn't involve
a complete re-write of the filesystem drivers, then there's not much
we can do about it. Yes, unprivileged mounts expose an attack
surface, but so does userspace access to the network stack, and so
do a lot of other features that are considered essential in a modern
general purpose operating system.
"X is exposes an attack surface. Y exposes a diferent attack surface.
Y is considered important. Therefore X is important enough to implement it"
Right...
That isn't the argument he made. I would summarize the argument as,
"Saying that X exposes an attack surface isn't by itself enough to
reject X, otherwise we wouldn't expose anything (such as example Y)."
It's good to see someone understood my meaning...
Post by Seth Forshee
You believe that the attack surface is too large, and that's
understandable. Is it your opinion that this is a fundamental problem
for an in-kernel filesystem driver, i.e. that we can never be confident
enough in an in-kernel filesystem parser to allow untrusted data? If
not, what would it take to establish a level of confidence that you
would be comfortable with?
While I can't speak for Al's opinion on this, I would like to point
Post by Seth Forshee
It's unfeasible from a practical standpoint to expect filesystems
to > assume that stuff they write might change under them due to
malicious > intent of a third party.
So maybe the first requirement is that the user cannot modify the
backing store directly while the device is mounted.
Post by Austin S Hemmelgarn
We can't protect against everything, not without making the system
completely unusable for general purpose computing. There is always
some degree of trust involved in usage of a computer, the OS has to
trust that the hardware works correctly, the administrator has to
trust the OS to behave correctly, and the users have to trust the
administrator. The administrator also needs to have at least some
trust in the users, otherwise he shouldn't be allowing them to use
the system.
Perhaps we should have an option that can only be enabled on
creation of the userns that would allow it to use regular kernel
mounts, and without that option we default to only allowing FUSE and
a couple of virtual filesystems (like /proc and devtmpfs).
I've considered the idea of something more global like a sysctl, or a
per-filesystem knob in sysfs. I guess a per-container knob is another
option, I'm not sure what interface we use to expose it though.
The most useful way I can see of implementing this would be to have an
option on container creation that controls whether kernel mounts are
allowed or not (possibly have it allow any of {no mounts, only FUSE
mounts, all mounts}), and then have a sysctl to set the default for
containers created without this option (and possibly one to force all
containers to ignore the option, and just use the default).
J. Bruce Fields
2015-11-18 18:44:08 UTC
Permalink
Post by Al Viro
Post by Austin S Hemmelgarn
Post by Al Viro
_Static_ attacks, or change-image-under-mounted-fs attacks?
To properly protect against attacks on mounted filesystems, we'd
need some new concept of a userspace immutable file (that is, one
where nobody can write to it except the kernel, and only the kernel
can change it between regular access and this new state), and then
have the kernel set an image (or block device) to this state when a
filesystem is mounted from it (this introduces all kinds of other
issues too however, for example stuff that allows an online fsck on
the device will stop working, as will many un-deletion tools).
The only other option would be to force the FS to cache all metadata
in memory, and validate between the cache and what's on disk on
every access, which is not realistic for any real world system.
Doctor, it hurt when I do it...
IOW, the other option is to refuse attempting this insanity. Fuse probably
can be handled, but being able to mount (with kernel-space drivera) an
arbitrary ext4 image is equivalent to being able to do anything and it's
going to stay that way for the forseeable future.
What about the filesystems that desktop users commonly mount? (fat,
isofs, udf?)

--b.
Loading...