Skip to content

Initial Rust support#4

Merged
ojeda merged 4 commits intorustfrom
rust-initial
Sep 12, 2020
Merged

Initial Rust support#4
ojeda merged 4 commits intorustfrom
rust-initial

Conversation

@ojeda
Copy link
Member

@ojeda ojeda commented Sep 4, 2020

Currently we have several lines of work:

  • Integrating with the kernel tree and build system (Nick's & mine, both uploaded as branches, based on rustc; and another one I have been working on, based on cargo).
  • Bindings and the first bits of functionality (Alex's & Geoffrey's, based on cargo).

This patch effectively merges the work we have been doing and integrates it in the latest mainline kernel tree.

This does not mean anything needs to stay as-is, but this gives us a working, common base to work and experiment upon. Hopefully, it will also attract external people to join!

As a summary, I added:

  • cargo integration with the kernel Makefiles:

    • Virtual cargo workspace to have a single lock file and to share deps between cargo jobs.
    • Picks the same optimization level as configured for C.
    • Verbose output on V=1.
    • A cargoclean target to clean all the Rust-related artifacts.
  • Initial support for built-in modules (i.e. Y as well as M):

    • It is a hack, we need to implement a few things (see the TODOs), but it is enough to start playing with things that depend on MODULE.
    • Passes --cfg module to built-in modules to be able to compile conditionally inside Rust.
    • Increased KSYM_NAME_LEN length to avoid warnings due to Rust long mangled symbols.
  • Rust infrastructure in a new top level folder rust/:

    • A kernel package which contains the sources from Alex & Geoffrey's work plus some changes:
      • Adapted build.rs.
      • Removed the THIS_MODULE emulation until it is implemented.
      • Removed Makefile logic and the code that cfg-depended on kernel version (no need in mainline).
      • Moved the helpers to be triggered via normal compilation, renamed them under rust_* and exported via EXPORT_SYMBOL instead.
      • Added a prelude.
    • A shlex package which serves as an example of an "inline" dependency (i.e. package checked out copy to avoid the network)
  • The example driver was setup at drivers/char/rust_example/.

  • Misc

    • The beginning of Documentation/rust/ with a quick start guide.
    • MAINTAINERS entry.
    • SPDXs for all files.

Other notes that aren't in TODOs:

  • We could minimize the network requirements (use bindgen binary, use more inline dependencies...), but it is not clear it would be a good idea for the moment due to core/alloc/compiler-builtins.

  • The intention of rust/ is to have a place to put extra dependencies and split the kernel package into several in the future if it grows. It could resemble the usual kernel tree structure.

  • With several drivers being built-in, cargo recompiles kernel and triggers duplicate symbol errors when merging thin archives. We need to make it only compile kernel once.

  • When the above works, then make's -j calling concurrent cargos (e.g. several drivers at the same time) should be OK, I think. According to Fix running Cargo concurrently rust-lang/cargo#2486 it shouldn't miscompile anything, but if they start locking on each other we will have make jobs waiting that could have been doing something else.

Signed-off-by: Miguel Ojeda miguel.ojeda.sandonis@gmail.com

Currently we have several lines of work:

  - Integrating with the kernel tree and build system (Nick's & mine,
    both uploaded as branches, based on `rustc`; and another one I have
    been working on, based on `cargo`).

  - Bindings and the first bits of functionality (Alex's & Geoffrey's,
    based on `cargo`).

This patch effectively merges the work we have been doing and
integrates it in the latest mainline kernel tree.

This does *not* mean anything needs to stay as-is, but this gives us
a working, common base to work and experiment upon. Hopefully, it will
also attract external people to join!

As a summary, I added:

  - `cargo` integration with the kernel `Makefile`s:
      + Virtual `cargo` workspace to have a single lock file and to share
        deps between `cargo` jobs.
      + Picks the same optimization level as configured for C.
      + Verbose output on `V=1`.
      + A `cargoclean` target to clean all the Rust-related artifacts.

  - Initial support for built-in modules (i.e. `Y` as well as `M`):
      + It is a hack, we need to implement a few things (see the `TODO`s),
        but it is enough to start playing with things that depend on `MODULE`.
      + Passes `--cfg module` to built-in modules to be able to compile
        conditionally inside Rust.
      + Increased `KSYM_NAME_LEN` length to avoid warnings due to Rust
        long mangled symbols.

  - Rust infrastructure in a new top level folder `rust/`:
      + A `kernel` package which contains the sources from Alex &
        Geoffrey's work plus some changes:
          * Adapted `build.rs`.
          * Removed the `THIS_MODULE` emulation until it is implemented.
          * Removed `Makefile` logic and the code that `cfg`-depended
            on kernel version (no need in mainline).
          * Moved the helpers to be triggered via normal compilation,
            renamed them under `rust_*` and exported via
            `EXPORT_SYMBOL` instead.
          * Added a prelude.
      + A `shlex` package which serves as an example of an "inline"
        dependency (i.e. package checked out copy to avoid the network)

  - The example driver was setup at `drivers/char/rust_example/`.

  - Misc
      + The beginning of `Documentation/rust/` with a quick start guide.
      + `MAINTAINERS` entry.
      + SPDXs for all files.

Other notes that aren't in `TODO`s:

  - We could minimize the network requirements (use `bindgen` binary, use more
    inline dependencies...), but it is not clear it would be a good idea for
    the moment due to `core`/`alloc`/`compiler-builtins`.

  - The intention of `rust/` is to have a place to put extra dependencies
    and split the `kernel` package into several in the future if it grows.
    It could resemble the usual kernel tree structure.

  - With several drivers being built-in, `cargo` recompiles `kernel`
    and triggers duplicate symbol errors when merging thin archives.
    We need to make it only compile `kernel` once.

  - When the above works, then `make`'s `-j` calling concurrent `cargo`s
    (e.g. several drivers at the same time) should be OK, I think.
    According to rust-lang/cargo#2486  it shouldn't
    miscompile anything, but if they start locking on each other
    we will have make jobs waiting that could have been doing something else.

Signed-off-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
@ojeda
Copy link
Member Author

ojeda commented Sep 4, 2020

  • I added TODOs in some places of the code I touched, so take a look.
  • shlex is just an example (most likely we will let cargo do its job instead).
  • Concerning credit (especially for Geoffrey & Alex): I wrote "Rust for Linux Contributors" everywhere for the moment, but do not worry: we will mention everyone in the eventual PR, note individual contributions (including per patch/file if needed), etc.

Copy link
Member

@alex alex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review the code that was directly copied too much. And my Makefile skills are subpar -- I barely understand how this KBuild integration works! But overall this looks amazing, thanks for driving this forward.

@alex
Copy link
Member

alex commented Sep 4, 2020

I'm not worried about credit at all. I'm so excited to see this moving forward!

@ojeda
Copy link
Member Author

ojeda commented Sep 4, 2020

I didn't review the code that was directly copied too much.

Yeah, I could have put an initial commit with just those files to make it show easier on GitHub's interface. Let me paste the manual diff here instead.

And my Makefile skills are subpar -- I barely understand how this KBuild integration works!

I don't think anybody understands it... ;) Later on we can ask Masahiro to give us a hand (e.g. to remove the hack I put there to make the distinction between built-in and module).

But overall this looks amazing, thanks for driving this forward.

You're welcome! And thanks for taking a look so quickly :-)

@ojeda
Copy link
Member Author

ojeda commented Sep 5, 2020

@alex @geofft The manual diff between the original files and the new ones (minus the repetitive SPDX ones):

diff -ur ./bindings_helper.h ../../linux/rust/kernel/src/bindings_helper.h
@@ -6,6 +8,5 @@
 #include <linux/uaccess.h>
 #include <linux/version.h>
 
-// Bindgen gets confused at certain things
-//
+// bindgen gets confused at certain things
 const gfp_t BINDINGS_GFP_KERNEL = GFP_KERNEL;
diff -ur ./bindings.rs ../../linux/rust/kernel/src/bindings.rs
@@ -5,10 +7,10 @@
     non_snake_case,
     improper_ctypes
 )]
-mod bindings {
+mod bindings_raw {
     use crate::c_types;
-    include!(concat!(env!("OUT_DIR"), "/bindings.rs"));
+    include!("bindings_gen.rs");
 }
-pub use bindings::*;
+pub use bindings_raw::*;
 
 pub const GFP_KERNEL: gfp_t = BINDINGS_GFP_KERNEL;
diff -ur ./chrdev.rs ../../linux/rust/kernel/src/chrdev.rs
@@ -56,7 +58,8 @@
         for (i, file_op) in self.file_ops.iter().enumerate() {
             unsafe {
                 bindings::cdev_init(&mut cdevs[i], *file_op);
-                cdevs[i].owner = &mut bindings::__this_module;
+                // TODO: proper `THIS_MODULE` handling
+                cdevs[i].owner = core::ptr::null_mut();
                 let rc = bindings::cdev_add(&mut cdevs[i], dev + i as bindings::dev_t, 1);
                 if rc != 0 {
                     // Clean up the ones that were allocated.
diff -ur ./file_operations.rs ../../linux/rust/kernel/src/file_operations.rs
@@ -160,18 +162,10 @@
             None
         },
 
-        #[cfg(not(kernel_4_9_0_or_greater))]
-        aio_fsync: None,
         check_flags: None,
-        #[cfg(all(kernel_4_5_0_or_greater, not(kernel_4_20_0_or_greater)))]
-        clone_file_range: None,
         compat_ioctl: None,
-        #[cfg(kernel_4_5_0_or_greater)]
         copy_file_range: None,
-        #[cfg(all(kernel_4_5_0_or_greater, not(kernel_4_20_0_or_greater)))]
-        dedupe_file_range: None,
         fallocate: None,
-        #[cfg(kernel_4_19_0_or_greater)]
         fadvise: None,
         fasync: None,
         flock: None,
@@ -179,22 +173,16 @@
         fsync: None,
         get_unmapped_area: None,
         iterate: None,
-        #[cfg(kernel_4_7_0_or_greater)]
         iterate_shared: None,
-        #[cfg(kernel_5_1_0_or_greater)]
         iopoll: None,
         lock: None,
         mmap: None,
-        #[cfg(kernel_4_15_0_or_greater)]
         mmap_supported_flags: 0,
         owner: ptr::null_mut(),
         poll: None,
         read_iter: None,
-        #[cfg(kernel_4_20_0_or_greater)]
         remap_file_range: None,
         sendpage: None,
-        #[cfg(kernel_aufs_setfl)]
-        setfl: None,
         setlease: None,
         show_fdinfo: None,
         splice_read: None,
diff -ur ./filesystem.rs ../../linux/rust/kernel/src/filesystem.rs
@@ -61,7 +63,8 @@
     let mut fs_registration = Registration {
         ptr: Box::new(bindings::file_system_type {
             name: T::NAME.as_ptr() as *const i8,
-            owner: unsafe { &mut bindings::__this_module },
+            // TODO: proper `THIS_MODULE` handling
+            owner: core::ptr::null_mut(),
             fs_flags: T::FLAGS.bits(),
             mount: Some(mount_callback::<T>),
             kill_sb: Some(bindings::kill_litter_super),
diff -ur ./lib.rs ../../linux/rust/kernel/src/lib.rs
@@ -1,3 +1,7 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! The `kernel` crate
+
 #![no_std]
 #![feature(allocator_api, alloc_error_handler, const_raw_ptr_deref)]
 
@@ -12,8 +16,8 @@
 mod error;
 pub mod file_operations;
 pub mod filesystem;
+pub mod prelude;
 pub mod printk;
-#[cfg(kernel_4_13_0_or_greater)]
 pub mod random;
 pub mod sysctl;
 mod types;
@@ -27,17 +31,18 @@
 ///
 /// Example:
 /// ```rust,no_run
-/// use linux_kernel_module;
+/// use kernel::prelude::*;
+///
 /// struct MyKernelModule;
-/// impl linux_kernel_module::KernelModule for MyKernelModule {
-///     fn init() -> linux_kernel_module::KernelResult<Self> {
+/// impl KernelModule for MyKernelModule {
+///     fn init() -> KernelResult<Self> {
 ///         Ok(MyKernelModule)
 ///     }
 /// }
 ///
-/// linux_kernel_module::kernel_module!(
+/// kernel_module!(
 ///     MyKernelModule,
-///     author: b"Fish in a Barrel Contributors",
+///     author: b"Rust for Linux Contributors",
 ///     description: b"My very own kernel module!",
 ///     license: b"GPL"
 /// );
@@ -45,6 +50,14 @@
 macro_rules! kernel_module {
     ($module:ty, $($name:ident : $value:expr),*) => {
         static mut __MOD: Option<$module> = None;
+
+        // TODO: find a proper way to emulate the C macro, including
+        // dealing with `HAVE_ARCH_PREL32_RELOCATIONS`
+        #[cfg(not(module))]
+        #[link_section = ".initcall6.init"]
+        #[used]
+        pub static __initcall: extern "C" fn() -> $crate::c_types::c_int = init_module;
+
         #[no_mangle]
         pub extern "C" fn init_module() -> $crate::c_types::c_int {
             match <$module as $crate::KernelModule>::init() {
@@ -85,6 +98,7 @@
     // Most of the alternatives (e.g. .as_bytes() as a const fn) give
     // you a pointer, not an array, which isn't right.
 
+    // TODO: `modules.builtin.modinfo` etc. is missing the prefix (module name)
     (@attribute author, $value:expr) => {
         #[link_section = ".modinfo"]
         #[used]
@@ -133,13 +147,13 @@
 }
 
 extern "C" {
-    fn bug_helper() -> !;
+    fn rust_helper_BUG() -> !;
 }
 
 #[panic_handler]
 fn panic(_info: &PanicInfo) -> ! {
     unsafe {
-        bug_helper();
+        rust_helper_BUG();
     }
 }
diff -ur ./random.rs ../../linux/rust/kernel/src/random.rs
@@ -21,9 +23,7 @@
 }
 
 /// Fills `dest` with random bytes generated from the kernel's CSPRNG. If the
-/// CSPRNG is not yet seeded, returns an `Err(EAGAIN)` immediately. Only
-/// available on 4.19 and later kernels.
-#[cfg(kernel_4_19_0_or_greater)]
+/// CSPRNG is not yet seeded, returns an `Err(EAGAIN)` immediately.
 pub fn getrandom_nonblock(dest: &mut [u8]) -> error::KernelResult<()> {
     if !unsafe { bindings::rng_is_initialized() } {
         return Err(error::Error::EAGAIN);
diff -ur ./user_ptr.rs ../../linux/rust/kernel/src/user_ptr.rs
@@ -7,7 +9,7 @@
 use crate::error;
 
 extern "C" {
-    fn access_ok_helper(addr: *const c_types::c_void, len: c_types::c_ulong) -> c_types::c_int;
+    fn rust_helper_access_ok(addr: *const c_types::c_void, len: c_types::c_ulong) -> c_types::c_int;
 }
 
 /// A reference to an area in userspace memory, which can be either
@@ -55,7 +57,7 @@
         ptr: *mut c_types::c_void,
         length: usize,
     ) -> error::KernelResult<UserSlicePtr> {
-        if access_ok_helper(ptr, length as c_types::c_ulong) == 0 {
+        if rust_helper_access_ok(ptr, length as c_types::c_ulong) == 0 {
             return Err(error::Error::EFAULT);
         }
         Ok(UserSlicePtr(ptr, length))

@ojeda
Copy link
Member Author

ojeda commented Sep 5, 2020

build.rs diff:

@@ -1,6 +1,7 @@
-use std::io::{BufRead, BufReader};
+// SPDX-License-Identifier: GPL-2.0
+
 use std::path::PathBuf;
-use std::{env, fs};
+use std::env;
 
 const INCLUDED_TYPES: &[&str] = &["file_system_type", "mode_t", "umode_t", "ctl_table"];
 const INCLUDED_FUNCTIONS: &[&str] = &[
@@ -55,56 +56,6 @@
     "xregs_state",
 ];
 
-fn handle_kernel_version_cfg(bindings_path: &PathBuf) {
-    let f = BufReader::new(fs::File::open(bindings_path).unwrap());
-    let mut version = None;
-    for line in f.lines() {
-        let line = line.unwrap();
-        if let Some(type_and_value) = line.split("pub const LINUX_VERSION_CODE").nth(1) {
-            if let Some(value) = type_and_value.split('=').nth(1) {
-                let raw_version = value.split(';').next().unwrap();
-                version = Some(raw_version.trim().parse::<u64>().unwrap());
-                break;
-            }
-        }
-    }
-    let version = version.expect("Couldn't find kernel version");
-    let (major, minor) = match version.to_be_bytes() {
-        [0, 0, 0, 0, 0, major, minor, _patch] => (major, minor),
-        _ => panic!("unable to parse LINUX_VERSION_CODE {:x}", version),
-    };
-
-    if major >= 6 {
-        panic!("Please update build.rs with the last 5.x version");
-        // Change this block to major >= 7, copy the below block for
-        // major >= 6, fill in unimplemented!() for major >= 5
-    }
-    if major >= 5 {
-        for x in 0..=if major > 5 { unimplemented!() } else { minor } {
-            println!("cargo:rustc-cfg=kernel_5_{}_0_or_greater", x);
-        }
-    }
-    if major >= 4 {
-        // We don't currently support anything older than 4.4
-        for x in 4..=if major > 4 { 20 } else { minor } {
-            println!("cargo:rustc-cfg=kernel_4_{}_0_or_greater", x);
-        }
-    }
-}
-
-fn handle_kernel_symbols_cfg(symvers_path: &PathBuf) {
-    let f = BufReader::new(fs::File::open(symvers_path).unwrap());
-    for line in f.lines() {
-        let line = line.unwrap();
-        if let Some(symbol) = line.split_ascii_whitespace().nth(1) {
-            if symbol == "setfl" {
-                println!("cargo:rustc-cfg=kernel_aufs_setfl");
-                break;
-            }
-        }
-    }
-}
-
 // Takes the CFLAGS from the kernel Makefile and changes all the include paths to be absolute
 // instead of relative.
 fn prepare_cflags(cflags: &str, kernel_dir: &str) -> Vec<String> {
@@ -112,6 +63,11 @@
     let mut cflag_iter = cflag_parts.iter();
     let mut kernel_args = vec![];
     while let Some(arg) = cflag_iter.next() {
+        // TODO: bindgen complains
+        if arg.starts_with("-Wp,-MMD") {
+            continue;
+        }
+
         if arg.starts_with("-I") && !arg.starts_with("-I/") {
             kernel_args.push(format!("-I{}/{}", kernel_dir, &arg[2..]));
         } else if arg == "-include" {
@@ -131,15 +87,12 @@
 
 fn main() {
     println!("cargo:rerun-if-env-changed=CC");
-    println!("cargo:rerun-if-env-changed=KDIR");
-    println!("cargo:rerun-if-env-changed=c_flags");
+    println!("cargo:rerun-if-env-changed=RUST_BINDGEN_CFLAGS");
 
-    let kernel_dir = env::var("KDIR").expect("Must be invoked from kernel makefile");
-    let kernel_cflags = env::var("c_flags").expect("Add 'export c_flags' to Kbuild");
-    let kbuild_cflags_module =
-        env::var("KBUILD_CFLAGS_MODULE").expect("Must be invoked from kernel makefile");
+    let kernel_dir = "../../";
+    let cflags = env::var("RUST_BINDGEN_CFLAGS")
+        .expect("Must be invoked from kernel makefile");
 
-    let cflags = format!("{} {}", kernel_cflags, kbuild_cflags_module);
     let kernel_args = prepare_cflags(&cflags, &kernel_dir);
 
     let target = env::var("TARGET").unwrap();
@@ -173,22 +126,8 @@
     }
     let bindings = builder.generate().expect("Unable to generate bindings");
 
-    let out_path = PathBuf::from(env::var("OUT_DIR").unwrap());
+    let out_path = PathBuf::from("src/bindings_gen.rs");
     bindings
-        .write_to_file(out_path.join("bindings.rs"))
+        .write_to_file(out_path)
         .expect("Couldn't write bindings!");
-
-    handle_kernel_version_cfg(&out_path.join("bindings.rs"));
-    handle_kernel_symbols_cfg(&PathBuf::from(&kernel_dir).join("Module.symvers"));
-
-    let mut builder = cc::Build::new();
-    builder.compiler(env::var("CC").unwrap_or_else(|_| "clang".to_string()));
-    builder.target(&target);
-    builder.warnings(false);
-    println!("cargo:rerun-if-changed=src/helpers.c");
-    builder.file("src/helpers.c");
-    for arg in kernel_args.iter() {
-        builder.flag(&arg);
-    }
-    builder.compile("helpers");
 }

@ojeda
Copy link
Member Author

ojeda commented Sep 5, 2020

And the driver:

@@ -1,35 +1,33 @@
-#![no_std]
-
-extern crate alloc;
+// SPDX-License-Identifier: GPL-2.0
 
-use alloc::borrow::ToOwned;
-use alloc::string::String;
+#![no_std]
 
-use linux_kernel_module::println;
+use kernel::prelude::*;
 
-struct HelloWorldModule {
+struct RustExample {
     message: String,
 }
 
-impl linux_kernel_module::KernelModule for HelloWorldModule {
-    fn init() -> linux_kernel_module::KernelResult<Self> {
-        println!("Hello kernel module!");
-        Ok(HelloWorldModule {
+impl KernelModule for RustExample {
+    fn init() -> KernelResult<Self> {
+        println!("Rust Example (init)");
+        println!("Am I built-in? {}", !cfg!(module));
+        Ok(RustExample {
             message: "on the heap!".to_owned(),
         })
     }
 }
 
-impl Drop for HelloWorldModule {
+impl Drop for RustExample {
     fn drop(&mut self) {
         println!("My message is {}", self.message);
-        println!("Goodbye kernel module!");
+        println!("Rust Example (exit)");
     }
 }
 
-linux_kernel_module::kernel_module!(
-    HelloWorldModule,
-    author: b"Fish in a Barrel Contributors",
-    description: b"An extremely simple kernel module",
-    license: b"GPL"
+kernel_module!(
+    RustExample,
+    author: b"Rust for Linux Contributors",
+    description: b"An example kernel module written in Rust",
+    license: b"GPL v2"
 );

Signed-off-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
@geofft
Copy link

geofft commented Sep 6, 2020

This is awesome!

I'm going to try to split this out into patches mostly so it's easier for me to review and also because I suspect it'll be better for submission. There might also be some changes we can "upstream" into linux-kernel-module-rust.

I had the Kbuild stuff on my mind because of fishinabarrel/linux-kernel-module-rust#266 - do your Kbudild rulesx make .cmd files for us? If so, I might try to roll this approach back into our out-of-tree Makefiles....

Agree re credit, I assume it'll be easy to stick some Co-authored-by lines or whatever, I don't really have a strong opinion about whose name is on it :)

@alex
Copy link
Member

alex commented Sep 6, 2020

You'll probably want to cherry-pick fishinabarrel/linux-kernel-module-rust@abc9791 which further reduces the need for nightly features.

Copy link
Member

@alex alex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments

@ojeda
Copy link
Member Author

ojeda commented Sep 7, 2020

This is awesome!

Thanks!

I'm going to try to split this out into patches mostly so it's easier for me to review and also because I suspect it'll be better for submission.

Please note that this repo/branch/commits is only meant for development within GitHub -- later I'll have to squash everything on my side and prepare the proper PR anyway (see #1). So all these dev-commits/messages/merges will go away.

There might also be some changes we can "upstream" into linux-kernel-module-rust.

Not much (I intentionally modified as few things as possible in your original files). In particular:

  • The build.rs things I removed are still needed on your side because you support several kernel versions etc.
  • The example driver changes are intended to fit the new structure etc., so those aren't needed on your side.
  • The built-in modules support is not used for external out-of-tree modules.
  • The kbuild changes require modifying kbuild itself which you can't do.

Things that I think you could pick, however, are:

  • The prelude idea.
  • The SPDX lines (it is good to teach people convention even if they are out-of-tree).
  • Building of helpers.c using kbuild rather than inside build.rs.

I should have probably left the prelude and SPDX changes for another PR to make it even easier to diff; I wasn't consistent on the strategy there.

I had the Kbuild stuff on my mind because of fishinabarrel/linux-kernel-module-rust#266. do your Kbudild rulesx make .cmd files for us? If so, I might try to roll this approach back into our out-of-tree Makefiles....

Kbuild files are intended for out-of-tree modules. Inside the kernel we just call it the Makefile, but it is still all kbuild (i.e. the build system)..cmd files are generated for the usual things kbuild does it currently (e.g. clang, ld).

What I did is teach kbuild (in a hacky way at some points) how to handle our Rust stuff in a way that adding new Rust modules looks like any other C module and as easy to use as possible, i.e. you only need to write:

# SPDX-License-Identifier: GPL-2.0

obj-$(CONFIG_RUST_EXAMPLE) += rust_example.o

Agree re credit, I assume it'll be easy to stick some Co-authored-by lines or whatever,

Yeah, exactly. Co-developed-by and Signed-off-by is the usual way.

Signed-off-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
@alex
Copy link
Member

alex commented Sep 7, 2020

FWIW, I managed to emulate the module_init codegen for builtin modules with something like this:

#![feature(global_asm)]

global_asm!(
    r#".section ".initcall6.init", "a"
    __initcall_foo6:
        .long   foo - .
        .previous
    "#
);

#[no_mangle]
fn foo() -> i32 {
    4
}

Converting to a macro that's parameterized on the function name is left as an exercise to the reader :-)

Signed-off-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
@ojeda
Copy link
Member Author

ojeda commented Sep 8, 2020

I am starting to wonder whether trying to emulate all the C module support is too much duplication. Perhaps we should generate and compile a C file for each defined Rust module on the fly (without the module writer having to write anything different).

Anyway, for the moment I added support to do conditional compilation on the kernel configuration variables (which is anyway good to have!) and used that to fully emulate the macro. That frees us from the arch/ change which was a blocker.

@ojeda
Copy link
Member Author

ojeda commented Sep 12, 2020

It has been a week so I am merging this so that we can start sending PRs etc.

@ojeda ojeda merged commit e280b81 into rust Sep 12, 2020
@ojeda ojeda deleted the rust-initial branch September 12, 2020 13:37
@nickdesaulniers
Copy link
Member

Sorry, guys, I'm slammed at work. Hopefully I can help out more in the future; just excessively busy right now... :(

@kees
Copy link
Member

kees commented Sep 16, 2020

Is cargo pulling stuff from the network? Is there a way to make this work with only local crates, i.e. installed by the distro or pre-installed by the user prior to attempting to build the kernel?

@ojeda
Copy link
Member Author

ojeda commented Sep 17, 2020

Sorry, guys, I'm slammed at work. Hopefully I can help out more in the future; just excessively busy right now... :(

No worries!

Is cargo pulling stuff from the network? Is there a way to make this work with only local crates, i.e. installed by the distro or pre-installed by the user prior to attempting to build the kernel?

Yeah, it does. For the moment, it is downloading them, although the versions are locked (i.e. no surprise updates, which means no unknown/arbitrary code downloaded when running make). Having said that:

  • It is possible to pre-download them by cargo fetch (I mention it in Documentation/rust/quick-start.rst).
  • It is possible to require to pre-download them. This means failing the build otherwise. Harsh, but this would keep the expectation that no network connections are made while building, which I suspect will be a popular option. In a way, those dependencies are a "tool", like rustc and cargo themselves, that should be available, so it makes sense to require it.

To recap the situation of the dependencies for everyone: we can make it work with local code (the rust/shlex I added is an example of that), but there are 2 big ones that are harder to manage:

  • Two compiler dependencies: it is not clear to me if newer/older rustcs will always be able to compile those. If not, it might mean we have to fix the version of the compiler, which is not great. In theory we could have a fork or our own version of those and make it work with any support compiler version, but I think for the moment it would be going too far (e.g. in the future we might want to have our own tailored alloc).
  • The bindings generator: it uses a lot of dependencies, but we could either commit the generated code (so only needed if updating it -- the best option, except at the beginning people won't be happy to have to run this if they change some kernel API that we use) or save all of them as third-party code (not great, but at least they are not tied to the compiler, so it should never break).

Another topic is avoiding the connections to crates.io and github.com. For that, we could propose to host mirrors at kernel.org.

@joshtriplett
Copy link
Member

joshtriplett commented Sep 17, 2020

For crate dependencies, we should vendor those into the kernel tree under a specific path, and then configure that path as a directory source, along with passing the appropriate cargo options to never touch the network.

For bindgen/cbindgen, I'm wondering if we can get distributions to package it, and then just expect people to have it installed like cargo and rustc.

@alex
Copy link
Member

alex commented Sep 17, 2020 via email

@ojeda ojeda mentioned this pull request Oct 11, 2020
AltF02 pushed a commit to AltF02/linux that referenced this pull request Oct 5, 2021
FD uses xyarray__entry that may return NULL if an index is out of
bounds. If NULL is returned then a segv happens as FD unconditionally
dereferences the pointer. This was happening in a case of with perf
iostat as shown below. The fix is to make FD an "int*" rather than an
int and handle the NULL case as either invalid input or a closed fd.

  $ sudo gdb --args perf stat --iostat  list
  ...
  Breakpoint 1, perf_evsel__alloc_fd (evsel=0x5555560951a0, ncpus=1, nthreads=1) at evsel.c:50
  50      {
  (gdb) bt
   #0  perf_evsel__alloc_fd (evsel=0x5555560951a0, ncpus=1, nthreads=1) at evsel.c:50
   Rust-for-Linux#1  0x000055555585c188 in evsel__open_cpu (evsel=0x5555560951a0, cpus=0x555556093410,
      threads=0x555556086fb0, start_cpu=0, end_cpu=1) at util/evsel.c:1792
   Rust-for-Linux#2  0x000055555585cfb2 in evsel__open (evsel=0x5555560951a0, cpus=0x0, threads=0x555556086fb0)
      at util/evsel.c:2045
   Rust-for-Linux#3  0x000055555585d0db in evsel__open_per_thread (evsel=0x5555560951a0, threads=0x555556086fb0)
      at util/evsel.c:2065
   Rust-for-Linux#4  0x00005555558ece64 in create_perf_stat_counter (evsel=0x5555560951a0,
      config=0x555555c34700 <stat_config>, target=0x555555c2f1c0 <target>, cpu=0) at util/stat.c:590
   Rust-for-Linux#5  0x000055555578e927 in __run_perf_stat (argc=1, argv=0x7fffffffe4a0, run_idx=0)
      at builtin-stat.c:833
   Rust-for-Linux#6  0x000055555578f3c6 in run_perf_stat (argc=1, argv=0x7fffffffe4a0, run_idx=0)
      at builtin-stat.c:1048
   Rust-for-Linux#7  0x0000555555792ee5 in cmd_stat (argc=1, argv=0x7fffffffe4a0) at builtin-stat.c:2534
   Rust-for-Linux#8  0x0000555555835ed3 in run_builtin (p=0x555555c3f540 <commands+288>, argc=3,
      argv=0x7fffffffe4a0) at perf.c:313
   Rust-for-Linux#9  0x0000555555836154 in handle_internal_command (argc=3, argv=0x7fffffffe4a0) at perf.c:365
   Rust-for-Linux#10 0x000055555583629f in run_argv (argcp=0x7fffffffe2ec, argv=0x7fffffffe2e0) at perf.c:409
   Rust-for-Linux#11 0x0000555555836692 in main (argc=3, argv=0x7fffffffe4a0) at perf.c:539
  ...
  (gdb) c
  Continuing.
  Error:
  The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (uncore_iio_0/event=0x83,umask=0x04,ch_mask=0xF,fc_mask=0x07/).
  /bin/dmesg | grep -i perf may provide additional information.

  Program received signal SIGSEGV, Segmentation fault.
  0x00005555559b03ea in perf_evsel__close_fd_cpu (evsel=0x5555560951a0, cpu=1) at evsel.c:166
  166                     if (FD(evsel, cpu, thread) >= 0)

v3. fixes a bug in perf_evsel__run_ioctl where the sense of a branch was
    backward.

Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20210918054440.2350466-1-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
AltF02 pushed a commit to AltF02/linux that referenced this pull request Oct 9, 2021
Pablo Neira Ayuso says:

====================
Netfilter fixes for net (v2)

The following patchset contains Netfilter fixes for net:

1) Move back the defrag users fields to the global netns_nf area.
   Kernel fails to boot if conntrack is builtin and kernel is booted
   with: nf_conntrack.enable_hooks=1. From Florian Westphal.

2) Rule event notification is missing relevant context such as
   the position handle and the NLM_F_APPEND flag.

3) Rule replacement is expanded to add + delete using the existing
   rule handle, reverse order of this operation so it makes sense
   from rule notification standpoint.

4) Propagate to userspace the NLM_F_CREATE and NLM_F_EXCL flags
   from the rule notification path.

Patches Rust-for-Linux#2, Rust-for-Linux#3 and Rust-for-Linux#4 are used by 'nft monitor' and 'iptables-monitor'
userspace utilities which are not correctly representing the following
operations through netlink notifications:

- rule insertions
- rule addition/insertion from position handle
- create table/chain/set/map/flowtable/...
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
ojeda pushed a commit that referenced this pull request Oct 26, 2021
There is no devfreq on a3xx at the moment since gpu_busy is not
implemented. This means that msm_devfreq_init() will return early
and the entire devfreq setup is skipped.

However, msm_devfreq_active() and msm_devfreq_idle() are still called
unconditionally later, causing a NULL pointer dereference:

  Unable to handle kernel NULL pointer dereference at virtual address 0000000000000010
  Internal error: Oops: 96000004 [#1] PREEMPT SMP
  CPU: 0 PID: 133 Comm: ring0 Not tainted 5.15.0-rc1 #4
  Hardware name: Longcheer L8150 (DT)
  pc : mutex_lock_io+0x2bc/0x2f0
  lr : msm_devfreq_active+0x3c/0xe0 [msm]
  Call trace:
   mutex_lock_io+0x2bc/0x2f0
   msm_gpu_submit+0x164/0x180 [msm]
   msm_job_run+0x54/0xe0 [msm]
   drm_sched_main+0x2b0/0x4a0 [gpu_sched]
   kthread+0x154/0x160
   ret_from_fork+0x10/0x20

Fix this by adding a check in msm_devfreq_active/idle() which ensures
that devfreq was actually initialized earlier.

Fixes: 9bc9557 ("drm/msm: Devfreq tuning")
Reported-by: Nikita Travkin <nikita@trvn.ru>
Tested-by: Nikita Travkin <nikita@trvn.ru>
Signed-off-by: Stephan Gerhold <stephan@gerhold.net>
Link: https://lore.kernel.org/r/20210913164556.16284-1-stephan@gerhold.net
Signed-off-by: Rob Clark <robdclark@chromium.org>
ojeda pushed a commit that referenced this pull request Nov 24, 2021
[BUG]
The following script can cause btrfs to crash:

  $ mount -o compress-force=lzo $DEV /mnt
  $ dd if=/dev/urandom of=/mnt/foo bs=4k count=1
  $ sync

The call trace looks like this:

  general protection fault, probably for non-canonical address 0xe04b37fccce3b000: 0000 [#1] PREEMPT SMP NOPTI
  CPU: 5 PID: 164 Comm: kworker/u20:3 Not tainted 5.15.0-rc7-custom+ #4
  Workqueue: btrfs-delalloc btrfs_work_helper [btrfs]
  RIP: 0010:__memcpy+0x12/0x20
  Call Trace:
   lzo_compress_pages+0x236/0x540 [btrfs]
   btrfs_compress_pages+0xaa/0xf0 [btrfs]
   compress_file_range+0x431/0x8e0 [btrfs]
   async_cow_start+0x12/0x30 [btrfs]
   btrfs_work_helper+0xf6/0x3e0 [btrfs]
   process_one_work+0x294/0x5d0
   worker_thread+0x55/0x3c0
   kthread+0x140/0x170
   ret_from_fork+0x22/0x30
  ---[ end trace 63c3c0f131e61982 ]---

[CAUSE]
In lzo_compress_pages(), parameter @out_pages is not only an output
parameter (for the number of compressed pages), but also an input
parameter, as the upper limit of compressed pages we can utilize.

In commit d408880 ("btrfs: subpage: make lzo_compress_pages()
compatible"), the refactoring doesn't take @out_pages as an input, thus
completely ignoring the limit.

And for compress-force case, we could hit incompressible data that
compressed size would go beyond the page limit, and cause the above
crash.

[FIX]
Save @out_pages as @max_nr_page, and pass it to lzo_compress_pages(),
and check if we're beyond the limit before accessing the pages.

Note: this also fixes crash on 32bit architectures that was suspected to
be caused by merge of btrfs patches to 5.16-rc1. Reported in
https://lore.kernel.org/all/20211104115001.GU20319@twin.jikos.cz/ .

Reported-by: Omar Sandoval <osandov@fb.com>
Fixes: d408880 ("btrfs: subpage: make lzo_compress_pages() compatible")
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
m-falkowski1 pushed a commit to m-falkowski1/linux that referenced this pull request Nov 30, 2021
The exit function fixes a memory leak with the src field as detected by
leak sanitizer. An example of which is:

Indirect leak of 25133184 byte(s) in 207 object(s) allocated from:
    #0 0x7f199ecfe987 in __interceptor_calloc libsanitizer/asan/asan_malloc_linux.cpp:154
    Rust-for-Linux#1 0x55defe638224 in annotated_source__alloc_histograms util/annotate.c:803
    Rust-for-Linux#2 0x55defe6397e4 in symbol__hists util/annotate.c:952
    Rust-for-Linux#3 0x55defe639908 in symbol__inc_addr_samples util/annotate.c:968
    Rust-for-Linux#4 0x55defe63aa29 in hist_entry__inc_addr_samples util/annotate.c:1119
    Rust-for-Linux#5 0x55defe499a79 in hist_iter__report_callback tools/perf/builtin-report.c:182
    Rust-for-Linux#6 0x55defe7a859d in hist_entry_iter__add util/hist.c:1236
    Rust-for-Linux#7 0x55defe49aa63 in process_sample_event tools/perf/builtin-report.c:315
    Rust-for-Linux#8 0x55defe731bc8 in evlist__deliver_sample util/session.c:1473
    Rust-for-Linux#9 0x55defe731e38 in machines__deliver_event util/session.c:1510
    Rust-for-Linux#10 0x55defe732a23 in perf_session__deliver_event util/session.c:1590
    Rust-for-Linux#11 0x55defe72951e in ordered_events__deliver_event util/session.c:183
    Rust-for-Linux#12 0x55defe740082 in do_flush util/ordered-events.c:244
    Rust-for-Linux#13 0x55defe7407cb in __ordered_events__flush util/ordered-events.c:323
    Rust-for-Linux#14 0x55defe740a61 in ordered_events__flush util/ordered-events.c:341
    Rust-for-Linux#15 0x55defe73837f in __perf_session__process_events util/session.c:2390
    Rust-for-Linux#16 0x55defe7385ff in perf_session__process_events util/session.c:2420
    ...

Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@arm.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Martin Liška <mliska@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: https://lore.kernel.org/r/20211112035124.94327-3-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
ojeda pushed a commit that referenced this pull request Dec 20, 2021
Line 1169 (#3) allocates a memory chunk for victim_name by kmalloc(),
but  when the function returns in line 1184 (#4) victim_name allocated
by line 1169 (#3) is not freed, which will lead to a memory leak.
There is a similar snippet of code in this function as allocating a memory
chunk for victim_name in line 1104 (#1) as well as releasing the memory
in line 1116 (#2).

We should kfree() victim_name when the return value of backref_in_log()
is less than zero and before the function returns in line 1184 (#4).

1057 static inline int __add_inode_ref(struct btrfs_trans_handle *trans,
1058 				  struct btrfs_root *root,
1059 				  struct btrfs_path *path,
1060 				  struct btrfs_root *log_root,
1061 				  struct btrfs_inode *dir,
1062 				  struct btrfs_inode *inode,
1063 				  u64 inode_objectid, u64 parent_objectid,
1064 				  u64 ref_index, char *name, int namelen,
1065 				  int *search_done)
1066 {

1104 	victim_name = kmalloc(victim_name_len, GFP_NOFS);
	// #1: kmalloc (victim_name-1)
1105 	if (!victim_name)
1106 		return -ENOMEM;

1112	ret = backref_in_log(log_root, &search_key,
1113			parent_objectid, victim_name,
1114			victim_name_len);
1115	if (ret < 0) {
1116		kfree(victim_name); // #2: kfree (victim_name-1)
1117		return ret;
1118	} else if (!ret) {

1169 	victim_name = kmalloc(victim_name_len, GFP_NOFS);
	// #3: kmalloc (victim_name-2)
1170 	if (!victim_name)
1171 		return -ENOMEM;

1180 	ret = backref_in_log(log_root, &search_key,
1181 			parent_objectid, victim_name,
1182 			victim_name_len);
1183 	if (ret < 0) {
1184 		return ret; // #4: missing kfree (victim_name-2)
1185 	} else if (!ret) {

1241 	return 0;
1242 }

Fixes: d3316c8 ("btrfs: Properly handle backref_in_log retval")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Jianglei Nie <niejianglei2021@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
ojeda pushed a commit that referenced this pull request Dec 20, 2021
The fixed commit attempts to close inject.output even if it was never
opened e.g.

  $ perf record uname
  Linux
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.002 MB perf.data (7 samples) ]
  $ perf inject -i perf.data --vm-time-correlation=dry-run
  Segmentation fault (core dumped)
  $ gdb --quiet perf
  Reading symbols from perf...
  (gdb) r inject -i perf.data --vm-time-correlation=dry-run
  Starting program: /home/ahunter/bin/perf inject -i perf.data --vm-time-correlation=dry-run
  [Thread debugging using libthread_db enabled]
  Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

  Program received signal SIGSEGV, Segmentation fault.
  0x00007eff8afeef5b in _IO_new_fclose (fp=0x0) at iofclose.c:48
  48      iofclose.c: No such file or directory.
  (gdb) bt
  #0  0x00007eff8afeef5b in _IO_new_fclose (fp=0x0) at iofclose.c:48
  #1  0x0000557fc7b74f92 in perf_data__close (data=data@entry=0x7ffcdafa6578) at util/data.c:376
  #2  0x0000557fc7a6b807 in cmd_inject (argc=<optimized out>, argv=<optimized out>) at builtin-inject.c:1085
  #3  0x0000557fc7ac4783 in run_builtin (p=0x557fc8074878 <commands+600>, argc=4, argv=0x7ffcdafb6a60) at perf.c:313
  #4  0x0000557fc7a25d5c in handle_internal_command (argv=<optimized out>, argc=<optimized out>) at perf.c:365
  #5  run_argv (argcp=<optimized out>, argv=<optimized out>) at perf.c:409
  #6  main (argc=4, argv=0x7ffcdafb6a60) at perf.c:539
  (gdb)

Fixes: 02e6246 ("perf inject: Close inject.output on exit")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Riccardo Mancini <rickyman7@gmail.com>
Cc: stable@vger.kernel.org
Link: http://lore.kernel.org/lkml/20211213084829.114772-2-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
ojeda pushed a commit that referenced this pull request Dec 20, 2021
The fixed commit attempts to get the output file descriptor even if the
file was never opened e.g.

  $ perf record uname
  Linux
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.002 MB perf.data (7 samples) ]
  $ perf inject -i perf.data --vm-time-correlation=dry-run
  Segmentation fault (core dumped)
  $ gdb --quiet perf
  Reading symbols from perf...
  (gdb) r inject -i perf.data --vm-time-correlation=dry-run
  Starting program: /home/ahunter/bin/perf inject -i perf.data --vm-time-correlation=dry-run
  [Thread debugging using libthread_db enabled]
  Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

  Program received signal SIGSEGV, Segmentation fault.
  __GI___fileno (fp=0x0) at fileno.c:35
  35      fileno.c: No such file or directory.
  (gdb) bt
  #0  __GI___fileno (fp=0x0) at fileno.c:35
  #1  0x00005621e48dd987 in perf_data__fd (data=0x7fff4c68bd08) at util/data.h:72
  #2  perf_data__fd (data=0x7fff4c68bd08) at util/data.h:69
  #3  cmd_inject (argc=<optimized out>, argv=0x7fff4c69c1f0) at builtin-inject.c:1017
  #4  0x00005621e4936783 in run_builtin (p=0x5621e4ee6878 <commands+600>, argc=4, argv=0x7fff4c69c1f0) at perf.c:313
  #5  0x00005621e4897d5c in handle_internal_command (argv=<optimized out>, argc=<optimized out>) at perf.c:365
  #6  run_argv (argcp=<optimized out>, argv=<optimized out>) at perf.c:409
  #7  main (argc=4, argv=0x7fff4c69c1f0) at perf.c:539
  (gdb)

Fixes: 0ae0389 ("perf tools: Pass a fd to perf_file_header__read_pipe()")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Riccardo Mancini <rickyman7@gmail.com>
Cc: stable@vger.kernel.org
Link: http://lore.kernel.org/lkml/20211213084829.114772-3-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
ojeda pushed a commit that referenced this pull request Jan 30, 2022
arm32 uses software to simulate the instruction replaced
by kprobe. some instructions may be simulated by constructing
assembly functions. therefore, before executing instruction
simulation, it is necessary to construct assembly function
execution environment in C language through binding registers.
after kasan is enabled, the register binding relationship will
be destroyed, resulting in instruction simulation errors and
causing kernel panic.

the kprobe emulate instruction function is distributed in three
files: actions-common.c actions-arm.c actions-thumb.c, so disable
KASAN when compiling these files.

for example, use kprobe insert on cap_capable+20 after kasan
enabled, the cap_capable assembly code is as follows:
<cap_capable>:
e92d47f0	push	{r4, r5, r6, r7, r8, r9, sl, lr}
e1a05000	mov	r5, r0
e280006c	add	r0, r0, #108    ; 0x6c
e1a0400	mov	r4, r1
e1a06002	mov	r6, r2
e59fa090	ldr	sl, [pc, #144]  ;
ebfc7bf8	bl	c03aa4b4 <__asan_load4>
e595706c	ldr	r7, [r5, #108]  ; 0x6c
e2859014	add	r9, r5, #20
......
The emulate_ldr assembly code after enabling kasan is as follows:
c06f1384 <emulate_ldr>:
e92d47f0	push	{r4, r5, r6, r7, r8, r9, sl, lr}
e282803c	add	r8, r2, #60     ; 0x3c
e1a05000	mov	r5, r0
e7e37855	ubfx	r7, r5, #16, #4
e1a00008	mov	r0, r8
e1a09001	mov	r9, r1
e1a04002	mov	r4, r2
ebf35462	bl	c03c6530 <__asan_load4>
e357000f	cmp	r7, #15
e7e36655	ubfx	r6, r5, #12, #4
e205a00f	and	sl, r5, #15
0a000001	beq	c06f13bc <emulate_ldr+0x38>
e0840107	add	r0, r4, r7, lsl #2
ebf3545c	bl	c03c6530 <__asan_load4>
e084010a	add	r0, r4, sl, lsl #2
ebf3545a	bl	c03c6530 <__asan_load4>
e2890010	add	r0, r9, #16
ebf35458	bl	c03c6530 <__asan_load4>
e5990010	ldr	r0, [r9, #16]
e12fff30	blx	r0
e356000f	cm	r6, #15
1a000014	bne	c06f1430 <emulate_ldr+0xac>
e1a06000	mov	r6, r0
e2840040	add	r0, r4, #64     ; 0x40
......

when running in emulate_ldr to simulate the ldr instruction, panic
occurred, and the log is as follows:
Unable to handle kernel NULL pointer dereference at virtual address
00000090
pgd = ecb46400
[00000090] *pgd=2e0fa003, *pmd=00000000
Internal error: Oops: 206 [#1] SMP ARM
PC is at cap_capable+0x14/0xb0
LR is at emulate_ldr+0x50/0xc0
psr: 600d0293 sp : ecd63af8  ip : 00000004  fp : c0a7c30c
r10: 00000000  r9 : c30897f4  r8 : ecd63cd4
r7 : 0000000f  r6 : 0000000a  r5 : e59fa090  r4 : ecd63c98
r3 : c06ae294  r2 : 00000000  r1 : b7611300  r0 : bf4ec008
Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
Control: 32c5387d  Table: 2d546400  DAC: 55555555
Process bash (pid: 1643, stack limit = 0xecd60190)
(cap_capable) from (kprobe_handler+0x218/0x340)
(kprobe_handler) from (kprobe_trap_handler+0x24/0x48)
(kprobe_trap_handler) from (do_undefinstr+0x13c/0x364)
(do_undefinstr) from (__und_svc_finish+0x0/0x30)
(__und_svc_finish) from (cap_capable+0x18/0xb0)
(cap_capable) from (cap_vm_enough_memory+0x38/0x48)
(cap_vm_enough_memory) from
(security_vm_enough_memory_mm+0x48/0x6c)
(security_vm_enough_memory_mm) from
(copy_process.constprop.5+0x16b4/0x25c8)
(copy_process.constprop.5) from (_do_fork+0xe8/0x55c)
(_do_fork) from (SyS_clone+0x1c/0x24)
(SyS_clone) from (__sys_trace_return+0x0/0x10)
Code: 0050a0e1 6c0080e2 0140a0e1 0260a0e1 (f801f0e7)

Fixes: 35aa1df ("ARM kprobes: instruction single-stepping support")
Fixes: 4210157 ("ARM: 9017/2: Enable KASan for ARM")
Signed-off-by: huangshaobo <huangshaobo6@huawei.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
ojeda pushed a commit that referenced this pull request Feb 28, 2022
Rolf Eike Beer reported the following bug:

[1274934.746891] Bad Address (null pointer deref?): Code=15 (Data TLB miss fault) at addr 0000004140000018
[1274934.746891] CPU: 3 PID: 5549 Comm: cmake Not tainted 5.15.4-gentoo-parisc64 #4
[1274934.746891] Hardware name: 9000/785/C8000
[1274934.746891]
[1274934.746891]      YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
[1274934.746891] PSW: 00001000000001001111111000001110 Not tainted
[1274934.746891] r00-03  000000ff0804fe0e 0000000040bc9bc0 00000000406760e4 0000004140000000
[1274934.746891] r04-07  0000000040b693c0 0000004140000000 000000004a2b08b0 0000000000000001
[1274934.746891] r08-11  0000000041f98810 0000000000000000 000000004a0a7000 0000000000000001
[1274934.746891] r12-15  0000000040bddbc0 0000000040c0cbc0 0000000040bddbc0 0000000040bddbc0
[1274934.746891] r16-19  0000000040bde3c0 0000000040bddbc0 0000000040bde3c0 0000000000000007
[1274934.746891] r20-23  0000000000000006 000000004a368950 0000000000000000 0000000000000001
[1274934.746891] r24-27  0000000000001fff 000000000800000e 000000004a1710f0 0000000040b693c0
[1274934.746891] r28-31  0000000000000001 0000000041f988b0 0000000041f98840 000000004a171118
[1274934.746891] sr00-03  00000000066e5800 0000000000000000 0000000000000000 00000000066e5800
[1274934.746891] sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[1274934.746891]
[1274934.746891] IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000406760e8 00000000406760ec
[1274934.746891]  IIR: 48780030    ISR: 0000000000000000  IOR: 0000004140000018
[1274934.746891]  CPU:        3   CR30: 00000040e3a9c000 CR31: ffffffffffffffff
[1274934.746891]  ORIG_R28: 0000000040acdd58
[1274934.746891]  IAOQ[0]: sba_unmap_sg+0xb0/0x118
[1274934.746891]  IAOQ[1]: sba_unmap_sg+0xb4/0x118
[1274934.746891]  RP(r2): sba_unmap_sg+0xac/0x118
[1274934.746891] Backtrace:
[1274934.746891]  [<00000000402740cc>] dma_unmap_sg_attrs+0x6c/0x70
[1274934.746891]  [<000000004074d6bc>] scsi_dma_unmap+0x54/0x60
[1274934.746891]  [<00000000407a3488>] mptscsih_io_done+0x150/0xd70
[1274934.746891]  [<0000000040798600>] mpt_interrupt+0x168/0xa68
[1274934.746891]  [<0000000040255a48>] __handle_irq_event_percpu+0xc8/0x278
[1274934.746891]  [<0000000040255c34>] handle_irq_event_percpu+0x3c/0xd8
[1274934.746891]  [<000000004025ecb4>] handle_percpu_irq+0xb4/0xf0
[1274934.746891]  [<00000000402548e0>] generic_handle_irq+0x50/0x70
[1274934.746891]  [<000000004019a254>] call_on_stack+0x18/0x24
[1274934.746891]
[1274934.746891] Kernel panic - not syncing: Bad Address (null pointer deref?)

The bug is caused by overrunning the sglist and incorrectly testing
sg_dma_len(sglist) before nents. Normally this doesn't cause a crash,
but in this case sglist crossed a page boundary. This occurs in the
following code:

	while (sg_dma_len(sglist) && nents--) {

The fix is simply to test nents first and move the decrement of nents
into the loop.

Reported-by: Rolf Eike Beer <eike-kernel@sf-tec.de>
Signed-off-by: John David Anglin <dave.anglin@bell.net>
Cc: stable@vger.kernel.org
Signed-off-by: Helge Deller <deller@gmx.de>
ojeda pushed a commit that referenced this pull request Feb 28, 2022
When cifs_get_root() fails during cifs_smb3_do_mount() we call
deactivate_locked_super() which eventually will call delayed_free() which
will free the context.
In this situation we should not proceed to enter the out: section in
cifs_smb3_do_mount() and free the same resources a second time.

[Thu Feb 10 12:59:06 2022] BUG: KASAN: use-after-free in rcu_cblist_dequeue+0x32/0x60
[Thu Feb 10 12:59:06 2022] Read of size 8 at addr ffff888364f4d110 by task swapper/1/0

[Thu Feb 10 12:59:06 2022] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G           OE     5.17.0-rc3+ #4
[Thu Feb 10 12:59:06 2022] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019
[Thu Feb 10 12:59:06 2022] Call Trace:
[Thu Feb 10 12:59:06 2022]  <IRQ>
[Thu Feb 10 12:59:06 2022]  dump_stack_lvl+0x5d/0x78
[Thu Feb 10 12:59:06 2022]  print_address_description.constprop.0+0x24/0x150
[Thu Feb 10 12:59:06 2022]  ? rcu_cblist_dequeue+0x32/0x60
[Thu Feb 10 12:59:06 2022]  kasan_report.cold+0x7d/0x117
[Thu Feb 10 12:59:06 2022]  ? rcu_cblist_dequeue+0x32/0x60
[Thu Feb 10 12:59:06 2022]  __asan_load8+0x86/0xa0
[Thu Feb 10 12:59:06 2022]  rcu_cblist_dequeue+0x32/0x60
[Thu Feb 10 12:59:06 2022]  rcu_core+0x547/0xca0
[Thu Feb 10 12:59:06 2022]  ? call_rcu+0x3c0/0x3c0
[Thu Feb 10 12:59:06 2022]  ? __this_cpu_preempt_check+0x13/0x20
[Thu Feb 10 12:59:06 2022]  ? lock_is_held_type+0xea/0x140
[Thu Feb 10 12:59:06 2022]  rcu_core_si+0xe/0x10
[Thu Feb 10 12:59:06 2022]  __do_softirq+0x1d4/0x67b
[Thu Feb 10 12:59:06 2022]  __irq_exit_rcu+0x100/0x150
[Thu Feb 10 12:59:06 2022]  irq_exit_rcu+0xe/0x30
[Thu Feb 10 12:59:06 2022]  sysvec_hyperv_stimer0+0x9d/0xc0
...
[Thu Feb 10 12:59:07 2022] Freed by task 58179:
[Thu Feb 10 12:59:07 2022]  kasan_save_stack+0x26/0x50
[Thu Feb 10 12:59:07 2022]  kasan_set_track+0x25/0x30
[Thu Feb 10 12:59:07 2022]  kasan_set_free_info+0x24/0x40
[Thu Feb 10 12:59:07 2022]  ____kasan_slab_free+0x137/0x170
[Thu Feb 10 12:59:07 2022]  __kasan_slab_free+0x12/0x20
[Thu Feb 10 12:59:07 2022]  slab_free_freelist_hook+0xb3/0x1d0
[Thu Feb 10 12:59:07 2022]  kfree+0xcd/0x520
[Thu Feb 10 12:59:07 2022]  cifs_smb3_do_mount+0x149/0xbe0 [cifs]
[Thu Feb 10 12:59:07 2022]  smb3_get_tree+0x1a0/0x2e0 [cifs]
[Thu Feb 10 12:59:07 2022]  vfs_get_tree+0x52/0x140
[Thu Feb 10 12:59:07 2022]  path_mount+0x635/0x10c0
[Thu Feb 10 12:59:07 2022]  __x64_sys_mount+0x1bf/0x210
[Thu Feb 10 12:59:07 2022]  do_syscall_64+0x5c/0xc0
[Thu Feb 10 12:59:07 2022]  entry_SYSCALL_64_after_hwframe+0x44/0xae

[Thu Feb 10 12:59:07 2022] Last potentially related work creation:
[Thu Feb 10 12:59:07 2022]  kasan_save_stack+0x26/0x50
[Thu Feb 10 12:59:07 2022]  __kasan_record_aux_stack+0xb6/0xc0
[Thu Feb 10 12:59:07 2022]  kasan_record_aux_stack_noalloc+0xb/0x10
[Thu Feb 10 12:59:07 2022]  call_rcu+0x76/0x3c0
[Thu Feb 10 12:59:07 2022]  cifs_umount+0xce/0xe0 [cifs]
[Thu Feb 10 12:59:07 2022]  cifs_kill_sb+0xc8/0xe0 [cifs]
[Thu Feb 10 12:59:07 2022]  deactivate_locked_super+0x5d/0xd0
[Thu Feb 10 12:59:07 2022]  cifs_smb3_do_mount+0xab9/0xbe0 [cifs]
[Thu Feb 10 12:59:07 2022]  smb3_get_tree+0x1a0/0x2e0 [cifs]
[Thu Feb 10 12:59:07 2022]  vfs_get_tree+0x52/0x140
[Thu Feb 10 12:59:07 2022]  path_mount+0x635/0x10c0
[Thu Feb 10 12:59:07 2022]  __x64_sys_mount+0x1bf/0x210
[Thu Feb 10 12:59:07 2022]  do_syscall_64+0x5c/0xc0
[Thu Feb 10 12:59:07 2022]  entry_SYSCALL_64_after_hwframe+0x44/0xae

Reported-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
ojeda pushed a commit that referenced this pull request Feb 28, 2022
I saw the below splatting after the host suspended and resumed.

   WARNING: CPU: 0 PID: 2943 at kvm/arch/x86/kvm/../../../virt/kvm/kvm_main.c:5531 kvm_resume+0x2c/0x30 [kvm]
   CPU: 0 PID: 2943 Comm: step_after_susp Tainted: G        W IOE     5.17.0-rc3+ #4
   RIP: 0010:kvm_resume+0x2c/0x30 [kvm]
   Call Trace:
    <TASK>
    syscore_resume+0x90/0x340
    suspend_devices_and_enter+0xaee/0xe90
    pm_suspend.cold+0x36b/0x3c2
    state_store+0x82/0xf0
    kernfs_fop_write_iter+0x1b6/0x260
    new_sync_write+0x258/0x370
    vfs_write+0x33f/0x510
    ksys_write+0xc9/0x160
    do_syscall_64+0x3b/0xc0
    entry_SYSCALL_64_after_hwframe+0x44/0xae

lockdep_is_held() can return -1 when lockdep is disabled which triggers
this warning. Let's use lockdep_assert_not_held() which can detect
incorrect calls while holding a lock and it also avoids false negatives
when lockdep is disabled.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1644920142-81249-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
ojeda pushed a commit that referenced this pull request Mar 14, 2022
…/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 5.17, take #4

- Correctly synchronise PMR and co on PSCI CPU_SUSPEND

- Skip tests that depend on GICv3 when the HW isn't available
ojeda pushed a commit that referenced this pull request Apr 12, 2022
We've got a mess on our hands.

1. xfs_trans_commit() cannot cancel transactions because the mount is
shut down - that causes dirty, aborted, unlogged log items to sit
unpinned in memory and potentially get written to disk before the
log is shut down. Hence xfs_trans_commit() can only abort
transactions when xlog_is_shutdown() is true.

2. xfs_force_shutdown() is used in places to cause the current
modification to be aborted via xfs_trans_commit() because it may be
impractical or impossible to cancel the transaction directly, and
hence xfs_trans_commit() must cancel transactions when
xfs_is_shutdown() is true in this situation. But we can't do that
because of #1.

3. Log IO errors cause log shutdowns by calling xfs_force_shutdown()
to shut down the mount and then the log from log IO completion.

4. xfs_force_shutdown() can result in a log force being issued,
which has to wait for log IO completion before it will mark the log
as shut down. If #3 races with some other shutdown trigger that runs
a log force, we rely on xfs_force_shutdown() silently ignoring #3
and avoiding shutting down the log until the failed log force
completes.

5. To ensure #2 always works, we have to ensure that
xfs_force_shutdown() does not return until the the log is shut down.
But in the case of #4, this will result in a deadlock because the
log Io completion will block waiting for a log force to complete
which is blocked waiting for log IO to complete....

So the very first thing we have to do here to untangle this mess is
dissociate log shutdown triggers from mount shutdowns. We already
have xlog_forced_shutdown, which will atomically transistion to the
log a shutdown state. Due to internal asserts it cannot be called
multiple times, but was done simply because the only place that
could call it was xfs_do_force_shutdown() (i.e. the mount shutdown!)
and that could only call it once and once only.  So the first thing
we do is remove the asserts.

We then convert all the internal log shutdown triggers to call
xlog_force_shutdown() directly instead of xfs_force_shutdown(). This
allows the log shutdown triggers to shut down the log without
needing to care about mount based shutdown constraints. This means
we shut down the log independently of the mount and the mount may
not notice this until it's next attempt to read or modify metadata.
At that point (e.g. xfs_trans_commit()) it will see that the log is
shutdown, error out and shutdown the mount.

To ensure that all the unmount behaviours and asserts track
correctly as a result of a log shutdown, propagate the shutdown up
to the mount if it is not already set. This keeps the mount and log
state in sync, and saves a huge amount of hassle where code fails
because of a log shutdown but only checks for mount shutdowns and
hence ends up doing the wrong thing. Cleaning up that mess is
an exercise for another day.

This enables us to address the other problems noted above in
followup patches.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
ojeda pushed a commit that referenced this pull request Apr 12, 2022
After rx/tx ring buffer size is changed, kernel panic occurs when
it acts XDP_TX or XDP_REDIRECT.

When tx/rx ring buffer size is changed(ethtool -G), sfc driver
reallocates and reinitializes rx and tx queues and their buffer
(tx_queue->buffer).
But it misses reinitializing xdp queues(efx->xdp_tx_queues).
So, while it is acting XDP_TX or XDP_REDIRECT, it uses the uninitialized
tx_queue->buffer.

A new function efx_set_xdp_channels() is separated from efx_set_channels()
to handle only xdp queues.

Splat looks like:
   BUG: kernel NULL pointer dereference, address: 000000000000002a
   #PF: supervisor write access in kernel mode
   #PF: error_code(0x0002) - not-present page
   PGD 0 P4D 0
   Oops: 0002 [#4] PREEMPT SMP NOPTI
   RIP: 0010:efx_tx_map_chunk+0x54/0x90 [sfc]
   CPU: 2 PID: 0 Comm: swapper/2 Tainted: G      D           5.17.0+ #55 e8beeee8289528f11357029357cf
   Code: 48 8b 8d a8 01 00 00 48 8d 14 52 4c 8d 2c d0 44 89 e0 48 85 c9 74 0e 44 89 e2 4c 89 f6 48 80
   RSP: 0018:ffff92f121e45c60 EFLAGS: 00010297
   RIP: 0010:efx_tx_map_chunk+0x54/0x90 [sfc]
   RAX: 0000000000000040 RBX: ffff92ea506895c0 RCX: ffffffffc0330870
   RDX: 0000000000000001 RSI: 00000001139b10ce RDI: ffff92ea506895c0
   RBP: ffffffffc0358a80 R08: 00000001139b110d R09: 0000000000000000
   R10: 0000000000000001 R11: ffff92ea414c0088 R12: 0000000000000040
   R13: 0000000000000018 R14: 00000001139b10ce R15: ffff92ea506895c0
   FS:  0000000000000000(0000) GS:ffff92f121ec0000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   Code: 48 8b 8d a8 01 00 00 48 8d 14 52 4c 8d 2c d0 44 89 e0 48 85 c9 74 0e 44 89 e2 4c 89 f6 48 80
   CR2: 000000000000002a CR3: 00000003e6810004 CR4: 00000000007706e0
   RSP: 0018:ffff92f121e85c60 EFLAGS: 00010297
   PKRU: 55555554
   RAX: 0000000000000040 RBX: ffff92ea50689700 RCX: ffffffffc0330870
   RDX: 0000000000000001 RSI: 00000001145a90ce RDI: ffff92ea50689700
   RBP: ffffffffc0358a80 R08: 00000001145a910d R09: 0000000000000000
   R10: 0000000000000001 R11: ffff92ea414c0088 R12: 0000000000000040
   R13: 0000000000000018 R14: 00000001145a90ce R15: ffff92ea50689700
   FS:  0000000000000000(0000) GS:ffff92f121e80000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 000000000000002a CR3: 00000003e6810005 CR4: 00000000007706e0
   PKRU: 55555554
   Call Trace:
    <IRQ>
    efx_xdp_tx_buffers+0x12b/0x3d0 [sfc 84c94b8e32d44d296c17e10a634d3ad454de4ba5]
    __efx_rx_packet+0x5c3/0x930 [sfc 84c94b8e32d44d296c17e10a634d3ad454de4ba5]
    efx_rx_packet+0x28c/0x2e0 [sfc 84c94b8e32d44d296c17e10a634d3ad454de4ba5]
    efx_ef10_ev_process+0x5f8/0xf40 [sfc 84c94b8e32d44d296c17e10a634d3ad454de4ba5]
    ? enqueue_task_fair+0x95/0x550
    efx_poll+0xc4/0x360 [sfc 84c94b8e32d44d296c17e10a634d3ad454de4ba5]

Fixes: 3990a8f ("sfc: allocate channels for XDP tx queues")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ojeda pushed a commit that referenced this pull request Apr 12, 2022
As guest_irq is coming from KVM_IRQFD API call, it may trigger
crash in svm_update_pi_irte() due to out-of-bounds:

crash> bt
PID: 22218  TASK: ffff951a6ad74980  CPU: 73  COMMAND: "vcpu8"
 #0 [ffffb1ba6707fa40] machine_kexec at ffffffff8565b397
 #1 [ffffb1ba6707fa90] __crash_kexec at ffffffff85788a6d
 #2 [ffffb1ba6707fb58] crash_kexec at ffffffff8578995d
 #3 [ffffb1ba6707fb70] oops_end at ffffffff85623c0d
 #4 [ffffb1ba6707fb90] no_context at ffffffff856692c9
 #5 [ffffb1ba6707fbf8] exc_page_fault at ffffffff85f95b51
 #6 [ffffb1ba6707fc50] asm_exc_page_fault at ffffffff86000ace
    [exception RIP: svm_update_pi_irte+227]
    RIP: ffffffffc0761b53  RSP: ffffb1ba6707fd08  RFLAGS: 00010086
    RAX: ffffb1ba6707fd78  RBX: ffffb1ba66d91000  RCX: 0000000000000001
    RDX: 00003c803f63f1c0  RSI: 000000000000019a  RDI: ffffb1ba66db2ab8
    RBP: 000000000000019a   R8: 0000000000000040   R9: ffff94ca41b82200
    R10: ffffffffffffffcf  R11: 0000000000000001  R12: 0000000000000001
    R13: 0000000000000001  R14: ffffffffffffffcf  R15: 000000000000005f
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffffb1ba6707fdb8] kvm_irq_routing_update at ffffffffc09f19a1 [kvm]
 #8 [ffffb1ba6707fde0] kvm_set_irq_routing at ffffffffc09f2133 [kvm]
 #9 [ffffb1ba6707fe18] kvm_vm_ioctl at ffffffffc09ef544 [kvm]
    RIP: 00007f143c36488b  RSP: 00007f143a4e04b8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 00007f05780041d0  RCX: 00007f143c36488b
    RDX: 00007f05780041d0  RSI: 000000004008ae6a  RDI: 0000000000000020
    RBP: 00000000000004e8   R8: 0000000000000008   R9: 00007f05780041e0
    R10: 00007f0578004560  R11: 0000000000000246  R12: 00000000000004e0
    R13: 000000000000001a  R14: 00007f1424001c60  R15: 00007f0578003bc0
    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b

Vmx have been fix this in commit 3a8b067 (KVM: VMX: Do not BUG() on
out-of-bounds guest IRQ), so we can just copy source from that to fix
this.

Co-developed-by: Yi Liu <liu.yi24@zte.com.cn>
Signed-off-by: Yi Liu <liu.yi24@zte.com.cn>
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Message-Id: <20220309113025.44469-1-wang.yi59@zte.com.cn>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
ojeda pushed a commit that referenced this pull request May 4, 2022
There is possible circular locking dependency detected on event_mutex
(see below logs). This is due to set fail safe mode is done at
dp_panel_read_sink_caps() within event_mutex scope. To break this
possible circular locking, this patch move setting fail safe mode
out of event_mutex scope.

[   23.958078] ======================================================
[   23.964430] WARNING: possible circular locking dependency detected
[   23.970777] 5.17.0-rc2-lockdep-00088-g05241de1f69e #148 Not tainted
[   23.977219] ------------------------------------------------------
[   23.983570] DrmThread/1574 is trying to acquire lock:
[   23.988763] ffffff808423aab0 (&dp->event_mutex){+.+.}-{3:3}, at: msm_dp_displ                                                                             ay_enable+0x58/0x164
[   23.997895]
[   23.997895] but task is already holding lock:
[   24.003895] ffffff808420b280 (&kms->commit_lock[i]/1){+.+.}-{3:3}, at: lock_c                                                                             rtcs+0x80/0x8c
[   24.012495]
[   24.012495] which lock already depends on the new lock.
[   24.012495]
[   24.020886]
[   24.020886] the existing dependency chain (in reverse order) is:
[   24.028570]
[   24.028570] -> #5 (&kms->commit_lock[i]/1){+.+.}-{3:3}:
[   24.035472]        __mutex_lock+0xc8/0x384
[   24.039695]        mutex_lock_nested+0x54/0x74
[   24.044272]        lock_crtcs+0x80/0x8c
[   24.048222]        msm_atomic_commit_tail+0x1e8/0x3d0
[   24.053413]        commit_tail+0x7c/0xfc
[   24.057452]        drm_atomic_helper_commit+0x158/0x15c
[   24.062826]        drm_atomic_commit+0x60/0x74
[   24.067403]        drm_mode_atomic_ioctl+0x6b0/0x908
[   24.072508]        drm_ioctl_kernel+0xe8/0x168
[   24.077086]        drm_ioctl+0x320/0x370
[   24.081123]        drm_compat_ioctl+0x40/0xdc
[   24.085602]        __arm64_compat_sys_ioctl+0xe0/0x150
[   24.090895]        invoke_syscall+0x80/0x114
[   24.095294]        el0_svc_common.constprop.3+0xc4/0xf8
[   24.100668]        do_el0_svc_compat+0x2c/0x54
[   24.105242]        el0_svc_compat+0x4c/0xe4
[   24.109548]        el0t_32_sync_handler+0xc4/0xf4
[   24.114381]        el0t_32_sync+0x178
[   24.118688]
[   24.118688] -> #4 (&kms->commit_lock[i]){+.+.}-{3:3}:
[   24.125408]        __mutex_lock+0xc8/0x384
[   24.129628]        mutex_lock_nested+0x54/0x74
[   24.134204]        lock_crtcs+0x80/0x8c
[   24.138155]        msm_atomic_commit_tail+0x1e8/0x3d0
[   24.143345]        commit_tail+0x7c/0xfc
[   24.147382]        drm_atomic_helper_commit+0x158/0x15c
[   24.152755]        drm_atomic_commit+0x60/0x74
[   24.157323]        drm_atomic_helper_set_config+0x68/0x90
[   24.162869]        drm_mode_setcrtc+0x394/0x648
[   24.167535]        drm_ioctl_kernel+0xe8/0x168
[   24.172102]        drm_ioctl+0x320/0x370
[   24.176135]        drm_compat_ioctl+0x40/0xdc
[   24.180621]        __arm64_compat_sys_ioctl+0xe0/0x150
[   24.185904]        invoke_syscall+0x80/0x114
[   24.190302]        el0_svc_common.constprop.3+0xc4/0xf8
[   24.195673]        do_el0_svc_compat+0x2c/0x54
[   24.200241]        el0_svc_compat+0x4c/0xe4
[   24.204544]        el0t_32_sync_handler+0xc4/0xf4
[   24.209378]        el0t_32_sync+0x174/0x178
[   24.213680] -> #3 (crtc_ww_class_mutex){+.+.}-{3:3}:
[   24.220308]        __ww_mutex_lock.constprop.20+0xe8/0x878
[   24.225951]        ww_mutex_lock+0x60/0xd0
[   24.230166]        modeset_lock+0x190/0x19c
[   24.234467]        drm_modeset_lock+0x34/0x54
[   24.238953]        drmm_mode_config_init+0x550/0x764
[   24.244065]        msm_drm_bind+0x170/0x59c
[   24.248374]        try_to_bring_up_master+0x244/0x294
[   24.253572]        __component_add+0xf4/0x14c
[   24.258057]        component_add+0x2c/0x38
[   24.262273]        dsi_dev_attach+0x2c/0x38
[   24.266575]        dsi_host_attach+0xc4/0x120
[   24.271060]        mipi_dsi_attach+0x34/0x48
[   24.275456]        devm_mipi_dsi_attach+0x28/0x68
[   24.280298]        ti_sn_bridge_probe+0x2b4/0x2dc
[   24.285137]        auxiliary_bus_probe+0x78/0x90
[   24.289893]        really_probe+0x1e4/0x3d8
[   24.294194]        __driver_probe_device+0x14c/0x164
[   24.299298]        driver_probe_device+0x54/0xf8
[   24.304043]        __device_attach_driver+0xb4/0x118
[   24.309145]        bus_for_each_drv+0xb0/0xd4
[   24.313628]        __device_attach+0xcc/0x158
[   24.318112]        device_initial_probe+0x24/0x30
[   24.322954]        bus_probe_device+0x38/0x9c
[   24.327439]        deferred_probe_work_func+0xd4/0xf0
[   24.332628]        process_one_work+0x2f0/0x498
[   24.337289]        process_scheduled_works+0x44/0x48
[   24.342391]        worker_thread+0x1e4/0x26c
[   24.346788]        kthread+0xe4/0xf4
[   24.350470]        ret_from_fork+0x10/0x20
[   24.354683]
[   24.354683]
[   24.354683] -> #2 (crtc_ww_class_acquire){+.+.}-{0:0}:
[   24.361489]        drm_modeset_acquire_init+0xe4/0x138
[   24.366777]        drm_helper_probe_detect_ctx+0x44/0x114
[   24.372327]        check_connector_changed+0xbc/0x198
[   24.377517]        drm_helper_hpd_irq_event+0xcc/0x11c
[   24.382804]        dsi_hpd_worker+0x24/0x30
[   24.387104]        process_one_work+0x2f0/0x498
[   24.391762]        worker_thread+0x1d0/0x26c
[   24.396158]        kthread+0xe4/0xf4
[   24.399840]        ret_from_fork+0x10/0x20
[   24.404053]
[   24.404053] -> #1 (&dev->mode_config.mutex){+.+.}-{3:3}:
[   24.411032]        __mutex_lock+0xc8/0x384
[   24.415247]        mutex_lock_nested+0x54/0x74
[   24.419819]        dp_panel_read_sink_caps+0x23c/0x26c
[   24.425108]        dp_display_process_hpd_high+0x34/0xd4
[   24.430570]        dp_display_usbpd_configure_cb+0x30/0x3c
[   24.436205]        hpd_event_thread+0x2ac/0x550
[   24.440864]        kthread+0xe4/0xf4
[   24.444544]        ret_from_fork+0x10/0x20
[   24.448757]
[   24.448757] -> #0 (&dp->event_mutex){+.+.}-{3:3}:
[   24.455116]        __lock_acquire+0xe2c/0x10d8
[   24.459690]        lock_acquire+0x1ac/0x2d0
[   24.463988]        __mutex_lock+0xc8/0x384
[   24.468201]        mutex_lock_nested+0x54/0x74
[   24.472773]        msm_dp_display_enable+0x58/0x164
[   24.477789]        dp_bridge_enable+0x24/0x30
[   24.482273]        drm_atomic_bridge_chain_enable+0x78/0x9c
[   24.488006]        drm_atomic_helper_commit_modeset_enables+0x1bc/0x244
[   24.494801]        msm_atomic_commit_tail+0x248/0x3d0
[   24.499992]        commit_tail+0x7c/0xfc
[   24.504031]        drm_atomic_helper_commit+0x158/0x15c
[   24.509404]        drm_atomic_commit+0x60/0x74
[   24.513976]        drm_mode_atomic_ioctl+0x6b0/0x908
[   24.519079]        drm_ioctl_kernel+0xe8/0x168
[   24.523650]        drm_ioctl+0x320/0x370
[   24.527689]        drm_compat_ioctl+0x40/0xdc
[   24.532175]        __arm64_compat_sys_ioctl+0xe0/0x150
[   24.537463]        invoke_syscall+0x80/0x114
[   24.541861]        el0_svc_common.constprop.3+0xc4/0xf8
[   24.547235]        do_el0_svc_compat+0x2c/0x54
[   24.551806]        el0_svc_compat+0x4c/0xe4
[   24.556106]        el0t_32_sync_handler+0xc4/0xf4
[   24.560948]        el0t_32_sync+0x174/0x178

Changes in v2:
-- add circular lockiing trace

Fixes: d4aca42 ("drm/msm/dp:  always add fail-safe mode into connector mode list")
Signed-off-by: Kuogee Hsieh <quic_khsieh@quicinc.com>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Patchwork: https://patchwork.freedesktop.org/patch/481396/
Link: https://lore.kernel.org/r/1649451894-554-1-git-send-email-quic_khsieh@quicinc.com
Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Signed-off-by: Rob Clark <robdclark@chromium.org>
ojeda pushed a commit that referenced this pull request May 4, 2022
…tion

During a scrub, or device replace, we can race with block group removal
and allocation and trigger the following assertion failure:

[7526.385524] assertion failed: cache->start == chunk_offset, in fs/btrfs/scrub.c:3817
[7526.387351] ------------[ cut here ]------------
[7526.387373] kernel BUG at fs/btrfs/ctree.h:3599!
[7526.388001] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[7526.388970] CPU: 2 PID: 1158150 Comm: btrfs Not tainted 5.17.0-rc8-btrfs-next-114 #4
[7526.390279] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[7526.392430] RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
[7526.393520] Code: f3 48 c7 c7 20 (...)
[7526.396926] RSP: 0018:ffffb9154176bc40 EFLAGS: 00010246
[7526.397690] RAX: 0000000000000048 RBX: ffffa0db8a910000 RCX: 0000000000000000
[7526.398732] RDX: 0000000000000000 RSI: ffffffff9d7239a2 RDI: 00000000ffffffff
[7526.399766] RBP: ffffa0db8a911e10 R08: ffffffffa71a3ca0 R09: 0000000000000001
[7526.400793] R10: 0000000000000001 R11: 0000000000000000 R12: ffffa0db4b170800
[7526.401839] R13: 00000003494b0000 R14: ffffa0db7c55b488 R15: ffffa0db8b19a000
[7526.402874] FS:  00007f6c99c40640(0000) GS:ffffa0de6d200000(0000) knlGS:0000000000000000
[7526.404038] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[7526.405040] CR2: 00007f31b0882160 CR3: 000000014b38c004 CR4: 0000000000370ee0
[7526.406112] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[7526.407148] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[7526.408169] Call Trace:
[7526.408529]  <TASK>
[7526.408839]  scrub_enumerate_chunks.cold+0x11/0x79 [btrfs]
[7526.409690]  ? do_wait_intr_irq+0xb0/0xb0
[7526.410276]  btrfs_scrub_dev+0x226/0x620 [btrfs]
[7526.410995]  ? preempt_count_add+0x49/0xa0
[7526.411592]  btrfs_ioctl+0x1ab5/0x36d0 [btrfs]
[7526.412278]  ? __fget_files+0xc9/0x1b0
[7526.412825]  ? kvm_sched_clock_read+0x14/0x40
[7526.413459]  ? lock_release+0x155/0x4a0
[7526.414022]  ? __x64_sys_ioctl+0x83/0xb0
[7526.414601]  __x64_sys_ioctl+0x83/0xb0
[7526.415150]  do_syscall_64+0x3b/0xc0
[7526.415675]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[7526.416408] RIP: 0033:0x7f6c99d34397
[7526.416931] Code: 3c 1c e8 1c ff (...)
[7526.419641] RSP: 002b:00007f6c99c3fca8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[7526.420735] RAX: ffffffffffffffda RBX: 00005624e1e007b0 RCX: 00007f6c99d34397
[7526.421779] RDX: 00005624e1e007b0 RSI: 00000000c400941b RDI: 0000000000000003
[7526.422820] RBP: 0000000000000000 R08: 00007f6c99c40640 R09: 0000000000000000
[7526.423906] R10: 00007f6c99c40640 R11: 0000000000000246 R12: 00007fff746755de
[7526.424924] R13: 00007fff746755df R14: 0000000000000000 R15: 00007f6c99c40640
[7526.425950]  </TASK>

That assertion is relatively new, introduced with commit d04fbe1
("btrfs: scrub: cleanup the argument list of scrub_chunk()").

The block group we get at scrub_enumerate_chunks() can actually have a
start address that is smaller then the chunk offset we extracted from a
device extent item we got from the commit root of the device tree.
This is very rare, but it can happen due to a race with block group
removal and allocation. For example, the following steps show how this
can happen:

1) We are at transaction T, and we have the following blocks groups,
   sorted by their logical start address:

   [ bg A, start address A, length 1G (data) ]
   [ bg B, start address B, length 1G (data) ]
   (...)
   [ bg W, start address W, length 1G (data) ]

     --> logical address space hole of 256M,
         there used to be a 256M metadata block group here

   [ bg Y, start address Y, length 256M (metadata) ]

      --> Y matches W's end offset + 256M

   Block group Y is the block group with the highest logical address in
   the whole filesystem;

2) Block group Y is deleted and its extent mapping is removed by the call
   to remove_extent_mapping() made from btrfs_remove_block_group().

   So after this point, the last element of the mapping red black tree,
   its rightmost node, is the mapping for block group W;

3) While still at transaction T, a new data block group is allocated,
   with a length of 1G. When creating the block group we do a call to
   find_next_chunk(), which returns the logical start address for the
   new block group. This calls returns X, which corresponds to the
   end offset of the last block group, the rightmost node in the mapping
   red black tree (fs_info->mapping_tree), plus one.

   So we get a new block group that starts at logical address X and with
   a length of 1G. It spans over the whole logical range of the old block
   group Y, that was previously removed in the same transaction.

   However the device extent allocated to block group X is not the same
   device extent that was used by block group Y, and it also does not
   overlap that extent, which must be always the case because we allocate
   extents by searching through the commit root of the device tree
   (otherwise it could corrupt a filesystem after a power failure or
   an unclean shutdown in general), so the extent allocator is behaving
   as expected;

4) We have a task running scrub, currently at scrub_enumerate_chunks().
   There it searches for device extent items in the device tree, using
   its commit root. It finds a device extent item that was used by
   block group Y, and it extracts the value Y from that item into the
   local variable 'chunk_offset', using btrfs_dev_extent_chunk_offset();

   It then calls btrfs_lookup_block_group() to find block group for
   the logical address Y - since there's currently no block group that
   starts at that logical address, it returns block group X, because
   its range contains Y.

   This results in triggering the assertion:

      ASSERT(cache->start == chunk_offset);

   right before calling scrub_chunk(), as cache->start is X and
   chunk_offset is Y.

This is more likely to happen of filesystems not larger than 50G, because
for these filesystems we use a 256M size for metadata block groups and
a 1G size for data block groups, while for filesystems larger than 50G,
we use a 1G size for both data and metadata block groups (except for
zoned filesystems). It could also happen on any filesystem size due to
the fact that system block groups are always smaller (32M) than both
data and metadata block groups, but these are not frequently deleted, so
much less likely to trigger the race.

So make scrub skip any block group with a start offset that is less than
the value we expect, as that means it's a new block group that was created
in the current transaction. It's pointless to continue and try to scrub
its extents, because scrub searches for extents using the commit root, so
it won't find any. For a device replace, skip it as well for the same
reasons, and we don't need to worry about the possibility of extents of
the new block group not being to the new device, because we have the write
duplication setup done through btrfs_map_block().

Fixes: d04fbe1 ("btrfs: scrub: cleanup the argument list of scrub_chunk()")
CC: stable@vger.kernel.org # 5.17
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
ojeda pushed a commit that referenced this pull request May 4, 2022
While handling PCI errors (AER flow) driver tries to
disable NAPI [napi_disable()] after NAPI is deleted
[__netif_napi_del()] which causes unexpected system
hang/crash.

System message log shows the following:
=======================================
[ 3222.537510] EEH: Detected PCI bus error on PHB#384-PE#800000 [ 3222.537511] EEH: This PCI device has failed 2 times in the last hour and will be permanently disabled after 5 failures.
[ 3222.537512] EEH: Notify device drivers to shutdown [ 3222.537513] EEH: Beginning: 'error_detected(IO frozen)'
[ 3222.537514] EEH: PE#800000 (PCI 0384:80:00.0): Invoking
bnx2x->error_detected(IO frozen)
[ 3222.537516] bnx2x: [bnx2x_io_error_detected:14236(eth14)]IO error detected [ 3222.537650] EEH: PE#800000 (PCI 0384:80:00.0): bnx2x driver reports:
'need reset'
[ 3222.537651] EEH: PE#800000 (PCI 0384:80:00.1): Invoking
bnx2x->error_detected(IO frozen)
[ 3222.537651] bnx2x: [bnx2x_io_error_detected:14236(eth13)]IO error detected [ 3222.537729] EEH: PE#800000 (PCI 0384:80:00.1): bnx2x driver reports:
'need reset'
[ 3222.537729] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
[ 3222.537890] EEH: Collect temporary log [ 3222.583481] EEH: of node=0384:80:00.0 [ 3222.583519] EEH: PCI device/vendor: 168e14e4 [ 3222.583557] EEH: PCI cmd/status register: 00100140 [ 3222.583557] EEH: PCI-E capabilities and status follow:
[ 3222.583744] EEH: PCI-E 00: 00020010 012c8da 00095d5e 00455c82 [ 3222.583892] EEH: PCI-E 10: 10820000 00000000 00000000 00000000 [ 3222.583893] EEH: PCI-E 20: 00000000 [ 3222.583893] EEH: PCI-E AER capability register set follows:
[ 3222.584079] EEH: PCI-E AER 00: 13c10001 00000000 00000000 00062030 [ 3222.584230] EEH: PCI-E AER 10: 00002000 000031c0 000001e0 00000000 [ 3222.584378] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 [ 3222.584416] EEH: PCI-E AER 30: 00000000 00000000 [ 3222.584416] EEH: of node=0384:80:00.1 [ 3222.584454] EEH: PCI device/vendor: 168e14e4 [ 3222.584491] EEH: PCI cmd/status register: 00100140 [ 3222.584492] EEH: PCI-E capabilities and status follow:
[ 3222.584677] EEH: PCI-E 00: 00020010 012c8da 00095d5e 00455c82 [ 3222.584825] EEH: PCI-E 10: 10820000 00000000 00000000 00000000 [ 3222.584826] EEH: PCI-E 20: 00000000 [ 3222.584826] EEH: PCI-E AER capability register set follows:
[ 3222.585011] EEH: PCI-E AER 00: 13c10001 00000000 00000000 00062030 [ 3222.585160] EEH: PCI-E AER 10: 00002000 000031c0 000001e0 00000000 [ 3222.585309] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 [ 3222.585347] EEH: PCI-E AER 30: 00000000 00000000 [ 3222.586872] RTAS: event: 5, Type: Platform Error (224), Severity: 2 [ 3222.586873] EEH: Reset without hotplug activity [ 3224.762767] EEH: Beginning: 'slot_reset'
[ 3224.762770] EEH: PE#800000 (PCI 0384:80:00.0): Invoking
bnx2x->slot_reset()
[ 3224.762771] bnx2x: [bnx2x_io_slot_reset:14271(eth14)]IO slot reset initializing...
[ 3224.762887] bnx2x 0384:80:00.0: enabling device (0140 -> 0142) [ 3224.768157] bnx2x: [bnx2x_io_slot_reset:14287(eth14)]IO slot reset
--> driver unload

Uninterruptible tasks
=====================
crash> ps | grep UN
     213      2  11  c000000004c89e00  UN   0.0       0      0  [eehd]
     215      2   0  c000000004c80000  UN   0.0       0      0
[kworker/0:2]
    2196      1  28  c000000004504f00  UN   0.1   15936  11136  wickedd
    4287      1   9  c00000020d076800  UN   0.0    4032   3008  agetty
    4289      1  20  c00000020d056680  UN   0.0    7232   3840  agetty
   32423      2  26  c00000020038c580  UN   0.0       0      0
[kworker/26:3]
   32871   4241  27  c0000002609ddd00  UN   0.1   18624  11648  sshd
   32920  10130  16  c00000027284a100  UN   0.1   48512  12608  sendmail
   33092  32987   0  c000000205218b00  UN   0.1   48512  12608  sendmail
   33154   4567  16  c000000260e51780  UN   0.1   48832  12864  pickup
   33209   4241  36  c000000270cb6500  UN   0.1   18624  11712  sshd
   33473  33283   0  c000000205211480  UN   0.1   48512  12672  sendmail
   33531   4241  37  c00000023c902780  UN   0.1   18624  11648  sshd

EEH handler hung while bnx2x sleeping and holding RTNL lock
===========================================================
crash> bt 213
PID: 213    TASK: c000000004c89e00  CPU: 11  COMMAND: "eehd"
  #0 [c000000004d477e0] __schedule at c000000000c70808
  #1 [c000000004d478b0] schedule at c000000000c70ee0
  #2 [c000000004d478e0] schedule_timeout at c000000000c76dec
  #3 [c000000004d479c0] msleep at c0000000002120cc
  #4 [c000000004d479f0] napi_disable at c000000000a06448
                                        ^^^^^^^^^^^^^^^^
  #5 [c000000004d47a30] bnx2x_netif_stop at c0080000018dba94 [bnx2x]
  #6 [c000000004d47a60] bnx2x_io_slot_reset at c0080000018a551c [bnx2x]
  #7 [c000000004d47b20] eeh_report_reset at c00000000004c9bc
  #8 [c000000004d47b90] eeh_pe_report at c00000000004d1a8
  #9 [c000000004d47c40] eeh_handle_normal_event at c00000000004da64

And the sleeping source code
============================
crash> dis -ls c000000000a06448
FILE: ../net/core/dev.c
LINE: 6702

   6697  {
   6698          might_sleep();
   6699          set_bit(NAPI_STATE_DISABLE, &n->state);
   6700
   6701          while (test_and_set_bit(NAPI_STATE_SCHED, &n->state))
* 6702                  msleep(1);
   6703          while (test_and_set_bit(NAPI_STATE_NPSVC, &n->state))
   6704                  msleep(1);
   6705
   6706          hrtimer_cancel(&n->timer);
   6707
   6708          clear_bit(NAPI_STATE_DISABLE, &n->state);
   6709  }

EEH calls into bnx2x twice based on the system log above, first through
bnx2x_io_error_detected() and then bnx2x_io_slot_reset(), and executes
the following call chains:

bnx2x_io_error_detected()
  +-> bnx2x_eeh_nic_unload()
       +-> bnx2x_del_all_napi()
            +-> __netif_napi_del()

bnx2x_io_slot_reset()
  +-> bnx2x_netif_stop()
       +-> bnx2x_napi_disable()
            +->napi_disable()

Fix this by correcting the sequence of NAPI APIs usage,
that is delete the NAPI after disabling it.

Fixes: 7fa6f34 ("bnx2x: AER revised")
Reported-by: David Christensen <drc@linux.vnet.ibm.com>
Tested-by: David Christensen <drc@linux.vnet.ibm.com>
Signed-off-by: Manish Chopra <manishc@marvell.com>
Signed-off-by: Ariel Elior <aelior@marvell.com>
Link: https://lore.kernel.org/r/20220426153913.6966-1-manishc@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ojeda pushed a commit that referenced this pull request May 22, 2022
As reported by Alan, the CFI (Call Frame Information) in the VDSO time
routines is incorrect since commit ce7d805 ("powerpc/vdso: Prepare
for switching VDSO to generic C implementation.").

DWARF has a concept called the CFA (Canonical Frame Address), which on
powerpc is calculated as an offset from the stack pointer (r1). That
means when the stack pointer is changed there must be a corresponding
CFI directive to update the calculation of the CFA.

The current code is missing those directives for the changes to r1,
which prevents gdb from being able to generate a backtrace from inside
VDSO functions, eg:

  Breakpoint 1, 0x00007ffff7f804dc in __kernel_clock_gettime ()
  (gdb) bt
  #0  0x00007ffff7f804dc in __kernel_clock_gettime ()
  #1  0x00007ffff7d8872c in clock_gettime@@GLIBC_2.17 () from /lib64/libc.so.6
  #2  0x00007fffffffd960 in ?? ()
  #3  0x00007ffff7d8872c in clock_gettime@@GLIBC_2.17 () from /lib64/libc.so.6
  Backtrace stopped: frame did not save the PC

Alan helpfully describes some rules for correctly maintaining the CFI information:

  1) Every adjustment to the current frame address reg (ie. r1) must be
     described, and exactly at the instruction where r1 changes. Why?
     Because stack unwinding might want to access previous frames.

  2) If a function changes LR or any non-volatile register, the save
     location for those regs must be given. The CFI can be at any
     instruction after the saves up to the point that the reg is
     changed.
     (Exception: LR save should be described before a bl. not after)

  3) If asychronous unwind info is needed then restores of LR and
     non-volatile regs must also be described. The CFI can be at any
     instruction after the reg is restored up to the point where the
     save location is (potentially) trashed.

Fix the inability to backtrace by adding CFI directives describing the
changes to r1, ie. satisfying rule 1.

Also change the information for LR to point to the copy saved on the
stack, not the value in r0 that will be overwritten by the function
call.

Finally, add CFI directives describing the save/restore of r2.

With the fix gdb can correctly back trace and navigate up and down the stack:

  Breakpoint 1, 0x00007ffff7f804dc in __kernel_clock_gettime ()
  (gdb) bt
  #0  0x00007ffff7f804dc in __kernel_clock_gettime ()
  #1  0x00007ffff7d8872c in clock_gettime@@GLIBC_2.17 () from /lib64/libc.so.6
  #2  0x0000000100015b60 in gettime ()
  #3  0x000000010000c8bc in print_long_format ()
  #4  0x000000010000d180 in print_current_files ()
  #5  0x00000001000054ac in main ()
  (gdb) up
  #1  0x00007ffff7d8872c in clock_gettime@@GLIBC_2.17 () from /lib64/libc.so.6
  (gdb)
  #2  0x0000000100015b60 in gettime ()
  (gdb)
  #3  0x000000010000c8bc in print_long_format ()
  (gdb)
  #4  0x000000010000d180 in print_current_files ()
  (gdb)
  #5  0x00000001000054ac in main ()
  (gdb)
  Initial frame selected; you cannot go up.
  (gdb) down
  #4  0x000000010000d180 in print_current_files ()
  (gdb)
  #3  0x000000010000c8bc in print_long_format ()
  (gdb)
  #2  0x0000000100015b60 in gettime ()
  (gdb)
  #1  0x00007ffff7d8872c in clock_gettime@@GLIBC_2.17 () from /lib64/libc.so.6
  (gdb)
  #0  0x00007ffff7f804dc in __kernel_clock_gettime ()
  (gdb)

Fixes: ce7d805 ("powerpc/vdso: Prepare for switching VDSO to generic C implementation.")
Cc: stable@vger.kernel.org # v5.11+
Reported-by: Alan Modra <amodra@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Segher Boessenkool <segher@kernel.crashing.org>
Link: https://lore.kernel.org/r/20220502125010.1319370-1-mpe@ellerman.id.au
ojeda pushed a commit that referenced this pull request May 22, 2022
Do not allow to write timestamps on RX rings if PF is being configured.
When PF is being configured RX rings can be freed or rebuilt. If at the
same time timestamps are updated, the kernel will crash by dereferencing
null RX ring pointer.

PID: 1449   TASK: ff187d28ed658040  CPU: 34  COMMAND: "ice-ptp-0000:51"
 #0 [ff1966a94a713bb0] machine_kexec at ffffffff9d05a0be
 #1 [ff1966a94a713c08] __crash_kexec at ffffffff9d192e9d
 #2 [ff1966a94a713cd0] crash_kexec at ffffffff9d1941bd
 #3 [ff1966a94a713ce8] oops_end at ffffffff9d01bd54
 #4 [ff1966a94a713d08] no_context at ffffffff9d06bda4
 #5 [ff1966a94a713d60] __bad_area_nosemaphore at ffffffff9d06c10c
 #6 [ff1966a94a713da8] do_page_fault at ffffffff9d06cae4
 #7 [ff1966a94a713de0] page_fault at ffffffff9da0107e
    [exception RIP: ice_ptp_update_cached_phctime+91]
    RIP: ffffffffc076db8b  RSP: ff1966a94a713e98  RFLAGS: 00010246
    RAX: 16e3db9c6b7ccae4  RBX: ff187d269dd3c180  RCX: ff187d269cd4d018
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000000
    RBP: ff187d269cfcc644   R8: ff187d339b9641b0   R9: 0000000000000000
    R10: 0000000000000002  R11: 0000000000000000  R12: ff187d269cfcc648
    R13: ffffffff9f128784  R14: ffffffff9d101b70  R15: ff187d269cfcc640
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #8 [ff1966a94a713ea0] ice_ptp_periodic_work at ffffffffc076dbef [ice]
 #9 [ff1966a94a713ee0] kthread_worker_fn at ffffffff9d101c1b
 #10 [ff1966a94a713f10] kthread at ffffffff9d101b4d
 #11 [ff1966a94a713f50] ret_from_fork at ffffffff9da0023f

Fixes: 77a7811 ("ice: enable receive hardware timestamping")
Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Reviewed-by: Michal Schmidt <mschmidt@redhat.com>
Tested-by: Dave Cain <dcain@redhat.com>
Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
ojeda pushed a commit that referenced this pull request Jul 11, 2022
This was missed in c3ed222 ("NFSv4: Fix free of uninitialized
nfs4_label on referral lookup.") and causes a panic when mounting
with '-o trunkdiscovery':

PID: 1604   TASK: ffff93dac3520000  CPU: 3   COMMAND: "mount.nfs"
 #0 [ffffb79140f738f8] machine_kexec at ffffffffaec64bee
 #1 [ffffb79140f73950] __crash_kexec at ffffffffaeda67fd
 #2 [ffffb79140f73a18] crash_kexec at ffffffffaeda76ed
 #3 [ffffb79140f73a30] oops_end at ffffffffaec2658d
 #4 [ffffb79140f73a50] general_protection at ffffffffaf60111e
    [exception RIP: nfs_fattr_init+0x5]
    RIP: ffffffffc0c18265  RSP: ffffb79140f73b08  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: ffff93dac304a800  RCX: 0000000000000000
    RDX: ffffb79140f73bb0  RSI: ffff93dadc8cbb40  RDI: d03ee11cfaf6bd50
    RBP: ffffb79140f73be8   R8: ffffffffc0691560   R9: 0000000000000006
    R10: ffff93db3ffd3df8  R11: 0000000000000000  R12: ffff93dac4040000
    R13: ffff93dac2848e00  R14: ffffb79140f73b60  R15: ffffb79140f73b30
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffffb79140f73b08] _nfs41_proc_get_locations at ffffffffc0c73d53 [nfsv4]
 #6 [ffffb79140f73bf0] nfs4_proc_get_locations at ffffffffc0c83e90 [nfsv4]
 #7 [ffffb79140f73c60] nfs4_discover_trunking at ffffffffc0c83fb7 [nfsv4]
 #8 [ffffb79140f73cd8] nfs_probe_fsinfo at ffffffffc0c0f95f [nfs]
 #9 [ffffb79140f73da0] nfs_probe_server at ffffffffc0c1026a [nfs]
    RIP: 00007f6254fce26e  RSP: 00007ffc69496ac8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 0000000000000000  RCX: 00007f6254fce26e
    RDX: 00005600220a82a0  RSI: 00005600220a64d0  RDI: 00005600220a6520
    RBP: 00007ffc69496c50   R8: 00005600220a8710   R9: 003035322e323231
    R10: 0000000000000000  R11: 0000000000000246  R12: 00007ffc69496c50
    R13: 00005600220a8440  R14: 0000000000000010  R15: 0000560020650ef9
    ORIG_RAX: 00000000000000a5  CS: 0033  SS: 002b

Fixes: c3ed222 ("NFSv4: Fix free of uninitialized nfs4_label on referral lookup.")
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
ojeda pushed a commit that referenced this pull request Jul 25, 2022
…tion

Each cset (css_set) is pinned by its tasks. When we're moving tasks around
across csets for a migration, we need to hold the source and destination
csets to ensure that they don't go away while we're moving tasks about. This
is done by linking cset->mg_preload_node on either the
mgctx->preloaded_src_csets or mgctx->preloaded_dst_csets list. Using the
same cset->mg_preload_node for both the src and dst lists was deemed okay as
a cset can't be both the source and destination at the same time.

Unfortunately, this overloading becomes problematic when multiple tasks are
involved in a migration and some of them are identity noop migrations while
others are actually moving across cgroups. For example, this can happen with
the following sequence on cgroup1:

 #1> mkdir -p /sys/fs/cgroup/misc/a/b
 #2> echo $$ > /sys/fs/cgroup/misc/a/cgroup.procs
 #3> RUN_A_COMMAND_WHICH_CREATES_MULTIPLE_THREADS &
 #4> PID=$!
 #5> echo $PID > /sys/fs/cgroup/misc/a/b/tasks
 #6> echo $PID > /sys/fs/cgroup/misc/a/cgroup.procs

the process including the group leader back into a. In this final migration,
non-leader threads would be doing identity migration while the group leader
is doing an actual one.

After #3, let's say the whole process was in cset A, and that after #4, the
leader moves to cset B. Then, during #6, the following happens:

 1. cgroup_migrate_add_src() is called on B for the leader.

 2. cgroup_migrate_add_src() is called on A for the other threads.

 3. cgroup_migrate_prepare_dst() is called. It scans the src list.

 4. It notices that B wants to migrate to A, so it tries to A to the dst
    list but realizes that its ->mg_preload_node is already busy.

 5. and then it notices A wants to migrate to A as it's an identity
    migration, it culls it by list_del_init()'ing its ->mg_preload_node and
    putting references accordingly.

 6. The rest of migration takes place with B on the src list but nothing on
    the dst list.

This means that A isn't held while migration is in progress. If all tasks
leave A before the migration finishes and the incoming task pins it, the
cset will be destroyed leading to use-after-free.

This is caused by overloading cset->mg_preload_node for both src and dst
preload lists. We wanted to exclude the cset from the src list but ended up
inadvertently excluding it from the dst list too.

This patch fixes the issue by separating out cset->mg_preload_node into
->mg_src_preload_node and ->mg_dst_preload_node, so that the src and dst
preloadings don't interfere with each other.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reported-by: shisiyuan <shisiyuan19870131@gmail.com>
Link: http://lkml.kernel.org/r/1654187688-27411-1-git-send-email-shisiyuan@xiaomi.com
Link: https://www.spinics.net/lists/cgroups/msg33313.html
Fixes: f817de9 ("cgroup: prepare migration path for unified hierarchy")
Cc: stable@vger.kernel.org # v3.16+
ojeda pushed a commit that referenced this pull request Jul 25, 2022
…abled

It was brought up that on ARMv7, that because the FUNCTION_TRACER does not
use nops to keep function tracing disabled because of the use of a link
register, it does have some performance impact.

The start of functions when -pg is used to compile the kernel is:

	push    {lr}
	bl      8010e7c0 <__gnu_mcount_nc>

When function tracing is tuned off, it becomes:

	push    {lr}
	add   sp, sp, #4

Which just puts the stack back to its normal location. But these two
instructions at the start of every function does incur some overhead.

Be more honest in the Kconfig FUNCTION_TRACER description and specify that
the overhead being in the noise was x86 specific, but other architectures
may vary.

Link: https://lore.kernel.org/all/20220705105416.GE5208@pengutronix.de/
Link: https://lkml.kernel.org/r/20220706161231.085a83da@gandalf.local.home

Reported-by: Sascha Hauer <sha@pengutronix.de>
Acked-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
ojeda pushed a commit that referenced this pull request Jul 25, 2022
On powerpc, 'perf trace' is crashing with a SIGSEGV when trying to
process a perf.data file created with 'perf trace record -p':

  #0  0x00000001225b8988 in syscall_arg__scnprintf_augmented_string <snip> at builtin-trace.c:1492
  #1  syscall_arg__scnprintf_filename <snip> at builtin-trace.c:1492
  #2  syscall_arg__scnprintf_filename <snip> at builtin-trace.c:1486
  #3  0x00000001225bdd9c in syscall_arg_fmt__scnprintf_val <snip> at builtin-trace.c:1973
  #4  syscall__scnprintf_args <snip> at builtin-trace.c:2041
  #5  0x00000001225bff04 in trace__sys_enter <snip> at builtin-trace.c:2319

That points to the below code in tools/perf/builtin-trace.c:
	/*
	 * If this is raw_syscalls.sys_enter, then it always comes with the 6 possible
	 * arguments, even if the syscall being handled, say "openat", uses only 4 arguments
	 * this breaks syscall__augmented_args() check for augmented args, as we calculate
	 * syscall->args_size using each syscalls:sys_enter_NAME tracefs format file,
	 * so when handling, say the openat syscall, we end up getting 6 args for the
	 * raw_syscalls:sys_enter event, when we expected just 4, we end up mistakenly
	 * thinking that the extra 2 u64 args are the augmented filename, so just check
	 * here and avoid using augmented syscalls when the evsel is the raw_syscalls one.
	 */
	if (evsel != trace->syscalls.events.sys_enter)
		augmented_args = syscall__augmented_args(sc, sample, &augmented_args_size, trace->raw_augmented_syscalls_args_size);

As the comment points out, we should not be trying to augment the args
for raw_syscalls. However, when processing a perf.data file, we are not
initializing those properly. Fix the same.

Reported-by: Claudio Carvalho <cclaudio@linux.ibm.com>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lore.kernel.org/lkml/20220707090900.572584-1-naveen.n.rao@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
wedsonaf pushed a commit to wedsonaf/linux that referenced this pull request Oct 5, 2022
The SRv6 layer allows defining HMAC data that can later be used to sign IPv6
Segment Routing Headers. This configuration is realised via netlink through
four attributes: SEG6_ATTR_HMACKEYID, SEG6_ATTR_SECRET, SEG6_ATTR_SECRETLEN and
SEG6_ATTR_ALGID. Because the SECRETLEN attribute is decoupled from the actual
length of the SECRET attribute, it is possible to provide invalid combinations
(e.g., secret = "", secretlen = 64). This case is not checked in the code and
with an appropriately crafted netlink message, an out-of-bounds read of up
to 64 bytes (max secret length) can occur past the skb end pointer and into
skb_shared_info:

Breakpoint 1, seg6_genl_sethmac (skb=<optimized out>, info=<optimized out>) at net/ipv6/seg6.c:208
208		memcpy(hinfo->secret, secret, slen);
(gdb) bt
 #0  seg6_genl_sethmac (skb=<optimized out>, info=<optimized out>) at net/ipv6/seg6.c:208
 Rust-for-Linux#1  0xffffffff81e012e9 in genl_family_rcv_msg_doit (skb=skb@entry=0xffff88800b1f9f00, nlh=nlh@entry=0xffff88800b1b7600,
    extack=extack@entry=0xffffc90000ba7af0, ops=ops@entry=0xffffc90000ba7a80, hdrlen=4, net=0xffffffff84237580 <init_net>, family=<optimized out>,
    family=<optimized out>) at net/netlink/genetlink.c:731
 Rust-for-Linux#2  0xffffffff81e01435 in genl_family_rcv_msg (extack=0xffffc90000ba7af0, nlh=0xffff88800b1b7600, skb=0xffff88800b1f9f00,
    family=0xffffffff82fef6c0 <seg6_genl_family>) at net/netlink/genetlink.c:775
 Rust-for-Linux#3  genl_rcv_msg (skb=0xffff88800b1f9f00, nlh=0xffff88800b1b7600, extack=0xffffc90000ba7af0) at net/netlink/genetlink.c:792
 Rust-for-Linux#4  0xffffffff81dfffc3 in netlink_rcv_skb (skb=skb@entry=0xffff88800b1f9f00, cb=cb@entry=0xffffffff81e01350 <genl_rcv_msg>)
    at net/netlink/af_netlink.c:2501
 Rust-for-Linux#5  0xffffffff81e00919 in genl_rcv (skb=0xffff88800b1f9f00) at net/netlink/genetlink.c:803
 Rust-for-Linux#6  0xffffffff81dff6ae in netlink_unicast_kernel (ssk=0xffff888010eec800, skb=0xffff88800b1f9f00, sk=0xffff888004aed000)
    at net/netlink/af_netlink.c:1319
 Rust-for-Linux#7  netlink_unicast (ssk=ssk@entry=0xffff888010eec800, skb=skb@entry=0xffff88800b1f9f00, portid=portid@entry=0, nonblock=<optimized out>)
    at net/netlink/af_netlink.c:1345
 Rust-for-Linux#8  0xffffffff81dff9a4 in netlink_sendmsg (sock=<optimized out>, msg=0xffffc90000ba7e48, len=<optimized out>) at net/netlink/af_netlink.c:1921
...
(gdb) p/x ((struct sk_buff *)0xffff88800b1f9f00)->head + ((struct sk_buff *)0xffff88800b1f9f00)->end
$1 = 0xffff88800b1b76c0
(gdb) p/x secret
$2 = 0xffff88800b1b76c0
(gdb) p slen
$3 = 64 '@'

The OOB data can then be read back from userspace by dumping HMAC state. This
commit fixes this by ensuring SECRETLEN cannot exceed the actual length of
SECRET.

Reported-by: Lucas Leong <wmliang.tw@gmail.com>
Tested: verified that EINVAL is correctly returned when secretlen > len(secret)
Fixes: 4f4853d ("ipv6: sr: implement API to control SR HMAC structure")
Signed-off-by: David Lebrun <dlebrun@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
wedsonaf pushed a commit to wedsonaf/linux that referenced this pull request Oct 5, 2022
Fix a NULL dereference of the struct bonding.rr_tx_counter member because
if a bond is initially created with an initial mode != zero (Round Robin)
the memory required for the counter is never created and when the mode is
changed there is never any attempt to verify the memory is allocated upon
switching modes.

This causes the following Oops on an aarch64 machine:
    [  334.686773] Unable to handle kernel paging request at virtual address ffff2c91ac905000
    [  334.694703] Mem abort info:
    [  334.697486]   ESR = 0x0000000096000004
    [  334.701234]   EC = 0x25: DABT (current EL), IL = 32 bits
    [  334.706536]   SET = 0, FnV = 0
    [  334.709579]   EA = 0, S1PTW = 0
    [  334.712719]   FSC = 0x04: level 0 translation fault
    [  334.717586] Data abort info:
    [  334.720454]   ISV = 0, ISS = 0x00000004
    [  334.724288]   CM = 0, WnR = 0
    [  334.727244] swapper pgtable: 4k pages, 48-bit VAs, pgdp=000008044d662000
    [  334.733944] [ffff2c91ac905000] pgd=0000000000000000, p4d=0000000000000000
    [  334.740734] Internal error: Oops: 96000004 [Rust-for-Linux#1] SMP
    [  334.745602] Modules linked in: bonding tls veth rfkill sunrpc arm_spe_pmu vfat fat acpi_ipmi ipmi_ssif ixgbe igb i40e mdio ipmi_devintf ipmi_msghandler arm_cmn arm_dsu_pmu cppc_cpufreq acpi_tad fuse zram crct10dif_ce ast ghash_ce sbsa_gwdt nvme drm_vram_helper drm_ttm_helper nvme_core ttm xgene_hwmon
    [  334.772217] CPU: 7 PID: 2214 Comm: ping Not tainted 6.0.0-rc4-00133-g64ae13ed4784 Rust-for-Linux#4
    [  334.779950] Hardware name: GIGABYTE R272-P31-00/MP32-AR1-00, BIOS F18v (SCP: 1.08.20211002) 12/01/2021
    [  334.789244] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
    [  334.796196] pc : bond_rr_gen_slave_id+0x40/0x124 [bonding]
    [  334.801691] lr : bond_xmit_roundrobin_slave_get+0x38/0xdc [bonding]
    [  334.807962] sp : ffff8000221733e0
    [  334.811265] x29: ffff8000221733e0 x28: ffffdbac8572d198 x27: ffff80002217357c
    [  334.818392] x26: 000000000000002a x25: ffffdbacb33ee000 x24: ffff07ff980fa000
    [  334.825519] x23: ffffdbacb2e398ba x22: ffff07ff98102000 x21: ffff07ff981029c0
    [  334.832646] x20: 0000000000000001 x19: ffff07ff981029c0 x18: 0000000000000014
    [  334.839773] x17: 0000000000000000 x16: ffffdbacb1004364 x15: 0000aaaabe2f5a62
    [  334.846899] x14: ffff07ff8e55d968 x13: ffff07ff8e55db30 x12: 0000000000000000
    [  334.854026] x11: ffffdbacb21532e8 x10: 0000000000000001 x9 : ffffdbac857178ec
    [  334.861153] x8 : ffff07ff9f6e5a28 x7 : 0000000000000000 x6 : 000000007c2b3742
    [  334.868279] x5 : ffff2c91ac905000 x4 : ffff2c91ac905000 x3 : ffff07ff9f554400
    [  334.875406] x2 : ffff2c91ac905000 x1 : 0000000000000001 x0 : ffff07ff981029c0
    [  334.882532] Call trace:
    [  334.884967]  bond_rr_gen_slave_id+0x40/0x124 [bonding]
    [  334.890109]  bond_xmit_roundrobin_slave_get+0x38/0xdc [bonding]
    [  334.896033]  __bond_start_xmit+0x128/0x3a0 [bonding]
    [  334.901001]  bond_start_xmit+0x54/0xb0 [bonding]
    [  334.905622]  dev_hard_start_xmit+0xb4/0x220
    [  334.909798]  __dev_queue_xmit+0x1a0/0x720
    [  334.913799]  arp_xmit+0x3c/0xbc
    [  334.916932]  arp_send_dst+0x98/0xd0
    [  334.920410]  arp_solicit+0xe8/0x230
    [  334.923888]  neigh_probe+0x60/0xb0
    [  334.927279]  __neigh_event_send+0x3b0/0x470
    [  334.931453]  neigh_resolve_output+0x70/0x90
    [  334.935626]  ip_finish_output2+0x158/0x514
    [  334.939714]  __ip_finish_output+0xac/0x1a4
    [  334.943800]  ip_finish_output+0x40/0xfc
    [  334.947626]  ip_output+0xf8/0x1a4
    [  334.950931]  ip_send_skb+0x5c/0x100
    [  334.954410]  ip_push_pending_frames+0x3c/0x60
    [  334.958758]  raw_sendmsg+0x458/0x6d0
    [  334.962325]  inet_sendmsg+0x50/0x80
    [  334.965805]  sock_sendmsg+0x60/0x6c
    [  334.969286]  __sys_sendto+0xc8/0x134
    [  334.972853]  __arm64_sys_sendto+0x34/0x4c
    [  334.976854]  invoke_syscall+0x78/0x100
    [  334.980594]  el0_svc_common.constprop.0+0x4c/0xf4
    [  334.985287]  do_el0_svc+0x38/0x4c
    [  334.988591]  el0_svc+0x34/0x10c
    [  334.991724]  el0t_64_sync_handler+0x11c/0x150
    [  334.996072]  el0t_64_sync+0x190/0x194
    [  334.999726] Code: b9001062 f9403c02 d53cd044 8b040042 (b8210040)
    [  335.005810] ---[ end trace 0000000000000000 ]---
    [  335.010416] Kernel panic - not syncing: Oops: Fatal exception in interrupt
    [  335.017279] SMP: stopping secondary CPUs
    [  335.021374] Kernel Offset: 0x5baca8eb0000 from 0xffff800008000000
    [  335.027456] PHYS_OFFSET: 0x80000000
    [  335.030932] CPU features: 0x0000,0085c029,19805c82
    [  335.035713] Memory Limit: none
    [  335.038756] Rebooting in 180 seconds..

The fix is to allocate the memory in bond_open() which is guaranteed
to be called before any packets are processed.

Fixes: 848ca91 ("net: bonding: Use per-cpu rr_tx_counter")
CC: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
wedsonaf pushed a commit to wedsonaf/linux that referenced this pull request Oct 5, 2022
When walking through an inode extents, the ext4_ext_binsearch_idx() function
assumes that the extent header has been previously validated.  However, there
are no checks that verify that the number of entries (eh->eh_entries) is
non-zero when depth is > 0.  And this will lead to problems because the
EXT_FIRST_INDEX() and EXT_LAST_INDEX() will return garbage and result in this:

[  135.245946] ------------[ cut here ]------------
[  135.247579] kernel BUG at fs/ext4/extents.c:2258!
[  135.249045] invalid opcode: 0000 [Rust-for-Linux#1] PREEMPT SMP
[  135.250320] CPU: 2 PID: 238 Comm: tmp118 Not tainted 5.19.0-rc8+ Rust-for-Linux#4
[  135.252067] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b-rebuilt.opensuse.org 04/01/2014
[  135.255065] RIP: 0010:ext4_ext_map_blocks+0xc20/0xcb0
[  135.256475] Code:
[  135.261433] RSP: 0018:ffffc900005939f8 EFLAGS: 00010246
[  135.262847] RAX: 0000000000000024 RBX: ffffc90000593b70 RCX: 0000000000000023
[  135.264765] RDX: ffff8880038e5f10 RSI: 0000000000000003 RDI: ffff8880046e922c
[  135.266670] RBP: ffff8880046e9348 R08: 0000000000000001 R09: ffff888002ca580c
[  135.268576] R10: 0000000000002602 R11: 0000000000000000 R12: 0000000000000024
[  135.270477] R13: 0000000000000000 R14: 0000000000000024 R15: 0000000000000000
[  135.272394] FS:  00007fdabdc56740(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000
[  135.274510] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  135.276075] CR2: 00007ffc26bd4f00 CR3: 0000000006261004 CR4: 0000000000170ea0
[  135.277952] Call Trace:
[  135.278635]  <TASK>
[  135.279247]  ? preempt_count_add+0x6d/0xa0
[  135.280358]  ? percpu_counter_add_batch+0x55/0xb0
[  135.281612]  ? _raw_read_unlock+0x18/0x30
[  135.282704]  ext4_map_blocks+0x294/0x5a0
[  135.283745]  ? xa_load+0x6f/0xa0
[  135.284562]  ext4_mpage_readpages+0x3d6/0x770
[  135.285646]  read_pages+0x67/0x1d0
[  135.286492]  ? folio_add_lru+0x51/0x80
[  135.287441]  page_cache_ra_unbounded+0x124/0x170
[  135.288510]  filemap_get_pages+0x23d/0x5a0
[  135.289457]  ? path_openat+0xa72/0xdd0
[  135.290332]  filemap_read+0xbf/0x300
[  135.291158]  ? _raw_spin_lock_irqsave+0x17/0x40
[  135.292192]  new_sync_read+0x103/0x170
[  135.293014]  vfs_read+0x15d/0x180
[  135.293745]  ksys_read+0xa1/0xe0
[  135.294461]  do_syscall_64+0x3c/0x80
[  135.295284]  entry_SYSCALL_64_after_hwframe+0x46/0xb0

This patch simply adds an extra check in __ext4_ext_check(), verifying that
eh_entries is not 0 when eh_depth is > 0.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215941
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216283
Cc: Baokun Li <libaokun1@huawei.com>
Cc: stable@kernel.org
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://lore.kernel.org/r/20220822094235.2690-1-lhenriques@suse.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Darksonn pushed a commit to Darksonn/linux that referenced this pull request Oct 19, 2022
…-bindings"

Conor Dooley <mail@conchuod.ie> says:

From: Conor Dooley <conor.dooley@microchip.com>

The device trees produced automatically for the virt and spike machines
fail dt-validate on several grounds. Some of these need to be fixed in
the linux kernel's dt-bindings, but others are caused by bugs in QEMU.

Patches been sent that fix the QEMU issues [0], but a couple of them
need to be fixed in the kernel's dt-bindings. The first patches add
compatibles for "riscv,{clint,plic}0" which are present in drivers and
the auto generated QEMU dtbs.

Thanks to Rob Herring for reporting these issues [1],
Conor.

To reproduce the errors:
./build/qemu-system-riscv64 -nographic -machine virt,dumpdtb=qemu.dtb
dt-validate -p /path/to/linux/kernel/Documentation/devicetree/bindings/processed-schema.json qemu.dtb
(The processed schema needs to be generated first)

0 - https://lore.kernel.org/linux-riscv/20220810184612.157317-1-mail@conchuod.ie/
1 - https://lore.kernel.org/linux-riscv/20220803170552.GA2250266-robh@kernel.org/

* fix-dt-validate:
  dt-bindings: riscv: add new riscv,isa strings for emulators
  dt-bindings: interrupt-controller: sifive,plic: add legacy riscv compatible
  dt-bindings: timer: sifive,clint: add legacy riscv compatible

Link: https://lore.kernel.org/r/20220823183319.3314940-1-mail@conchuod.ie
[Palmer: some cover letter pruning, and dropped Rust-for-Linux#4 as suggested.]
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
ojeda pushed a commit that referenced this pull request Jan 16, 2023
…ata and not linked with libtraceevent

When we have a perf.data file with tracepoints, such as:

  # perf evlist -f
  probe_perf:lzma_decompress_to_file
  # Tip: use 'perf evlist --trace-fields' to show fields for tracepoint events
  #

We end up segfaulting when using perf built with NO_LIBTRACEEVENT=1 by
trying to find an evsel with a NULL 'event_name' variable:

  (gdb) run report --stdio -f
  Starting program: /root/bin/perf report --stdio -f

  Program received signal SIGSEGV, Segmentation fault.
  0x000000000055219d in find_evsel (evlist=0xfda7b0, event_name=0x0) at util/sort.c:2830
  warning: Source file is more recent than executable.
  2830		if (event_name[0] == '%') {
  Missing separate debuginfos, use: dnf debuginfo-install bzip2-libs-1.0.8-11.fc36.x86_64 cyrus-sasl-lib-2.1.27-18.fc36.x86_64 elfutils-debuginfod-client-0.188-3.fc36.x86_64 elfutils-libelf-0.188-3.fc36.x86_64 elfutils-libs-0.188-3.fc36.x86_64 glibc-2.35-20.fc36.x86_64 keyutils-libs-1.6.1-4.fc36.x86_64 krb5-libs-1.19.2-12.fc36.x86_64 libbrotli-1.0.9-7.fc36.x86_64 libcap-2.48-4.fc36.x86_64 libcom_err-1.46.5-2.fc36.x86_64 libcurl-7.82.0-12.fc36.x86_64 libevent-2.1.12-6.fc36.x86_64 libgcc-12.2.1-4.fc36.x86_64 libidn2-2.3.4-1.fc36.x86_64 libnghttp2-1.51.0-1.fc36.x86_64 libpsl-0.21.1-5.fc36.x86_64 libselinux-3.3-4.fc36.x86_64 libssh-0.9.6-4.fc36.x86_64 libstdc++-12.2.1-4.fc36.x86_64 libunistring-1.0-1.fc36.x86_64 libunwind-1.6.2-2.fc36.x86_64 libxcrypt-4.4.33-4.fc36.x86_64 libzstd-1.5.2-2.fc36.x86_64 numactl-libs-2.0.14-5.fc36.x86_64 opencsd-1.2.0-1.fc36.x86_64 openldap-2.6.3-1.fc36.x86_64 openssl-libs-3.0.5-2.fc36.x86_64 slang-2.3.2-11.fc36.x86_64 xz-libs-5.2.5-9.fc36.x86_64 zlib-1.2.11-33.fc36.x86_64
  (gdb) bt
  #0  0x000000000055219d in find_evsel (evlist=0xfda7b0, event_name=0x0) at util/sort.c:2830
  #1  0x0000000000552416 in add_dynamic_entry (evlist=0xfda7b0, tok=0xffb6eb "trace", level=2) at util/sort.c:2976
  #2  0x0000000000552d26 in sort_dimension__add (list=0xf93e00 <perf_hpp_list>, tok=0xffb6eb "trace", evlist=0xfda7b0, level=2) at util/sort.c:3193
  #3  0x0000000000552e1c in setup_sort_list (list=0xf93e00 <perf_hpp_list>, str=0xffb6eb "trace", evlist=0xfda7b0) at util/sort.c:3227
  #4  0x00000000005532fa in __setup_sorting (evlist=0xfda7b0) at util/sort.c:3381
  #5  0x0000000000553cdc in setup_sorting (evlist=0xfda7b0) at util/sort.c:3608
  #6  0x000000000042eb9f in cmd_report (argc=0, argv=0x7fffffffe470) at builtin-report.c:1596
  #7  0x00000000004aee7e in run_builtin (p=0xf64ca0 <commands+288>, argc=3, argv=0x7fffffffe470) at perf.c:330
  #8  0x00000000004af0f2 in handle_internal_command (argc=3, argv=0x7fffffffe470) at perf.c:384
  #9  0x00000000004af241 in run_argv (argcp=0x7fffffffe29c, argv=0x7fffffffe290) at perf.c:428
  #10 0x00000000004af5fc in main (argc=3, argv=0x7fffffffe470) at perf.c:562
  (gdb)

So check if we have tracepoint events in add_dynamic_entry() and bail
out instead:

  # perf report --stdio -f
  This perf binary isn't linked with libtraceevent, can't process probe_perf:lzma_decompress_to_file
  Error:
  Unknown --sort key: `trace'
  #

Fixes: 378ef0f ("perf build: Use libtraceevent from the system")
Acked-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lore.kernel.org/lkml/Y7MDb7kRaHZB6APC@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

6 participants