-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking Issue for os_str_slice
#118485
Comments
This was discussed somewhat in the ACP (rust-lang/libs-team#306). Copying over the parts I find relevant :) Starting with "lowest common denominator invariants" can always be relaxed later, as we'd be switching cases from asserting to not-asserting. The other |
Looking over #118484, this looks like it'll have a lot of overhead to enforce the invariants when building higher level operations on top that would go away with native support for those operations. This is mostly an observation as I'm not sure what else we can do for now while pattern API support is at a stand-still. |
There's a fast path for splitting on ASCII, which you'll always take when doing traditional options parsing1. My hunch is that that's the common case in general, that even if part of the string is Unicode or raw bytes you'll typically be hunting for some bit of ASCII syntax. It can be made a lot faster on Windows (see #118484 (comment)), and I want to try that if I can get a local Windows environment set up. (EDIT: the If we relax the requirements for Unix then it just becomes a normal slicing operation there. That's one reason that feels attractive to me. Otherwise we have to keep doing something like the current validation. The ASCII fast path can be made faster by skipping bounds checks and giving it its own function. Here's what I get by mucking around in compiler explorer: But I don't know how far it's worth going, and if we choose relaxed checks then we can just get rid of it. Footnotes
|
With strict checks it would technically be optimal for user code to have a function like this: use std::ffi::OsStr;
fn slice_os_str(s: &OsStr, start: usize, end: usize) -> &OsStr {
#[cfg(all(target_vendor = "fortanix", target_env = "sgx"))]
use std::os::fortanix_sgx::ffi::OsStrExt;
#[cfg(target_os = "hermit")]
use std::os::hermit::ffi::OsStrExt;
#[cfg(target_os = "solid")]
use std::os::solid::ffi::OsStrExt;
#[cfg(unix)]
use std::os::unix::ffi::OsStrExt;
#[cfg(target_os = "wasi")]
use std::os::wasi::ffi::OsStrExt;
#[cfg(target_os = "xous")]
use std::os::xous::ffi::OsStrExt;
#[cfg(any(
unix,
target_os = "wasi",
target_os = "hermit",
all(target_vendor = "fortanix", target_env = "sgx"),
target_os = "solid",
target_os = "xous"
))]
return OsStr::from_bytes(&s.as_bytes()[start..end]);
#[cfg(not(any(
unix,
target_os = "wasi",
target_os = "hermit",
all(target_vendor = "fortanix", target_env = "sgx"),
target_os = "solid",
target_os = "xous"
)))]
return s.slice_encoded_bytes(start..end);
} And that's a little sad.
So all in all I'm back to preferring to skip the check on these platforms. |
Regarding performance, the question I asked myself is why we can't have a subset of the Pattern API that doesn't take |
My thought is that we should have such an API but that this method will be easier to get stabilized. It's a small and unopinionated MVP. |
That doesn't make them mutually exclusive. My point was that for more performance critical code, we can look to #109350 while we can have #109350 mirrors an API on one type into another, is related to an approved RFC, and trims down the biggest, most contentious part of that RFC. My hope is that it can be a relatively quick to stabilize API. It hasn't gotten much attention but I'm looking into that. |
With respect to the restrictions, I am generally inclined toward @blyxxyz's point of view here where we shouldn't need them on Unix because its representation is already set in stone. With that said, keeping uniform restrictions does make the behavior easier to explain and reason about. (I think y'all mentioned that.) But I think most importantly for me anyway, if we start with uniform restrictions, we can always relax them later when we've got some solid use cases motivating us to do that. For std, I am generally inclined to the conservative posture because of our unique stability constraints. |
Add substring API for `OsStr` This adds a method for taking a substring of an `OsStr`, which in combination with [`OsStr::as_encoded_bytes()`](https://doc.rust-lang.org/std/ffi/struct.OsStr.html#method.as_encoded_bytes) makes it possible to implement most string operations in safe code. API: ```rust impl OsStr { pub fn slice_encoded_bytes<R: ops::RangeBounds<usize>>(&self, range: R) -> &Self; } ``` Motivation, examples and research at rust-lang/libs-team#306. Tracking issue: rust-lang#118485 cc `@epage` r? libs-api
…ark-Simulacrum Move `OsStr::slice_encoded_bytes` validation to platform modules This delegates OS string slicing (`OsStr::slice_encoded_bytes`) validation to the underlying platform implementation. For now that results in increased performance and better error messages on Windows without any changes to semantics. In the future we may want to provide different semantics for different platforms. The existing implementation is still used on Unix and most other platforms and is now optimized a little better. Tracking issue: rust-lang#118485 cc `@epage,` `@BurntSushi`
Rollup merge of rust-lang#118569 - blyxxyz:platform-os-str-slice, r=Mark-Simulacrum Move `OsStr::slice_encoded_bytes` validation to platform modules This delegates OS string slicing (`OsStr::slice_encoded_bytes`) validation to the underlying platform implementation. For now that results in increased performance and better error messages on Windows without any changes to semantics. In the future we may want to provide different semantics for different platforms. The existing implementation is still used on Unix and most other platforms and is now optimized a little better. Tracking issue: rust-lang#118485 cc `@epage,` `@BurntSushi`
Was poking around One thing that feels obvious as an alternative here: why is there no For a project of mine, I'm looking to try and use This could maybe also be proposed as an alternative to this function, to allow for better safe code that relies on this method. It's effectively the same as processing any other UTF-8 string as bytes and then recombining the pieces into valid strings, except it's a superset of UTF-8 instead. So, it feels like this would be a reasonable alternative. |
I covered
It might still be reasonable but |
I mean, that's extremely fair. I hadn't fully read the ACP before commenting, so, I probably should have done that. I guess that my main aversion here is that this API essentially relies on doing something that Rust really dislikes, which is keeping track of a bunch of indices into a string instead of just splitting the string itself. Although I suppose that you could just slice twice to split the string and keep doing that as you go, although it would be awkward. Will have to mess around and see if the code is alright. |
It is a little clunky but I think not fatally so. Tracking indices: let path = OsStr::new("foo:bar:baz");
let mut parts = Vec::new();
let mut idx = 0;
for part in path.as_encoded_bytes().split(|&b| b == b':') {
parts.push(path.slice_encoded_bytes(idx..idx + part.len()));
idx += part.len() + 1;
} Slicing as you go: let mut rest = path;
while let Some(idx) = rest.as_encoded_bytes().iter().position(|&b| b == b':') {
parts.push(rest.slice_encoded_bytes(..idx));
rest = rest.slice_encoded_bytes(idx + 1..);
}
parts.push(rest); If you need these operations often you can write a helper like I'm not sure about the balance between using this directly and using it as a safe primitive for a fully-featured helper. I plan to experiment on uutils as a case study. An API along the lines of |
Not sure if this covers all your use cases, but are you aware of |
Wasn't aware, but those only support That said, the example @blyxxyz gave does look like it works well here. |
Feature gate:
#![feature(os_str_slice)]
This is a tracking issue for an API for taking substrings of
OsStr
, which in combination withOsStr::as_encoded_bytes()
would make it possible to implement most string operations in (portable) safe code.Public API
Steps / History
OsStr
libs-team#306OsStr
#118484, MoveOsStr::slice_encoded_bytes
validation to platform modules #118569Unresolved Questions
OsStr
is already fully specified to be arbitrary bytes by means of theOsStrExt
trait. Should we:Footnotes
https://std-dev-guide.rust-lang.org/feature-lifecycle/stabilization.html ↩
The text was updated successfully, but these errors were encountered: