Hi, I’m working on a similar tool and cant figure what makes ripgrep so fast. Actually, on a single file, ripgrep is almost as fast at parsing than a simple naive program only reading the same file with a BufReader. I’d like to understand what mechanisms make ripgrep so fast at parsing single independant files. I believe the way it reads files and buffers data before parsing them is very important but couldnt figure where to search in its source code. Any hints ? I naively believed that reading a file buffered was already the best it could do but obviously it does something more. thanks !

Edit : thanks for the answers and the resources ! I’m looking at that :)


Comments

Shnatsel145 points • 2022-02-12

The author has written an article about that: https://blog.burntsushi.net/ripgrep/

Memory mapping can help avoid the BufRead overhead, and it is discussed in the article. But IIRC there are more manual ways to read faster without memory-mapping, too. I’ll see if I can find them.

https://github.com/uutils/coreutils also makes several programs go faster than the GNU equivalents, and might be more accessible than the source code of ripgrep. IIRC their head and cut implementations are faster than GNU ones.

almandin_jv20 points • 2022-02-12

thanks, I’ve read that already, it covers more high level aspects, what other languages do, what does the regex engine has to do to improve performances, but it doesnt go in finer details (which is fine in the context of the article)

TheRealMasonMac24 points • 2022-02-12

scc took a lot of inspiration from ripgrep, and it goes into a little more detail of what it does. Maybe it’ll help?

https://boyter.org/posts/sloc-cloc-code/

Shnatsel17 points • 2022-02-12

After a bit of digging I’ve found a tip on fast reading from stdin from the author of ripgrep:

https://users.rust-lang.org/t/fast-i-o-in-rust/61714/4

matu3ba1 points • 2022-02-13

Path traversal is potentially unoptimized, ie one can go faster without checking some conditions of other processes interfering ie with directory moving.

perf of POSIX/Linux nftw, Rust walkdir and find are comparable, which is what burntsushi also writes as finding. One option for Rust would be what zig libstd does: offer higher perf for less convenience/checks with optimizations (no sorting) https://github.com/romkatv/gitstatus/blob/master/docs/listdir.md

burntsushi21 points • 2022-02-13

I started work on this years ago to bring the technique to walkdir, but burnt out on it. I got far enough to see that the gains weren’t as huge as reported in that link, so it’s possible I was doing something wrong.

But path traversal is definitely not unoptimized. Firstly, it is parallel in ripgrep. Secondly, a big aspect of it is matching gitignore and that is also heavily optimized as well. :)

burntsushi45 points • 2022-02-13

If you’re constraining yourself to the case of searching a single file, then we can leave out discussion of directory traversal and gitignores. So that makes it easier.

In terms of reading the file, there are two strategies. The uninteresting one is “just use memory maps.” These work well in a single file use case and tend to be faster than other approaches on Linux at least, but can get slower in some cases (when you’re doing a lot of memory map operations) and otherwise simply don’t work in every case (like for streams or virtual files). So you need a fallback anyway.

That fallback is the line buffer: https://github.com/BurntSushi/ripgrep/blob/0b36942f680bfa9ae88a564f2636aa8286470073/crates/searcher/src/line_buffer.rs

It’s not terribly complicated. The only thing it does (other than binary data detection) is to keep track of lines split across buffers. In other words, it provides an API to access a chunk of contiguous lines in a file without ever offering an incomplete line.

Other than that though, the “file reading” strategies are mostly about getting out of the way and amortizing costs. The real secret sauce at this point is SIMD. That’s going to yield the biggest difference between a naive “compare one byte at a time” loop and a vectorized memchr routine. That’s why my article talks about the SIMD stuff. It is truly the difference maker in straight-line search.

almandin_jv4 points • 2022-02-13

thanks for your answer ! I saw that stuff in your article and I understand that the regex library already does these implementations(SIMD), am I wrong ? However, I always feel that rg is faster at matching patterns over files than any program I can write that just read the file without doing anything !

let mmap = unsafe { Mmap::map(&file) }.unwrap(); let split = mmap.split(|c| *c == b’\n’); split.for_each(|_| {});

is still 3 times slower than rg matching a simple pattern over a 1Go file. Do I understand well that it actually searches over the file without caring about newlines, and when it finds an “area of interest” extract the matching line once to do some regex stuff on it ?

burntsushi14 points • 2022-02-13

Well that code appears to be doing a byte-at-a-time loop, which is exactly what is slow compared to SIMD. Try modifying your code to split lines using memchr.

But yes, you’re right, there is another trick here. ripgrep does not typically iterate over every line in a file. It just searches groupes of lines as one big chunk. If a match is found, it is only then that the corresponding line is found. This optimization rests on the assumption that matches tend to be rare. It is overall slower (but only a little bit) if there is actually a match on every line.

So yes, it seems like your understanding is correct. But as I mentioned, that is not the only difference from your code. Your code (likely) doesn’t use SIMD.

duckerude71 points • 2022-02-12

A BufReader is not automatically the fastest way to read. It’s important to avoid allocations and to avoid copies.

If you call .lines() on a BufReader then every line you read will be allocated into a fresh String. That’s convenient, but inefficient. An improvement could be to use a single String and repeatedly .clear() it and .read_line() into it.

But even if you do that you’re still copying the data. First the BufReader reads it into its buffer, and then it copies it from the buffer into the String. Sometimes it’s possible to avoid that.

It may or may not be doable for your use case, but consider creating your own buffer (a Vec<u8>, bigger can be faster), doing simple Read::read()s into that, and processing the data immediately. Then it doesn’t have to be copied around as much. You can see something like that in jsonxf, for example. The big downside is that you have to be able to process incomplete data, because the read might end right in the middle of something. (I usually don’t go this far.)

…or just use mmap. It doesn’t work on everything, and you have to promise the file won’t change while you read, but it’s fast and convenient.

Freeky19 points • 2022-02-13

The bstr crate includes a BufRead extension trait which adds functions for efficient line iteration, lending out slices of the internal buffer wherever possible, and using a growable auxiliary buffer only for trailing lines and those which don’t fit.

Before writing those I made linereader, which only supports a fixed buffer - this does have the disadvantage you mention of potentially having partial lines to deal with, but that also comes with the benefit of predictable memory use, so it depends on what your requirements are.

SorteKanin1 points • 2022-02-13

Why isn’t .lines() implemented so that it uses a single string? Is that not possible?

duckerude14 points • 2022-02-13

It’s not possible for an Iterator to yield items that only live until the next call to .next(). See this explanation.

You can have an API that only uses a single String under the hood, but it can’t be an Iterator so you can’t use it in a for loop or call the iterator methods.

SorteKanin9 points • 2022-02-13

So if I understand correctly, it’s not impossible but would require GATs and a redefinition of the iterator trait?

matthieum17 points • 2022-02-13

Indeed.

It is therefore more likely that a new trait would appear once GATs are stable; possibly a StreamingIterator or LendingIterator.

SorteKanin5 points • 2022-02-13

But this trait would be just strictly better than Iterator right? There would never be a need for Iterator if such a trait existed right? Would be nice if an edition could change the trait directly

matthieum18 points • 2022-02-13

Not necessarily better, no.

Specifically, the problem of a lending iterator is that the iterator itself is borrowed as long as you use the item you got. This makes it impossible to get two items at a time, like for chunks() or window(), which some algorithms require.

Hence there’s a trade-off. You can some capabilities, but lose some others in return.

hniksic4 points • 2022-02-13

Specifically, the problem of a lending iterator is that the iterator itself is borrowed as long as you use the item you got.

That’s true, but it is my understanding (intuition) that with GATs it should be possible to write a trait where the implementor chooses whether they give away ownership of items returned by next(), or they “lend” them, depending on how they define Item the iterator returns.

I don’t know if that is correct, but I suspect that that’s what the GP means by the hypothetical trait being “strictly better” than Iterator.

matthieum5 points • 2022-02-13

That’s true, but it is my understanding (intuition) that with GATs it should be possible to write a trait where the implementor chooses whether they give away ownership of items returned by next(), or they “lend” them, depending on how they define Item the iterator returns.

This is my understanding too, but it doesn’t help in generic contexts.

That is, if you have a method with a signature of:

fn foo<I: LendingIterator>(iterator: I) { let e: LendingIterator::Item<’_> = iterator.next(); // e borrows iterator }

Then within the method you know not the concrete implementation and must assume that the iterator is borrowed.

Unless maybe with a for<'a> Item<'a>: 'static bound?

nightcracker3 points • 2022-02-13

There’s a workaround not listed on that page that works in current day Rust, and that’s to store the state of the iterator in an external RefCell and return a Ref to it from the iterator. It does incur some runtime overhead (that could likely be eliminated during loop unrolling, as it sets/unsets a flag then immediately checks it), and a runtime failure condition if someone did in fact keep the Ref around in the next iteration, but otherwise works.

the_gnarts20 points • 2022-02-12

Performance is discussed in detail in this podcast interview with the author of ripgrep.

TheRealMasonMac14 points • 2022-02-12

https://blog.burntsushi.net/ripgrep/

Earthling198017 points • 2022-02-12

For anybody not already aware of this tool (as I wasn’t)

https://github.com/BurntSushi/ripgrep

ripgrep is a line-oriented search tool that recursively searches the current directory for a regex pattern. By default, ripgrep will respect gitignore rules and automatically skip hidden files/directories and binary files. ripgrep has first class support on Windows, macOS and Linux, with binary downloads available for every release. ripgrep is similar to other popular search tools like The Silver Searcher, ack and grep.

Nickitolas7 points • 2022-02-13

I looked around the code a bit, hope these links to files/lines help

main.rs, specifically search: https://github.com/BurntSushi/ripgrep/blob/master/crates/core/main.rs#L76-L122

search function: https://github.com/BurntSushi/ripgrep/blob/df83b8b44426b3f2179abe632eb183e8c8270524/crates/core/search.rs#L327-L346

Following that I wind up in searcher mod.rs which does the File::open : https://github.com/BurntSushi/ripgrep/blob/master/crates/searcher/src/searcher/mod.rs#L615-L629

If you look at `search_file_maybe_path` in that file it either mmaps to a slice if configured, or uses `fill_multi_line_buffer_from_file`, and also a generic Read fallback

Then `fill_multi_line_buffer_from_file` does a `read_to_end` after reserving capacity :https://github.com/BurntSushi/ripgrep/blob/master/crates/searcher/src/searcher/mod.rs#L889-L921

delta1-tari2 points • 2022-02-13

Great post thanks!

superander0 points • 2022-02-13

RIP Grep? Meaning the death of the good ol’ grep tool?

Kbknapp19 points • 2022-02-13

No, the author has said publicly that “rip” meant as to “rip through data quickly.” Or a grep that rips through data.

lovestruckluna7 points • 2022-02-13

Funny story that. Per its author, the name wasn’t actually referring to death, but that the tool would ‘rip’ through files.

Doesn’t stop the rest of us from making that connection 😂

burntsushi7 points • 2022-02-13

https://github.com/BurntSushi/ripgrep/blob/master/FAQ.md#intentcountsforsomething

U007D3 points • 2022-02-13

Thanks. This question also popped into ny head for the first time today too. :)