Improve sources memory consumption #279

vzamanillo · 2020-07-24T18:34:31Z

While doing some memory profiles with pprof I've discovered that some sources increase the memory footprint of subfinder in excess ej: waybackarchive

This is because the size of the results is very large and we are using ioutil.ReadAll(pagesResp.Body).

After some changes to read the response stream using bufio.NewReader(pagesResp.Body) the memory consumption is drastically reduced.

It happens in other sources too, especially in those that return json and no decoder is used to process it, but all the content is put in memory with ioutil.ReadAll (pagesResp.Body) and subdomainExtractor is used with regexp to match subdomains (ej: threatminer, threatcrowd...).

It would be nice to avoid using ioutil.ReadAll (pagesResp.Body) as long as possible and check the rest of the sources to use the json responses correctly.

We could do it after merging #278 or we could introduce them directly in that branch.

The text was updated successfully, but these errors were encountered:

vzamanillo · 2020-07-26T09:56:01Z

First results after some rework, I have excluded github because it takes a long time to finish, but it increases the consumption by only about 5MB and keeps it constant until finished.

ehsandeep · 2020-07-26T10:14:24Z

Hey @vzamanillo we didn't focus on memory profiling in the past because subfinder is not something we run all the time, mostly it's one time run before you start with your target, but definitely one of the things to improve to make it more mature.

Apart from memory consumption improvement, do you also notice an improvement in overall run time (as we can see in the above poc), is it a result of linting work on your side or little improvement because of better memory management?

vzamanillo · 2020-07-26T16:32:55Z

Hi @bauthard, there are no significant improvement in overall run time, there is in some cases, but the difference is not so important, in fact, in sources with large response data, such as commoncrawl or waybackarchive, it is a few milliseconds slower because the content of the responses is iterated line by line instead of putting everything in memory and processing the data.

These improvements in memory consumption are not in the branch of pull request #278, they are changes that I have made based on that branch, but I have them prepared to be able to merge them when #278 comes out (I think it is not time to introduce them in the #278 so as not to increase the cost of the review and because the scope of these changes is different from the changes we are talking about)

vzamanillo · 2020-07-27T11:56:52Z

Step by step guide to profile golang CPU / Memory.

Add profile package to main.go imports.

import (
	"context"

	"github.com/projectdiscovery/gologger"
	"github.com/pkg/profile"
	"github.com/projectdiscovery/subfinder/pkg/runner"
)

func main() {
	defer profile.Start().Stop() // CPU profiling (default)
       // defer profile.Start(profile.MemProfile).Stop() // Memory profiling
....
}

Run main.go:

# go run main.go -d uber.com -sources alienvault

after finished you can see the following message:

2020/07/27 13:46:24 profile: cpu profiling disabled, /tmp/profile978571390/cpu.pprof

Run pprof and inspect the results (it will open a new browser window):

go tool pprof -http=:8080 /tmp/profile093511175/cpu.pprof

freecodecamp pprof guide: https://www.freecodecamp.org/news/how-i-investigated-memory-leaks-in-go-using-pprof-on-a-large-codebase-4bec4325e192/

ehsandeep added Priority: Medium This issue may be useful, and needs some attention. Type: Discussion Some ideas need to be planned and disucssed to come to a strategy. Type: Enhancement Most issues will probably ask for additions or changes. labels Jul 27, 2020

vzamanillo mentioned this issue Jul 29, 2020

Sources memory improvement #281

Merged

ehsandeep closed this as completed in #281 Aug 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve sources memory consumption #279

Improve sources memory consumption #279

vzamanillo commented Jul 24, 2020 •

edited

Loading

vzamanillo commented Jul 26, 2020

ehsandeep commented Jul 26, 2020

vzamanillo commented Jul 26, 2020 •

edited

Loading

vzamanillo commented Jul 27, 2020 •

edited

Loading

Improve sources memory consumption #279

Improve sources memory consumption #279

Comments

vzamanillo commented Jul 24, 2020 • edited Loading

vzamanillo commented Jul 26, 2020

ehsandeep commented Jul 26, 2020

vzamanillo commented Jul 26, 2020 • edited Loading

vzamanillo commented Jul 27, 2020 • edited Loading

vzamanillo commented Jul 24, 2020 •

edited

Loading

vzamanillo commented Jul 26, 2020 •

edited

Loading

vzamanillo commented Jul 27, 2020 •

edited

Loading