Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible race-condition when first using the WARC writers? #167

Closed
anjackson opened this issue Aug 12, 2016 · 3 comments
Closed

Possible race-condition when first using the WARC writers? #167

anjackson opened this issue Aug 12, 2016 · 3 comments

Comments

@anjackson
Copy link
Collaborator

Most of the times I start a crawl using a recent Heritrix build, an exception gets thrown when the system first tries to write to a WARC file. e.g.

2016-08-12T09:06:35.096Z    -5         54 dns:madebybridge.com RLP http://madebybridge.com/ text/dns #030 20160812090634493+63 sha1:QT62S24FK6C32Z7PBBYEDY3OBTNBZFMI - err=java.lang.NullPointerException
 java.lang.NullPointerException
        at org.archive.io.warc.WARCWriter.getFilenameWithoutOccupiedSuffix(WARCWriter.java:275)
        at org.archive.modules.writer.WARCWriterProcessor.updateMetadataAfterWrite(WARCWriterProcessor.java:289)
        at org.archive.modules.writer.WARCWriterProcessor.write(WARCWriterProcessor.java:259)
        at org.archive.modules.writer.WARCWriterProcessor.innerProcessResult(WARCWriterProcessor.java:195)
        at org.archive.modules.Processor.process(Processor.java:142)
        at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:152)
 java.lang.NullPointerException
        at org.archive.io.warc.WARCWriter.getFilenameWithoutOccupiedSuffix(WARCWriter.java:275)
        at org.archive.modules.writer.WARCWriterProcessor.updateMetadataAfterWrite(WARCWriterProcessor.java:289)
        at org.archive.modules.writer.WARCWriterProcessor.write(WARCWriterProcessor.java:259)
        at org.archive.modules.writer.WARCWriterProcessor.innerProcessResult(WARCWriterProcessor.java:195)
        at org.archive.modules.Processor.process(Processor.java:142)
        at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:152)
2016-08-12T09:06:35.098Z    -5         52 dns:wiki.bl.uk P https://wiki.bl.uk:8443/ text/dns #002 20160812090633536+66 sha1:KHNVONKMAARJYO2OZHRC2TPPZZM6A7AG - err=java.lang.NullPointerException
 java.lang.NullPointerException
        at org.archive.io.warc.WARCWriter.getFilenameWithoutOccupiedSuffix(WARCWriter.java:275)
        at org.archive.modules.writer.WARCWriterProcessor.updateMetadataAfterWrite(WARCWriterProcessor.java:289)
        at org.archive.modules.writer.WARCWriterProcessor.write(WARCWriterProcessor.java:259)
        at org.archive.modules.writer.WARCWriterProcessor.innerProcessResult(WARCWriterProcessor.java:195)
        at org.archive.modules.Processor.process(Processor.java:142)
        at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:152)
 java.lang.NullPointerException
        at org.archive.io.warc.WARCWriter.getFilenameWithoutOccupiedSuffix(WARCWriter.java:275)
        at org.archive.modules.writer.WARCWriterProcessor.updateMetadataAfterWrite(WARCWriterProcessor.java:289)
        at org.archive.modules.writer.WARCWriterProcessor.write(WARCWriterProcessor.java:259)
        at org.archive.modules.writer.WARCWriterProcessor.innerProcessResult(WARCWriterProcessor.java:195)
        at org.archive.modules.Processor.process(Processor.java:142)
        at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:152)

Any ideas what's going awry here? Seems like a race-condition when setting up the WARC writers?

@nlevitt
Copy link
Contributor

nlevitt commented Aug 15, 2016

Thanks for reporting @anjackson. Here is a fix: #168
Please merge if it works for you!

The issue doesn't happen if you have poolMaxActive=1 which I think is why I hadn't seen it. Not sure why it hasn't been happening for a long time though (not that I really investigated).

I was seeing the stack trace below. The problem was multiple threads trying to create the "warcs" directory at the same time. Fix was to synchronize around the directory creation.

java.io.IOException: Failed to create directory: /1/ait-h3-jobs/gh-167/20160815201316/warcs
    at org.archive.util.FileUtils.ensureWriteableDirectory(FileUtils.java:677)
    at org.archive.modules.writer.WriterPoolProcessor.calcOutputDirs(WriterPoolProcessor.java:457)
    at org.archive.io.WriterPoolMember.createFile(WriterPoolMember.java:203)
    at org.archive.io.WriterPoolMember.checkSize(WriterPoolMember.java:180)
    at org.archive.modules.writer.WARCWriterProcessor.write(WARCWriterProcessor.java:219)
    at org.archive.modules.writer.WARCWriterProcessor.innerProcessResult(WARCWriterProcessor.java:195)
    at org.archive.modules.Processor.process(Processor.java:142)
    at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
    at org.archive.crawler.framework.ToeThread.run(ToeThread.java:152)
2016-08-15 20:13:18.793 SEVERE thread-56 org.archive.crawler.framework.ToeThread.recoverableProblem() Problem java.lang.NullPointerException occured when trying to process 'dns:archive.org' at step ABOUT_TO_BEGIN_PROCESSOR in 

java.lang.NullPointerException
    at org.archive.io.warc.WARCWriter.getFilenameWithoutOccupiedSuffix(WARCWriter.java:274)
    at org.archive.modules.writer.WARCWriterProcessor.updateMetadataAfterWrite(WARCWriterProcessor.java:289)
    at org.archive.modules.writer.WARCWriterProcessor.write(WARCWriterProcessor.java:259)
    at org.archive.modules.writer.WARCWriterProcessor.innerProcessResult(WARCWriterProcessor.java:195)
    at org.archive.modules.Processor.process(Processor.java:142)
    at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
    at org.archive.crawler.framework.ToeThread.run(ToeThread.java:152)

@anjackson
Copy link
Collaborator Author

anjackson commented Aug 17, 2016

Thanks @nlevitt - I'll test it ASAP but it might be a little while yet.

EDIT: To anyone else hitting this, a simple workaround is to manually ensure the relevant directory already exists.

nlevitt added a commit that referenced this issue Jun 8, 2017
fix for race-condition when first using the WARC writers https://gith…
@nlevitt
Copy link
Contributor

nlevitt commented Jun 8, 2017

Merged anyhow

@nlevitt nlevitt closed this as completed Jun 8, 2017
nlevitt added a commit that referenced this issue Jun 16, 2017
* master:
  ughhh appease java 8 javadoc rules
  fix for race-condition when first using the WARC writers #167
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants