Request to add license test files for a different use case #2

goneall · 2017-09-15T17:08:48Z

The SPDX tools issue spdx/tools#109 (comment) submitted by @wking suggests adding license test files to test the correctness of the license XML files and the tool that generates the website to a separate repository. This seems like a natural fit.

@jmanbeck If you agree, I can create a pull request to update the README file and the top level directory.

wking · 2017-09-15T17:26:08Z

If you agree, I can create a pull request to update the README file and the top level directory.

Oops, I'd gone ahead and filed #3 without waiting for a green light. I'll stub out the tar unpacking too so you can see what I have in mind, but feel free to reject both proposals if you don't think they're a good fit.

jmanbeck · 2017-09-20T20:21:48Z

Im open to whatever you guys want to do. Are you looking for just plain text files with nothing but the license text?

wking · 2017-09-20T20:55:18Z

On Wed, Sep 20, 2017 at 01:21:48PM -0700, jmanbeck wrote: Are you looking for just plain text files with nothing but the license text?

See #4 for what I want for the 0BSD. One of those files (0BSD/license/good/canonical.txt) is just the license text. The other files show the standard header in a variety of contexts. For licenses with alt/optional components, license-id/license/good would contain additional files exercising the mutable parts. The idea is to have a collection of files which cover the cases we expect to find in the wild, so tools which successfully identify the license associated with the various corpus files can trust their identification/matching implementation.

goneall · 2017-09-21T16:07:29Z

I'll be using the files to test the license generator output (the output used for the website). These can be different test files than those used to test license scanners. Since I'll be automating the tests, the test files names will have to follow a specific naming convention so that I can match the license test files with the license and determine if the test is supposed to result in a positive or negative match result.

jmanbeck · 2017-09-21T16:25:49Z

So these are two different things as far as I can see. I looked at what Trevor did and can agree with most of it and will comment seperately.

Gary, you need just the license text and nothing else correct? No standard header, etc.,. Those can be easily generated. I know I gave you a set. Is that you are looking for? The program can be easily modified to generate those as well using the naming convention you want. If youd like it could be a seperate program as well given its specialzed use.

goneall · 2017-09-21T16:45:01Z

@jmanbeck Just the license text (no comments - just the straight text).

We could put them in a separate directory "license-matcher-testfiles" since they are a different use case than the license scanner testing.

@wking suggested creating separate subdirectories for each license id. If it isn't too much work, can you create a subdirectory for each license ID named by licenseID with a single subdirectory "license" which contains a single subdirectory "good" and the license text could be named "original.txt".

To summarize:

``

\

license-matcher-testfiles

 \ 

 {license-id} ...

   \ 

    license

       \ 

        good

         \ 

         original.txt

``
This follow the pattern we will be using for the automated testing: {license-id}/(license|header)/(good|bad)/{test-id}.txt

If the file naming ends up being too much work, just drop me a tarball of the files and I'll structure them into the directory and commit

Thanks

wking · 2017-09-21T16:56:13Z

On Thu, Sep 21, 2017 at 09:45:01AM -0700, goneall wrote: We could put them in a separate directory "license-matcher-testfiles" since they are a different use case than the license scanner testing.

It's a different use case, but I think it's a subset of the match-testing use case. For example, the 0BSD file you ask for is 0BSD/license/good/canonical.txt [1]. [1]: https://github.com/spdx/license-test-files/pull/4/files#diff-87c959bfe7349374963aaa364f6a8470R1

@wking suggested creating separate subdirectories for each license id.

And the docs for that are in flight with #3.

If it isn't too much work, can you create a subdirectory…

I've already started in on this with #4. If you like the direction, I'll continue through the rest of the tarball.

… "original.txt".

I recommend ‘canonical.txt’, because what SPDX chooses for the canonical wording may not be the chronologically-first wording.

jmanbeck · 2017-09-21T16:58:16Z

Okay. The naming isnt an issue. We just need to agree on the structure. Generating it wont be that difficult and I dont think adding them by hand makes sense.

I dont have a solid opinion on this to be honest. I more or less just volunteered to create some files. I think the first thing to do is to agree on the structure. Once we have that we can agree on the number and type of test files.

@goneall and @wking if you two can agree on a structure that would be great. I dont have anything to do with the tooling and license list so Im happy to do what works best.

@wking I have looked over what you did for 0BSD and like it. I do have a few comments/questions.

You included .sh and .c. Why? Imnot sure I see the need for both unless its a comments thing (#) and we could add all sorts of other form factors inclkuding HTML/ JS /etc.,. Who knows maybe we should.
What am I missing in the with prefix and no prefix cases? They look the same to me.
I like the idea of the license name being part of the file name for easy visual detection. If the reports out of scanners include the directory probbaly is not necessary then. If they dont, then its more work to hand verify results I think.

wking · 2017-09-21T17:18:45Z

On Thu, Sep 21, 2017 at 04:58:17PM +0000, jmanbeck wrote: Generating it wont be that difficult and I dont think adding them by hand makes sense.

Generating automatically is fine, but things like 0BSD/header/good/canonical-with-prefix.c and 0BSD/header/good/alternate-copyright.sh (which I just pushed to #4) don't need to be generated for each license. And regardless of how they're generated, we want to manually review all of them to ensure we are not coupling in possible generator bugs. Once we catch up to the current list, it should be the legal team submitting new content, not automated tools.

1. You included .sh and .c. Why? Imnot sure I see the need for both unless its a comments thing (#)…

Yes, it's to test that your matcher correctly ignores comments. I'd also like some bad tests where we inject some of those in the license body (vs. using them as prefixes) to make sure the match fails in those cases (unless we have some matching guideline that suggests comment-ish characters be ignored, in which case the internal comment character files would go under good/).

… and we could add all sorts of other form factors inclkuding HTML/ JS /etc.,. Who knows maybe we should.

I agree, although JavaScript uses C-style comments [1], so I don't think we need *.js tests. We probably do want *.xml tests using , although I'm mostly worried about prefixes, and I don't usually see prefixes in XML comments.

2. What am I missing in the with prefix and no prefix cases? They look the same to me.

The prefix case has a leading ‘** ’: $ diff -u 0BSD/header/good/canonical*prefix.c --- 0BSD/header/good/canonical-no-prefix.c 2017-09-15 11:01:34.430392702 -0700 +++ 0BSD/header/good/canonical-with-prefix.c 2017-09-21 10:05:28.725042963 -0700 @@ -12,19 +12,19 @@ */ /* -Copyright (C) 2006 by Rob Landley <[email protected]> … +** Copyright (C) 2006 by Rob Landley <[email protected]>

3. I like the idea of the license name being part of the file name for easy visual detection. If the reports out of scanners include the directory probbaly is not necessary then. If they dont, then its more work to hand verify results I think.

Are you saying that tools may only print the filename, so we should use a flat-file approach instead of nested directories? I'd rather stick with nested directories, because it makes grouping easier with ‘ls’, and writing tools that print enough of the path to include the license-id should be straightforward enough. But I don't feel super-strongly about it. If you want a flat structure, I can do that. [1]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Lexical_grammar#Comments

goneall · 2017-09-21T17:59:59Z

The only think I need for testing the license generator in the short term is the clean text (no comments) for the positive test cases. I can filter by file extensions to only pull out the .txt files if we want to mix the other language types in the same directory.

I'm OK with naming it canonical.txt, although I know that there has been debate in the past as to what the real canonical text is for a given license. As long as the naming choice doesn't require us resolving each of those debates, I'm OK with canonical.txt. If we feel that we need to confirm the text is the real canonical text as defined by the author of each individual license, I would suggest using a different term.

@wking If you still feel canonical.txt with the above consideration, let's go with that.

wking · 2017-09-21T18:45:17Z

On Thu, Sep 21, 2017 at 06:00:00PM +0000, goneall wrote: I'm OK with naming it canonical.txt, although I know that there has been debate in the past as to what the real canonical text is for a given license. As long as the naming choice doesn't require us resolving each of those debates, I'm OK with canonical.txt. If we feel that we need to confirm the text is the real canonical text as defined by the author of each individual license, I would suggest using a different term.

I'm fine defining “canonical” as “the wording preferred by the SPDX legal team” to make it clear that it's independent of the author (where we even have clear author information). I've updated #3 with 644fc96 → 69adb5f to make that clear.

goneall · 2017-09-22T17:27:44Z

I would also like to test exception text in addition to the license and header text.

Suggest adding another directory under the license id "exception"

wking · 2017-09-22T18:18:50Z

On Fri, Sep 22, 2017 at 05:27:46PM +0000, goneall wrote: I would also like to test exception text in addition to the license and header text.

Good idea.

Suggest adding another directory under the license id "exception"

Hmm. I don't see a reason to stop at WITH. Folks will probably want to test + detection as well, and possibly AND and OR detection. What we want is closer to (adjusting the #3 ABNF): file-name = license-expression "/" license-header "/" good-bad "/" test "." extention (and that works for both the current spec's ABNF and the cleanup I'm proposing in spdx/spdx-spec#37). But license expressions have spaces and such, which are annoying for filenames. We can work around that by percent encoding [1] reserved characters [2], which most languages will be able to support with generic URI-processing libraries (e.g. [3,4]). That would give you the same filename for bare licenses: 0BSD/… because license-ids contain no reserved characters [5]. Exception filenames would look like: GPL-3.0%20WITH%20GCC-exception-3.1/… GPL-2.0+ would be: GPL-2.0%2B/… And GPL-2.0+ WITH GCC-exception-2.0 OR GPL-3.0 WITH GCC-exception-3.1 would be: GPL-2.0%2B%20WITH%20GCC-exception-2.0%20OR%20GPL-3.0%20WITH%20GCC-exception-3.1/… (although we probably want to stick with examples we actually find in the wild ;). Does that sound like a reasonable balance between flexibility and human-readability to you? [1]: https://tools.ietf.org/html/rfc3986#section-2.1 [2]: https://tools.ietf.org/html/rfc3986#section-2.2 [3]: https://docs.python.org/3/library/urllib.parse.html#urllib.parse.quote [4]: https://docs.python.org/3/library/urllib.parse.html#urllib.parse.unquote [5]: https://github.com/spdx/spdx-spec/blame/cfa1b9d08903befdf03e669da6472707b7b60cb9/chapters/appendix-IV-SPDX-license-expressions.md#L11

jmanbeck · 2017-09-22T19:43:25Z

@goneall and @wking Im wondering of we can continue this discussion on the tech team mailing list. Kate had invited others to the Outreach call yesterday who had an interest in this. Thomas showed up and has done some similar work in this area and has some thoughts and maybe even something he could contribute.

In a way we rushed these files. Mostly we did it to gauge interest. I think before we do a massive restructure lets nail a few things done on the direction and scope of this. We wont need to beat a dead horse but how this will work is important. For example, as Trevor suggested I dont know if Legal will sign up to maintain this. I have my doubts. They seem pretty busy as it is.

That said the original idea behind this was to create a set of test files for scanners to see how well they did in matching license text and standard headers to SPDX licenses. It was a start without boiling the ocean. We wanted to have both "good" examples and "bad" ones. The good ones should be detected as the spdx license and the "bad" ones should not. The bad ones would have been slightly edited to fail the matching guidelines. That said, the idea was not to get too crazy with the number of test files. We started out hand editing these files but there were so many we opted to auto generate them. The downside of this auto generation is that you have these silly files with the entire license in them because they have no standard header. Seemed okay to me as whether it was a plain text file or not should not matter to a scanner and it was easier to do. We put off doing bad files until we could figure out how to interject noise with the auto generation.

Trevor, both you and Thomas would like to see this expanded and Gary would as well. I think even Kate agrees. I agree as well. I think adding additional use cases is okay but lets collect them all up.

Some thoughts:

I like the layouts so far. Im still on the fence on whether to use the license id in the file name or not for quick visual resolution.
Who will support new licenses if we dont auto generate the files from the list? Who will correct current ones because I think there are some changes that need to made to some of the licenses. I like the idea of auto generating (we never have enough help in SPDX and I worry this wont be properly maintained).
Should we allow crowd sourcing of files (good and bad) from others? If so, should they be in a separate directory? How will people feel about being a "bad" example if we pull something real. Who will vet these?
Its likely we need some sort of statement about licensing and I need to speak with the Legal team.

We can always start small and build up.

wking · 2017-09-22T20:31:39Z

I'm wondering of we can continue this discussion on the tech team mailing list…

Works for me.

For example, as Trevor suggested I dont know if Legal will sign up to maintain this. I have my doubts. They seem pretty busy as it is.

Somebody is signing off on our alt/optional markup (spdx/license-list-XML#423, spdx/license-list-XML#425, spdx/license-list-XML#428, spdx/license-list-XML#429, spdx/license-list-XML#430, spdx/license-list-XML#431, …). I expected that was legal, and would mostly be grounded in decisions like this one. Things like spdx/license-list-XML#425 which are deemed too obvious can happen without involving legal. I don't know how these will breakdown in practice, maybe 50/50 for “serious enough to involve legal”?

And while it would be nice if legal decisions were backed by a legal-generated PR, I still don't think it will be a big deal if someone on the tech team has to proxy the legal decision into a backing PR.

And again, I'm ok with auto-generation if folks want to do that. I just think “I autogenerated these from our license source” is creating a chicken/egg issue, and I think it's easier to get legal to sign off on these examples as a source of truth than it would be to get them to sign off on the XML-marked licenses. With solid examples, the development of the XML-marked licenses would be the sole province of the tech team.

Should we allow crowd sourcing of files (good and bad) from others?

Absolutely. But legal needs to be the gatekeeper, since the match/no-match decision is theirs.

How will people feel about being a "bad" example if we pull something real…

I'll switch to use match / no-match to sound less judgmental.

We can always start small and build up.

Yup. #4 is pretty small ;).

goneall · 2017-09-22T21:48:12Z

+1 on continuing on the SPDX mailing list. We can also add this as a topic for next Tuesday's call. @jmanbeck did you want to tee this up or would you like me to introduce it?

There are definitely multiple use cases.

The original purpose of this repo was a set of licenses to test and compare license scanners.

I would like to use the test files to test the generation of the actual SPDX listed licenses.

There's quite a bit of overlap for the two, but some differences as well (e.g. I do not plan to use commented code from different languages in testing out the license generation).

wking · 2017-09-22T21:57:27Z

On Fri, Sep 22, 2017 at 02:48:13PM -0700, goneall wrote: There's quite a bit of overlap for the two, but some differences as well (e.g. I do not plan to use commented code from different languages in testing out the license generation).

I think the generation-test data is a strict subset of the match-test data.

goneall · 2017-09-26T18:10:55Z

Discussed on the tech team call. Minutes are at https://wiki.spdx.org/view/Technical_Team/Minutes/2017-09-26

Next tech call (Tuesday Oct. 3) we plan to decide on the file naming convention and the format of any associated test metadata format (there is a proposal for YAML or SPDX files to describe expected test results). If interested, please join the call:
10AM Pacific time
Dial-in number: +1-857-216-2871
Conference code: 38633
http://uberconference.com/SPDXTeam pin # 38633

wking · 2017-09-26T18:34:41Z

On Tue, Sep 26, 2017 at 11:10:55AM -0700, goneall wrote: Minutes are at https://wiki.spdx.org/view/Technical_Team/Minutes/2017-09-26

Can you elaborate on the distinction between “Test license scanner” and “Test license matching” (maybe linking to projects implementing both)? At the moment, those sound like the same thing to me.

Next tech call (Tuesday Oct. 3) we plan to decide on the file naming convention and the format of any associated test metadata format (there is a proposal for YAML or SPDX files to describe expected test results).

Can an advocate for the YAML/SPDX test-metadata provide an example of what they'd like to see? Are they looking for things like autodetected FileCopyrightText [1]? [1]: https://spdx.org/spdx-specification-21-web-version#h.206ipza

goneall · 2017-09-27T03:52:43Z

Can you elaborate on the distinction between “Test license scanner”
and “Test license matching” (maybe linking to projects implementing
both)? At the moment, those sound like the same thing to me.

License scanning is looking at a wide variety text and identifying matches to license headers, comments, SPDX ID's, snippets of license text and actual license text. Testing for this would include all of these artifacts. Example scanners are scancode, FOSSOlogy, Ninkia and some commercial scanners.
License matching involves full license text and following the SPDX Matching Guidelines to determine if 2 blocks of text are equivalent per the guidelines. Example implementation is the SPDX License Checker
There may be some overlap as some license scanners provide a score and a score of 100% may be interpreted as a match per the matching guidelines (although, I believe there are very few scanners that actually take this approach).

wking · 2017-09-27T05:06:57Z

License matching involves full license text and following the SPDX Matching Guidelines to determine if 2 blocks of text are equivalent per the guidelines.

Those guidelines explicitly mention headers (1.1.1) and comment prefixes (6). So the distinction is that matching is "do you have a generic implementation of the template syntax and matching guidelines?" and scanning is actually applying that matching logic to compare a test file with each license (expression) in a known list?

goneall · 2017-09-27T17:30:48Z

Those guidelines explicitly mention headers (1.1.1) and comment prefixes (6). So the distinction is that matching is "do you have a generic implementation of the template syntax and matching guidelines?" and scanning is actually applying that matching logic to compare a test file with each license (expression) in a known list?

In my opinion, there is a separate use case with a separate goal in matching the license text (even though the guidelines do mention headers). It is to answer the question if two texts can be considered the same license or if you need to add them to an SPDX document as a separate license. This is an important use case in my auditing work when presented with license text in package information. A human eye will sometimes miss important wording changes - having a tool to test license equivalence improves the quality of the audit results.

wking · 2017-09-27T19:23:19Z

On Wed, Sep 27, 2017 at 10:30:49AM -0700, goneall wrote: > Those guidelines explicitly mention headers (1.1.1) and comment > prefixes (6). So the distinction is that matching is "do you have > a generic implementation of the template syntax and matching > guidelines?" and scanning is actually applying that matching logic > to compare a test file with each license (expression) in a known > list? In my opinion, there is a separate use case with a separate goal in matching the license text (even though the guidelines do mention headers).

That's fine, and I think we want to support tests like that. However, I think the matching guidelines pretty clearly identify both license matching and header matching, with or without comment prefixes and surrounding code as in-scope. So the test corpus we provide should cover all those cases. And if a particular tool (e.g. your “does this stand-alone license text match one of these other licenses” tool) only needs a subset of that test corpus, it can pick out that subset. Using something like the filename scheme in #3 should make picking out that subset pretty easy. Do you have concerns about finding the right files for testing your matcher with the filename scheme in #3?

wking mentioned this issue Sep 22, 2017

Add upstream header suggestions to exceptions? spdx/license-list-XML#435

Closed

wking mentioned this issue Sep 22, 2017

README: Document target filename scheme #3

Open

wking mentioned this issue Dec 4, 2017

Include testing samples in an easily unmarshallable format #7

Open

wking mentioned this issue Dec 15, 2017

Add a directory of files used for testing the license list website generating code #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request to add license test files for a different use case #2

Request to add license test files for a different use case #2

goneall commented Sep 15, 2017

wking commented Sep 15, 2017

jmanbeck commented Sep 20, 2017

wking commented Sep 20, 2017 via email

goneall commented Sep 21, 2017

jmanbeck commented Sep 21, 2017

goneall commented Sep 21, 2017 •

edited

Loading

wking commented Sep 21, 2017 via email •

edited

Loading

jmanbeck commented Sep 21, 2017

wking commented Sep 21, 2017 via email

goneall commented Sep 21, 2017

wking commented Sep 21, 2017 via email

goneall commented Sep 22, 2017

wking commented Sep 22, 2017 via email

jmanbeck commented Sep 22, 2017

wking commented Sep 22, 2017 •

edited

Loading

goneall commented Sep 22, 2017

wking commented Sep 22, 2017 via email

goneall commented Sep 26, 2017 •

edited

Loading

wking commented Sep 26, 2017 via email

goneall commented Sep 27, 2017 •

edited

Loading

wking commented Sep 27, 2017

goneall commented Sep 27, 2017

wking commented Sep 27, 2017 via email

Request to add license test files for a different use case #2

Request to add license test files for a different use case #2

Comments

goneall commented Sep 15, 2017

wking commented Sep 15, 2017

jmanbeck commented Sep 20, 2017

wking commented Sep 20, 2017 via email

goneall commented Sep 21, 2017

jmanbeck commented Sep 21, 2017

goneall commented Sep 21, 2017 • edited Loading

wking commented Sep 21, 2017 via email • edited Loading

jmanbeck commented Sep 21, 2017

wking commented Sep 21, 2017 via email

goneall commented Sep 21, 2017

wking commented Sep 21, 2017 via email

goneall commented Sep 22, 2017

wking commented Sep 22, 2017 via email

jmanbeck commented Sep 22, 2017

wking commented Sep 22, 2017 • edited Loading

goneall commented Sep 22, 2017

wking commented Sep 22, 2017 via email

goneall commented Sep 26, 2017 • edited Loading

wking commented Sep 26, 2017 via email

goneall commented Sep 27, 2017 • edited Loading

wking commented Sep 27, 2017

goneall commented Sep 27, 2017

wking commented Sep 27, 2017 via email

goneall commented Sep 21, 2017 •

edited

Loading

wking commented Sep 21, 2017 via email •

edited

Loading

wking commented Sep 22, 2017 •

edited

Loading

goneall commented Sep 26, 2017 •

edited

Loading

goneall commented Sep 27, 2017 •

edited

Loading