-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request to add license test files for a different use case #2
Comments
Oops, I'd gone ahead and filed #3 without waiting for a green light. I'll stub out the tar unpacking too so you can see what I have in mind, but feel free to reject both proposals if you don't think they're a good fit. |
Im open to whatever you guys want to do. Are you looking for just plain text files with nothing but the license text? |
On Wed, Sep 20, 2017 at 01:21:48PM -0700, jmanbeck wrote:
Are you looking for just plain text files with nothing but the
license text?
See #4 for what I want for the 0BSD. One of those files
(0BSD/license/good/canonical.txt) is just the license text. The other
files show the standard header in a variety of contexts. For licenses
with alt/optional components, license-id/license/good would contain
additional files exercising the mutable parts.
The idea is to have a collection of files which cover the cases we
expect to find in the wild, so tools which successfully identify the
license associated with the various corpus files can trust their
identification/matching implementation.
|
I'll be using the files to test the license generator output (the output used for the website). These can be different test files than those used to test license scanners. Since I'll be automating the tests, the test files names will have to follow a specific naming convention so that I can match the license test files with the license and determine if the test is supposed to result in a positive or negative match result. |
So these are two different things as far as I can see. I looked at what Trevor did and can agree with most of it and will comment seperately. Gary, you need just the license text and nothing else correct? No standard header, etc.,. Those can be easily generated. I know I gave you a set. Is that you are looking for? The program can be easily modified to generate those as well using the naming convention you want. If youd like it could be a seperate program as well given its specialzed use. |
@jmanbeck Just the license text (no comments - just the straight text). We could put them in a separate directory "license-matcher-testfiles" since they are a different use case than the license scanner testing. @wking suggested creating separate subdirectories for each license id. If it isn't too much work, can you create a subdirectory for each license ID named by licenseID with a single subdirectory "license" which contains a single subdirectory "good" and the license text could be named "original.txt". To summarize: `` \ license-matcher-testfiles
`` If the file naming ends up being too much work, just drop me a tarball of the files and I'll structure them into the directory and commit Thanks |
On Thu, Sep 21, 2017 at 09:45:01AM -0700, goneall wrote:
We could put them in a separate directory
"license-matcher-testfiles" since they are a different use case than
the license scanner testing.
It's a different use case, but I think it's a subset of the
match-testing use case. For example, the 0BSD file you ask for is
0BSD/license/good/canonical.txt [1].
[1]: https://github.com/spdx/license-test-files/pull/4/files#diff-87c959bfe7349374963aaa364f6a8470R1
@wking suggested creating separate subdirectories for each license id.
And the docs for that are in flight with #3.
If it isn't too much work, can you create a subdirectory…
I've already started in on this with #4. If you like the direction,
I'll continue through the rest of the tarball.
… "original.txt".
I recommend ‘canonical.txt’, because what SPDX chooses for the
canonical wording may not be the chronologically-first wording.
|
Okay. The naming isnt an issue. We just need to agree on the structure. Generating it wont be that difficult and I dont think adding them by hand makes sense. I dont have a solid opinion on this to be honest. I more or less just volunteered to create some files. I think the first thing to do is to agree on the structure. Once we have that we can agree on the number and type of test files. @goneall and @wking if you two can agree on a structure that would be great. I dont have anything to do with the tooling and license list so Im happy to do what works best. @wking I have looked over what you did for 0BSD and like it. I do have a few comments/questions.
|
On Thu, Sep 21, 2017 at 04:58:17PM +0000, jmanbeck wrote:
Generating it wont be that difficult and I dont think adding them by
hand makes sense.
Generating automatically is fine, but things like
0BSD/header/good/canonical-with-prefix.c and
0BSD/header/good/alternate-copyright.sh (which I just pushed to #4)
don't need to be generated for each license. And regardless of how
they're generated, we want to manually review all of them to ensure we
are not coupling in possible generator bugs. Once we catch up to the
current list, it should be the legal team submitting new content, not
automated tools.
1. You included .sh and .c. Why? Imnot sure I see the need for both
unless its a comments thing (#)…
Yes, it's to test that your matcher correctly ignores comments. I'd
also like some bad tests where we inject some of those in the license
body (vs. using them as prefixes) to make sure the match fails in
those cases (unless we have some matching guideline that suggests
comment-ish characters be ignored, in which case the internal comment
character files would go under good/).
… and we could add all sorts of other form factors inclkuding HTML/
JS /etc.,. Who knows maybe we should.
I agree, although JavaScript uses C-style comments [1], so I don't
think we need *.js tests. We probably do want *.xml tests using <!--
… -->, although I'm mostly worried about prefixes, and I don't usually
see prefixes in XML comments.
2. What am I missing in the with prefix and no prefix cases? They
look the same to me.
The prefix case has a leading ‘** ’:
$ diff -u 0BSD/header/good/canonical*prefix.c
--- 0BSD/header/good/canonical-no-prefix.c 2017-09-15 11:01:34.430392702 -0700
+++ 0BSD/header/good/canonical-with-prefix.c 2017-09-21 10:05:28.725042963 -0700
@@ -12,19 +12,19 @@
*/
/*
-Copyright (C) 2006 by Rob Landley <[email protected]>
…
+** Copyright (C) 2006 by Rob Landley <[email protected]>
3. I like the idea of the license name being part of the file name
for easy visual detection. If the reports out of scanners include
the directory probbaly is not necessary then. If they dont, then its
more work to hand verify results I think.
Are you saying that tools may only print the filename, so we should
use a flat-file approach instead of nested directories? I'd rather
stick with nested directories, because it makes grouping easier with
‘ls’, and writing tools that print enough of the path to include the
license-id should be straightforward enough.
But I don't feel super-strongly about it. If you want a flat
structure, I can do that.
[1]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Lexical_grammar#Comments
|
The only think I need for testing the license generator in the short term is the clean text (no comments) for the positive test cases. I can filter by file extensions to only pull out the .txt files if we want to mix the other language types in the same directory. I'm OK with naming it canonical.txt, although I know that there has been debate in the past as to what the real canonical text is for a given license. As long as the naming choice doesn't require us resolving each of those debates, I'm OK with canonical.txt. If we feel that we need to confirm the text is the real canonical text as defined by the author of each individual license, I would suggest using a different term. @wking If you still feel canonical.txt with the above consideration, let's go with that. |
On Thu, Sep 21, 2017 at 06:00:00PM +0000, goneall wrote:
I'm OK with naming it canonical.txt, although I know that there has
been debate in the past as to what the real canonical text is for a
given license. As long as the naming choice doesn't require us
resolving each of those debates, I'm OK with canonical.txt. If we
feel that we need to confirm the text is the real canonical text as
defined by the author of each individual license, I would suggest
using a different term.
|
I would also like to test exception text in addition to the license and header text. Suggest adding another directory under the license id "exception" |
On Fri, Sep 22, 2017 at 05:27:46PM +0000, goneall wrote:
I would also like to test exception text in addition to the license
and header text.
Good idea.
Suggest adding another directory under the license id "exception"
Hmm. I don't see a reason to stop at WITH. Folks will probably want
to test + detection as well, and possibly AND and OR detection. What
we want is closer to (adjusting the #3 ABNF):
file-name = license-expression "/" license-header "/" good-bad "/" test "." extention
(and that works for both the current spec's ABNF and the cleanup I'm
proposing in spdx/spdx-spec#37). But license expressions have spaces
and such, which are annoying for filenames. We can work around that
by percent encoding [1] reserved characters [2], which most languages
will be able to support with generic URI-processing libraries
(e.g. [3,4]). That would give you the same filename for bare licenses:
0BSD/…
because license-ids contain no reserved characters [5]. Exception
filenames would look like:
GPL-3.0%20WITH%20GCC-exception-3.1/…
GPL-2.0+ would be:
GPL-2.0%2B/…
And GPL-2.0+ WITH GCC-exception-2.0 OR GPL-3.0 WITH GCC-exception-3.1 would be:
GPL-2.0%2B%20WITH%20GCC-exception-2.0%20OR%20GPL-3.0%20WITH%20GCC-exception-3.1/…
(although we probably want to stick with examples we actually find in
the wild ;). Does that sound like a reasonable balance between
flexibility and human-readability to you?
[1]: https://tools.ietf.org/html/rfc3986#section-2.1
[2]: https://tools.ietf.org/html/rfc3986#section-2.2
[3]: https://docs.python.org/3/library/urllib.parse.html#urllib.parse.quote
[4]: https://docs.python.org/3/library/urllib.parse.html#urllib.parse.unquote
[5]: https://github.com/spdx/spdx-spec/blame/cfa1b9d08903befdf03e669da6472707b7b60cb9/chapters/appendix-IV-SPDX-license-expressions.md#L11
|
@goneall and @wking Im wondering of we can continue this discussion on the tech team mailing list. Kate had invited others to the Outreach call yesterday who had an interest in this. Thomas showed up and has done some similar work in this area and has some thoughts and maybe even something he could contribute. In a way we rushed these files. Mostly we did it to gauge interest. I think before we do a massive restructure lets nail a few things done on the direction and scope of this. We wont need to beat a dead horse but how this will work is important. For example, as Trevor suggested I dont know if Legal will sign up to maintain this. I have my doubts. They seem pretty busy as it is. That said the original idea behind this was to create a set of test files for scanners to see how well they did in matching license text and standard headers to SPDX licenses. It was a start without boiling the ocean. We wanted to have both "good" examples and "bad" ones. The good ones should be detected as the spdx license and the "bad" ones should not. The bad ones would have been slightly edited to fail the matching guidelines. That said, the idea was not to get too crazy with the number of test files. We started out hand editing these files but there were so many we opted to auto generate them. The downside of this auto generation is that you have these silly files with the entire license in them because they have no standard header. Seemed okay to me as whether it was a plain text file or not should not matter to a scanner and it was easier to do. We put off doing bad files until we could figure out how to interject noise with the auto generation. Trevor, both you and Thomas would like to see this expanded and Gary would as well. I think even Kate agrees. I agree as well. I think adding additional use cases is okay but lets collect them all up. Some thoughts:
We can always start small and build up. |
Works for me.
Somebody is signing off on our alt/optional markup (spdx/license-list-XML#423, spdx/license-list-XML#425, spdx/license-list-XML#428, spdx/license-list-XML#429, spdx/license-list-XML#430, spdx/license-list-XML#431, …). I expected that was legal, and would mostly be grounded in decisions like this one. Things like spdx/license-list-XML#425 which are deemed too obvious can happen without involving legal. I don't know how these will breakdown in practice, maybe 50/50 for “serious enough to involve legal”? And while it would be nice if legal decisions were backed by a legal-generated PR, I still don't think it will be a big deal if someone on the tech team has to proxy the legal decision into a backing PR. And again, I'm ok with auto-generation if folks want to do that. I just think “I autogenerated these from our license source” is creating a chicken/egg issue, and I think it's easier to get legal to sign off on these examples as a source of truth than it would be to get them to sign off on the XML-marked licenses. With solid examples, the development of the XML-marked licenses would be the sole province of the tech team.
Absolutely. But legal needs to be the gatekeeper, since the match/no-match decision is theirs.
I'll switch to use
Yup. #4 is pretty small ;). |
+1 on continuing on the SPDX mailing list. We can also add this as a topic for next Tuesday's call. @jmanbeck did you want to tee this up or would you like me to introduce it? There are definitely multiple use cases. The original purpose of this repo was a set of licenses to test and compare license scanners. I would like to use the test files to test the generation of the actual SPDX listed licenses. There's quite a bit of overlap for the two, but some differences as well (e.g. I do not plan to use commented code from different languages in testing out the license generation). |
On Fri, Sep 22, 2017 at 02:48:13PM -0700, goneall wrote:
There's quite a bit of overlap for the two, but some differences as
well (e.g. I do not plan to use commented code from different
languages in testing out the license generation).
I think the generation-test data is a strict subset of the match-test
data.
|
Discussed on the tech team call. Minutes are at https://wiki.spdx.org/view/Technical_Team/Minutes/2017-09-26 Next tech call (Tuesday Oct. 3) we plan to decide on the file naming convention and the format of any associated test metadata format (there is a proposal for YAML or SPDX files to describe expected test results). If interested, please join the call: |
On Tue, Sep 26, 2017 at 11:10:55AM -0700, goneall wrote:
Minutes are at https://wiki.spdx.org/view/Technical_Team/Minutes/2017-09-26
Can you elaborate on the distinction between “Test license scanner”
and “Test license matching” (maybe linking to projects implementing
both)? At the moment, those sound like the same thing to me.
Next tech call (Tuesday Oct. 3) we plan to decide on the file naming
convention and the format of any associated test metadata format
(there is a proposal for YAML or SPDX files to describe expected
test results).
Can an advocate for the YAML/SPDX test-metadata provide an example of
what they'd like to see? Are they looking for things like
autodetected FileCopyrightText [1]?
[1]: https://spdx.org/spdx-specification-21-web-version#h.206ipza
|
License scanning is looking at a wide variety text and identifying matches to license headers, comments, SPDX ID's, snippets of license text and actual license text. Testing for this would include all of these artifacts. Example scanners are scancode, FOSSOlogy, Ninkia and some commercial scanners. |
Those guidelines explicitly mention headers (1.1.1) and comment prefixes (6). So the distinction is that matching is "do you have a generic implementation of the template syntax and matching guidelines?" and scanning is actually applying that matching logic to compare a test file with each license (expression) in a known list? |
In my opinion, there is a separate use case with a separate goal in matching the license text (even though the guidelines do mention headers). It is to answer the question if two texts can be considered the same license or if you need to add them to an SPDX document as a separate license. This is an important use case in my auditing work when presented with license text in package information. A human eye will sometimes miss important wording changes - having a tool to test license equivalence improves the quality of the audit results. |
On Wed, Sep 27, 2017 at 10:30:49AM -0700, goneall wrote:
> Those guidelines explicitly mention headers (1.1.1) and comment
> prefixes (6). So the distinction is that matching is "do you have
> a generic implementation of the template syntax and matching
> guidelines?" and scanning is actually applying that matching logic
> to compare a test file with each license (expression) in a known
> list?
In my opinion, there is a separate use case with a separate goal in
matching the license text (even though the guidelines do mention
headers).
That's fine, and I think we want to support tests like that. However,
I think the matching guidelines pretty clearly identify both license
matching and header matching, with or without comment prefixes and
surrounding code as in-scope. So the test corpus we provide should
cover all those cases. And if a particular tool (e.g. your “does this
stand-alone license text match one of these other licenses” tool) only
needs a subset of that test corpus, it can pick out that subset.
Using something like the filename scheme in #3 should make picking out
that subset pretty easy. Do you have concerns about finding the right
files for testing your matcher with the filename scheme in #3?
|
The SPDX tools issue spdx/tools#109 (comment) submitted by @wking suggests adding license test files to test the correctness of the license XML files and the tool that generates the website to a separate repository. This seems like a natural fit.
@jmanbeck If you agree, I can create a pull request to update the README file and the top level directory.
The text was updated successfully, but these errors were encountered: