Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request to add license test files for a different use case #2

Open
goneall opened this issue Sep 15, 2017 · 23 comments
Open

Request to add license test files for a different use case #2

goneall opened this issue Sep 15, 2017 · 23 comments

Comments

@goneall
Copy link
Member

goneall commented Sep 15, 2017

The SPDX tools issue spdx/tools#109 (comment) submitted by @wking suggests adding license test files to test the correctness of the license XML files and the tool that generates the website to a separate repository. This seems like a natural fit.

@jmanbeck If you agree, I can create a pull request to update the README file and the top level directory.

@wking
Copy link

wking commented Sep 15, 2017

If you agree, I can create a pull request to update the README file and the top level directory.

Oops, I'd gone ahead and filed #3 without waiting for a green light. I'll stub out the tar unpacking too so you can see what I have in mind, but feel free to reject both proposals if you don't think they're a good fit.

@jmanbeck
Copy link
Collaborator

Im open to whatever you guys want to do. Are you looking for just plain text files with nothing but the license text?

@wking
Copy link

wking commented Sep 20, 2017 via email

@goneall
Copy link
Member Author

goneall commented Sep 21, 2017

I'll be using the files to test the license generator output (the output used for the website). These can be different test files than those used to test license scanners. Since I'll be automating the tests, the test files names will have to follow a specific naming convention so that I can match the license test files with the license and determine if the test is supposed to result in a positive or negative match result.

@jmanbeck
Copy link
Collaborator

So these are two different things as far as I can see. I looked at what Trevor did and can agree with most of it and will comment seperately.

Gary, you need just the license text and nothing else correct? No standard header, etc.,. Those can be easily generated. I know I gave you a set. Is that you are looking for? The program can be easily modified to generate those as well using the naming convention you want. If youd like it could be a seperate program as well given its specialzed use.

@goneall
Copy link
Member Author

goneall commented Sep 21, 2017

@jmanbeck Just the license text (no comments - just the straight text).

We could put them in a separate directory "license-matcher-testfiles" since they are a different use case than the license scanner testing.

@wking suggested creating separate subdirectories for each license id. If it isn't too much work, can you create a subdirectory for each license ID named by licenseID with a single subdirectory "license" which contains a single subdirectory "good" and the license text could be named "original.txt".

To summarize:

``

\

license-matcher-testfiles

 \ 

 {license-id} ...

   \ 

    license

       \ 

        good

         \ 

         original.txt

``
This follow the pattern we will be using for the automated testing: {license-id}/(license|header)/(good|bad)/{test-id}.txt

If the file naming ends up being too much work, just drop me a tarball of the files and I'll structure them into the directory and commit

Thanks

@wking
Copy link

wking commented Sep 21, 2017 via email

@jmanbeck
Copy link
Collaborator

Okay. The naming isnt an issue. We just need to agree on the structure. Generating it wont be that difficult and I dont think adding them by hand makes sense.

I dont have a solid opinion on this to be honest. I more or less just volunteered to create some files. I think the first thing to do is to agree on the structure. Once we have that we can agree on the number and type of test files.

@goneall and @wking if you two can agree on a structure that would be great. I dont have anything to do with the tooling and license list so Im happy to do what works best.

@wking I have looked over what you did for 0BSD and like it. I do have a few comments/questions.

  1. You included .sh and .c. Why? Imnot sure I see the need for both unless its a comments thing (#) and we could add all sorts of other form factors inclkuding HTML/ JS /etc.,. Who knows maybe we should.

  2. What am I missing in the with prefix and no prefix cases? They look the same to me.

  3. I like the idea of the license name being part of the file name for easy visual detection. If the reports out of scanners include the directory probbaly is not necessary then. If they dont, then its more work to hand verify results I think.

@wking
Copy link

wking commented Sep 21, 2017 via email

@goneall
Copy link
Member Author

goneall commented Sep 21, 2017

The only think I need for testing the license generator in the short term is the clean text (no comments) for the positive test cases. I can filter by file extensions to only pull out the .txt files if we want to mix the other language types in the same directory.

I'm OK with naming it canonical.txt, although I know that there has been debate in the past as to what the real canonical text is for a given license. As long as the naming choice doesn't require us resolving each of those debates, I'm OK with canonical.txt. If we feel that we need to confirm the text is the real canonical text as defined by the author of each individual license, I would suggest using a different term.

@wking If you still feel canonical.txt with the above consideration, let's go with that.

@wking
Copy link

wking commented Sep 21, 2017 via email

@goneall
Copy link
Member Author

goneall commented Sep 22, 2017

I would also like to test exception text in addition to the license and header text.

Suggest adding another directory under the license id "exception"

@wking
Copy link

wking commented Sep 22, 2017 via email

@jmanbeck
Copy link
Collaborator

@goneall and @wking Im wondering of we can continue this discussion on the tech team mailing list. Kate had invited others to the Outreach call yesterday who had an interest in this. Thomas showed up and has done some similar work in this area and has some thoughts and maybe even something he could contribute.

In a way we rushed these files. Mostly we did it to gauge interest. I think before we do a massive restructure lets nail a few things done on the direction and scope of this. We wont need to beat a dead horse but how this will work is important. For example, as Trevor suggested I dont know if Legal will sign up to maintain this. I have my doubts. They seem pretty busy as it is.

That said the original idea behind this was to create a set of test files for scanners to see how well they did in matching license text and standard headers to SPDX licenses. It was a start without boiling the ocean. We wanted to have both "good" examples and "bad" ones. The good ones should be detected as the spdx license and the "bad" ones should not. The bad ones would have been slightly edited to fail the matching guidelines. That said, the idea was not to get too crazy with the number of test files. We started out hand editing these files but there were so many we opted to auto generate them. The downside of this auto generation is that you have these silly files with the entire license in them because they have no standard header. Seemed okay to me as whether it was a plain text file or not should not matter to a scanner and it was easier to do. We put off doing bad files until we could figure out how to interject noise with the auto generation.

Trevor, both you and Thomas would like to see this expanded and Gary would as well. I think even Kate agrees. I agree as well. I think adding additional use cases is okay but lets collect them all up.

Some thoughts:

  1. I like the layouts so far. Im still on the fence on whether to use the license id in the file name or not for quick visual resolution.

  2. Who will support new licenses if we dont auto generate the files from the list? Who will correct current ones because I think there are some changes that need to made to some of the licenses. I like the idea of auto generating (we never have enough help in SPDX and I worry this wont be properly maintained).

  3. Should we allow crowd sourcing of files (good and bad) from others? If so, should they be in a separate directory? How will people feel about being a "bad" example if we pull something real. Who will vet these?

  4. Its likely we need some sort of statement about licensing and I need to speak with the Legal team.

We can always start small and build up.

@wking
Copy link

wking commented Sep 22, 2017

I'm wondering of we can continue this discussion on the tech team mailing list…

Works for me.

For example, as Trevor suggested I dont know if Legal will sign up to maintain this. I have my doubts. They seem pretty busy as it is.

Somebody is signing off on our alt/optional markup (spdx/license-list-XML#423, spdx/license-list-XML#425, spdx/license-list-XML#428, spdx/license-list-XML#429, spdx/license-list-XML#430, spdx/license-list-XML#431, …). I expected that was legal, and would mostly be grounded in decisions like this one. Things like spdx/license-list-XML#425 which are deemed too obvious can happen without involving legal. I don't know how these will breakdown in practice, maybe 50/50 for “serious enough to involve legal”?

And while it would be nice if legal decisions were backed by a legal-generated PR, I still don't think it will be a big deal if someone on the tech team has to proxy the legal decision into a backing PR.

And again, I'm ok with auto-generation if folks want to do that. I just think “I autogenerated these from our license source” is creating a chicken/egg issue, and I think it's easier to get legal to sign off on these examples as a source of truth than it would be to get them to sign off on the XML-marked licenses. With solid examples, the development of the XML-marked licenses would be the sole province of the tech team.

Should we allow crowd sourcing of files (good and bad) from others?

Absolutely. But legal needs to be the gatekeeper, since the match/no-match decision is theirs.

How will people feel about being a "bad" example if we pull something real…

I'll switch to use match / no-match to sound less judgmental.

We can always start small and build up.

Yup. #4 is pretty small ;).

@goneall
Copy link
Member Author

goneall commented Sep 22, 2017

+1 on continuing on the SPDX mailing list. We can also add this as a topic for next Tuesday's call. @jmanbeck did you want to tee this up or would you like me to introduce it?

There are definitely multiple use cases.

The original purpose of this repo was a set of licenses to test and compare license scanners.

I would like to use the test files to test the generation of the actual SPDX listed licenses.

There's quite a bit of overlap for the two, but some differences as well (e.g. I do not plan to use commented code from different languages in testing out the license generation).

@wking
Copy link

wking commented Sep 22, 2017 via email

@goneall
Copy link
Member Author

goneall commented Sep 26, 2017

Discussed on the tech team call. Minutes are at https://wiki.spdx.org/view/Technical_Team/Minutes/2017-09-26

Next tech call (Tuesday Oct. 3) we plan to decide on the file naming convention and the format of any associated test metadata format (there is a proposal for YAML or SPDX files to describe expected test results). If interested, please join the call:
10AM Pacific time
Dial-in number: +1-857-216-2871
Conference code: 38633
http://uberconference.com/SPDXTeam pin # 38633

@wking
Copy link

wking commented Sep 26, 2017 via email

@goneall
Copy link
Member Author

goneall commented Sep 27, 2017

Can you elaborate on the distinction between “Test license scanner”
and “Test license matching” (maybe linking to projects implementing
both)? At the moment, those sound like the same thing to me.

License scanning is looking at a wide variety text and identifying matches to license headers, comments, SPDX ID's, snippets of license text and actual license text. Testing for this would include all of these artifacts. Example scanners are scancode, FOSSOlogy, Ninkia and some commercial scanners.
License matching involves full license text and following the SPDX Matching Guidelines to determine if 2 blocks of text are equivalent per the guidelines. Example implementation is the SPDX License Checker
There may be some overlap as some license scanners provide a score and a score of 100% may be interpreted as a match per the matching guidelines (although, I believe there are very few scanners that actually take this approach).

@wking
Copy link

wking commented Sep 27, 2017

License matching involves full license text and following the SPDX Matching Guidelines to determine if 2 blocks of text are equivalent per the guidelines.

Those guidelines explicitly mention headers (1.1.1) and comment prefixes (6). So the distinction is that matching is "do you have a generic implementation of the template syntax and matching guidelines?" and scanning is actually applying that matching logic to compare a test file with each license (expression) in a known list?

@goneall
Copy link
Member Author

goneall commented Sep 27, 2017

Those guidelines explicitly mention headers (1.1.1) and comment prefixes (6). So the distinction is that matching is "do you have a generic implementation of the template syntax and matching guidelines?" and scanning is actually applying that matching logic to compare a test file with each license (expression) in a known list?

In my opinion, there is a separate use case with a separate goal in matching the license text (even though the guidelines do mention headers). It is to answer the question if two texts can be considered the same license or if you need to add them to an SPDX document as a separate license. This is an important use case in my auditing work when presented with license text in package information. A human eye will sometimes miss important wording changes - having a tool to test license equivalence improves the quality of the audit results.

@wking
Copy link

wking commented Sep 27, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants