Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test suite #38

Open
VladimirAlexiev opened this issue Apr 3, 2019 · 41 comments
Open

Test suite #38

VladimirAlexiev opened this issue Apr 3, 2019 · 41 comments

Comments

@VladimirAlexiev
Copy link
Contributor

VladimirAlexiev commented Apr 3, 2019

The sparql 1.1 test suite is useful, but

  • there are bugs that can't be fixed because the process is over
  • There are some harnesses, but no harness as a service that any dev can easily use to test his implementation, continuously.
  • The Implementation Reports are generated from EARL rdf test results, which is great. But afaik these are submitted by devs and taken at face value.

@BorderCloud (Karima Rafes) has been running http://sparqlscore.com/ valiantly for 4 years (see documentation), added some tests and fixed some; and given up on others because of ambiguities in the spec.

She proposed and I support that whatever 1.2 features are standardized by this group, should have tests. I also put forward that this group should try to fix 1.1 test suite problems, and help w3c host a continuous testing harness.

The biggest improvements needed on this testing site are

  • more flexible result comparison by the test runner. Eg using jsonld c14n to make comparison easier
  • logistical issues eg what do you use as counterparty server for Federated queries

Karima please add more from recent emails

@kasei
Copy link
Collaborator

kasei commented Apr 3, 2019

Are there "test suite problems" that are not captured by the updates made as part of the rdf-tests CG?

@afs
Copy link
Collaborator

afs commented Apr 3, 2019

See RDF tests issue 51 -- w3c/rdf-tests#51 -- for previous discussions.

The draft charter for the SPARQL 1.1 CG specifically recognizes liaison with the "RDF Test Curation Community Group".

@BorderCloud
Copy link

I do not know if it is the right time to discuss of test suite but the first thing to do is to define exactly the same minimal API for all SPARQL services (#27). When the minimal API will be accepted. We can imagine a new solution to test integrally each SPARQL implementation in parallel of works of future WG.

In this new solution to test SPARQL implementation, I would like:

  • Simulate all communications of a SPARQL service with a SPARQL client and other services (federated query)
  • Allow anyone to define a new test (public or private)
  • Allow anyone to reproduce for free the result of each test (ie. in GitHub with Travis Ci)
  • Generate customized test reports to help WG members test their softwares (in private during the development)
  • Allow the WG to select the tests that will be part of the recommendation or not in function of results of tests
  • After the recommendation of SPARQL 1.2, offer a service allowing users to see the SPARQL 1.2 official features supported or not by each solution on the market.

I think it's time to indutrialise SPARQL.

If the GC officially asks me to participate in the WG to consolidate the next version of SPARQL, I can start to propose a new research project to develop this new platform.

@VladimirAlexiev
Copy link
Contributor Author

VladimirAlexiev commented Apr 4, 2019

@kasei Good question. Nearly all files at https://github.com/w3c/rdf-tests/tree/gh-pages/sparql11 are last updated 3-4 years ago. See TFT-tests/issues: I believe the following are bugs in the tests: BorderCloud/TFT-tests#18, BorderCloud/TFT-tests#15, BorderCloud/TFT-tests#20, BorderCloud/TFT-tests#2. We even had some absurd discussions like

  • Fix the name of tvs02 to tsv02
  • you can ask to fix it in the project of w3c/rdf-tests. After, I will pull their new "official name"

Sure enough, tvs02 is still not fixed.

BorderCloud/TFT-tests#4 is pervasive: many tests use relative URLs but fix no base. @jeenbroekstra said that's not a bug and Karima's runner adds some base, but I think it is a bug.

The other issues are more important: we need a flexible test result comparator (perhaps based on c14n), else there are many false negatives.

@afs the W3C Tests CG site doesn't have any posts since 3.5y ago (2015). I asked a couple days ago "Is the activity of this group closed? TFT-tests runs continuous tests over some RDF repos, and tries to fix some of the tests. The biggest improvements needed in this suite is more flexible result comparison by the test runner". The comment is still awaiting moderation: I think that group is closed and gone.


Karima replied "I finished my thesis (in french): Karima Rafes. Le Linked Data à l'université : la plateforme LinkedWiki. Université Paris-Saclay, 2019. Français. The chapter 5 is the conclusion of this work.
I developed the simplest. There are still tests that are difficult or useless to code because several parts of SPARQL 1.1 specifications are too fuzzy. I did my maximum. The next step for me is to consolidate/change the specifications, otherwise SPARQL will never be totally interoperable.

So, the project TFT is in standby and will disappear when W3C offer all tests with a tool such as TFT to validate the compliance with SPARQL. If the tests and the tools to run the tests becomes a prerequisite for validate the specifications, there will be less functionalities but SPARQL 1.2 will not have the interoperability problems of SPARQL 1.1. When the CG will work on the tests needed for SPARQL 1.2, I will try to work with it (if I have the time).


Maybe I should have pressed with w3c/rdf-tests. But I had these exchanges with Karima in 2018 (I was trying to get GraphDB to perfect score), while the last activity I see in the W3C Tests CG and their github is 2015.

Ergo the point of this issue: SPARQL test suite activity needs to be restarted, and kept continuous for 3-4 years. Every SPARQL 1.2 feature must come with tests, and there should be a continuous-testing framework in place. Else there is a risk that users won't know which repo implements what and how well, and the new features won't be used much.

@VladimirAlexiev
Copy link
Contributor Author

@afs SPARQL 1.1 CG specifically recognizes liaison with the "RDF Test Curation Community Group"

If another group can take over testing that would be great. But it seems to me the W3C Tests CG is disbanded/passive. I think that together with forming this SPARQL 1.2 CG, the Tests CG must be restarted. @iherman and @gkellogg, please comment?

@gkellogg
Copy link
Member

gkellogg commented Apr 4, 2019

CG is not disbanded, it has been quiescent for a long time. It makes sense to have this CG to drive SPARQL tests, but may want to work out of the RDF tests CG repo.

@afs
Copy link
Collaborator

afs commented Apr 4, 2019

@VladimirAlexiev

Nearly all files at https://github.com/w3c/rdf-tests/tree/gh-pages/sparql11 are last updated 3-4 years ago.

because there have been no fixes needed. https://github.com/w3c/rdf-tests/commits/gh-pages and https://github.com/w3c/rdf-tests/pulls?q=is%3Apr+is%3Aclosed show recent activity.

Moving the work across CGs does not change the fact that someone has to do the work.
Change happens when pull requests are sent.

Is there a barrier to contributing to RDF test CG?

@VladimirAlexiev
Copy link
Contributor Author

@afs then please move this task to rdf-tests (but change the title to something more descriptive).

@gkellogg and @kasei and whoever else was active in rdf-tests, you'll be the best people to continue leading this work! I've long marveled at EARL and how EARL reports are used to generate Implementation Report htmls, a work of beauty. But do you agree with the more ambitious goals that Karima and I have proposed above?

  • A continuous testing framework is better than taking those EARL reports at face value.
  • A more flexible comparator (perhaps based on c14n) will eliminate false negatives and let vendors focus on the true discrepancies. @gkellogg you and Manu would be the best people to pull this off.

Is there a barrier to contributing to RDF test CG?

Truth be told I never tried, I didn't know it was active. I (or a QA at ONTO) would love to work with rdf-test to eliminate false negatives.
I posted some issues to Karima but she basically threw her hands in the air for some of them, saying "it's out of my hands".
Or fixed stuff locally, eg look at BorderCloud/TFT-tests#4: she added some base to all queries, but maybe it's better to specify the base explicitly.

@afs
Copy link
Collaborator

afs commented Apr 4, 2019

If that community wish to take the issue, then fine. I do not believe pushing it at them is productive. There is RDF tests issue 51 -- w3c/rdf-tests#51 -- for previous discussions.

Work on a test runner does not need any permission from anyone but the idea of changing SPARQL to fit one particular runner seems a bad idea.

Base URI handling is explained in the SPARQL test suite. RFC 3986 section 5.1 explains the general mechanism that applies to all URI resolution.

@VladimirAlexiev
Copy link
Contributor Author

Work on a test runner does not need any permission from anyone

I'm not seeking permission, I seek willingness for collaboration on this important topic. Do you think it'd be important to run a centralized continuous test runner for everyone's benefit?

changing SPARQL to fit one particular runner seems a bad idea

Don't know what gave you that idea. I think that using relative URLs in tests without base leaves them underspecfied, and is one issue that needs fixing in the tests.

@kasei
Copy link
Collaborator

kasei commented Apr 4, 2019

changing SPARQL to fit one particular runner seems a bad idea

Don't know what gave you that idea. I think that using relative URLs in tests without base leaves them underspecfied, and is one issue that needs fixing in the tests.

Base URL resolution is well defined.

Beyond this issue, there have been other suggestions (e.g. in #27) to make backwards incompatible changes for the benefit of testing. I strongly agree with @afs that this sort thing would be a bad idea.

@gkellogg
Copy link
Member

gkellogg commented Apr 4, 2019

@gkellogg and @kasei and whoever else was active in rdf-tests, you'll be the best people to continue leading this work! I've long marveled at EARL and how EARL reports are used to generate Implementation Report htmls, a work of beauty. But do you agree with the more ambitious goals that Karima and I have proposed above?

  • A continuous testing framework is better than taking those EARL reports at face value.

RDFa did something like that, which was a pain. Every implementation must maintain a service to respond to test queries. In reality, it was a lot of work. Today, you might use containerized apps, but might be better to define a CI best practice for implementations to use to run the tests, and potentially send an update report. Conceivably, the implementation report could be automatically updates, but it’s required a lot of hand holding in the past.

  • A more flexible comparator (perhaps based on c14n) will eliminate false negatives and let vendors focus on the true discrepancies. @gkellogg you and Manu would be the best people to pull this off.

I don’t see that it would eliminate false negatives, as C14N and Isomorphism effectively allow equivalent comparisons. C14N might generate more useful diffs when results don’t compare.

Is there a barrier to contributing to RDF test CG?

Truth be told I never tried, I didn't know it was active. I (or a QA at ONTO) would love to work with rdf-test to eliminate false negatives.

Consider joining the CG.

@BorderCloud
Copy link

BorderCloud commented Apr 5, 2019

@gkellogg

A continuous testing framework is better than taking those EARL reports at face value.

RDFa did something like that, which was a pain. Every implementation must maintain a service to respond to test queries. In reality, it was a lot of work. Today, you might use containerized apps, but might be better to define a CI best practice for implementations to use to run the tests, and potentially send an update report. Conceivably, the implementation report could be automatically updates, but it’s required a lot of hand holding in the past.

My continuous testing framework works already via Travis CI and the results of tests are collected in a RDF database via a SPARQL service.
The CG can already use it to evaluate the compliance with SPARQL 1.1... (and enable my tests about the protocol)

But for the federated query protocol, my first implementation is insufficient. We have to imagine another method in the future.

@VladimirAlexiev
Copy link
Contributor Author

@kasei

Base URL resolution is well defined.

But when a test doesn't define a base and the test SPARQL can be located at different URLs, what is the result of that resolution? Would you agree with me that a test that uses relative URLs and doesn't specify base is under-specified?

there have been other suggestions (e.g. in #27) to make backwards incompatible changes

I myself don't know what Karima means by #27. But don't throw away the baby with the bath water. Have you looked at http://sparqlscore.com and what do you think of it?

@gkellogg

Every implementation must maintain a service to respond to test queries

Most vendors (and I speak for one) have eval or free versions, that's what Karima used for her service. Vendors have an interest in perfecting their score. Karima's done a good job, but she needs the support of the RDF Test CG to keep it going and to improve it.

the implementation report could be automatically updates, but it’s required a lot of hand holding in the past.

Have you considered the reproducibility of the Implementation Report? If I want to check all claimed results, what am I to do?

Consider joining the CG.

I'll speak to colleagues at ONTO.

I don’t see that it would eliminate false negatives, as C14N and Isomorphism effectively allow equivalent comparisons.

It's easier to compare two c14n-ed result sets (the etalon and the SUT (system under test) response). The SUT response often can include extra triples, which the comparator must allow.

@BorderCloud

federated query protocol, my first implementation is insufficient.

Yes, what do you use as counterparty server for Federated queries is a difficult question.

  • Eg if you use a Virtuoso, presumably this gives an unfair advantage of Virtuoso as SUT (because presumably, two Vurtuoso will implement federation more smoothly than Virutoso and another SUT).
  • the uptime of this counterparty system is important, else it'll fail federated tests of other SUT's

@afs
Copy link
Collaborator

afs commented Apr 7, 2019

@VladimirAlexiev,

The tests run from manifest files, which are Turtle. Suppose the manifest file is http://example/manifest.ttl.

    mf:action
         [ qt:query  <agg01.rq> ;
           qt:data   <agg01.ttl> ] ;

when that is read by a Turtle parser, the RDF term for <agg01.rq> is http://example/agg01.rq. When reading the query, the base URI is therefore http://example/agg01.rq. A query can change this during with BASE but out starts out being http://example/agg01.rq. This not a feature of SPARQL, it is part of RFC 3986.

@VladimirAlexiev
Copy link
Contributor Author

@afs Exactly my point: what is the actual value of http://example? It is not defined by the test suite.

@kasei comment on Protocol validation: #1 (comment). Would be great to include protocol tests in the suite.

@afs
Copy link
Collaborator

afs commented Apr 9, 2019

It is wherever the test suite resides. It is not fixed and does not need to be.

This allows people to download the suite and run it locally as they have done. (After all, it is mostly the test suite for query engines.)

This has been discussed at length before. What is the problem you are facing with relative URI resolution to make the test suite portable?

@BorderCloud
Copy link

BorderCloud commented Apr 9, 2019

@afs

After all, it is mostly the test suite for query engines.

It's wrong.
It is mostly the test suite for SPARQL clients because It's the SPARQL clients the victims of your different protocols.

A unique and reproductible test suite is not a optional tool when we want to build a real interoperability for the Semantic Web.

I demontrated it is possible to use the same protocol test suite to evaluate our interoperability. It's free and reproductible online by anybody. It's a excellent news for the next version of SPARQL, isn't it ?

It's time to use the same test suite to build a real interoperability for SPARQL 1.1 and 1.2 and 2.0...

@VladimirAlexiev
Copy link
Contributor Author

Andy's comment "mostly the test suite for query engines" applies to the question of whether queries should specify their BASE.

On the other hand, I believe that protocol tests are definitely fair game for such a test suite.

@afs
Copy link
Collaborator

afs commented Apr 9, 2019

Please update sparqlscore to work with RDF 1.1.

@BorderCloud
Copy link

BorderCloud commented Apr 9, 2019

@afs

Please update sparqlscore to work with RDF 1.1.

I would like... but the test suite is implemented in RDF 1.0 (Turtle 1.0).
https://github.com/w3c/rdf-tests/blob/gh-pages/sparql11/data-sparql11/manifest-all.ttl

I'm not sure I understood the meaning of the sentence. Sparqlscore loads the turtle 1.0 of the official test suite (compliant in theory with 1.1).

@afs
Copy link
Collaborator

afs commented Apr 12, 2019

The issue for sparqlscore seems to be in the comparison of results. In RDF1.1, simple strings and xsd:string are the same thing and there is a preference for omitting the datatype. For running tests, it is the comparison that can handle that even if up until then a mix of simple strings and xsd:string happens.

@BorderCloud
Copy link

BorderCloud commented Apr 12, 2019

@afs
A fix for one query engine in sparqlscore may be a new issue for another query engine. For the moment, I wait the next version of SPARQL before to change TFT and SparqlScore.

I dream... In the future version of test suite, the SPARQL results should be strictly the same for the same query on the same data for any query engines (and ofcourse with the same protocol).

@kasei
Copy link
Collaborator

kasei commented Apr 12, 2019

@BorderCloud surely it's better to support the current standard than keep outdated implementations appearing to pass while ensuring new implementations appear to fail? sparqlscore.com says:

SPARQLScore is an attempt to evaluate the conformance of triplestores to the W3C standards.

(Emphasis added.) I read that as implying the current standards, so if that's not what you're choosing to do, you might want to explicit state as much.

I dream... In the future version of test suite, the SPARQL results should be strictly the same for the same query on the same data for any query engines (and ofcourse with the same protocol).

The nature of scheduling different working groups and their related standards will make your dream very difficult to achieve in practice. In practice, however, I think there is already broad consensus around the test suite and what counts as a conforming implementation.

@afs
Copy link
Collaborator

afs commented Apr 12, 2019

@BorderCloud It will not invalidate a result from an RDF 1.0 based engine.

@kasei
Copy link
Collaborator

kasei commented Apr 12, 2019

@afs

@BorderCloud It will not invalidate a result from an RDF 1.0 based engine.

I think that's true for everything except two tests. This rdf-tests commit explains the reasoning, and removes the old tests from the manifest list.

@BorderCloud
Copy link

@afs @kasei
I checked the specifications of SPARQL result 1.1 with XML.
https://www.w3.org/2007/SPARQL/result.xsd

The attribute "datatype" seems required (for RDF 1.0 or 1.1). There is not a default type when the attribute "datatype" not exists.

@kasei
Copy link
Collaborator

kasei commented Apr 12, 2019

The attribute "datatype" seems required (for RDF 1.0 or 1.1). There is not a default type when the attribute "datatype" not exists.

I'm not sure what the problem is. Could you provide some more context?

Possibly helpful to this discussion, I'll point out that the RDF 1.1 Concepts and Abstract Syntax has this to say about literals:

Please note that concrete syntaxes may support simple literals consisting of only a lexical form without any datatype IRI or language tag. Simple literals are syntactic sugar for abstract syntax literals with the datatype IRI http://www.w3.org/2001/XMLSchema#string. Similarly, most concrete syntaxes represent language-tagged strings without the datatype IRI because it always equals http://www.w3.org/1999/02/22-rdf-syntax-ns#langString.

@BorderCloud
Copy link

BorderCloud commented Apr 12, 2019

I'm not sure, it's the best place for this discussion... This is only one of problems that still need to explicitly specify in the next version.

@afs
Copy link
Collaborator

afs commented Apr 12, 2019

The datatype attribute was not required at SPARQL 1.0.
2.3.1. Variable Binding Results

RDF Literal S
<binding><literal>S</literal></binding>

You are right there is no default datatype because in RDF 1.0 plain strings didn't have a datatype.

@VladimirAlexiev
Copy link
Contributor Author

@BorderCloud I dream... In the future version of test suite, the SPARQL results should be strictly the same for the same query on the same data for any query engines

This dream is neither realistic nor necessary. Query engines are allowed some flexibility eg

  • return extra triples for wildcard queries (eg GDB returns system ontology axiomatic triples, depending on installed ruleset)
  • vary result order, if ordering is not specified, or for all CONSTRUCT queries
  • vary the names of blank nodes
  • return or omit xsd:string, because it's the default

We need a more flexible comparator

@jindrichmynarz
Copy link

We need a more flexible comparator

Comparing serialized results via byte-by-byte equality is brittle. Using a canonical serialization or testing result graph isomorphism helps, but as you mention above, there are still cases, in which we want to give query engines more leeway. In such cases, we can define looser tests via invariants (e.g., ASK queries on results expected to be true/false) or metamorphic relations (some input data permutations produce the same results).

@afs
Copy link
Collaborator

afs commented Apr 14, 2019

@jindrichmynarz
Copy link

jindrichmynarz commented Apr 14, 2019

Unordered SELECT results can be parsed as sets of hash-maps (I've done this here). Such data structure provides more fitting equality semantics.

@afs
Copy link
Collaborator

afs commented Apr 14, 2019

Yes - trying to avoid parsing the results in some way becomes more trouble than its worth, effectively becoming a parser eventually. After all, XML and JSON allow layout variations and engines need room to deliver implementation choices and optimizations.

Sounds to me like something to be written up as a "Practice and Experience" note.

@VladimirAlexiev
Copy link
Contributor Author

I had a chat with Nikolay Kolev, one of our leading testers.

  • We adopted all of the SPARQL conformance tests in our regression testing
  • We had to "fix" a number of the expected results to fit legitimate GDB behavior. We could provide the changes, but they can't be adopted as standard because that will cause false negatives in other repos
  • We adopted several of the TFT additions (eg https://github.com/BorderCloud/TFT-tests/tree/b0fc7769c72905bd8954d116b113aa116914a5dd/GO3/ERT-ART) in our regression testing, but corrected or eliminated some. Eg q05 asks for "dateTime - dateTime = X" (which we return as duration) and then "integer - X" (which we return as null): we eliminated this one.

@BorderCloud
Copy link

@VladimirAlexiev Don't forget to insert also the tests about the protocol.
https://github.com/BorderCloud/rdf-tests/tree/withJmeter/sparql11/data-sparql11/protocol

@namedgraph
Copy link

We have a rather basic test suite based on bash and curl:
https://github.com/AtomGraph/Processor/tree/master/http-tests

@BorderCloud
Copy link

@namedgraph Great !

@gkellogg
Copy link
Member

gkellogg commented Dec 7, 2020

Note that the RDF Test Suite Curation CG has taken on curation of RDF and SPARQL test suites, and there have been a number of additions and corrections.

https://github.com/w3c/rdf-tests/tree/gh-pages/sparql11.

Issues and PRs are welcome there. Of course, this is not official, as there is no active WG, but it has proven to be a useful resource fo the community.

@BorderCloud
Copy link

Hello

I updated my works about test suite with SPARQL (may be for the last time ?).

You can find here a draft report for SPARQL 1.1.
All results have produced only with GitHub and Travis CI.
I used only 3 databases (I have not any sponsors to pay the time to fix the SPARQL protocol of other SPARQL services).
I use now docker-compose in order to simplify the deployment of multiple SPARQL services simultaneously for the tests with federated queries.
JMeter is stable and it's the best solution for the moment to develop/debug the necessary tests when the SPARQL services will have (a day) an error protocol to respect during a transaction.
With Varnish, I can ignore the different protocols of SPARQL services so I have disabled for the moment all my tests about protocols and I check only the language.
Without sponsors, I cannot check correctly the tests about "Entailment Regimes" (see details).

With sponsors, I can develop all the tests on the protocols (query, update and error messages) and I can generate all the possible combinations between all the SPARQL services that really want to share the same protocol.

With this approach, the working group of SPARQL1.2 can remove words like "should be", "Want to be", etc. from the specification and only precise the tests for each functionality in the official repository of W3C.
No "bullshits"... only tests for everything and a report generated automatically by an independent entity.
In my opinion, if we can't test something, that thing shouldn't be in the final SPARQL 1.2 specification.

I have proven that it is possible to automate tests with SPARQL protocol. It's time to recommend only one protocol at SPARQL 1.2.

Hope my work helps you build a better SPARQL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants