Tuesday, June 25, 2013

A Simplest Triad Calculus - DRAFT

Series:  RDF Investigations

Overview

For the moment, please treat this as a kind of proof-of-concept.  It's intended to have enough detail to make the design concept clear, without bogging down too much in details at this point.  For example, I'm disregarding literals for now, because they are inessential to the basic deductive system.  I guess the basic idea is that we ought to be able to use some variation of Natural Deduction to express just about everything needed to define RDF and RDFS clearly and concisely.  Among other things this should allow us to eliminate stuff like "instance of a graph" and "lean graph" (described in RDF Semantics); the former is just not needed (being handled by derivation rules involving variables), while the latter seems to correspond to the concept of normalization, in the sense of a set (sequence) of independent triads, no one of which can be derived from the others.  To get started I'm just sketching out a basic calculus to suggest what shape the final thing would have.

Also, the notes amount to a sort of tutorial/cry-for-help in that I try to explain what I'm doing and why, rather than just stating definitions as one might do in a formal definition.

Meta-language

  • Constants:
    • All constants in the object language are also constants in the meta-language.
    • \(\exists\):  existential quantifier
    • \(\land\): logical and
    • \(\phi\): a formula
    • \(\Phi\): set of formulae
    • \(\vdash\):  derivation relation
    • \(\models\):  entailment (logical consequence) relation
    • \(\leadsto\);  informal justification
    • \(\Rightarrow\):  rewrite relation; like derivation, only from object language to meta-language.
  • Formulae
    • Quantification
      • \(\exists x.\!x\phi\) has its usual FOL meaning.
    • Inference.  We treat inference as a genus with three species:
      • \(\Phi \vdash \phi\):  the formulae \(\Phi\) derives the formula \(\phi\); \(\phi\) is derivable or deducible from \(\Phi\)
      • \(\Phi \models \phi\): the formulae \(\Phi\) entail the formula \(\phi\); \(\phi\) is a (logical) consequence of \(\Phi\).
      • \(\Phi \leadsto \phi\): the meanings of \(\Phi\) justify or lead to the meaning \(\phi\); \(\phi\) follows from of \(\Phi\) (semantically).  (NB: this is an informal notion we will use to explicate the consequence relation.)
Note that the meta-language contains a few constants from First-Order Logic (FOL): \(\exists, \land, \vdash, \models\).  They allow us to say useful things about object languages and how they work.  As it stands it is not complete; this is just enough to get the ideas across.

\(\displaystyle\mathbb{T}_\varnothing\)  (T-null)

  • Symbols
    • Constants:   \(\{A\ldots Z\} \cup \{a,\ldots z\}\cup \{., ``,"\}\)
  • Term calculus:  every symbol is a term.
  • Formula calculus:
    • If \(A\), \(B\), and \(C\) are terms, then \(ABC\) is a formula.
    • If \(A\), \(B\), and \(C\) are terms, then each of the following is a formula:
      • \(xBC\), \(AxC\), \(ABx\)
      • \(xyC\), \(xBy\), \(Axy\)
      • \(xyz\)
    • If \(ABC\) and \(DEF\) are formulae, then \(ABC, DEF.\) is a formula.
So far, \(\mathbb{T}_\varnothing\) - call it T-null - is a minimal language to say the least.  It has a very limited number of forms, and it has no semantics whatsoever.  In other words, it's not really a language at all; we could call it an uninterpreted calculus, except that you can't do any calculating in it.  However we care to classify it, we can attach any semantics we care to it by assigning meanings to its symbols, but since no such assignments are fixed the result has no stability - you can always change any assignment, so there isn't much point in making one in the first place.

We can express this in terms of expressive strength in a way that will turn out to be useful as we proceed.  \(\mathbb{T}_0\) is the weakest possible language, because it has no expressive stability.  You can't say anything in it that is guaranteed to mean the same thing under different interpretations.

\(\mathbb{T}_\varnothing\) corresponds to RDF without blank nodes and literals.
Formulae like \(ABC\) obviously correspond structurally to RDF triples; to maintain a clean separation between this language and RDF languages, we will call such forms triads.   We might occasionally be tempted to call the constants trions, but we hope not.


    \(\displaystyle\mathbb{T}_0\)  (T-basic)

    What can we do to remedy this sorry state of affairs?  Let's start by giving the speakers of this little language the ability to express things like "somebody killed Caesar" and "Cicero said something".  While we're at it, let's allow them to say "Somebody killed Caesar and Cicero said something".

    As to "somebody" ("something", etc.): the usual way this is done is by extending the language.  One adds some symbols and the axioms that fix their meanings.  In this case, we could add the existential quantifier \(\exists\) to the object language and, if we wanted to be fastidious, define rules in the meta-language to determine its meaning ("there exists").  This would allow users of the language to "say" things of the form \(\exists x.\!xBC\); if you set B = killed and C = Caesar, and this expresses "somebody killed Caesar".

    But since we want to track RDF, whose symbol set does not include anything corresponding to \(\exists\), we will take a different tack, one that corresponds to the way things are done in the RDF ecology: instead of extending the language, we strengthen it.

    Note: there's more to say about extending/strengthening; in particular such changes must be conservative, that is, they cannot break what is already there.  More generally: we have a language, a world, and a stable but partial interpretation.  Fix any two and vary the third.

    There are two ways to do this.  One is to pick some constants that are already in the language and give them a fixed meaning.  This is the technique we will use later on when we accomodate the various "entailment regimes" of RDF.  This kind of strengthening is equivalent to extending the set of constants in the language with fixed meanings (call them the fixed constants for short).  Having more fixed constants means more expressive power; it allows the speaker to say more (kinds of) things.

    The other way to strengthen a language, which we will now use, is to reclassify some subset of its constants.  Notice that \(\mathbb{T}_\varnothing\) contains only one kind of symbol.  We will change the language not by adding symbols, but by reclassifying the symbols \(x, y,\) and \(z\) into a new class of  Variables.  Unfortunately, that alone would have no effect; something additional needs to be introduced in order to make such reclassification meaningful.  In this case, that means adding rules to the meta-language that fix the meanings of the formulae in which they appear.  That last clause - "meanings of the formulae" - is critical, since with this technique we do not fix the meanings of the variable symbols themselves.  On the contrary, quantified variables of this sort will never be assigned a determinate meaning.  Their meaning is essentially tied up with the meaning of quantification, which in turn is essentially connected with the composite forms in which the variables appear.

    The new meta-language rules fixing the meanings of formulae containing variables will be expressed as rewrite rules.  A rewrite rule is essentially the same as a syntactic derivation or transformation rule; it says how one form can be mapped to another.  The difference is that derivation rules work within the object language (i.e. they're closed), whereas rewrite rules take forms from the object language to forms in the metalanguage.


    • Symbols 
      • Constants:   \(\{A\ldots Z\} \cup \{a,\ldots z\}\!\setminus\!\{x, y, z\} \cup \{., ``,"\}\)
      • Variables: \(\{x, y, z\}\)
    • Meta-language Rewrite Sublimation Rules
      1. \(ABC\Rightarrow ABC\)
      2. \(, \Rightarrow\land\)
      3. \(xBC \Rightarrow \exists x.\!xBC\) and similarly for AxC and ABx.  We treat any formula in the object language that contains variables as an existentially quantified formula.
      4. \(xyC \Rightarrow \exists x\exists y.\!xyC\). 
      5. \(xyz \Rightarrow \exists x\exists y\exists z.\!xyz\). 
      6. \(xBC, xEF \Rightarrow \exists x.\!(xBC\land xEF)\) and similar for two and three variables.
      7. \(xBC, yEF \Rightarrow \exists x\exists y.\!(xBC\land yEF)\) and similar for two and three variables.
      8. etc. (existential rewrites shown by example rather than general rule)
    CAVEAT: it may be tempting to view the rewrite rules as a kind of semantic interpretation.  That would be a tragic mistake.  They are purely syntactic rules.
    For example, if we have \(ABC, xEF\), rewrite rules 1-3 ``sublimate'' this to \(ABC\land\exists x.\!xEF\).
    Note: as it stands this calculus is not complete, because it does not handle the scope of quantification for all cases.  For example it does not tell us how to get from \(xBC, DEF, GHy\) to \(\exists x\exists y.\!(xBC\land~DEF\land~GHy)\).  But I hope what's there is sufficient to get the idea across.

    This is a stronger language, but it still lacks a highly desirable bit of logic: it does not yet support any form of inference.  We cannot go from a premise to a conclusion.  In particular, even though we know how to construct (legal) formulae containing ',' and \(\land\), nothing tells us when it is appropriate to do so.  For example, if you know that grass is green, and you also know (separately, as it were) that snow is white, then you ought to be able to package those two bits of knowledge into a single whole and say "grass is green and snow is white".  Intuitively that's obviously what we do; but for a calculus, we need an explicit rule.  Similarly for \(\exists\): if we know the specific person (thing) that exists, then we can say that some person or thing exists.  But we need a formal rule for this.

    We could strengthen \(\mathbb{T}_0\) and call it a new language, but in the interest of conciseness let's just define \(\mathbb{T}_0\) as including support for a basic form of inference.  It is intuitively obvious that if "Brutus killed Caesar" is true, then so is "somebody killed Caesar".  Going from the former to the latter is a basic species of inference.  Similarly, if you know "Somebody killed Caesar", and you also know (perhaps you were told) that Brutus killed Caesar, then you no longer need "Somebody killed Caesar".  Since Brutus is somebody, you lose nothing if you discard "Somebody killed Caesar" and retain "Brutus killed Caesar".  This too is a form of inference.

    To support this in \(\mathbb{T}_0\) we add inference rules governing the introduction and elimination of terms like "somebody".  To start, we have only two such terms: \(\exists\) ("there exists") corresponding to "somebody" (or "something", etc.) and \(\land\) ("and").  [In the case of \(\exists\) we're abusing the terminology a little bit, since \(\exists\) is not part of the object language.  Our introduction rule will actually license introduction of variables in triad constructions, which in turn are mapped to \(\exists\) expressions in the meta-language.
    • \(\land\)-Introduction: \(\displaystyle{\Phi\quad \phi\over \Phi,\ \phi}\)
    • \(\land\)-Elimination: 
      • \(\displaystyle{\Phi,\ \phi\over\Phi}\)
      • \(\displaystyle{\Phi,\ \phi\over\phi}\)
    • \(\exists\)-Introduction
      • \(ABC\vdash xBC\) ...similar for \(AxB, ABx\) etc.
      • \(ABC\vdash xyC\) ...
      • \(ABC\vdash xyz\)
    • \(\exists\)-Elimination
      • \({ABC\quad xBC\over ABC}\qquad or\qquad  {\Phi\quad \Phi x\over\Phi}  \)


    RDF:  \(\exists\)-Elimination is basically the counterpart to "instancing" in RDF-Semantics.

  • NOTE: I'm not sure how to write \(\exists\)-Elimination yet.  Maybe something like one of these?   TODO: explicate significance of elimination rules for readers unfamiliar with this style of logic.


    [NB: if we were being really fastidious we would give each of  these rules a distinct name, but that seems like too much bother for now.  And why not just call them \(\exists\)\!I rules, since that's what they mean?]
    CAVEAT: this works fine for one triad, but does not work for a sequent of triads.  The single triad rule would allow e.g. \(ABC,ADE\vdash xBC,yDE\).  We need a rule to express the requirement of uniform substitution, e.g. \(ABC,ADE\vdash xBC,xDE\).
    Alternative format:

    \(ABC\over xBC\)\(ABC\over AxC\)\(ABC\over ABx\)
    \(ABC\over xyC\)\(ABC\over Axy\)\(ABC\over xBy\)
    \(ABC\over xyz\)
    [NB: can we generalize? E.g.  \(\Phi a\over\Phi x\) , reading \(\Phi a\) to mean a collection of triads containing one or more constants a so the rule says replace 'em all with a variable.  For a single triad, \(\phi a\over\phi x\).]

    The intuitive way to interpret such rules is simple:  if you have \(ABC\), then you are licensed to introduce a var in place of any of the constants.  So "introduce a var" is the formal counterpart of "say 'somebody' (instead of 'Brutus')".  [NB:  "Var-introduction" actually doesn't quite work, since it involves substitution, but it's close enough.]
    [TODO:  explicate: are the inference rules part of the language definition, or are they in the meta-language?  Intuitively they seem to be like the rules of the formula calculus; but they use the symbol \(\vdash\), which we do not want to admit to the language.  So they must be part of the meta-language.  Remember, the meta-language "includes" the object language.]

    Historical note:  this rule is cribbed directly from the great Gerhard Gentzen, who first came up with the idea of introduction and elimination rules as determining the meanings of the logical constants.  But his \(\exists\)-Introduction rule is different than our Var-Introduction rule, since his object language included they symbol \(\exists\) and ours does not.  His looked like something this: \[{Pa\over\exists x.Px}\ {\scriptstyle\!\exists\!-\!I}\]

    where the \(\exists\!-\!I\) on the right names the rule.

    Normal Form
    Informally, a collection of triads is in normal form if the triads are independent; that is, if none of them can be derived from the others.
    RDF:  as far as I can tell, the concept of "leanness" introduced in RDF-Semantics boils down to normal form.  It is expressed in terms of graphs, instances, and proper subgraphs, but what a "lean" graph amounts to is one whose triples are independent - the semantic domain counterpart of normal form.


    Literals

    If we want our little Triad calculus to reflect RDF, we obviously need to support literals.  I'll get to it eventually.


    \(\displaystyle\mathbb{T}_1\)  (RDF-entailment)



    ...todo

    \(\displaystyle\mathbb{T}_2\)  (RDFS-entailment)

    ...todo

    Commentary:

    So far, this has nothing to do with RDF per se, except insofar as it reflects its structure.  But the symbol set has been chosen with an eye toward the RDF family of languages we will define using this calculus.  Informally:
    • Our set of constant symbols \(\{A, B,\ldots\}\cup \{a, b,\ldots\}\)  corresponds to the set of IRIs
    • To define a logical language corresponding to RDFS, OWL, etc. we will fix the meanings of subsets of the constants, e.g. \(t\) corresponds to \(\mathrm{rdf:type}\)
    • Variable symbols will be used to capture the RDF concept of ``blank node''.
    We intentionally limit the symbol set so that we can write triads concisely.  If we our term calculus allow e.g. "Foo" as a term, we would need to be able to write "Foo Bar Baz".  Which isn't all that bad, but ABC is a whole lot more concise, and since we're only concerned with inference we don't need to be able to construct human-readable terms like "Brutus".  This is supposed to be a meta-language, after all. 

    For example, to accomodate RDFS, we can define inference schemata such as:
    • \(AtB, BkC\vdash AtC\).   
    This is a deduction rule, not an entailment rule, although it corresponds to RDFS entailment rule rdfs9 in RDF Semantics.  The critical point here is thatthe complete set of such inference schemata will serve to fix the meanings of the constants involved in them - in this case, rdf:type (t) and rdfs:subClassOf (k).  For such constants, at least, no model theoretic semantics is needed.

    Sunday, June 23, 2013

    RDF Investigations

    First in a series of articles exploring the True Nature(s) of RDF.

    "Why, oh why?" you cry.  Mainly because it's fun; it inevitably leads to investigation of lots of very interesting topics not only in logic but in philosophy, computer science, and linguistics.

    But also because there is at least a slim chance that it may prove useful.  Actually I think there's a very good chance.  It may not prove useful for implementers who have already mastered the official definition(s); but for the newcomer, the official docs are dauntingly complex, obscure, quasi-incoherent, and generally hard to read.  On top of that, the reigning idiom in which the so-called "semantic web" is discussed is largely, by turns, vague, counter-intuitive, sloppy, propagandistic, ... [your adjective here].  A careful investigation of RDF might well turn up language that eliminates such problems for the general reader.

    The goal here is not to provide a yet another Guide to RDF.  On the contrary: the first question to ask is whether the official definition of RDF, not to mention prevalent informal ways of talking about it, are adequate to the task. At the center of RDF is a collection of concepts - triples, graphs, statements, inference, etc., which the official definition regiments into a quasi-formal definition.  One question is whether this official definition adequately captures the intuitive understandings that lead to RDF in the first place.  It would be a mistake to take the official definition of this collection of ideas as the only or even the best way of thinking about them, just as it would be a mistake to think that one particular logical formalism (e.g. classical, intuitionistic, game-theoretical, quantum, etc.) is the only or best way of capturing the essential properties of logical consequence or logical truth.  There are always multiple ways to think about things; and sometimes, adoption of one way over another has real consequences.

    In fact, careful investigation of RDF will show that the official definition is inadequate in two fundamental ways.  First, it restricts interpretation to a single semantic domain.  This is unnecessary; there are many different ways of thinking about RDF - concrete "theories" of RDF - that can serve as the basis of models of RDF, just as there are many different concrete groups that can serve as models of formalized group theory.  Furthermore, restricting interpretation to a single specific theory of RDF obscures the point of model theory, which is fundamentally concerned with how a fixed language can be used to generalize about aspects of structure common to a variety of concrete mathematical domains.

    The second point of inadequacy is the lack of a complete RDF calculus.  For model theory to work, you need three separate calculi, which together comprise the calculus of the language: rules for constructing formal proofs (deductions).

    • term calculus specifies how terms are to be constructed from elementary symbols.  For RDF, this is provided by the rules of IRI syntax plus RDF-specific rules for the syntax of literals.
    • formula calculus specifies how formulae are to be constructed from terms.  In RDF, a formula is called a statement, and a set of formulae is called a graph.  The so-called abstract syntax described  in RDF Concepts and Abstract Syntax serves as the formula calculus, but it is incomplete.  It specifies that a triple (statement) "contains" three terms (nodes), and that an RDF graph is "a set of triples".  But these are not rules of a calculus; they do not tell us how to construct statements in a formal language.
    • An inferential calculus specifies how proofs or deductions are to be constructed from formulae and symbols.  The official definition of RDF does not specify such a calculus.
    Lack of a complete calculus means that the model-theoretic interpretation of RDF in RDF Semantics is essentially incomplete.  It does provide an account of semantic entailment (not to be confused with logical entailment), but in the absence a calculus we can use to construct formal statements and inferences (deductions), we have no means of making use of such entailments.  The business of model theory is to build a bridge between formal calculi and (informal) semantic domains.  You don't need a formal representation of the semantic domain, but you do need a formal calculus.  Viewed from the computational rather than mathematical perspective, this is the point of model theory, since it makes automated proof a legitimate idea.  No model without a calculus.

    This analysis suggests a radically different strategy for defining RDF: start with a syntactic calculus, and demonstrate several different models for the sentences generated by the calculus.  The result is a reconceptualization of RDF as a structure common to several concrete mathematical theories.  The practical consequence is that implementers are freed from the need to select a particular model; they can concentrate solely on the syntax.  Users too can dispense with the need to understand a theory of RDF, and can focus on understanding the deductions licensed by the calculus.  In particular, it is not necessary to conceptualize RDF as a graph, nor is it necessary to think of the middle term of a triple as denoting a relation between the first and third terms.

    That's just the formal side of RDF; a distinct but related investigation will address the notion that RDF "statements" can be construed as assertions, and that IRIs denote real-world entities.

    Program

    1. Formalize the official definition of RDF.  This means both defining a calculus (syntax), and describing the semantic aspect of RDF as a mathematical domain.
    2. Investigate alternative concrete RDF theories.  Some possibilities:
      1. Set theoretic with binary relations.  This is the domain used in the official RDF semantics, which treats the first and third elements of an RDF statement as members of a set of Resources, the second term as a member of a set of Properties, together with a mapping from the Property to a relation containing the pair of Resources.  Entailment comes automatically, from the axioms of set theory.  No 3-tuples are defined as part of the semantic domain.
      2. Set theoretic with cartesian 3-tuples and entailment axioms.  This would dispense with binary relations and account for entailments axiomatically.
      3. Category theoretic, with triples represented as product constructions in the category RDF whose objects are Nodes (or Resources, or whatever).  Here a triple is a triple of arrows rather than objects.  Entailments are expressed as axiomatic commutative diagrams.
    3. Investigate alternative calculi.  Investigate ways to formalize the rdf: and rdfs: constants defined in RDF Schema; what axioms are needed, and what form should they take?
    4. Other?
    5. Investigate what it means to say that IRIs and RDF statements have any sort of meaning beyond the mathematical meanings investigated above.  What is it for an IRI to denote anything at all, let alone a real-world entity?  What is the relation between the idea that an RDF triple amounts to a statement, that it is asserted?  How are such meanings related to the pragmatics of how the web is actually used?  Etc.  This is a very big can of worms.