pyPEG – a PEG Parser-Interpreter in Python
Requires Python 3.x or 2.7
Older versions: pyPEG 1.x
Caveat: pyPEG 2.x is written for Python 3. That means, it accepts
Unicode strings only. You can use it with Python 2.7 by writing
u'string' instead of 'string' or with the following import (you
don't need that for Python 3):
from __future__ import unicode_literals
The samples in this documentation are written for Python 3, too. To execute them with Python 2.7, you'll need this import:
from __future__ import print_function
pyPEG 2.x supports new-style classes only.
A str instance as well as an instance of pypeg2.Literal is parsed
in the source text as a
Terminal Symbol.
It is removed and no result is put into the Abstract syntax tree.
If it does not exist at the correct position in the source text,
a SyntaxError is raised.
Example:
>>> class Key(str):
... grammar = name(), "=", restline, endl
...
>>> k = parse("this=something", Key)
>>> k.name
Symbol('this')
>>> k
'something'
str instances and pypeg2.Literal instances are being output
literally.
Example:
>>> class Key(str):
... grammar = name(), "=", restline, endl
...
>>> k = Key("a value")
>>> k.name = Symbol("give me")
>>> compose(k)
'give me=a value\n'
pyPEG uses Python's re module. You can use
Python Regular Expression Objects purely, or use
the pypeg2.RegEx encapsulation. Regular Expressions are parsed as
Terminal Symbols. The matching
result is put into the AST. If no match can be achieved, a
SyntaxError is raised.
pyPEG predefines different RegEx objects:
| Regular expression for scanning a word. |
| Regular expression for rest of line. |
| Regular expression for scanning whitespace. |
| Shell script style comment. |
| C++ style comment. |
| C style comment without nesting. |
| Pascal style comment without nesting. |
Example:
>>> class Key(str):
... grammar = name(), "=", restline, endl
...
>>> k = parse("this=something", Key)
>>> k.name
Symbol('this')
>>> k
'something'
For RegEx objects their corresponding value in the AST will be
output. If this value does not match the RegEx a ValueError is raised.
Example:
>>> class Key(str):
... grammar = name(), "=", restline, endl
...
>>> k = Key("a value")
>>> k.name = Symbol("give me")
>>> compose(k)
'give me=a value\n'
A tuple or an instance of pypeg2.Concat specifies, that different
things have to be parsed one after another. If not all of them parse in
their sequence, a SyntaxError is raised.
Example:
>>> class Key(str):
... grammar = name(), "=", restline, endl
...
>>> k = parse("this=something", Key)
>>> k.name
Symbol('this')
>>> k
'something'
In a tuple there may be integers preceding another thing in the
tuple. These integers represent a cardinality. For example, to parse
three times a word, you can have as a grammar:
grammar = word, word, word
or:
grammar = 3, word
which is equivalent. There are special cardinality values:
|
|
|
|
|
|
The special cardinality values can be generated with the Cardinality Functions. Other negative values are reserved and may not be used.
For tuple instances and instances of pypeg2.Concat all attributes of
the corresponding thing (and elements of the corresponding collection
if that applies) in the AST will be composed and the result is
concatenated.
Example:
>>> class Key(str):
... grammar = name(), "=", restline, endl
...
>>> k = Key("a value")
>>> k.name = Symbol("give me")
>>> compose(k)
'give me=a value\n'
A list instance which is not derived from pypeg2.Concat represents
different options. They're tested in their sequence. The first option
which parses is chosen, the others are not tested any more. If none
matches, a SyntaxError is raised.
Example:
>>> number = re.compile(r"\d+")
>>> parse("hello", [number, word])
'hello'
The elements of the list are tried out in their sequence, if one of
them can be composed. If none can a ValueError is raised.
Example:
>>> letters = re.compile(r"[a-zA-Z]")
>>> number = re.compile(r"\d+")
>>> compose(23, [letters, number])
'23'
None parses to nothing. And it composes to nothing. It represents
the no-operation value.
Symbol(str)
Used to scan a Symbol.
If you're putting a Symbol somewhere in your grammar, then
Symbol.regex is used to scan while parsing. The result will be a
Symbol instance. Optionally it is possible to check that a Symbol
instance will not be identical to any Keyword instance. This can be
helpful if the source language forbids that.
A class which is derived from Symbol can have an Enum as its
grammar only. Other values for its grammar are forbidden and will
raise a TypeError. If such an Enum is specified, each parsed value
will be checked if being a member of this Enum additionally to the
RegEx matching.
| regular expression to scan, default |
| flag if a |
| name of the |
__init__(self, name, namespace=None)Construct a Symbol with that name in namespace.
| if |
| if |
Parsing a Symbol is done by scanning with Symbol.regex. In our
example we're using the name() function, which is often used to parse
a Symbol. name() equals to attr("name", Symbol).
Example:
>>> Symbol.regex = re.compile(r"[\w\s]+")
>>> class Key(str):
... grammar = name(), "=", restline, endl
...
>>> k = parse("this one=foo bar", Key)
>>> k.name
Symbol('this one')
>>> k
'foo bar'
Composing a Symbol is done by converting it to text.
Example:
>>> k.name = Symbol("that one")
>>> compose(k)
'that one=foo bar'
Keyword(Symbol)
Used to access the keyword table.
The Keyword class is meant to be instanciated for each Keyword of
the source language. The class holds the keyword table as a Namespace
instance. There is the abbreviation K for Keyword. The latter is
useful for instancing keywords.
| regular expression to scan; default |
|
|
| name of the |
__init__(self, keyword)Adds keyword to the keyword table.
When a Keyword instance is parsed, it is removed and nothing is put
into the resulting AST. When a Keyword class is parsed, an
instance is created and put into the AST.
Example:
>>> class Type(Keyword):
... grammar = Enum( K("int"), K("long") )
...
>>> k = parse("long", Type)
>>> k.name
'long'
When a Keyword instance is in a grammar, it is converted into a
str instance, and the resulting text is added to the result. When a
Keyword class is in the grammar, the correspoding instance in the
AST is converted into a str instance and added to the result.
Example:
>>> k = K("do")
>>> compose(k)
'do'
List(list)
A List of things.
A List is a collection for parsed things. It can be used as a base class
for collections in the grammar. If a List class has no class
variable grammar, grammar = csl(Symbol) is assumed.
__init__(self, L=[], **kwargs)Construct a List, and construct its attributes from keyword arguments.
A List is parsed by following its grammar. If a List is parsed,
then all things which are parsed and which are not attributes are
appended to the List.
Example:
>>> class Instruction(str): pass
...
>>> class Block(List):
... grammar = "{", maybe_some(Instruction), "}"
...
>>> b = parse("{ hello world }", Block)
>>> b[0]
'hello'
>>> b[1]
'world'
>>>
If a List is composed, then its grammar is followed and composed.
Example:
>>> class Instruction(str): pass
...
>>> class Block(List):
... grammar = "{", blank, csl(Instruction), blank, "}"
...
>>> b = Block()
>>> b.append(Instruction("hello"))
>>> b.append(Instruction("world"))
>>> compose(b)
'{ hello, world }'
Namespace(_UserDict)
A dictionary of things, indexed by their name.
A Namespace holds an OrderedDict mapping the name attributes of the
collected things to their respective representation instance. Unnamed
things cannot be collected with a Namespace.
__init__(self, *args, **kwargs)Initialize an OrderedDict containing the data of the Namespace. Arguments are put into the Namespace, keyword arguments give the attributes of the Namespace.
A Namespace is parsed by following its grammar. If a Namespace is
parsed, then all things which are parsed and which are not attributes
are appended to the Namespace and indexed by their name
attribute.
Example:
>>> Symbol.regex = re.compile(r"[\w\s]+")
>>> class Key(str):
... grammar = name(), "=", restline, endl
...
>>> class Section(Namespace):
... grammar = "[", name(), "]", endl, maybe_some(Key)
...
>>> class IniFile(Namespace):
... grammar = some(Section)
...
>>> ini_file_text = """[Number 1]
... this=something
... that=something else
... [Number 2]
... once=anything
... twice=goes
... """
>>> ini_file = parse(ini_file_text, IniFile)
>>> ini_file["Number 2"]["once"]
'anything'
If a Namespace is composed, then its grammar is followed and
composed.
Example:
>>> ini_file["Number 1"]["that"] = Key("new one")
>>> ini_file["Number 3"] = Section()
>>> print(compose(ini_file))
[Number 1]
this=something
that=new one
[Number 2]
once=anything
twice=goes
[Number 3]
Enum(Namespace)
A Namespace which is treated as an Enum. Enums can only contain
Keyword or Symbol instances. An Enum cannot be modified after
creation. An Enum is allowed as the grammar of a Symbol only.
__init__(self, *things)Construct an Enum using a tuple of things.
An Enum is parsed as a selection for possible values for a Symbol.
If a value is parsed which is not member of the Enum, a SyntaxError
is raised.
Example:
>>> class Type(Keyword):
... grammar = Enum( K("int"), K("long") )
...
>>> parse("int", Type)
Type('int')
>>> parse("string", Type)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pypeg2/__init__.py", line 382, in parse
t, r = parser.parse(text, thing)
File "pypeg2/__init__.py", line 469, in parse
raise r
File "<string>", line 1
string
^
SyntaxError: 'string' is not a member of Enum([Keyword('int'),
Keyword('long')])
>>>
When a Symbol is composed which has an Enum as its grammar, the
composed value is checked if it is a member of the Enum. If not, a
ValueError is raised.
>>> class Type(Keyword):
... grammar = Enum( K("int"), K("long") )
...
>>> t = Type("int")
>>> compose(t)
'int'
>>> t = Type("string")
>>> compose(t)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pypeg2/__init__.py", line 403, in compose
return parser.compose(thing, grammar)
File "pypeg2/__init__.py", line 819, in compose
raise ValueError(repr(thing) + " is not in " + repr(grammar))
ValueError: Type('string') is not in Enum([Keyword('int'),
Keyword('long')])
Grammar generator function generate a piece of a grammar. They're
meant to be used in a grammar directly.
some(*thing)
At least one occurrence of thing, + operator. Inserts -2 as
cardinality before thing.
Parsing some() parses at least one occurence of thing, or as many
as there are. If there aren't things then a SyntaxError is generated.
Example:
>>> w = parse("hello world", some(word))
>>> w
['hello', 'world']
>>> w = parse("", some(word))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pypeg2/__init__.py", line 390, in parse
t, r = parser.parse(text, thing)
File "pypeg2/__init__.py", line 477, in parse
raise r
File "<string>", line 1
^
SyntaxError: expecting match on \w+
Composing some() composes as many things as there are, but at least
one. If there is no matching thing, a ValueError is raised.
Example:
>>> class Words(List):
... grammar = some(word, blank)
...
>>> compose(Words("hello", "world"))
'hello world '
>>> compose(Words())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pypeg2/__init__.py", line 414, in compose
return parser.compose(thing, grammar)
File "pypeg2/__init__.py", line 931, in compose
result = compose_tuple(thing, thing[:], grammar)
File "pypeg2/__init__.py", line 886, in compose_tuple
raise ValueError("not enough things to compose")
ValueError: not enough things to compose
>>>
maybe_some(*thing)
No thing or some of them, * operator. Inserts -1 as cardinality
before thing.
Parsing maybe_some() parses all occurrences of thing. If there
aren't things then the result is empty.
Example:
>>> parse("hello world", maybe_some(word))
['hello', 'world']
>>> parse("", maybe_some(word))
[]
Composing maybe_some() composes as many things as there are.
>>> class Words(List):
... grammar = maybe_some(word, blank)
...
>>> compose(Words("hello", "world"))
'hello world '
>>> compose(Words())
''
optional(*thing)
Thing or no thing, ? operator. Inserts 0 as cardinality before thing.
Parsing optional() parses one occurrence of thing. If there
aren't things then the result is empty.
Example:
>>> parse("hello", optional(word))
['hello']
>>> parse("", optional(word))
[]
>>> number = re.compile("[-+]?\d+")
>>> parse("-23 world", (optional(word), number, word))
['-23', 'world']
Composing optional() composes one thing if there is any.
Example:
>>> class OptionalWord(str):
... grammar = optional(word)
...
>>> compose(OptionalWord("hello"))
'hello'
>>> compose(OptionalWord())
''
csl(*thing, separator=",")
csl(*thing)
Generate a grammar for a simple comma separated list.
csl(Something) generates
Something, maybe_some(",", blank, Something)
attr(name, thing=word, subtype=None)
Generate an Attribute with that name, referencing the thing. An
Attribute is a namedtuple("Attribute", ("name", "thing")).
| reference to |
An Attribute is parsed following its grammar in thing. The result
is not put into another thing directly; instead the result is added as
an attribute to containing thing.
Example:
>>> class Type(Keyword):
... grammar = Enum( K("int"), K("long") )
...
>>> class Parameter:
... grammar = attr("typing", Type), blank, name()
...
>>> p = parse("int a", Parameter)
>>> p.typing
Type('int')
An Attribute is cmposed following its grammar in thing.
Example:
>>> p = Parameter()
>>> p.typing = K("int")
>>> p.name = "x"
>>> compose(p)
'int x'
flag(name, thing=None)
Generate an Attribute with that name which is valued True or
False. If no thing is given, Keyword(name) is assumed.
A flag is usually a Keyword which can be there or not. If it is
there, the resulting value is True. If it is not there, the resulting
value is False.
Example:
>>> class BoolLiteral(Symbol):
... grammar = Enum( K("True"), K("False") )
...
>>> class Fact:
... grammar = name(), K("is"), flag("negated", K("not")), \
... attr("value", BoolLiteral)
...
>>> f1 = parse("a is not True", Fact)
>>> f2 = parse("b is False", Fact)
>>> f1.name
Symbol('a')
>>> f1.value
BoolLiteral('True')
>>> f1.negated
True
>>> f2.negated
False
If the flag is True compose the grammar. If the flag is False
don't compose anything.
Example:
>>> class ValidSign:
... grammar = flag("invalid", K("not")), blank, "valid"
...
>>> v = ValidSign()
>>> v.invalid = True
>>> compose(v)
'not valid'
name()
Generate a grammar for a Symbol with a name. This is a shortcut for
attr("name", Symbol).
ignore(*grammar)
Ignore what matches to the grammar.
Parse what's to be ignored. The result is added to an attribute
named "_ignore" + str(i) with i as a serial number.
Compose the result as with any attr().
indent(*thing)
Indent thing by one level.
The indent function has no meaning while parsing. The parameters are
parsed as if they would be in a tuple.
While composing the indent function increases the level of indention.
Example:
>>> class Instruction(str):
... grammar = word, ";", endl
...
>>> class Block(List):
... grammar = "{", endl, maybe_some(indent(Instruction)), "}"
...
>>> print(compose(Block(Instruction("first"), \
... Instruction("second"))))
{
first;
second;
}
contiguous(*thing)
Temporary disable automated whitespace removing while parsing thing.
While parsing whitespace removing is disabled. That means, if
whitespace is not part of the grammar, it will lead to a SyntaxError
if whitespace will be found between the parsed objects.
Example:
class Path(List):
grammar = flag("relative", "."), maybe_some(Symbol, ".")
class Reference(GrammarElement):
grammar = contiguous(attr("path", Path), name())
While composing the contiguous function has no effect.
separated(*thing)
Temporary enable automated whitespace removing while parsing thing.
Whitespace removing is enabled by default. This function is for
temporary enabling whitespace removing after it was disabled with the
contiguous function.
While parsing whitespace removing is enabled again. That means, if whitespace is not part of the grammar, it will be omitted if whitespace will be found between parsed objects.
While composing the separated function has no effect.
omit(*thing)
Omit what matches the grammar. This function cuts out thing and
throws it away.
While parsing omit() cuts out what matches the grammar thing and
throws it away.
Example:
>>> p = parse("hello", omit(Symbol))
>>> print(p)
None
>>> _
While composing omit() does not compose text for what matches the
grammar thing.
Example:
>>> compose(Symbol('hello'), omit(Symbol))
''
>>> _
Callback functions are called while composing only. They're ignored while parsing.
blank(thing, parser)
Space marker for composing text.
blank is outputting a space character (ASCII 32) when called.
endl(thing, parser)
End of line marker for composing text.
endl is outputting a linefeed charater (ASCII 10) when called. The
indention system reacts when reading endl while composing.
callback_function(thing, parser)
Arbitrary callback functions can be defined and put into the grammar.
They will be called while composing.
Example:
>>> class Instruction(str):
... def heading(self, parser):
... return "/* on level " + str(parser.indention_level) \
... + " */", endl
... grammar = heading, word, ";", endl
...
>>> print(compose(Instruction("do_this")))
/* on level 0 */
do_this;
If a method of the following is present in a grammar element, it will override the standard behaviour.
parse(cls, parser, text, pos)
Overwrites the parsing behaviour. If present, this class method is called at each place the grammar references the grammar element instead of automatic parsing.
| class object of the grammar element |
| parser object which is calling |
| text to be parsed |
|
|
compose(cls, parser)
Overwrites the composing behaviour. If present, this class method is called at each place the grammar references the grammar element instead of automatic composing.
| class object of the grammar element |
| parser object which is calling |