Litexte is a Textile parser which i wrote to see how far i can go with Regex patterns and to exercise my lazy right brain. In this blogpost ill be illustrating how to create a parser in Ruby for Textile.
Textile is a light-weight markup language like Markdown. RedCloth is a well-known library for parsing Textile markup in Ruby. What's the fun in using an existing library? Lets weave our very own Textile parser with lots of Regex awesomeness just for fun.
Checkout this Textile quick reference written by _why. Click here for a sample textile input which will be used to build the Parser
Textile markup can be categorized and parsed in the following order:
1. Headers and Blockquotes
Headers are represented with a begin marker, followed by a . and content.
input.gsub!(/h1\.(.*?)\n/,'<h1>\1</h1>')
input.gsub!(/h2\.(.*?)\n/,'<h2>\1</h2>')
input.gsub!(/h3\.(.*?)\n/,'<h3>\1</h3>')
input.gsub!(/bq\.(.*?)\n/,'<blockquote>\1</blockquote>')
2. Delimited tags
Delimited markups have a begin and end marker with the content in the middle. Most of Textile markups fall in this category.
input.gsub!(/\_{2}(.*?)\_{2}/,'<i>\1</i>')
input.gsub!(/\*{2}(.*?)\*{2}/,'<b>\1</b>')
input.gsub!(/\?{2}(.*?)\?{2}/,'<cite>\1</cite>')
input.gsub!(/\_{1}(.*?)\_{1}/,'<em>\1</em>')
input.gsub!(/\*{1}(.+?)\*{1}/,'<strong>\1</strong>')
input.gsub!(/\-{1}(.*?)\-{1}/,'<del>\1</del>')
input.gsub!(/\+{1}(.*?)\+{1}/,'<ins>\1</ins>')
input.gsub!(/\^{1}(.*?)\^{1}/,'<sup>\1</sup>')
input.gsub!(/\~{1}(.*?)\~{1}/,'<sub>\1</sub>')
3. Links, Images , Superscript, Subscript
Links, Images, Superscript, Subscript tags etc dont follow symmetric patterns, but can still be parsed easily with simple regexes
input.gsub!(/"(\w+)":(\S+)/,'<a href="\2">\1</a>')
input.gsub!(/\!{1}(.*?)\!{1}/,'<img src="\1"/>')
4. Span and P tags
Span and P tags are a little tricky because there are several variants with classes, ids, style etc like:
%Ruby is awesome%
Ruby is awesome
%{color:blue}Regex is awesome%
Regex is awesome
Ruby Regexes can be used with blocks, which is a very powerful feature for this kind of conditional substitution. It suits both a simple span tag and p tag with numerous variants. Check out the below regex substitution with blocks for span and p:
input.gsub!(/\%{1}(\{(.*)\})?(.*?)\%{1}/) do
style = $1 ? " style=\"#{$2}\" " : ''
"<span#{style}>#{$3}</span>"
end
input.gsub!(/p([\<\>\=]+)?(\((.*)\))?(\{(.*)\})?(\[(.*)\])?\.(.*?)\n/) do
aligns = {'<' => 'left', '>' => 'right', '=' => 'center', '<>' => 'justify'}
align = $1 ? " text-align: #{aligns[$1]};" : ""
styles = $5 || ""
style = (align + styles).empty? ? "" : " style=\"#{align}#{styles}\""
lang = $7 ? " lang=\"#{$7}\"" : ""
text = $8
mdata = $3 ? $3.match(/(\w+)?#?(\w+)?/) : []
_class = mdata[1] ? " class=\"#{mdata[1]}\"" : ""
_id = mdata[2] ? " id=\"#{mdata[2]}\"" : ""
"<p#{_id}#{_class}#{style}#{lang}>#{text}</p>"
end
5. Tables
|_. Name |_. Age |
|John |20 |
|Bill |25 |
For Tables you need a way to block substitute each table in the markup incase there are multiple tables. Otherwise its very straightforward:
def parse_table(table)
header = /^\_\./
out = "<table>"
table.each do |row|
out += "<tr>"
row.split('|').reject {|t| t.chomp.empty?}.each do |cell|
if cell =~ header
out += "<th>#{cell.sub(header,'')}</th>"
else
out += "<td>#{cell}</td>"
end
end
out += "</tr>"
end
out += "</table>"
end
6. Ordered and Unordered Lists
# Languages
## Ruby
## Python
# Frameworks
## Rails
Lists in addition to being multiple can also be nested, which requires a recursive solution, to either spit out the list content or to parse a sublist. The solution is not follow Ruby idioms because each and sublisting don't work very well.
def parse_list(list)
items = list.scan(/^#+.*?\n/).map(&:chomp).collect {|item| item =~ /(#+)(.*)/; [$1,$2]}
parse_list_items(items, symbol)
end
def parse_list_items(items, start = 0)
list_out = "<ol>"
i = 0
while(i < items.length)
level, item = items[i]
if level.length-start == 1
list_out += "<li>#{item}</li>"
i += 1
else
j = i + (items[i,items.size].find_index {|e| e[0].length == start+1} || items.length)
list_out += parse_list_items(items[i,j-1], start+1)
i += (j-1)
end
end
list_out += "</ol>"
end