1. #!/usr/bin/skyle
  2. # getting started in less than 5 minutes
  3. title tutorial
  4. probe http://skyle.codeissues.net/
  5. output keywords.csv
  6. # extract all Skyle keywords
  7. follow
  8. save keyword
  9. # add keyword usage example
  10. follow
  11. save usage

Skyle

Learn Skyle by Example

Skyle is an ambitious open-source scraper with its own declarative language. It's purpose is to create a portable and elegant way to scrape various file formats. The following document is a brief introduction on how Skyle can be used. A good understanding of regular expressions and XPath expressions is important.

 

Headers
user-defined constants at the begining, static properties of the profile, global settings

 
Skyle profile





Instructions
blocks of code, text modifiers, filtering expressions, save query results

 

A file written in Skyle is called a profile. It is case-sensitive and whitespace sensitive. A profile is divided into headers and instructions. Headers must precede instructions, otherwise an error is thrown and exits.

The goal of a Skyle profile is to describe the steps to scrape a probe through instructions. The results are saved in an output. These are the most common two headers of a profile.

Think of headers as some very barebone assumptions of a certain task, whereas the instructions are basically the blueprints of the solution. Another point to remember is that instructions are file format dependent, whereas headers are not.

Skyle uses an in-memory stack after the probe has been successfully loaded. This stack adapts after each instruction, meaning an instruction will indirectly change the stack's elements. The table below lists three instructions that directly interacts with the stack.

Instruction Usage
dump Used to analyze the stack
flush Revert the stack to its initial state
save Label results for output

By design, Skyle handles saved entities as lists. Having this in mind, one entity is not a value; it is a list with one element. This applies for nil values as well; there are no nil values, just lists with zero elements.

Whenever Skyle encounters the save instruction, it understands that the results are now classified as category, where category is the name after the save instruction. A good synonim for the word save would be label.

Skyle profile
  1. #!/usr/bin/skyle
  2. # provide path to probe
  3. probe /tmp/dummy.html
  4. # evaluate xpath expression
  5. follow
  6. # filter all results from previous
  7. # instruction and keep new results
  8. keep .+\.pdf$
  9. # label results from previous
  10. # instruction and write to output
  11. save files
Memory stack dump

Skyle initializes the memory stack with the probe's context

As a side note, lines starting with the pound sign (hashtag) are comments and Skyle ignores them. Use comments to leave user friendly notes.

Make yourself familiar with some common keywords before moving on with some use-case scenarios and profile examples.

Keyword Meaning
title probe output Common headers
flags Advance header used to customize profile
follow node next XML/HTML instructions & loop instruction
pattern remove glue Text modifier instructions
keep Filter expression instruction
save Save instruction, used for output
exec Shell instruction, used to execute shell commands

A first use-case of a simple profile is to scrape an interview paper with questions and answers and create a tabel-like output of the results. Skyle supports CSV with headers as an output of the profile.

    $ ls
    interview.txt

    $ cat interview.txt
    Q: What's your name?
    A: Skyle.

    Q: How are you?
    A: I'm fine, thanks.

    Q: What do you think of this interview?
    A: It's nice!

This is how the interview.txt looks like. Each question and answer is written on a line. Questions start with the Q: prefix and answers with A: prefix and a blank line separates each pair. This pattern can be used to transform the probe into something else.

The expression for a question line is Q:\s(What's your name?)\n and the pattern can be extracted as Q:\s+(.*)\n. It's availabe for answers as well. Finally, the profile looks like this.

Skyle profile
  1. #!/usr/bin/skyle
  2. # a name for the profile to recognize it later
  3. title Interview Example
  4. # provide path to interview probe
  5. probe interview.txt
  6. # provide path to output
  7. output interview.csv
  8. # get all questions and save them
  9. pattern Q:\s+(.*)\n
  10. save question
  11. # reset the stack
  12. flush all
  13. # get all answers and save them
  14. pattern A:\s+(.*)\n
  15. save answer

The title is not mandatory, but it helps organize profiles by adding names or short descriptions per file. Running the profile will generate the desired output as follows.

    $ stat -c "%A %n" *
    -rw-rw-r-- interview.txt
    -rw-rw-r-- profile.sky

    $ chmod +x profile.sky && ./profile.sky

    $ stat -c "%A %n" *
    -rw-rw-r-- interview.txt
    -rw-rw-r-- interview.csv
    -rwxrwxr-x profile.sky

By default profiles are not executables, thus the need to chmod +x before running it. Opening the CSV file should contain two columns, one for each classified entity.

Skyle Output
A B
1 question answer
2 What's your name? Skyle.
3 How are you? I'm fine, thanks.
4 What do you think of this interview? It's nice!

 

Scraping XML/HTML documents is not very different than text files. Just adapt the profile with the corresponding follow instruction and most of the work is done.

Behind the curtains, XML/HTML documents create an in-memory copy of the root node(s). Instructions such as node and next are used to control this component. Skyle's internal stack is not affected since this component has an auxiliary role in the process of scraping.

    $ ls
    document.html

    $ cat document.html
    <html>
        <head>
            <title>HTML Example</title>
        </head>
        <body>
            <div>
                <h2>An article</h2>
                <span>633 views</span>
                <p>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
            </div>
            <div>
                <h2>Anoter article</h2>
                <span>542 views</span>
                <p>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</p>
            </div>
            <div>
                <h2>Yet again an article</h2>
                <span>62 views</span>
                <p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.</p>
            </div>
            <div>
                <h2>Still an article</h2>
                <a href="article.pdf">Download article</a>
            </div>
            <div>
                <h2>Last article</h2>
                <span>321 views</span>
            </div>
        </body>
    </html>

It's worth mentioning that not all div tags contain the same structure, therefore a simple use of follow will generate unsynchronized results. The solution is to use the node instruction to change the root node from the default html tag to div tag. With the help of next instruction, Skyle can loop all nodes, resulting in a synchronized output, as expected.

Skyle profile
  1. #!/usr/bin/skyle
  2. # a name for the profile to recognize it later
  3. title HTML Example
  4. # provide path to HTML probe
  5. probe document.html
  6. # provide path to output
  7. output htmldoc.csv
  8. # change root node from "html" to "div"
  9. node //div
  10. # get relative "h2" text content, save and continue
  11. follow
  12. save title
  13. # get relative "span" text content, remove whatever
  14. # is not a digit, save and continue
  15. follow
  16. remove \D
  17. save views
  18. # get relative "p" text content, save and continue
  19. follow
  20. save content
  21. # get relative "a" href attribute, save and continue
  22. follow
  23. save file
  24. # continue loop until no next root node exists
  25. next node
  26. # download all files
  27. exec wget $file

The span tag has a pattern of DIGIT views. The use of remove instead of pattern is simply a choice of preference. Both would have worked.

It's the first time a variable is being used in a Skyle profile: $file. The dollar sign in front of a alphanumeric word invokes a classified entity, if it exists.

Notice the exec instruction is out of the node-loop. Variables are lists, therefore any instruction calling upon a saved entity acts as a loop for each element of the list. This would translate as a loop inside a loop, thus the reason exec is being called outside of the node-loop

Skyle Output
A B C D
1 title views content file
2 An article 633 Lorem ipsum...
3 Anoter article 542 Ut enim...
4 Yet again an article 62 Duis aute...
5 Still an article article.pdf
6 Last article 321

A complete reference to Skyle's documentation can be found on GitHub

Watch Star Fork

Support Skyle.

Spread the word over the internet!

Follow @lexndru

scroll to go back