Support Skyle.
Spread the word over the internet!
scroll to go back
Skyle is an ambitious open-source scraper with its own declarative language. It's purpose is to create a portable and elegant way to scrape various file formats. The following document is a brief introduction on how Skyle can be used. A good understanding of regular expressions and XPath expressions is important.
Headers
user-defined constants at the begining, static properties of the profile, global settings
Instructions
blocks of code, text modifiers, filtering expressions, save query results
A file written in Skyle is called a profile. It is case-sensitive and whitespace sensitive. A profile is divided into headers and instructions. Headers must precede instructions, otherwise an error is thrown and exits.
The goal of a Skyle profile is to describe the steps to scrape a probe through instructions. The results are saved in an output. These are the most common two headers of a profile.
Think of headers as some very barebone assumptions of a certain task, whereas the instructions are basically the blueprints of the solution. Another point to remember is that instructions are file format dependent, whereas headers are not.
Skyle uses an in-memory stack after the probe has been successfully loaded. This stack adapts after each instruction, meaning an instruction will indirectly change the stack's elements. The table below lists three instructions that directly interacts with the stack.
Instruction | Usage |
---|---|
dump | Used to analyze the stack |
flush | Revert the stack to its initial state |
save | Label results for output |
By design, Skyle handles saved entities as lists. Having this in mind, one entity is not a value; it is a list with one element. This applies for nil values as well; there are no nil values, just lists with zero elements.
Whenever Skyle encounters the save instruction, it understands that the results are now classified as category, where category is the name after the save instruction. A good synonim for the word save would be label.
Skyle initializes the memory stack with the probe's context
As a side note, lines starting with the pound sign (hashtag) are comments and Skyle ignores them. Use comments to leave user friendly notes.
Make yourself familiar with some common keywords before moving on with some use-case scenarios and profile examples.
Keyword | Meaning |
---|---|
title probe output | Common headers |
flags | Advance header used to customize profile |
follow node next | XML/HTML instructions & loop instruction |
pattern remove glue | Text modifier instructions |
keep | Filter expression instruction |
save | Save instruction, used for output |
exec | Shell instruction, used to execute shell commands |
A first use-case of a simple profile is to scrape an interview paper with questions and answers and create a tabel-like output of the results. Skyle supports CSV with headers as an output of the profile.
$ ls interview.txt $ cat interview.txt Q: What's your name? A: Skyle. Q: How are you? A: I'm fine, thanks. Q: What do you think of this interview? A: It's nice!
This is how the interview.txt looks like. Each question and answer is written on a line. Questions start with the Q: prefix and answers with A: prefix and a blank line separates each pair. This pattern can be used to transform the probe into something else.
The expression for a question line is Q:\s(What's your name?)\n and the pattern can be extracted as Q:\s+(.*)\n. It's availabe for answers as well. Finally, the profile looks like this.
The title is not mandatory, but it helps organize profiles by adding names or short descriptions per file. Running the profile will generate the desired output as follows.
$ stat -c "%A %n" * -rw-rw-r-- interview.txt -rw-rw-r-- profile.sky $ chmod +x profile.sky && ./profile.sky $ stat -c "%A %n" * -rw-rw-r-- interview.txt -rw-rw-r-- interview.csv -rwxrwxr-x profile.sky
By default profiles are not executables, thus the need to chmod +x before running it. Opening the CSV file should contain two columns, one for each classified entity.
A | B | |
1 | question | answer |
2 | What's your name? | Skyle. |
3 | How are you? | I'm fine, thanks. |
4 | What do you think of this interview? | It's nice! |
Scraping XML/HTML documents is not very different than text files. Just adapt the profile with the corresponding follow instruction and most of the work is done.
Behind the curtains, XML/HTML documents create an in-memory copy of the root node(s). Instructions such as node and next are used to control this component. Skyle's internal stack is not affected since this component has an auxiliary role in the process of scraping.
$ ls document.html $ cat document.html <html> <head> <title>HTML Example</title> </head> <body> <div> <h2>An article</h2> <span>633 views</span> <p>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p> </div> <div> <h2>Anoter article</h2> <span>542 views</span> <p>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</p> </div> <div> <h2>Yet again an article</h2> <span>62 views</span> <p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.</p> </div> <div> <h2>Still an article</h2> <a href="article.pdf">Download article</a> </div> <div> <h2>Last article</h2> <span>321 views</span> </div> </body> </html>
It's worth mentioning that not all div tags contain the same structure, therefore a simple use of follow will generate unsynchronized results. The solution is to use the node instruction to change the root node from the default html tag to div tag. With the help of next instruction, Skyle can loop all nodes, resulting in a synchronized output, as expected.
The span tag has a pattern of DIGIT views. The use of remove instead of pattern is simply a choice of preference. Both would have worked.
It's the first time a variable is being used in a Skyle profile: $file. The dollar sign in front of a alphanumeric word invokes a classified entity, if it exists.
Notice the exec instruction is out of the node-loop. Variables are lists, therefore any instruction calling upon a saved entity acts as a loop for each element of the list. This would translate as a loop inside a loop, thus the reason exec is being called outside of the node-loop
A | B | C | D | |
1 | title | views | content | file |
2 | An article | 633 | Lorem ipsum... | |
3 | Anoter article | 542 | Ut enim... | |
4 | Yet again an article | 62 | Duis aute... | |
5 | Still an article | article.pdf | ||
6 | Last article | 321 |
Spread the word over the internet!
scroll to go back