cppScrape is a basic and flexible web scraper made with c++ and curl, Designed to be useable as the most minimal web browser.
CPP LIBRARY FUNCTIONS
below are the various functions provided by the object Scrape
.
-
Scrape * = new Scrape()
initialize a new scraoe object -
Scrape->setURL(std::string name)
: sets a target webpage to get html data from. -
Scrape->setFileName(std::string filename)
: sets the name of the .tsv file outputed byparseByTag()
-
Scrape->addTag(std::string tag)
: adds a line to the bottem of the output file where html tags named after tag will be stored -
Scrape->addTag(std::string tag, int index)
: inserts a line to the output file at index, where the corresponding tag will be stored -
Scrape->URL
: string containing the current target url, default value is "http://example.com" -
Scrape->FILENAME
: string containing the filename of the output file, default valud is "out.tsv" -
Scrape->TAGS
: a string vector containing all html tags, with each item denoting a row in out.tsvparseByHTML()
-
Scrape->makeEndTag(std::string tag)
: converts any html tag passed to it to a valid closed version of the tag -
Scrape->sendRequest()
: sends a GET request to the webpage atURL
, stores the response. returns the raw response as a string -
Scrape->parseByTags()
: sorts the internal responce value, and makes a .tsv file with- filename as defined by FILENAME value
- each row will store values of its corresponding tag in TAGS (ie a h1 tag at index 2 will turn row 2 of the tsv file into a h2 row
- each column will contain a instance of the tag. (ie : {text from header1 on page} /t {text from a different header on page}
CPP LIBRARY EXAMPLE USAGE
#include "cppScrape.hpp"
int main(){
Scrape request = new Scrape();
Scrape->setURL("http://example.com"); Scrape->addTag("<p>");
Scrape-> sendRequest();
std::vectorstd::string output= (Scrape-> parseByTags());
for (int i = 0; i < static_cast(output.size()); i++){std::cout<<output.at(i)<<", "}
}
WRAPPER EXECUTABLE USAGE If you want to run this utility as a command, download and compile the included main.hpp, then usage of the executable is as follows:
-
./cppScrape <url>
: will use defaultTAGS
andFILENAME
to generate out.tsv from the given url -
./cppScrape <url>, <html tags>
: will generate a out.tsv from the given url and format out.tsv using the given html tags