Skip to content

a simple cpp script to extract content from webpages, sort and parse specific tag's content into a .tsv file

Notifications You must be signed in to change notification settings

Jurorno9/Web-C-Scrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Introduction

cppScrape is a basic and flexible web scraper made with c++ and curl, Designed to be useable as the most minimal web browser.

CPP LIBRARY FUNCTIONS

below are the various functions provided by the object Scrape .

  • Scrape * = new Scrape() initialize a new scraoe object

  • Scrape->setURL(std::string name) : sets a target webpage to get html data from.

  • Scrape->setFileName(std::string filename) : sets the name of the .tsv file outputed by parseByTag()

  • Scrape->addTag(std::string tag) : adds a line to the bottem of the output file where html tags named after tag will be stored

  • Scrape->addTag(std::string tag, int index) : inserts a line to the output file at index, where the corresponding tag will be stored

  • Scrape->URL : string containing the current target url, default value is "http://example.com"

  • Scrape->FILENAME : string containing the filename of the output file, default valud is "out.tsv"

  • Scrape->TAGS : a string vector containing all html tags, with each item denoting a row in out.tsv parseByHTML()

  • Scrape->makeEndTag(std::string tag) : converts any html tag passed to it to a valid closed version of the tag

  • Scrape->sendRequest() : sends a GET request to the webpage at URL, stores the response. returns the raw response as a string

  • Scrape->parseByTags() : sorts the internal responce value, and makes a .tsv file with

    1. filename as defined by FILENAME value
    2. each row will store values of its corresponding tag in TAGS (ie a h1 tag at index 2 will turn row 2 of the tsv file into a h2 row
    3. each column will contain a instance of the tag. (ie : {text from header1 on page} /t {text from a different header on page}

CPP LIBRARY EXAMPLE USAGE

#include "cppScrape.hpp" int main(){ Scrape request = new Scrape(); Scrape->setURL("http://example.com"); Scrape->addTag("<p>"); Scrape-> sendRequest(); std::vectorstd::string output= (Scrape-> parseByTags()); for (int i = 0; i < static_cast(output.size()); i++){std::cout<<output.at(i)<<", "} }

WRAPPER EXECUTABLE USAGE If you want to run this utility as a command, download and compile the included main.hpp, then usage of the executable is as follows:

  • ./cppScrape <url> : will use default TAGS and FILENAME to generate out.tsv from the given url

  • ./cppScrape <url>, <html tags> : will generate a out.tsv from the given url and format out.tsv using the given html tags

About

a simple cpp script to extract content from webpages, sort and parse specific tag's content into a .tsv file

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages