GitHub - Jurorno9/Web-C-Scrape: a simple cpp script to extract content from webpages, sort and parse specific tag's content into a .tsv file

Introduction

cppScrape is a basic and flexible web scraper made with c++ and curl, Designed to be useable as the most minimal web browser.

CPP LIBRARY FUNCTIONS

below are the various functions provided by the object Scrape .

Scrape * = new Scrape() initialize a new scraoe object
Scrape->setURL(std::string name) : sets a target webpage to get html data from.
Scrape->setFileName(std::string filename) : sets the name of the .tsv file outputed by parseByTag()
Scrape->addTag(std::string tag) : adds a line to the bottem of the output file where html tags named after tag will be stored
Scrape->addTag(std::string tag, int index) : inserts a line to the output file at index, where the corresponding tag will be stored
Scrape->URL : string containing the current target url, default value is "http://example.com"
Scrape->FILENAME : string containing the filename of the output file, default valud is "out.tsv"
Scrape->TAGS : a string vector containing all html tags, with each item denoting a row in out.tsv parseByHTML()
Scrape->makeEndTag(std::string tag) : converts any html tag passed to it to a valid closed version of the tag
Scrape->sendRequest() : sends a GET request to the webpage at URL, stores the response. returns the raw response as a string
Scrape->parseByTags() : sorts the internal responce value, and makes a .tsv file with
1. filename as defined by FILENAME value
2. each row will store values of its corresponding tag in TAGS (ie a h1 tag at index 2 will turn row 2 of the tsv file into a h2 row
3. each column will contain a instance of the tag. (ie : {text from header1 on page} /t {text from a different header on page}

CPP LIBRARY EXAMPLE USAGE

#include "cppScrape.hpp" int main(){ Scrape request = new Scrape(); Scrape->setURL("http://example.com"); Scrape->addTag("<p>"); Scrape-> sendRequest(); std::vectorstd::string output= (Scrape-> parseByTags()); for (int i = 0; i < static_cast(output.size()); i++){std::cout<<output.at(i)<<", "} }

WRAPPER EXECUTABLE USAGE If you want to run this utility as a command, download and compile the included main.hpp, then usage of the executable is as follows:

./cppScrape <url> : will use default TAGS and FILENAME to generate out.tsv from the given url
./cppScrape <url>, <html tags> : will generate a out.tsv from the given url and format out.tsv using the given html tags

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
cppScrape.hpp		cppScrape.hpp
main.cpp		main.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

About

Releases 2

Packages

Languages

Jurorno9/Web-C-Scrape

Folders and files

Latest commit

History

Repository files navigation

Introduction

About

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages