Basic tutorial notebook extended #1

leriomaggio · 2021-03-10T17:27:04Z

Description

This PR extends and complements the basic tutorial on Apache Beam by integrating very thorough explanations and references to the preparatory execution framework for the exercises, as well as to Apache Beam programming model and main design components.

The notebook has been optimised for Google Colab, making extensive use of forms to hide tedious and irrelevant details (mostly related to the setup, and to hide hints and solutions for the exercise parts.

Conversely, all the coding bits which are considered relevant to the learning journey into apache_beam features have been intentionally left visible.

A list of a few but the most relevant features that have been included in the PR:

The Netflix-Prize dataset is now directly downloaded from Kaggle using kaggle official Python API
The default size of the dataset is fixed to 10K lines, but it could be easily customised
Data objects and custom PTransformers have been slightly optimised (in terms of Python code) as well as made compliant with Python multiprocessing execution environment
- This particularly affects the use of custom dataclass objects (saved into a separate Python module) and corresponding beam.coders.Coder implementation
Execution of the beam.Pipeline is configure to exploit (as default) the multiprocessing Python environment to leverage on the multiple cores - if running this notebook on a larger version of the dataset and/or in an environment with more than 2 CPUs (as in default colab).
A few changes in the formatting in the text of exercises has been applied;
The code in the provided solutions for the exercises has been adapted to the new execution framework setup in the first part of the notebook.
Very thorough documentation reference and quite detailed explanation of the analysed dataset, and of all the features and component pf apache_beam used throughout the notebook.
(last but not least) The tutorial has been prepared to run smoothly on both (local/standard) Jupyter notebook, as well as on Google Colab.

Affected Dependencies

None - all the required packages are installed automatically in the notebook. This includes apache_beam and kaggle (which is already available in Colab, but needs configuring).

How has this been tested?

The notebook has been tested on Google Colab, as well as on local Jupyter notebook server on a MacBook Pro laptop with 8 cores.
A few performance tests have been also carried out (on both the laptop and in colab) to compare the amount of time required by the multiple supported execution modes on increasing dataset sizes (from 10K to 10M lines). Reproducing those tests is quite easy considering the fast dataset size selection as dropdown list.

First Draft of the extended version of the tutorial. This commimt is still WIP and it is going to be released to be tested in Colab (also for performance) It includes restructured content and extended description of dataset and Apache Beam APIs short intro. Download of the full dataset from Kaggle API support with Python Multiprocessing to circumvent the GIL (To be tested for performance on Colab)

…set sizes

…mpleted successfully

review-notebook-app · 2021-03-10T17:27:07Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

dvadym · 2021-03-12T12:42:43Z

Wow, thanks for contributing Valerio! I've quickly checked, but I don't have much time now to make thorough review. I'll return back to the review earlier next week.

leriomaggio · 2021-03-12T17:56:06Z

Wow, thanks for contributing Valerio! I've quickly checked, but I don't have much time now to make thorough review. I'll return back to the review earlier next week.

Hi @dvadym thanks a lot for the feedback: glad you appreciated.

Following up from a quick chat that I had on slack with @chinmayshah99 I am also going to share here a few more details about some performance tests I ran on my laptop (MacBook Pro ,8 cores) on increasing sizes of the datasets (and different direct_running_mode).

On Colab, on the other hand, the different running mode do not make any difference, as the default VM is offering only 2 cores (so no real gain by parallelisation of tasks).

However, I worked on the tutorial assuming that the notebook could be executed either on Colab or in a Jupyter session.
Besides, I personally think it is a good bit to include in the tutorial for a framework like Apache Beam (at least, it was fun for me to explore a little bit the internals and the execution framework, while working on the PR) :)

Looking forward to receiving your feedbacks 🙌

leriomaggio · 2021-03-15T17:51:04Z

As promised, attached to this PR, I am also sharing the results of a few experiments I tried on my laptop by running the notebook with the three direct_running_mode configurations, and multiple dataset sizes.

Premise

Those experiments were originally motivated by my intention to find the best combination of those parameters that led to reasonable running time on Colab. However, as already pointed out in my previous comment, including the support for different direct_running_mode on Colab is quite pointless as the (very limited) 2 cores on the default VM instance don't make appreciate any difference, nor gain in performance.

However, the scenario is completely different when running the notebook in a (slightly) better-equipped computing environment. Gathered results will allow to derive a quite interesting pattern IMHO, as they also motivate my re-implementation of the run_pipeline function to allow for customisations.

Experiments

Case 1: 1M lines Dataset (~1% of total size)

In the following picture, the different running times for the first count_all_views function executed with the three direct_running_mode configurations

This very first experiment shows some preliminary gain in performance of the multiprocessing-based approach vs the single-process (in memory) approach.

Interestingly, the multi_threading mode takes even more than in_memory - and that makes total sense as there is presumably extra over-head in communication of the multiple threads.

Case 2: 10M lines Dataset (~10% of total size)

With a dataset size of 10M lines the scenario changed a bit, observing another interesting effect of the multiple execution modes.

The leap in performance of the two execution mode becomes substantially more evident with a dataset size that is 10x bigger.

However, in the case of multi-threaded case:

The execution did not complete successfully, with an error message in the grpc backend as "too many pings".
I tried to dig a little bit in the issue and the only thing I could find was a (closed) issue on grpc #2444 specifically mentioning python client and grpc 1.8. In my environment, I am currently using grpc 1.36.1

To my understanding, the issue is generated by some processing not yet completed which makes the worker threads going in timeout.

In more details: for what I understood about the general execution framework in Apache Beam, the main worker process always spawns auxiliary threads (i.e. pthreads at C-level via grpc) to handle parallel (I/O?) operations on the parallel collection(s). And this is still true, regardless of the selected direct_running_mode.

Digging a little bit into the PipelineOptions, I've discovered a job-server-timeout which (supposedly) controls the timeout of those job worker threads - set to 60 secs by default.
To further corroborate this hypothesis, it has to be said that the same issue also happens with the multi_processing running mode. In fact, I had to change the default of this parameter to 65,536, to circumvent this issue from happening with multi_processing in the former 1M lines dataset.

In this particular case (10M-lines dataset) that value was still enough for multi_processing and in_memory but not for multi_threading for which I had to increase that value by 2/4x to complete the execution (see below)

AGAIN (as expected) multi_threading was slightly worse than in_memory since Python doesn't like CPU-bound threads, and extra time is wasted in communication.

Considerations and Take-Away Messages

After these two experiments, and looking at the performance gain with the multiple configurations, I think it would be fair to conclude that execution time scales up linearly w.r.t. the dataset sizes. So in a multi_processing fashion:

  1M ==> 11 s
 10M ==> 1  m
100M ==> 10 m

whereas with multi-threading the whole completion time would presumably sum up to ~1 h in the 100M lines (almost full dataset).

The `run_pipeline` function has been updated with a detailed and previously missing docstring. The function now also include a `verbose` parameter which controls whether the function should also print out the execution running time. By default, the verbosity is set to False, so no information is printed after each execution/invocation.

dvadym · 2021-03-17T19:07:05Z

Sorry, I haven't had yet time

leriomaggio · 2021-03-17T19:10:50Z

Sorry, I haven't had yet time

No problem at all, I totally understand.

I have been meaning to issue another PR (as a follow up on this one) to align the second notebook of the tutorial, but I could not finish that either.

Maybe it would be also useful to get the feedbacks on this one first, and then finishing the other notebook - anyway :)

dvadym · 2021-03-17T19:14:15Z

Yeah, I think it's definetely worth to review the 1st Colab first and then to make changes in 2nd Colab.

Thanks again for contributing!

dvadym

Thanks a lot for improvements! Sry for late reply, I've been finishing other projects. Further I'll try to keep latency on review not more than 1-2 working days.

I've left comments please check. I like this PR improvements, most my comments are about hiding some cells by default and merging some code cells. It might be too much information for the user if everything is open by default :)

dvadym · 2021-04-29T14:16:19Z