Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic tutorial notebook extended #1

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

leriomaggio
Copy link

@leriomaggio leriomaggio commented Mar 10, 2021

Description

This PR extends and complements the basic tutorial on Apache Beam by integrating very thorough explanations and references to the preparatory execution framework for the exercises, as well as to Apache Beam programming model and main design components.

The notebook has been optimised for Google Colab, making extensive use of forms to hide tedious and irrelevant details (mostly related to the setup, and to hide hints and solutions for the exercise parts.

Conversely, all the coding bits which are considered relevant to the learning journey into apache_beam features have been intentionally left visible.

A list of a few but the most relevant features that have been included in the PR:

  • The Netflix-Prize dataset is now directly downloaded from Kaggle using kaggle official Python API
  • The default size of the dataset is fixed to 10K lines, but it could be easily customised
  • Data objects and custom PTransformers have been slightly optimised (in terms of Python code) as well as made compliant with Python multiprocessing execution environment
    • This particularly affects the use of custom dataclass objects (saved into a separate Python module) and corresponding beam.coders.Coder implementation
  • Execution of the beam.Pipeline is configure to exploit (as default) the multiprocessing Python environment to leverage on the multiple cores - if running this notebook on a larger version of the dataset and/or in an environment with more than 2 CPUs (as in default colab).
  • A few changes in the formatting in the text of exercises has been applied;
  • The code in the provided solutions for the exercises has been adapted to the new execution framework setup in the first part of the notebook.
  • Very thorough documentation reference and quite detailed explanation of the analysed dataset, and of all the features and component pf apache_beam used throughout the notebook.
  • (last but not least) The tutorial has been prepared to run smoothly on both (local/standard) Jupyter notebook, as well as on Google Colab.

Affected Dependencies

None - all the required packages are installed automatically in the notebook. This includes apache_beam and kaggle (which is already available in Colab, but needs configuring).

How has this been tested?

  • The notebook has been tested on Google Colab, as well as on local Jupyter notebook server on a MacBook Pro laptop with 8 cores.
  • A few performance tests have been also carried out (on both the laptop and in colab) to compare the amount of time required by the multiple supported execution modes on increasing dataset sizes (from 10K to 10M lines). Reproducing those tests is quite easy considering the fast dataset size selection as dropdown list.

First Draft of the extended version of the tutorial. This commimt is
still WIP and it is going to be released to be tested in Colab (also for
performance)

It includes restructured content and extended description of
dataset and Apache Beam APIs short intro.

Download of the full dataset from Kaggle

API support with Python Multiprocessing to circumvent the GIL
(To be tested for performance on Colab)
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@dvadym
Copy link
Collaborator

dvadym commented Mar 12, 2021

Wow, thanks for contributing Valerio! I've quickly checked, but I don't have much time now to make thorough review. I'll return back to the review earlier next week.

@leriomaggio
Copy link
Author

Wow, thanks for contributing Valerio! I've quickly checked, but I don't have much time now to make thorough review. I'll return back to the review earlier next week.

Hi @dvadym thanks a lot for the feedback: glad you appreciated.

Following up from a quick chat that I had on slack with @chinmayshah99 I am also going to share here a few more details about some performance tests I ran on my laptop (MacBook Pro ,8 cores) on increasing sizes of the datasets (and different direct_running_mode).

On Colab, on the other hand, the different running mode do not make any difference, as the default VM is offering only 2 cores (so no real gain by parallelisation of tasks).

However, I worked on the tutorial assuming that the notebook could be executed either on Colab or in a Jupyter session.
Besides, I personally think it is a good bit to include in the tutorial for a framework like Apache Beam (at least, it was fun for me to explore a little bit the internals and the execution framework, while working on the PR) :)

Looking forward to receiving your feedbacks 🙌

@leriomaggio
Copy link
Author

leriomaggio commented Mar 15, 2021

As promised, attached to this PR, I am also sharing the results of a few experiments I tried on my laptop by running the notebook with the three direct_running_mode configurations, and multiple dataset sizes.

Premise

Those experiments were originally motivated by my intention to find the best combination of those parameters that led to reasonable running time on Colab. However, as already pointed out in my previous comment, including the support for different direct_running_mode on Colab is quite pointless as the (very limited) 2 cores on the default VM instance don't make appreciate any difference, nor gain in performance.

However, the scenario is completely different when running the notebook in a (slightly) better-equipped computing environment. Gathered results will allow to derive a quite interesting pattern IMHO, as they also motivate my re-implementation of the run_pipeline function to allow for customisations.

Experiments

Case 1: 1M lines Dataset (~1% of total size)

In the following picture, the different running times for the first count_all_views function executed with the three direct_running_mode configurations

1M lines

This very first experiment shows some preliminary gain in performance of the multiprocessing-based approach vs the single-process (in memory) approach.

Interestingly, the multi_threading mode takes even more than in_memory - and that makes total sense as there is presumably extra over-head in communication of the multiple threads.

Case 2: 10M lines Dataset (~10% of total size)

With a dataset size of 10M lines the scenario changed a bit, observing another interesting effect of the multiple execution modes.

10M lines

The leap in performance of the two execution mode becomes substantially more evident with a dataset size that is 10x bigger.

However, in the case of multi-threaded case:
multi-threading error

The execution did not complete successfully, with an error message in the grpc backend as "too many pings".
I tried to dig a little bit in the issue and the only thing I could find was a (closed) issue on grpc #2444 specifically mentioning python client and grpc 1.8. In my environment, I am currently using grpc 1.36.1

To my understanding, the issue is generated by some processing not yet completed which makes the worker threads going in timeout.

In more details: for what I understood about the general execution framework in Apache Beam, the main worker process always spawns auxiliary threads (i.e. pthreads at C-level via grpc) to handle parallel (I/O?) operations on the parallel collection(s). And this is still true, regardless of the selected direct_running_mode.

Digging a little bit into the PipelineOptions, I've discovered a job-server-timeout which (supposedly) controls the timeout of those job worker threads - set to 60 secs by default.
To further corroborate this hypothesis, it has to be said that the same issue also happens with the multi_processing running mode. In fact, I had to change the default of this parameter to 65,536, to circumvent this issue from happening with multi_processing in the former 1M lines dataset.

In this particular case (10M-lines dataset) that value was still enough for multi_processing and in_memory but not for multi_threading for which I had to increase that value by 2/4x to complete the execution (see below)

10M lines multi_threading

AGAIN (as expected) multi_threading was slightly worse than in_memory since Python doesn't like CPU-bound threads, and extra time is wasted in communication.

Considerations and Take-Away Messages

After these two experiments, and looking at the performance gain with the multiple configurations, I think it would be fair to conclude that execution time scales up linearly w.r.t. the dataset sizes. So in a multi_processing fashion:

  1M ==> 11 s
 10M ==> 1  m
100M ==> 10 m

whereas with multi-threading the whole completion time would presumably sum up to ~1 h in the 100M lines (almost full dataset).

The `run_pipeline` function has been updated with a detailed and
previously missing docstring.

The function now also include a `verbose` parameter which controls whether the function should also print out the
execution running time.
By default, the verbosity is set to False, so no information is printed
after each execution/invocation.
@dvadym
Copy link
Collaborator

dvadym commented Mar 17, 2021

Sorry, I haven't had yet time

@leriomaggio
Copy link
Author

Sorry, I haven't had yet time

No problem at all, I totally understand.

I have been meaning to issue another PR (as a follow up on this one) to align the second notebook of the tutorial, but I could not finish that either.

Maybe it would be also useful to get the feedbacks on this one first, and then finishing the other notebook - anyway :)

@dvadym
Copy link
Collaborator

dvadym commented Mar 17, 2021

Yeah, I think it's definetely worth to review the 1st Colab first and then to make changes in 2nd Colab.

Thanks again for contributing!

Copy link
Collaborator

@dvadym dvadym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for improvements! Sry for late reply, I've been finishing other projects. Further I'll try to keep latency on review not more than 1-2 working days.

I've left comments please check. I like this PR improvements, most my comments are about hiding some cells by default and merging some code cells. It might be too much information for the user if everything is open by default :)

"## The Execution Framework\n",
"\n",
"In this section we will define the core main components that will be used throughout the exercises. These components will be based on **Apache Beam**, which consitutes the reference computational framework, and provides the building blocks to define our data workflows. "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Copy link
Author

@leriomaggio leriomaggio May 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @dvadym which one - I could not see any typo? Review of notebooks via Github is indeed very painful! :)
Trying to recollect all your useful comments.. thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sry, I probably incorrectly located a position in .ipynb file when mapped from Colab. I can't find it now. Never mind about this comment.

Yeah, reviewing notebooks in Github is painful :(

"source": [
"@beam.typehints.with_output_types(MovieView) # type-hint annotation for the output\n",
"class ParseMovieViews(beam.DoFn):\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add
#@title ....
and hide the code by default (it might be too much information for the user)

"source": [
"@beam.typehints.with_output_types(MovieTitle) # type-hint annotation for the output\n",
"class ParseMovieTitles(beam.DoFn):\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add
#@title ....
and hide the code by default (it might be too much information for the user)

"outputs": [],
"source": [
"def netflix_movie_views_collection(p: beam.Pipeline, data_file: str = DATA_FILE) -> beam.PCollection[MovieView]:\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please combine both functions in one cell (it's better for the user to run less cells) and

could you please add
#@title ....
and hide the code by default (it might be too much information for the user)

"metadata": {},
"source": [
"##### Workaround `multiprocessing` issues with namespace"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add optional, and that it's not effective for Google Colab to the name of this subsection and hide the content of this subsection by default

"from apache_beam.options.pipeline_options import PipelineOptions\n",
"\n",
"def run_pipeline(pipeline_fn: PipelineModule, \n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please add
#@title ....
and hide the code by default (it might be too much information for the user) ?

zachferr added a commit that referenced this pull request Jun 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants