-
-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic tutorial notebook extended #1
base: main
Are you sure you want to change the base?
Basic tutorial notebook extended #1
Conversation
First Draft of the extended version of the tutorial. This commimt is still WIP and it is going to be released to be tested in Colab (also for performance) It includes restructured content and extended description of dataset and Apache Beam APIs short intro. Download of the full dataset from Kaggle API support with Python Multiprocessing to circumvent the GIL (To be tested for performance on Colab)
…mpleted successfully
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Wow, thanks for contributing Valerio! I've quickly checked, but I don't have much time now to make thorough review. I'll return back to the review earlier next week. |
Hi @dvadym thanks a lot for the feedback: glad you appreciated. Following up from a quick chat that I had on slack with @chinmayshah99 I am also going to share here a few more details about some performance tests I ran on my laptop (MacBook Pro , On Colab, on the other hand, the different running mode do not make any difference, as the default VM is offering only However, I worked on the tutorial assuming that the notebook could be executed either on Colab or in a Jupyter session. Looking forward to receiving your feedbacks 🙌 |
As promised, attached to this PR, I am also sharing the results of a few experiments I tried on my laptop by running the notebook with the three PremiseThose experiments were originally motivated by my intention to find the best combination of those parameters that led to reasonable running time on Colab. However, as already pointed out in my previous comment, including the support for different However, the scenario is completely different when running the notebook in a (slightly) better-equipped computing environment. Gathered results will allow to derive a quite interesting pattern IMHO, as they also motivate my re-implementation of the ExperimentsCase 1: In the following picture, the different running times for the first This very first experiment shows some preliminary gain in performance of the Interestingly, the Case 2: With a dataset size of The leap in performance of the two execution mode becomes substantially more evident with a dataset size that is However, in the case of The execution did not complete successfully, with an error message in the To my understanding, the issue is generated by some processing not yet completed which makes the In more details: for what I understood about the general execution framework in Apache Beam, the main worker process always spawns auxiliary threads (i.e. Digging a little bit into the In this particular case ( AGAIN (as expected) Considerations and Take-Away MessagesAfter these two experiments, and looking at the performance gain with the multiple configurations, I think it would be fair to conclude that execution time scales up linearly w.r.t. the dataset sizes. So in a
whereas with |
The `run_pipeline` function has been updated with a detailed and previously missing docstring. The function now also include a `verbose` parameter which controls whether the function should also print out the execution running time. By default, the verbosity is set to False, so no information is printed after each execution/invocation.
Sorry, I haven't had yet time |
No problem at all, I totally understand. I have been meaning to issue another PR (as a follow up on this one) to align the second notebook of the tutorial, but I could not finish that either. Maybe it would be also useful to get the feedbacks on this one first, and then finishing the other notebook - anyway :) |
Yeah, I think it's definetely worth to review the 1st Colab first and then to make changes in 2nd Colab. Thanks again for contributing! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for improvements! Sry for late reply, I've been finishing other projects. Further I'll try to keep latency on review not more than 1-2 working days.
I've left comments please check. I like this PR improvements, most my comments are about hiding some cells by default and merging some code cells. It might be too much information for the user if everything is open by default :)
"## The Execution Framework\n", | ||
"\n", | ||
"In this section we will define the core main components that will be used throughout the exercises. These components will be based on **Apache Beam**, which consitutes the reference computational framework, and provides the building blocks to define our data workflows. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @dvadym which one - I could not see any typo? Review of notebooks via Github is indeed very painful! :)
Trying to recollect all your useful comments.. thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sry, I probably incorrectly located a position in .ipynb file when mapped from Colab. I can't find it now. Never mind about this comment.
Yeah, reviewing notebooks in Github is painful :(
"source": [ | ||
"@beam.typehints.with_output_types(MovieView) # type-hint annotation for the output\n", | ||
"class ParseMovieViews(beam.DoFn):\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add
#@title ....
and hide the code by default (it might be too much information for the user)
"source": [ | ||
"@beam.typehints.with_output_types(MovieTitle) # type-hint annotation for the output\n", | ||
"class ParseMovieTitles(beam.DoFn):\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add
#@title ....
and hide the code by default (it might be too much information for the user)
"outputs": [], | ||
"source": [ | ||
"def netflix_movie_views_collection(p: beam.Pipeline, data_file: str = DATA_FILE) -> beam.PCollection[MovieView]:\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please combine both functions in one cell (it's better for the user to run less cells) and
could you please add
#@title ....
and hide the code by default (it might be too much information for the user)
"metadata": {}, | ||
"source": [ | ||
"##### Workaround `multiprocessing` issues with namespace" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add optional, and that it's not effective for Google Colab to the name of this subsection and hide the content of this subsection by default
"from apache_beam.options.pipeline_options import PipelineOptions\n", | ||
"\n", | ||
"def run_pipeline(pipeline_fn: PipelineModule, \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please add
#@title ....
and hide the code by default (it might be too much information for the user) ?
Description
This PR extends and complements the basic tutorial on Apache Beam by integrating very thorough explanations and references to the preparatory execution framework for the exercises, as well as to Apache Beam programming model and main design components.
The notebook has been optimised for Google Colab, making extensive use of forms to hide tedious and irrelevant details (mostly related to the setup, and to hide hints and solutions for the exercise parts.
Conversely, all the coding bits which are considered relevant to the learning journey into
apache_beam
features have been intentionally left visible.A list of a few but the most relevant features that have been included in the PR:
Netflix-Prize
dataset is now directly downloaded from Kaggle usingkaggle
official Python API10K
lines, but it could be easily customisedPTransformers
have been slightly optimised (in terms of Python code) as well as made compliant with Pythonmultiprocessing
execution environmentdataclass
objects (saved into a separate Python module) and correspondingbeam.coders.Coder
implementationbeam.Pipeline
is configure to exploit (as default) themultiprocessing
Python environment to leverage on the multiple cores - if running this notebook on a larger version of the dataset and/or in an environment with more than2 CPUs
(as in default colab).apache_beam
used throughout the notebook.Affected Dependencies
None - all the required packages are installed automatically in the notebook. This includes
apache_beam
andkaggle
(which is already available in Colab, but needs configuring).How has this been tested?
10K
to10M
lines). Reproducing those tests is quite easy considering the fast dataset size selection as dropdown list.