Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISSUE-15 Augment README with installation and examples documentation #16

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ Hadoop.egg-info/
__pycache__
build/
*.pyc
html
44 changes: 34 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,40 @@
# Python-Hadoop

This library was cloned from [here](https://github.com/matteobertozzi/Hadoop/tree/master/python-hadoop). There have been modifications to make it work with Python3.
A pure Python [Hadoop SequenceFile](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/io/SequenceFile.html) Reader and Writer implementation with no Java dependency.

### Original README
Pure Python SequenceFile Reader and Writer implementation
that allows you to read and write your Hadoop sequence files
without using java.
# Installation

Author: Matteo Bertozzi <[email protected]>
From source
```bash
pip install .
```

Contributors:
# API Documentation

* Brian Bloniarz <[email protected]>
* Alex Roper <[email protected]>
* Jeremy G. Kahn <[email protected]>
```bash
# install pdoc - https://pdoc3.github.io/pdoc/
pdoc --html hadoop --skip-errors # skip errors due to broken dependency on pydoop (see below)
open html/hadoop/index.html
```

# Reading and writing SequenceFiles

[hadoop/io/SequenceFile](https://github.com/opaque-systems/sequencefile/blob/master/hadoop/io/SequenceFile.py) provides Reader, Writer and Metadata interfaces. See the examples below for more details.

# HDFS Integration

**Is currently broken as the underlying pydoop library is unmaintained**

The goal of [hadoop.pydoop.reader.SequenceFileReader](https://github.com/opaque-systems/sequencefile/blob/master/hadoop/pydoop/reader.py) is to read sequence files from HDFS. This leveraged an extra dependency on [pydoop](http://crs4.github.io/pydoop/index.html).

# Examples

Several [examples](https://github.com/opaque-systems/sequencefile/tree/master/examples) are provideed to demonstrate API usage. With the exception of the [SequenceFileReader example](https://github.com/opaque-systems/sequencefile/blob/master/examples/SequenceFileReader.py), all examples are self contained.

# Credits

This source was originally cloned from the [matteobertozzi/Hadoop python-hadoop](https://github.com/matteobertozzi/Hadoop/blob/master/python-hadoop/README) source tree and then modified for Python3 compatibility.

# License

[Apache Licence v2](https://github.com/opaque-systems/sequencefile/blob/master/LICENSE) a copy of which ships with this codebase.