Skip to content
This repository was archived by the owner on Aug 25, 2024. It is now read-only.

source: Labeled and Versioned datasets #9

Closed
johnandersen777 opened this issue Mar 9, 2019 · 2 comments
Closed

source: Labeled and Versioned datasets #9

johnandersen777 opened this issue Mar 9, 2019 · 2 comments
Labels
enhancement New feature or request gsoc Google Summer of Code related project Issues which will take a while to complete

Comments

@johnandersen777
Copy link

johnandersen777 commented Mar 9, 2019

Assignee: @sudharsana-kjl

DFFML is hoping to participate in Google Summer of Code (GSoC) under the Python Software Foundation umbrella. You can read all about what this means at http://python-gsoc.org/. This issue, and any others tagged gsoc and project are not generally available bugs, but related to project ideas for GSoC.

Project Idea: Labeled and Versioned Datasets.

Project description:
DFFML's initial release includes sources which abstract the format in which the data is stored from the dataset generation and usage in models.

Add information allowing users to have different versions and datasets from the same source.

Skills: Python, git
Difficulty level: Intermediate

Related Readings/Links:

class Source(abc.ABC, Entrypoint):
'''
Abstract base class for all sources. New sources must be derived from this
class and implement the repos method.
'''
ENTRY_POINT = 'dffml.source'
def __init__(self, src: str) -> None:
self.src = src
@abc.abstractmethod
async def update(self, repo: Repo):
'''
Updates a repo for a source
'''
@abc.abstractmethod
async def repos(self) -> AsyncIterator[Repo]:
'''
Returns a list of repos retrieved from self.src
'''
# mypy ignores AsyncIterator[Repo], therefore this is needed
yield Repo('') # pragma: no cover
@abc.abstractmethod
async def repo(self, src_url: str):
'''
Get a repo from the source or add it if it doesn't exist
'''

dffml/dffml/repo.py

Lines 90 to 116 in dd8007d

class Repo(object):
'''
Manages feature independent information and actions for a repo.
'''
REPO_DATA = RepoData
def __init__(self, src_url: str, *,
data: Optional[Dict[str, Any]] = None,
extra: Optional[Dict[str, Any]] = None) -> None:
if data is None:
data = {}
if extra is None:
extra = {}
data['src_url'] = src_url
if 'extra' in data:
# Prefer extra from init arguments to extra stored in data
data['extra'].update(extra)
extra = data['extra']
del data['extra']
self.data = self.REPO_DATA(**data)
self.extra = extra
def dict(self):
data = self.data.dict()
data['extra'] = self.extra
return data

Potential mentors: @pdxjohnny

Getting Started: Source.__init__ probably needs another two arguments, label and version, which should probably have defaults (say, default and v0). Since the same backend (aka, a csv file or json file) would be used to store all the data, you'll have to change the existing sources we have to understand how to deal with this. For CSVSource that might mean adding another column to each repo, for JSONSource that might mean instead of one big array, the array of repos is stored like so:

{
    "default": {
        "v0": [
            "... all the repos ..."
        ]
    }
}

What we want to see in your application: Describe how you intend to solve the problem, and give us some "stretch goals", maybe you'll implement a source using sqlite too or something. Don't forget to include some time for building appropriate tests.

@johnandersen777 johnandersen777 added enhancement New feature or request gsoc Google Summer of Code related project Issues which will take a while to complete labels Mar 9, 2019
@johnandersen777 johnandersen777 changed the title source: Labeled and versioned datasets source: Labeled and Versioned datasets Mar 10, 2019
@CYBruce
Copy link

CYBruce commented Mar 19, 2019

Hi, @pdxjohnny . This is Chenyu Tian from China. I am an undergrad at Sun Yat-Sen University. I use Python frequently and have learned machine learning algorithms. I would like to contribute to this issue to the best of my ability. I would like to ask where should I start to make my first commit? Because I still feel a little confused about the 'getting started' part
And I guess there is a small mistake in the wiki that it should be a 'with' instead of 'will'.
wiki

@johnandersen777
Copy link
Author

Thanks for letting me know! I'll fix that! If you're more familiar with machine learning models it might be good if you implemented a DFFML Model. To do that you would follow the new model tutorial then you replace the functions in that class with code which calls to some machine learning library.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request gsoc Google Summer of Code related project Issues which will take a while to complete
Projects
None yet
Development

No branches or pull requests

2 participants