Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple storage implementations #168

Closed
lemon24 opened this issue Jun 9, 2020 · 8 comments
Closed

Support multiple storage implementations #168

lemon24 opened this issue Jun 9, 2020 · 8 comments

Comments

@lemon24
Copy link
Owner

lemon24 commented Jun 9, 2020

reader is intended to support multiple storage implementations.

Making make_reader() the main constructor for Reader instances should serve as the base for this. IIRC the main inspiration for it was sqlalchemy.create_engine; it should be possible to expand make_reader() to a similar connection string + additional parameters model.


I recently wrote a "SQLite server" proof of concept using the stdlib multiprocessing; using it in a subclass of the existing storage should provide a different enough implementation to drive the design (that doesn't actually require implementing a whole new storage).

The client side of the prototype needs at least the following arguments:

  • address; (str, int) for TCP, str for UNIX sockets / Windows names pipes
  • authkey; bytes

These could be mapped to a connection string like scheme://:authkey@host:port or scheme://:authkey@/path, with the authkey encoded in some way (not sure how it would work for Windows pipes, though).

Additionally, in the future it may be possible to pass a SQLite database path, or maybe a full URI filename; I'm not sure how this would work with the connection string (e.g. the UNIX socket path may clash with with the DB path).

@lemon24
Copy link
Owner Author

lemon24 commented Jun 10, 2020

Case study: SQLAlchemy create_engine().

The main entity in SQLAlchemy is the engine. An engine maintains a pool of DBAPI connections, on which it executes statements of a specific dialect. Connections are created lazily.

create_engine() is responsible with wiring them all up; it takes a connection string, and a wide variety of keyword arguments which are routed to the appropriate components:

create_engine(
    "dialect[+driver]://user:password@host/dbname[?key=value...]",
    key=value,
)

It roughly works like this:

  • a dialect instance is created
    • a dialect class is loaded based on the URL; for each driver, there's a specific dialect subclass
    • its arguments are consumed from the keyword arguments
    • a DBAPI module is selected, either the module keyword argument, or a dialect-specific default; if the latter, more keyword arguments are consumed for dialect_cls.dbapi()
    • the dialect class is instantiated
  • connection arguments are collected from:
    • common arguments from the URL ("host", "database", "username", "password", "port")
    • the URL query string arguments known by the dialect
    • the connect_args keyword argument
  • a pool instance is created
    • if the pool keyword argument is given, it's either a pool ready to be used directly, or a proxy that takes the connection arguments and returns the pool
    • otherwise
      • a pool class is selected, either the poolclass keyword argument, or a dialect-specific one
      • arguments for the pool are collected
        • a connect() function, either the creator keyword argument, or one created based on the connection arguments
        • known pool_* keyword arguments, without the pool_ prefix
      • the dialect class is instantiated
  • an engine instance is created
    • from the pool, dialect, url, and arguments consumed from the keyword arguments
    • it's possible to use a different engine class via _future_engine_class
  • if any keyword arguments are left unconsumed, an exception is raised

I omitted the following:

  • various arguments may be converted from strings to other types
  • there's a plugins argument with a number of import names; these are loaded, and specific hooks are called throughout the process above
  • there's an event systems which allows registering more hooks; these are also called

Dialects/drivers are imported by a plugin loader instance as follows:

  • The default ones are sqlalchemy sub-packages/modules (and discovered as such), e.g. postgresql+pg8000://... corresponds to sqlalchemy.dialects.postgresql.pg8000:dialect.
  • Additional ones can be registered via setuptools entry points or in-process, by calling sqlalchemy.dialects.registry.register(...).

@lemon24
Copy link
Owner Author

lemon24 commented Jun 10, 2020

Ok, here goes. At the moment, we don't actually need to do much (see the "Incremental implementation" at the end).

Requirements

It should be possible to use an already instantiated Storage.

It should be easy to:

  • Load a specific storage implementation based on configuration.
  • Pass additional arguments to the storage from said configuration.
    • It should also be easy to pass these from Python.
    • Known use cases:
      • address+port / path, authkey, for the "SQLite server" storage
      • allow_migrations, to prevent automatic migrations
      • (part of) a sqlalchemy.create_engine URL, for a SQLAlchemy-based storage
  • Expose new storage implementations and storage arguments without having to change the CLI.
    • Use new storage implementations without having to change reader, or use Setuptools entrypoints / publish a package on PyPI.
  • Include the storage configuration in other configuration, e.g. a config file for the reader CLI or web app.

Additionally, it should be possible to use the same solution for alternate search providers.

The current make_reader() signature must continue to work, and correspond to the the default storage:

make_reader('path/to/database')
make_reader(url='path/to/database')

Basics

It seems natural to follow a model similar to that of sqlalchemy.create_engine: a connection string URL + additional parameters as keyword arguments.

As such, make_reader() will take a storage connection string, followed by additional keyword arguments:

make_reader("storage://...", storage_arg='value', ...)

Based on the URL scheme, a storage class should be loaded from a registry of well-known storage implementations. This class will then be instantiated with the URL and the storage_ keyword arguments with the prefix removed as arguments (the prefix is required to avoid clashes with other keyword arguments that are not storage-related).

Assuming the storage scheme corresponds to the Storage class, the invocation above should result in the following storage instance:

Storage("storage://...", arg='value')

Query string and keyword arguments

By default, the URL and keyword arguments will be passed to the Storage as-is; it is the storage's responsibility to make sense of them.

This is both to simplify initial implementation, and allow integration with other similar APIs; e.g., a SQLAlchemy-based storage could delegate most argument handling to sqlalchemy.create_engine:

make_reader('sqlalchemy+postgresql://user:pass@host/db', storage_option=True)
# would end up calling
sqlalchemy.create_engine('postgresql://user:pass@host/db', option=True)

If needed, in the future there may be an opt-in mode in which make_reader() merges the URL components, query parameters, and keyword arguments into single keyword argument dict to be passed to the storage class, with the values converted from strings to specific types declared by the storage.

This would allow loading all the options from a configuration file; e.g. this config:

[reader]
url = storage://host:1000/path?timeout=10
storage_option = value

would result in a storage invocation like this:

Storage(host='host', port=1000, path='path', timeout=10, option='value')

The precedence would be: URL components > query parameters > keyword arguments.

Storage discovery

It should be possible to register storage implementations corresponding to a specific URL scheme into a global registry, both via a Setuptools entry point and at runtime, in a way similar to how SQLAlchemy does it for dialects.

The discovery should have the following precedence: runtime > entry point > shipped with reader.

In order to support integration with other URL-based APIs, the scheme should be split into +-delimited components, and only the first component should be used for discovery; as a consequence, registered schemes cannot contain +. Examples:

  • sqlite:... corresponds to the sqlite registry key
  • sqlalchemy+postgresql:... corresponds to the sqlalchemy registry key

Storage instances

Alternative storage instances can be passed explicitly using the storage keyword argument:

make_reader(storage=UnixSocketStorage(...))

This bypasses the URL and the storage_ keyword arguments entirely; it may be an error to pass them with storage.

Storage classes

It should also be possible to pass alternative storage classes using the storage_cls keyword argument:

make_reader('whatever://1.2.3.4:5678', storage_cls=TCPStorage)

This bypasses storage discovery; the URL and other storage keyword arguments will be passed to storage_cls (but the URL may be optional).

This would allow overriding the storage implementation for a specific scheme locally.

In order for this to work with the configuration file, make_reader() may try to import storage_cls if it is a string (e.g. 'mymodule:MyStorage').

Search

Search instantiation works the same as the storage one, except the arguments are prefixed by search_ instead of storage_, and the search implementations should have a separate registry:

  • url: search_url
  • storage: search
  • storage_cls: search_cls
  • storage_*: search_*
make_reader(
    "storage://..." storage_arg='value', ...,
    search_url="search://...", search_arg='value', ...,
    # alternatively,
    search=Search(...),
    # alternatively,
    search_cls=Search, search_arg='value', ...,
    # for convenience, allow search to be interpreted as search_url
    search="search://...",
)

In order to support the current "parasitic" / tightly-coupled search implementation, search classes may have a storage argument; if present, it will be the storage instance (this implies the storage must be instantiated before the search).

Also, storage classes may have a default_search_cls attribute; if present, and make_reader() does not receive any of search_url, search, or search_cls, it will be used as the default search implementation.

TBD: We should have a value/URL for "no search" and one for "default search" (None/'none:' and 'default:'?).

Backwards compatibility

The scheme-less path invocation will continue to be supported, and will be equivalent to the sqlite scheme.

The default search_url will correspond to the current search implementation, which will have a storage argument and will be the default_search_cls of the default storage, as described above.

Examples

Default storage and search, implicit:

make_reader('path/to/database')

This is equivalent to:

make_reader(
    'sqlite:path/to/database', 
    # we should probably find a better name for this
    search_url='sqlite-parasitic:',
)

The slashes should probably work in the way they work for SQLite URI filenames.

Assuming unix and tcp are schemes for the multiprocessing "SQLite server" storages:

make_reader('unix://:authkey@/path/to/unix/socket?path=/path/to/db')
make_reader('tcp://:[email protected]:5678?path=/path/to/db')

Assuming sqlalchemy is the scheme for a SQLAlchemy-based storage:

make_reader('sqlalchemy+postgresql://user:pass@host/db', )

Incremental implementation

Most likely, YAGNI (most of this). (But it was a "learning experience" to write.)

For private use, moving the Storage and Search instantiation from the Reader constructor to make_reader() should be enough to allow using different implementations for them.

If anyone else wants to use a different storage/search, we would probably get away with exposing storage_/search_cls (and possibly automagically importing it if it's a string). Then, if public first party or third party implementations appear and are widely used, we should probably implement the whole scheme / registry thing.

We can probably live without allowing any keyword arguments until they're actually needed. If they receive heavy use and have types different than strings, it makes sense to implement the type conversion thing.

lemon24 added a commit that referenced this issue Jun 14, 2020
lemon24 added a commit that referenced this issue Jun 19, 2020
lemon24 added a commit that referenced this issue Jun 19, 2020
lemon24 added a commit that referenced this issue Jun 19, 2020
lemon24 added a commit that referenced this issue Jun 19, 2020
@lemon24
Copy link
Owner Author

lemon24 commented Jun 19, 2020

I managed to use the diff below with bench.py to see what the difference between the vanilla storage and the "sqlite3 server" storage is; results below the diff (it ranges from 1 to 5x).

diff --git a/scripts/bench.py b/scripts/bench.py
index 2c574ce..fbb97ea 100644
--- a/scripts/bench.py
+++ b/scripts/bench.py
@@ -29,6 +29,35 @@ from reader import make_reader
 from reader._app import create_app, get_reader
 
 
+# ---
+
+# sqliteserver is https://gist.github.com/lemon24/1f35deed9da79dc20d9c538aa0f5e0cc#file-08-manager-db-everything-py
+import sqliteserver
+import reader._storage
+
+class RemoteStorage(reader._storage.Storage):
+
+    def __init__(self, address, authkey, path, timeout=None):
+        self.address = address
+        self.authkey = authkey
+        super().__init__(path, timeout)
+
+    def connect(self, *args, **kwargs):
+        return sqliteserver.connect(self.address, self.authkey, *args, **kwargs)
+
+
+ADDRESS = '/tmp/db.sqlite.sock'
+
+def make_reader(url):
+    from reader import make_reader
+    return make_reader(None, _storage=RemoteStorage(ADDRESS, b'', url))
+
+manager = sqliteserver.make_manager_cls()(ADDRESS, b'')
+manager.start()
+
+# ---
+
+
 def get_params(fn):
     rv = []
     for param in inspect.signature(fn).parameters.values():
$ python scripts/bench.py time get_entries_all get_entries_read get_entries_feed update_feeds '*search*' > before.txt
$ : ...
$ python scripts/bench.py time get_entries_all get_entries_read get_entries_feed update_feeds '*search*' > after.txt
$ python scripts/bench.py diff local.txt remote.txt --format times
number num_entries get_entries_all get_entries_feed get_entries_read search_entries_all search_entries_read update_feeds update_search
     4          32            2.09             4.00             4.50               2.11                3.00         5.14          5.00
     4          64            1.42             3.67             4.50               1.71                3.00         4.76          1.49
     4         128            1.19             3.00             4.50               1.38                3.00         4.63          1.64
     4         256            1.47             2.82             3.33               1.30                3.20         8.68          0.98
     4         512            0.99             1.81             2.50               1.17                2.00         3.02          1.33
     4        1024            2.22             2.42             4.67               1.15                1.50         2.24          1.08
     4        2048            1.32             1.55             3.33               1.05                2.45         2.09          1.36

@lemon24
Copy link
Owner Author

lemon24 commented Jun 19, 2020

I guess reader now supports multiple storage implementations (at least privately), and what it doesn't support we know how to implement. Resolving.

@egslava
Copy link

egslava commented Jan 11, 2023

reader = make_reader(
    r'sqlalchemy+postgresql://'
    r'postgres:password@'
    r'flscrapper-egslava.c9xhhmfmayou.us-east-1.rds.'
    r'amazonaws.com/flscrapper_egslava_db'
)

reader.exceptions.StorageError: error while opening database: sqlite3.OperationalError: unable to open database file

I can connect to this DB using pgAdmin/Postico, meanwhile feedparser tells me that it can't open sqlite db, though it's not. I haven't found any documentation section about postgresql, except posts in this thread. What am I doing wrong?

@lemon24
Copy link
Owner Author

lemon24 commented Jan 15, 2023

Hi @egslava,

As mentioned in passing here, there's currently only one storage implementation, SQLite.

That said, it is possible to write an alternative one (this issue is about what make_reader() should look like when that happens).

If you are willing/planning to implement a Postgres storage (not necessarily as part of reader itself), please let me know, so I can make the minimum necessary changes to unblock you.

I would update make_reader() to allow at least passing in a different storage instance, make_reader(storage = OtherStorage(...)) (currently, the wiring is a bit hardcoded). Discovery by connection string etc. can then be added later on, separately from this.

You would need to implement at least the following Storage and Search methods (click to expand):
Storage.__enter__()
Storage.__exit__()
Storage.close()
Storage.add_feed()
Storage.delete_feed()
Storage.change_feed_url()
Storage.get_feed_last()
Storage.get_feeds_page()
Storage.get_feed_counts()
Storage.get_feeds_for_update()
Storage.get_entries_for_update()
Storage.set_feed_user_title()
Storage.set_feed_updates_enabled()
Storage.mark_as_stale()
Storage.mark_as_read()
Storage.mark_as_important()
Storage.get_entry_recent_sort()
Storage.set_entry_recent_sort()
Storage.update_feed()
Storage.add_or_update_entries()
Storage.add_entry()
Storage.delete_entries()
Storage.get_entry_last()
Storage.get_entries_page()
Storage.get_entry_counts()
Storage.get_tags_page()
Storage.set_tag()
Storage.delete_tag()
Storage.default_search_cls (class attribute pointing to Search class)
Search.enable()
Search.disable()
Search.is_enabled()
Search.update()
Search.search_entry_last()
Search.search_entries_page()
Search.search_entry_counts()

I can also extract these into a protocol / base class, so you can:

  1. have your own Storage implementation pass type checking
  2. get to reuse some convenience Storage methods like get_feeds() (which is a thin wrapper over get_feeds_page())

A note regarding search: The SQLite search implementation is tightly-bound to the SQLite storage (that is, search keeps a reference to storage and uses its internals, e.g. database connections). Since Postgres also has support for full-text search AFAIK, it makes sense for it to follow the same pattern. (Having a search implementation that uses a different underlying storage than the storage is possible, but additional hooks would need to be exposed, so that search can be notified of changes at the Python level.)

Do let me know.

@egslava
Copy link

egslava commented Jan 17, 2023

Thanks for the info. It took me a while to find this in docs, but, yeah, later, I found it and just hosted my 10 lines of code script to EC2, so it's fine for now, since I'm still having my free tier. I feel, like I have to say to avoid ambiguity, that I'm not planning to implement Postgresql support, since I just don't need it that much right now and current version of the 'reader' already fits my needs. Thanks for the great library!

@lemon24
Copy link
Owner Author

lemon24 commented Jan 17, 2023

since I just don't need it that much right now and current version of the 'reader' already fits my needs

No worries, it looks like we're in a similar situation here.

Glad to hear reader is still useful as-is, thank you for reaching out. :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants