-
Notifications
You must be signed in to change notification settings - Fork 16
Add SwitchPage #103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Gallaecio
wants to merge
15
commits into
scrapinghub:master
Choose a base branch
from
Gallaecio:switch
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Add SwitchPage #103
Changes from all commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
f012a0b
Add SwitchPage (unfinished)
Gallaecio 62cd1ed
Add missing reference
Gallaecio 51114e4
Merge remote-tracking branch 'scrapinghub/master' into switch
Gallaecio 31fc060
Get a clean tox pass
Gallaecio 7817813
Refactor into the proposed API improvement
Gallaecio 36ad056
Remove unneeded NotImplementedError
Gallaecio b5d61ea
Add MultiLayoutPage to the API reference
Gallaecio 6778aa3
Mention priority resolution on the multi-layout page object class doc…
Gallaecio 073c4ab
Clarify that the layouts under a multi-layout should return the same …
Gallaecio b51f056
Remove unnecessary @attrs.define
Gallaecio 308c4bf
Apply feedback
Gallaecio 51bf31f
Add a test case for multilayout exposing layouts that inherit from a …
Gallaecio 828a84b
MultiLayoutPage: layout → get_layout
Gallaecio fc7867f
Provide __get_layout example as a test
Gallaecio b32f92f
Merge remote-tracking branch 'scrapinghub/master' into switch
Gallaecio File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,233 @@ | ||
.. _layouts: | ||
|
||
=============== | ||
Webpage layouts | ||
=============== | ||
|
||
Different webpages may show the same *type* of page, but different *data*. For | ||
example, in an e-commerce website there are usually many product detail pages, | ||
each showing data from a different product. | ||
|
||
The code that those webpages have in common is their **webpage layout**. | ||
|
||
Coding for webpage layouts | ||
========================== | ||
|
||
Webpage layouts should inform how you organize your data extraction code. | ||
|
||
A good practice to keep your code maintainable is to have a separate :ref:`page | ||
object class <page-objects>` per webpage layout. | ||
|
||
Trying to support multiple webpage layouts with the same page object class can | ||
make your class hard to maintain. | ||
|
||
|
||
Identifying webpage layouts | ||
=========================== | ||
|
||
There is no precise way to determine whether 2 webpages have the same or a | ||
different webpage layout. You must decide based on what you know, and be ready | ||
to adapt if things change. | ||
|
||
It is also often difficult to identify webpage layouts before you start writing | ||
extraction code. Completely different webpage layouts can have the same look, | ||
and very similar webpage layouts can look completely different. | ||
|
||
It can be a good starting point to assume that, for a given combination of | ||
data type and website, there is going to be a single webpage layout. For | ||
example, assume that all product pages of a given e-commerce website will have | ||
the same webpage layout. | ||
|
||
Then, as you write a :ref:`page object class <page-objects>` for that webpage | ||
layout, you may find out more, and adapt. | ||
|
||
When the same piece of information must be extracted from a different place for | ||
different webpages, that is a sign that you may be dealing with more than 1 | ||
webpage layout. For example, if on some webpages the product name is in an | ||
``h1`` element, but on some webpages it is in an ``h2`` element, chances are | ||
there are at least 2 different webpage layouts. | ||
|
||
However, whether you continue to work as if everything uses the same webpage | ||
layout, or you split your page object class into 2 page object classes, each | ||
targeting one of the webpage layouts you have found, it is entirely up to you. | ||
|
||
Ask yourself: Is supporting all webpage layout differences making your page | ||
object class implementation only a few lines of code longer, or is it making it | ||
an unmaintainable bowl of spaghetti code? | ||
|
||
|
||
Mapping webpage layouts | ||
======================= | ||
|
||
Once you have written a :ref:`page object class <page-objects>` for a webpage | ||
layout, you need to make it so that your page object class is used for webpages | ||
that use that webpage layout. | ||
|
||
URL patterns | ||
------------ | ||
|
||
Webpage layouts are often associated to specific URL patterns. For example, all | ||
the product detail pages of an e-commerce website usually have similar URLs, | ||
such as ``https://example.com/product/<product ID>``. | ||
|
||
When that is the case, you can :ref:`associate your page object class to the | ||
corresponding URL pattern <rules-intro>`. | ||
|
||
|
||
.. _multi-layout: | ||
|
||
Multi-layout page object classes | ||
-------------------------------- | ||
|
||
Sometimes it is impossible to know, based on the target URL, which webpage | ||
layout you are getting. For example, during `A/B testing`_, you could get a | ||
random webpage layout on every request. | ||
|
||
.. _A/B testing: https://en.wikipedia.org/wiki/A/B_testing | ||
|
||
For these scenarios, we recommend that you create different page object classes | ||
for the different layouts that you may get, and then write a special | ||
“multi-layout” page object class, and use it to select the right page object | ||
class at run time based on the input you receive. | ||
|
||
Your multi-layout page object class should: | ||
BurnzZ marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
#. Declare attributes for the input that you will need to determine which page | ||
object class to use. | ||
|
||
For example, declare an :class:`HttpResponse` attribute to select a page | ||
object class based on the response content: | ||
|
||
.. code-block:: python | ||
|
||
class MyMultiLayoutPage(ItemPage): | ||
response: HttpResponse | ||
... | ||
|
||
#. Declare an attribute for every page object class that you may use depending | ||
on which webpage layout you get from the target website. | ||
|
||
They all should return the same type of :ref:`item <item-classes>` as your | ||
multi-layout page object class. | ||
|
||
For example: | ||
|
||
.. code-block:: python | ||
|
||
class MyItem: | ||
... | ||
|
||
@attrs.define | ||
class MyPage1(ItemPage[MyItem]): | ||
... | ||
|
||
@attrs.define | ||
class MyPage2(ItemPage[MyItem]): | ||
... | ||
|
||
@attrs.define | ||
class MyMultiLayoutPage(ItemPage[MyItem]): | ||
... | ||
page1: MyPage1 | ||
page2: MyPage2 | ||
|
||
Note that all inputs of all those page object classes will be resolved and | ||
requested along with the input of your multi-layout page object class. | ||
|
||
For example, given: | ||
|
||
.. code-block:: python | ||
|
||
@attrs.define | ||
class MyPage1(ItemPage): | ||
response: HttpResponse | ||
|
||
@attrs.define | ||
class MyPage2(ItemPage): | ||
response: BrowserHtml | ||
|
||
@attrs.define | ||
class MyMultiLayoutPage(ItemPage): | ||
response: HttpResponse | ||
page1: MyPage1 | ||
page2: MyPage2 | ||
|
||
Using ``MyMultiLayoutPage`` causes the use of both ``HttpResponse`` and | ||
``BrowserHtml``, because ``MyMultiLayoutPage`` requires ``MyPage2``, and | ||
``MyPage2`` requires ``BrowserHtml``. | ||
|
||
If combining different inputs is a problem, consider refactoring your page | ||
object classes to require similar inputs. | ||
|
||
#. On its :meth:`~web_poet.pages.ItemPage.to_item` method: | ||
|
||
#. Determine, based on inputs, which page object to use. | ||
|
||
#. Return the output of the :meth:`~web_poet.pages.ItemPage.to_item` | ||
method of that page object. | ||
|
||
For example: | ||
|
||
.. code-block:: python | ||
|
||
@attrs.define | ||
class MyMultiLayoutPage(ItemPage[MyItem]): | ||
response: HttpResponse | ||
page1: MyPage1 | ||
page2: MyPage2 | ||
|
||
async def to_item(self) -> MyItem: | ||
if self.response.css(".foo"): | ||
page_object = self.page1 | ||
else: | ||
page_object = self.page2 | ||
return await page_object.to_item() | ||
|
||
You may use :class:`~web_poet.pages.MultiLayoutPage` as a base class for your | ||
multi-layout page object class, so you only need to implement the | ||
:class:`~web_poet.pages.MultiLayoutPage.get_layout` method that determines | ||
which page object to use. For example: | ||
|
||
.. code-block:: python | ||
|
||
Gallaecio marked this conversation as resolved.
Show resolved
Hide resolved
|
||
from typing import Optional | ||
|
||
import attrs | ||
from web_poet import handle_urls, HttpResponse, ItemPage, MultiLayoutPage, WebPage | ||
|
||
|
||
@attrs.define | ||
class Header: | ||
text: str | ||
|
||
|
||
class H1Page(WebPage[Header]): | ||
|
||
@field | ||
def text(self) -> Optional[str]: | ||
return self.css("h1::text").get() | ||
|
||
|
||
class H2Page(WebPage[Header]): | ||
|
||
@field | ||
def text(self) -> Optional[str]: | ||
return self.css("h2::text").get() | ||
|
||
|
||
@handle_urls("example.com") | ||
@attrs.define | ||
class HeaderMultiLayoutPage(MultiLayoutPage[Header]): | ||
response: HttpResponse | ||
h1: H1Page | ||
h2: H2Page | ||
|
||
async def get_layout(self) -> ItemPage[Header]: | ||
if self.response.css("h1::text"): | ||
return self.h1 | ||
return self.h2 | ||
|
||
.. note:: If you use :func:`~web_poet.handle_urls` both for your multi-layout | ||
page object class and for any of the page object classes that it | ||
uses, you may need to :ref:`grant your multi-layout page object class | ||
a higher priority <rules-priority-resolution>`. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,116 @@ | ||
"""Proof of concept of an approach to multi-layout support that involves | ||
documenting best practices on how to handle it with the existing API, rather | ||
than providing a new API for it.""" | ||
|
||
import attrs | ||
import pytest | ||
|
||
from web_poet import HttpResponse, ItemPage, field | ||
|
||
|
||
@attrs.define | ||
class Item: | ||
title: str | ||
text: str | ||
|
||
|
||
@attrs.define | ||
class Title: | ||
title: str | ||
|
||
|
||
@attrs.define | ||
class Text: | ||
text: str | ||
|
||
|
||
@pytest.mark.asyncio | ||
async def test_multiple_inheritance(): | ||
|
||
html = b""" | ||
<!doctype html> | ||
<html> | ||
<head> | ||
<title>foo</title> | ||
</head> | ||
<text id="a">bar</text> | ||
</html> | ||
""" | ||
|
||
@attrs.define | ||
class TitleAPage(ItemPage[Title]): | ||
response: HttpResponse | ||
|
||
@field | ||
def title(self): | ||
return self.response.css("title::text").get() | ||
|
||
@attrs.define | ||
class TitleBPage(ItemPage[Title]): | ||
response: HttpResponse | ||
|
||
@field | ||
def title(self): | ||
return self.response.css("h1::text").get() | ||
|
||
@attrs.define | ||
class TitleMultiLayout(ItemPage[Item]): | ||
response: HttpResponse | ||
title_a: TitleAPage | ||
title_b: TitleBPage | ||
|
||
# TODO: cache the result | ||
def __get_layout(self): | ||
if self.response.css("#a"): | ||
return self.title_a | ||
return self.title_b | ||
|
||
@field | ||
def title(self): | ||
return self.__get_layout().title | ||
|
||
@attrs.define | ||
class TextAPage(ItemPage[Text]): | ||
response: HttpResponse | ||
|
||
@field | ||
def text(self): | ||
return self.response.css("#a::text").get() | ||
|
||
@attrs.define | ||
class TextBPage(ItemPage[Text]): | ||
response: HttpResponse | ||
|
||
@field | ||
def text(self): | ||
return self.response.css("#b::text").get() | ||
|
||
@attrs.define | ||
class TitleAndTextMultiLayout(TitleMultiLayout): | ||
text_a: TextAPage | ||
text_b: TextBPage | ||
|
||
# TODO: cache the result | ||
def __get_layout(self): | ||
if self.response.css("#a"): | ||
return self.text_a | ||
return self.text_b | ||
|
||
@field | ||
def text(self): | ||
return self.__get_layout().text | ||
|
||
response = HttpResponse("https://example.com", body=html, encoding="utf8") | ||
title_a = TitleAPage(response=response) | ||
title_b = TitleBPage(response=response) | ||
text_a = TextAPage(response=response) | ||
text_b = TextBPage(response=response) | ||
layout = TitleAndTextMultiLayout( | ||
response=response, | ||
title_a=title_a, | ||
title_b=title_b, | ||
text_a=text_a, | ||
text_b=text_b, | ||
) | ||
|
||
assert await layout.to_item() == Item(title="foo", text="bar") |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.