Spark Data Frame support in Ignite

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Spark Data Frame support in Ignite

dsetrakyan
Igniters,

We have had the integration with Spark Data Frames on our roadmap for a
while:
https://issues.apache.org/jira/browse/IGNITE-3084

However, while browsing Spark documentation, I cam across the generic JDBC
data frame support in Spark:
https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases

Given that Ignite has a JDBC driver, does it mean that it transitively also
supports Spark data frames? If yes, we should document it.

D.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Data Frame support in Ignite

Jörn Franke
These are two different things. Spark applications themselves do not use JDBC - it is more for non-spark applications to access Spark DataFrames.

A direct support by Ignite would make more sense. Although you have in theory IGFS, if the user is using HDFS, which might not be the case. It is now also very common to use Object stores, such as S3.
Direct support could be leverage for interactive analysis or different Spark applications sharing data.

> On 3. Aug 2017, at 05:12, Dmitriy Setrakyan <[hidden email]> wrote:
>
> Igniters,
>
> We have had the integration with Spark Data Frames on our roadmap for a
> while:
> https://issues.apache.org/jira/browse/IGNITE-3084
>
> However, while browsing Spark documentation, I cam across the generic JDBC
> data frame support in Spark:
> https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
>
> Given that Ignite has a JDBC driver, does it mean that it transitively also
> supports Spark data frames? If yes, we should document it.
>
> D.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Data Frame support in Ignite

dsetrakyan
Jorn, thanks for your feedback!

Can you explain how the direct support would be different from the JDBC
support?

Thanks,
D.

On Thu, Aug 3, 2017 at 7:40 AM, Jörn Franke <[hidden email]> wrote:

> These are two different things. Spark applications themselves do not use
> JDBC - it is more for non-spark applications to access Spark DataFrames.
>
> A direct support by Ignite would make more sense. Although you have in
> theory IGFS, if the user is using HDFS, which might not be the case. It is
> now also very common to use Object stores, such as S3.
> Direct support could be leverage for interactive analysis or different
> Spark applications sharing data.
>
> > On 3. Aug 2017, at 05:12, Dmitriy Setrakyan <[hidden email]>
> wrote:
> >
> > Igniters,
> >
> > We have had the integration with Spark Data Frames on our roadmap for a
> > while:
> > https://issues.apache.org/jira/browse/IGNITE-3084
> >
> > However, while browsing Spark documentation, I cam across the generic
> JDBC
> > data frame support in Spark:
> > https://spark.apache.org/docs/latest/sql-programming-guide.
> html#jdbc-to-other-databases
> >
> > Given that Ignite has a JDBC driver, does it mean that it transitively
> also
> > supports Spark data frames? If yes, we should document it.
> >
> > D.
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Data Frame support in Ignite

Jörn Franke
I think the JDBC one is more inefficient, slower requires too much development effort. You can also check the integration of Alluxio with Spark.
Then, in general I think JDBC has never designed for large data volumes. It is for executing queries and getting a small or aggregated result set back. Alternatively for inserting / updating single rows.

> On 3. Aug 2017, at 08:17, Dmitriy Setrakyan <[hidden email]> wrote:
>
> Jorn, thanks for your feedback!
>
> Can you explain how the direct support would be different from the JDBC
> support?
>
> Thanks,
> D.
>
>> On Thu, Aug 3, 2017 at 7:40 AM, Jörn Franke <[hidden email]> wrote:
>>
>> These are two different things. Spark applications themselves do not use
>> JDBC - it is more for non-spark applications to access Spark DataFrames.
>>
>> A direct support by Ignite would make more sense. Although you have in
>> theory IGFS, if the user is using HDFS, which might not be the case. It is
>> now also very common to use Object stores, such as S3.
>> Direct support could be leverage for interactive analysis or different
>> Spark applications sharing data.
>>
>>> On 3. Aug 2017, at 05:12, Dmitriy Setrakyan <[hidden email]>
>> wrote:
>>>
>>> Igniters,
>>>
>>> We have had the integration with Spark Data Frames on our roadmap for a
>>> while:
>>> https://issues.apache.org/jira/browse/IGNITE-3084
>>>
>>> However, while browsing Spark documentation, I cam across the generic
>> JDBC
>>> data frame support in Spark:
>>> https://spark.apache.org/docs/latest/sql-programming-guide.
>> html#jdbc-to-other-databases
>>>
>>> Given that Ignite has a JDBC driver, does it mean that it transitively
>> also
>>> supports Spark data frames? If yes, we should document it.
>>>
>>> D.
>>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Data Frame support in Ignite

dsetrakyan
On Thu, Aug 3, 2017 at 8:45 AM, Jörn Franke <[hidden email]> wrote:

> I think the JDBC one is more inefficient, slower requires too much
> development effort. You can also check the integration of Alluxio with
> Spark.
>

As far as I know, Alluxio is a file system, so it cannot use JDBC. Ignite,
on the other hand, is an SQL system and works well with JDBC. As far as the
development effort, we are dealing with SQL, so I am not sure why JDBC
would be harder.

Generally speaking, until Ignite provides native data frame integration,
having JDBC-based integration out of the box is minimally acceptable.


> Then, in general I think JDBC has never designed for large data volumes.
> It is for executing queries and getting a small or aggregated result set
> back. Alternatively for inserting / updating single rows.
>

Agree in general. However, Ignite JDBC is designed to work with larger data
volumes and supports data pagination automatically.


> > On 3. Aug 2017, at 08:17, Dmitriy Setrakyan <[hidden email]>
> wrote:
> >
> > Jorn, thanks for your feedback!
> >
> > Can you explain how the direct support would be different from the JDBC
> > support?
> >
> > Thanks,
> > D.
> >
> >> On Thu, Aug 3, 2017 at 7:40 AM, Jörn Franke <[hidden email]>
> wrote:
> >>
> >> These are two different things. Spark applications themselves do not use
> >> JDBC - it is more for non-spark applications to access Spark DataFrames.
> >>
> >> A direct support by Ignite would make more sense. Although you have in
> >> theory IGFS, if the user is using HDFS, which might not be the case. It
> is
> >> now also very common to use Object stores, such as S3.
> >> Direct support could be leverage for interactive analysis or different
> >> Spark applications sharing data.
> >>
> >>> On 3. Aug 2017, at 05:12, Dmitriy Setrakyan <[hidden email]>
> >> wrote:
> >>>
> >>> Igniters,
> >>>
> >>> We have had the integration with Spark Data Frames on our roadmap for a
> >>> while:
> >>> https://issues.apache.org/jira/browse/IGNITE-3084
> >>>
> >>> However, while browsing Spark documentation, I cam across the generic
> >> JDBC
> >>> data frame support in Spark:
> >>> https://spark.apache.org/docs/latest/sql-programming-guide.
> >> html#jdbc-to-other-databases
> >>>
> >>> Given that Ignite has a JDBC driver, does it mean that it transitively
> >> also
> >>> supports Spark data frames? If yes, we should document it.
> >>>
> >>> D.
> >>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Data Frame support in Ignite

Jörn Franke
I think the development effort would still be higher. Everything would have to be put via JDBC into Ignite, then checkpointing would have to be done via JDBC (again additional development effort), a lot of conversion from spark internal format to JDBC and back to ignite internal format. Pagination I do not see as a useful feature for managing large data volumes from databases - on the contrary it is very inefficient (and one would to have to implement logic to fetch al pages). Pagination was also never thought of for fetching large data volumes, but for web pages showing a small result set over several pages, where the user can click manually for the next page (what they anyway not do most of the time).

While it might be a quick solution , I think a deeper integration than JDBC would be more beneficial.

> On 3. Aug 2017, at 08:57, Dmitriy Setrakyan <[hidden email]> wrote:
>
>> On Thu, Aug 3, 2017 at 8:45 AM, Jörn Franke <[hidden email]> wrote:
>>
>> I think the JDBC one is more inefficient, slower requires too much
>> development effort. You can also check the integration of Alluxio with
>> Spark.
>>
>
> As far as I know, Alluxio is a file system, so it cannot use JDBC. Ignite,
> on the other hand, is an SQL system and works well with JDBC. As far as the
> development effort, we are dealing with SQL, so I am not sure why JDBC
> would be harder.
>
> Generally speaking, until Ignite provides native data frame integration,
> having JDBC-based integration out of the box is minimally acceptable.
>
>
>> Then, in general I think JDBC has never designed for large data volumes.
>> It is for executing queries and getting a small or aggregated result set
>> back. Alternatively for inserting / updating single rows.
>>
>
> Agree in general. However, Ignite JDBC is designed to work with larger data
> volumes and supports data pagination automatically.
>
>
>>> On 3. Aug 2017, at 08:17, Dmitriy Setrakyan <[hidden email]>
>> wrote:
>>>
>>> Jorn, thanks for your feedback!
>>>
>>> Can you explain how the direct support would be different from the JDBC
>>> support?
>>>
>>> Thanks,
>>> D.
>>>
>>>> On Thu, Aug 3, 2017 at 7:40 AM, Jörn Franke <[hidden email]>
>> wrote:
>>>>
>>>> These are two different things. Spark applications themselves do not use
>>>> JDBC - it is more for non-spark applications to access Spark DataFrames.
>>>>
>>>> A direct support by Ignite would make more sense. Although you have in
>>>> theory IGFS, if the user is using HDFS, which might not be the case. It
>> is
>>>> now also very common to use Object stores, such as S3.
>>>> Direct support could be leverage for interactive analysis or different
>>>> Spark applications sharing data.
>>>>
>>>>> On 3. Aug 2017, at 05:12, Dmitriy Setrakyan <[hidden email]>
>>>> wrote:
>>>>>
>>>>> Igniters,
>>>>>
>>>>> We have had the integration with Spark Data Frames on our roadmap for a
>>>>> while:
>>>>> https://issues.apache.org/jira/browse/IGNITE-3084
>>>>>
>>>>> However, while browsing Spark documentation, I cam across the generic
>>>> JDBC
>>>>> data frame support in Spark:
>>>>> https://spark.apache.org/docs/latest/sql-programming-guide.
>>>> html#jdbc-to-other-databases
>>>>>
>>>>> Given that Ignite has a JDBC driver, does it mean that it transitively
>>>> also
>>>>> supports Spark data frames? If yes, we should document it.
>>>>>
>>>>> D.
>>>>
>>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Data Frame support in Ignite

dsetrakyan
On Thu, Aug 3, 2017 at 9:04 AM, Jörn Franke <[hidden email]> wrote:

> I think the development effort would still be higher. Everything would
> have to be put via JDBC into Ignite, then checkpointing would have to be
> done via JDBC (again additional development effort), a lot of conversion
> from spark internal format to JDBC and back to ignite internal format.
> Pagination I do not see as a useful feature for managing large data volumes
> from databases - on the contrary it is very inefficient (and one would to
> have to implement logic to fetch al pages). Pagination was also never
> thought of for fetching large data volumes, but for web pages showing a
> small result set over several pages, where the user can click manually for
> the next page (what they anyway not do most of the time).
>
> While it might be a quick solution , I think a deeper integration than
> JDBC would be more beneficial.
>

Jorn, I completely agree. However, we have not been able to find a
contributor for this feature. You sound like you have sufficient domain
expertise in Spark and Ignite. Would you be willing to help out?


> > On 3. Aug 2017, at 08:57, Dmitriy Setrakyan <[hidden email]>
> wrote:
> >
> >> On Thu, Aug 3, 2017 at 8:45 AM, Jörn Franke <[hidden email]>
> wrote:
> >>
> >> I think the JDBC one is more inefficient, slower requires too much
> >> development effort. You can also check the integration of Alluxio with
> >> Spark.
> >>
> >
> > As far as I know, Alluxio is a file system, so it cannot use JDBC.
> Ignite,
> > on the other hand, is an SQL system and works well with JDBC. As far as
> the
> > development effort, we are dealing with SQL, so I am not sure why JDBC
> > would be harder.
> >
> > Generally speaking, until Ignite provides native data frame integration,
> > having JDBC-based integration out of the box is minimally acceptable.
> >
> >
> >> Then, in general I think JDBC has never designed for large data volumes.
> >> It is for executing queries and getting a small or aggregated result set
> >> back. Alternatively for inserting / updating single rows.
> >>
> >
> > Agree in general. However, Ignite JDBC is designed to work with larger
> data
> > volumes and supports data pagination automatically.
> >
> >
> >>> On 3. Aug 2017, at 08:17, Dmitriy Setrakyan <[hidden email]>
> >> wrote:
> >>>
> >>> Jorn, thanks for your feedback!
> >>>
> >>> Can you explain how the direct support would be different from the JDBC
> >>> support?
> >>>
> >>> Thanks,
> >>> D.
> >>>
> >>>> On Thu, Aug 3, 2017 at 7:40 AM, Jörn Franke <[hidden email]>
> >> wrote:
> >>>>
> >>>> These are two different things. Spark applications themselves do not
> use
> >>>> JDBC - it is more for non-spark applications to access Spark
> DataFrames.
> >>>>
> >>>> A direct support by Ignite would make more sense. Although you have in
> >>>> theory IGFS, if the user is using HDFS, which might not be the case.
> It
> >> is
> >>>> now also very common to use Object stores, such as S3.
> >>>> Direct support could be leverage for interactive analysis or different
> >>>> Spark applications sharing data.
> >>>>
> >>>>> On 3. Aug 2017, at 05:12, Dmitriy Setrakyan <[hidden email]>
> >>>> wrote:
> >>>>>
> >>>>> Igniters,
> >>>>>
> >>>>> We have had the integration with Spark Data Frames on our roadmap
> for a
> >>>>> while:
> >>>>> https://issues.apache.org/jira/browse/IGNITE-3084
> >>>>>
> >>>>> However, while browsing Spark documentation, I cam across the generic
> >>>> JDBC
> >>>>> data frame support in Spark:
> >>>>> https://spark.apache.org/docs/latest/sql-programming-guide.
> >>>> html#jdbc-to-other-databases
> >>>>>
> >>>>> Given that Ignite has a JDBC driver, does it mean that it
> transitively
> >>>> also
> >>>>> supports Spark data frames? If yes, we should document it.
> >>>>>
> >>>>> D.
> >>>>
> >>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Data Frame support in Ignite

Valentin Kulichenko
This JDBC integration is just a Spark data source, which means that Spark
will fetch data in its local memory first, and only then apply filters,
aggregations, etc. This is obviously slow and doesn't use all advantages
Ignite provides.

To create useful and valuable integration, we should create a custom
Strategy that will convert Spark's logical plan into a SQL query and
execute it directly on Ignite.

-Val

On Thu, Aug 3, 2017 at 12:12 AM, Dmitriy Setrakyan <[hidden email]>
wrote:

> On Thu, Aug 3, 2017 at 9:04 AM, Jörn Franke <[hidden email]> wrote:
>
> > I think the development effort would still be higher. Everything would
> > have to be put via JDBC into Ignite, then checkpointing would have to be
> > done via JDBC (again additional development effort), a lot of conversion
> > from spark internal format to JDBC and back to ignite internal format.
> > Pagination I do not see as a useful feature for managing large data
> volumes
> > from databases - on the contrary it is very inefficient (and one would to
> > have to implement logic to fetch al pages). Pagination was also never
> > thought of for fetching large data volumes, but for web pages showing a
> > small result set over several pages, where the user can click manually
> for
> > the next page (what they anyway not do most of the time).
> >
> > While it might be a quick solution , I think a deeper integration than
> > JDBC would be more beneficial.
> >
>
> Jorn, I completely agree. However, we have not been able to find a
> contributor for this feature. You sound like you have sufficient domain
> expertise in Spark and Ignite. Would you be willing to help out?
>
>
> > > On 3. Aug 2017, at 08:57, Dmitriy Setrakyan <[hidden email]>
> > wrote:
> > >
> > >> On Thu, Aug 3, 2017 at 8:45 AM, Jörn Franke <[hidden email]>
> > wrote:
> > >>
> > >> I think the JDBC one is more inefficient, slower requires too much
> > >> development effort. You can also check the integration of Alluxio with
> > >> Spark.
> > >>
> > >
> > > As far as I know, Alluxio is a file system, so it cannot use JDBC.
> > Ignite,
> > > on the other hand, is an SQL system and works well with JDBC. As far as
> > the
> > > development effort, we are dealing with SQL, so I am not sure why JDBC
> > > would be harder.
> > >
> > > Generally speaking, until Ignite provides native data frame
> integration,
> > > having JDBC-based integration out of the box is minimally acceptable.
> > >
> > >
> > >> Then, in general I think JDBC has never designed for large data
> volumes.
> > >> It is for executing queries and getting a small or aggregated result
> set
> > >> back. Alternatively for inserting / updating single rows.
> > >>
> > >
> > > Agree in general. However, Ignite JDBC is designed to work with larger
> > data
> > > volumes and supports data pagination automatically.
> > >
> > >
> > >>> On 3. Aug 2017, at 08:17, Dmitriy Setrakyan <[hidden email]>
> > >> wrote:
> > >>>
> > >>> Jorn, thanks for your feedback!
> > >>>
> > >>> Can you explain how the direct support would be different from the
> JDBC
> > >>> support?
> > >>>
> > >>> Thanks,
> > >>> D.
> > >>>
> > >>>> On Thu, Aug 3, 2017 at 7:40 AM, Jörn Franke <[hidden email]>
> > >> wrote:
> > >>>>
> > >>>> These are two different things. Spark applications themselves do not
> > use
> > >>>> JDBC - it is more for non-spark applications to access Spark
> > DataFrames.
> > >>>>
> > >>>> A direct support by Ignite would make more sense. Although you have
> in
> > >>>> theory IGFS, if the user is using HDFS, which might not be the case.
> > It
> > >> is
> > >>>> now also very common to use Object stores, such as S3.
> > >>>> Direct support could be leverage for interactive analysis or
> different
> > >>>> Spark applications sharing data.
> > >>>>
> > >>>>> On 3. Aug 2017, at 05:12, Dmitriy Setrakyan <[hidden email]
> >
> > >>>> wrote:
> > >>>>>
> > >>>>> Igniters,
> > >>>>>
> > >>>>> We have had the integration with Spark Data Frames on our roadmap
> > for a
> > >>>>> while:
> > >>>>> https://issues.apache.org/jira/browse/IGNITE-3084
> > >>>>>
> > >>>>> However, while browsing Spark documentation, I cam across the
> generic
> > >>>> JDBC
> > >>>>> data frame support in Spark:
> > >>>>> https://spark.apache.org/docs/latest/sql-programming-guide.
> > >>>> html#jdbc-to-other-databases
> > >>>>>
> > >>>>> Given that Ignite has a JDBC driver, does it mean that it
> > transitively
> > >>>> also
> > >>>>> supports Spark data frames? If yes, we should document it.
> > >>>>>
> > >>>>> D.
> > >>>>
> > >>
> >
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Data Frame support in Ignite

Dmitriy Setrakyan-2
On Thu, Aug 3, 2017 at 9:04 PM, Valentin Kulichenko <
[hidden email]> wrote:

> This JDBC integration is just a Spark data source, which means that Spark
> will fetch data in its local memory first, and only then apply filters,
> aggregations, etc. This is obviously slow and doesn't use all advantages
> Ignite provides.
>
> To create useful and valuable integration, we should create a custom
> Strategy that will convert Spark's logical plan into a SQL query and
> execute it directly on Ignite.
>

I get it, but we have been talking about Data Frame support for longer than
a year. I think we should advise our users to switch to JDBC until the
community gets someone to implement it.


>
> -Val
>
> On Thu, Aug 3, 2017 at 12:12 AM, Dmitriy Setrakyan <[hidden email]>
> wrote:
>
> > On Thu, Aug 3, 2017 at 9:04 AM, Jörn Franke <[hidden email]>
> wrote:
> >
> > > I think the development effort would still be higher. Everything would
> > > have to be put via JDBC into Ignite, then checkpointing would have to
> be
> > > done via JDBC (again additional development effort), a lot of
> conversion
> > > from spark internal format to JDBC and back to ignite internal format.
> > > Pagination I do not see as a useful feature for managing large data
> > volumes
> > > from databases - on the contrary it is very inefficient (and one would
> to
> > > have to implement logic to fetch al pages). Pagination was also never
> > > thought of for fetching large data volumes, but for web pages showing a
> > > small result set over several pages, where the user can click manually
> > for
> > > the next page (what they anyway not do most of the time).
> > >
> > > While it might be a quick solution , I think a deeper integration than
> > > JDBC would be more beneficial.
> > >
> >
> > Jorn, I completely agree. However, we have not been able to find a
> > contributor for this feature. You sound like you have sufficient domain
> > expertise in Spark and Ignite. Would you be willing to help out?
> >
> >
> > > > On 3. Aug 2017, at 08:57, Dmitriy Setrakyan <[hidden email]>
> > > wrote:
> > > >
> > > >> On Thu, Aug 3, 2017 at 8:45 AM, Jörn Franke <[hidden email]>
> > > wrote:
> > > >>
> > > >> I think the JDBC one is more inefficient, slower requires too much
> > > >> development effort. You can also check the integration of Alluxio
> with
> > > >> Spark.
> > > >>
> > > >
> > > > As far as I know, Alluxio is a file system, so it cannot use JDBC.
> > > Ignite,
> > > > on the other hand, is an SQL system and works well with JDBC. As far
> as
> > > the
> > > > development effort, we are dealing with SQL, so I am not sure why
> JDBC
> > > > would be harder.
> > > >
> > > > Generally speaking, until Ignite provides native data frame
> > integration,
> > > > having JDBC-based integration out of the box is minimally acceptable.
> > > >
> > > >
> > > >> Then, in general I think JDBC has never designed for large data
> > volumes.
> > > >> It is for executing queries and getting a small or aggregated result
> > set
> > > >> back. Alternatively for inserting / updating single rows.
> > > >>
> > > >
> > > > Agree in general. However, Ignite JDBC is designed to work with
> larger
> > > data
> > > > volumes and supports data pagination automatically.
> > > >
> > > >
> > > >>> On 3. Aug 2017, at 08:17, Dmitriy Setrakyan <[hidden email]
> >
> > > >> wrote:
> > > >>>
> > > >>> Jorn, thanks for your feedback!
> > > >>>
> > > >>> Can you explain how the direct support would be different from the
> > JDBC
> > > >>> support?
> > > >>>
> > > >>> Thanks,
> > > >>> D.
> > > >>>
> > > >>>> On Thu, Aug 3, 2017 at 7:40 AM, Jörn Franke <[hidden email]
> >
> > > >> wrote:
> > > >>>>
> > > >>>> These are two different things. Spark applications themselves do
> not
> > > use
> > > >>>> JDBC - it is more for non-spark applications to access Spark
> > > DataFrames.
> > > >>>>
> > > >>>> A direct support by Ignite would make more sense. Although you
> have
> > in
> > > >>>> theory IGFS, if the user is using HDFS, which might not be the
> case.
> > > It
> > > >> is
> > > >>>> now also very common to use Object stores, such as S3.
> > > >>>> Direct support could be leverage for interactive analysis or
> > different
> > > >>>> Spark applications sharing data.
> > > >>>>
> > > >>>>> On 3. Aug 2017, at 05:12, Dmitriy Setrakyan <
> [hidden email]
> > >
> > > >>>> wrote:
> > > >>>>>
> > > >>>>> Igniters,
> > > >>>>>
> > > >>>>> We have had the integration with Spark Data Frames on our roadmap
> > > for a
> > > >>>>> while:
> > > >>>>> https://issues.apache.org/jira/browse/IGNITE-3084
> > > >>>>>
> > > >>>>> However, while browsing Spark documentation, I cam across the
> > generic
> > > >>>> JDBC
> > > >>>>> data frame support in Spark:
> > > >>>>> https://spark.apache.org/docs/latest/sql-programming-guide.
> > > >>>> html#jdbc-to-other-databases
> > > >>>>>
> > > >>>>> Given that Ignite has a JDBC driver, does it mean that it
> > > transitively
> > > >>>> also
> > > >>>>> supports Spark data frames? If yes, we should document it.
> > > >>>>>
> > > >>>>> D.
> > > >>>>
> > > >>
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Data Frame support in Ignite

Denis Magda-2
>> This JDBC integration is just a Spark data source, which means that Spark
>> will fetch data in its local memory first, and only then apply filters,
>> aggregations, etc.

Seems that there is a backdoor exposed via the standard SQL syntax. You can execute so called “pushdown” queries [1] that are sent by Spark to a JDBC database right away and the result is wrapped into a form of the DataFrame.

I could do this trick using Ignite as a JDBC compliant datasource executing the query below over the data stored in the cluster:

SELECT p.name as person, c.name as city " +
    "FROM person p, city c  WHERE p.city_id = c.id

There are some limitations though because the actual query issued by Spark will be:

SELECT * FROM (SELECT p.name as person, c.name as city " +
    "FROM person p, city c  WHERE p.city_id = c.id) as res

Here [2] is a complete example.


[1] https://docs.databricks.com/spark/latest/data-sources/sql-databases.html#pushdown-query-to-database-engine <https://docs.databricks.com/spark/latest/data-sources/sql-databases.html#pushdown-query-to-database-engine>
[2] https://github.com/dmagda/ignite-dataframes <https://github.com/dmagda/ignite-dataframes>


Denis

> On Aug 4, 2017, at 3:41 PM, Dmitriy Setrakyan <[hidden email]> wrote:
>
> On Thu, Aug 3, 2017 at 9:04 PM, Valentin Kulichenko <
> [hidden email]> wrote:
>
>> This JDBC integration is just a Spark data source, which means that Spark
>> will fetch data in its local memory first, and only then apply filters,
>> aggregations, etc. This is obviously slow and doesn't use all advantages
>> Ignite provides.
>>
>> To create useful and valuable integration, we should create a custom
>> Strategy that will convert Spark's logical plan into a SQL query and
>> execute it directly on Ignite.
>>
>
> I get it, but we have been talking about Data Frame support for longer than
> a year. I think we should advise our users to switch to JDBC until the
> community gets someone to implement it.
>
>
>>
>> -Val
>>
>> On Thu, Aug 3, 2017 at 12:12 AM, Dmitriy Setrakyan <[hidden email]>
>> wrote:
>>
>>> On Thu, Aug 3, 2017 at 9:04 AM, Jörn Franke <[hidden email]>
>> wrote:
>>>
>>>> I think the development effort would still be higher. Everything would
>>>> have to be put via JDBC into Ignite, then checkpointing would have to
>> be
>>>> done via JDBC (again additional development effort), a lot of
>> conversion
>>>> from spark internal format to JDBC and back to ignite internal format.
>>>> Pagination I do not see as a useful feature for managing large data
>>> volumes
>>>> from databases - on the contrary it is very inefficient (and one would
>> to
>>>> have to implement logic to fetch al pages). Pagination was also never
>>>> thought of for fetching large data volumes, but for web pages showing a
>>>> small result set over several pages, where the user can click manually
>>> for
>>>> the next page (what they anyway not do most of the time).
>>>>
>>>> While it might be a quick solution , I think a deeper integration than
>>>> JDBC would be more beneficial.
>>>>
>>>
>>> Jorn, I completely agree. However, we have not been able to find a
>>> contributor for this feature. You sound like you have sufficient domain
>>> expertise in Spark and Ignite. Would you be willing to help out?
>>>
>>>
>>>>> On 3. Aug 2017, at 08:57, Dmitriy Setrakyan <[hidden email]>
>>>> wrote:
>>>>>
>>>>>> On Thu, Aug 3, 2017 at 8:45 AM, Jörn Franke <[hidden email]>
>>>> wrote:
>>>>>>
>>>>>> I think the JDBC one is more inefficient, slower requires too much
>>>>>> development effort. You can also check the integration of Alluxio
>> with
>>>>>> Spark.
>>>>>>
>>>>>
>>>>> As far as I know, Alluxio is a file system, so it cannot use JDBC.
>>>> Ignite,
>>>>> on the other hand, is an SQL system and works well with JDBC. As far
>> as
>>>> the
>>>>> development effort, we are dealing with SQL, so I am not sure why
>> JDBC
>>>>> would be harder.
>>>>>
>>>>> Generally speaking, until Ignite provides native data frame
>>> integration,
>>>>> having JDBC-based integration out of the box is minimally acceptable.
>>>>>
>>>>>
>>>>>> Then, in general I think JDBC has never designed for large data
>>> volumes.
>>>>>> It is for executing queries and getting a small or aggregated result
>>> set
>>>>>> back. Alternatively for inserting / updating single rows.
>>>>>>
>>>>>
>>>>> Agree in general. However, Ignite JDBC is designed to work with
>> larger
>>>> data
>>>>> volumes and supports data pagination automatically.
>>>>>
>>>>>
>>>>>>> On 3. Aug 2017, at 08:17, Dmitriy Setrakyan <[hidden email]
>>>
>>>>>> wrote:
>>>>>>>
>>>>>>> Jorn, thanks for your feedback!
>>>>>>>
>>>>>>> Can you explain how the direct support would be different from the
>>> JDBC
>>>>>>> support?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> D.
>>>>>>>
>>>>>>>> On Thu, Aug 3, 2017 at 7:40 AM, Jörn Franke <[hidden email]
>>>
>>>>>> wrote:
>>>>>>>>
>>>>>>>> These are two different things. Spark applications themselves do
>> not
>>>> use
>>>>>>>> JDBC - it is more for non-spark applications to access Spark
>>>> DataFrames.
>>>>>>>>
>>>>>>>> A direct support by Ignite would make more sense. Although you
>> have
>>> in
>>>>>>>> theory IGFS, if the user is using HDFS, which might not be the
>> case.
>>>> It
>>>>>> is
>>>>>>>> now also very common to use Object stores, such as S3.
>>>>>>>> Direct support could be leverage for interactive analysis or
>>> different
>>>>>>>> Spark applications sharing data.
>>>>>>>>
>>>>>>>>> On 3. Aug 2017, at 05:12, Dmitriy Setrakyan <
>> [hidden email]
>>>>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Igniters,
>>>>>>>>>
>>>>>>>>> We have had the integration with Spark Data Frames on our roadmap
>>>> for a
>>>>>>>>> while:
>>>>>>>>> https://issues.apache.org/jira/browse/IGNITE-3084
>>>>>>>>>
>>>>>>>>> However, while browsing Spark documentation, I cam across the
>>> generic
>>>>>>>> JDBC
>>>>>>>>> data frame support in Spark:
>>>>>>>>> https://spark.apache.org/docs/latest/sql-programming-guide.
>>>>>>>> html#jdbc-to-other-databases
>>>>>>>>>
>>>>>>>>> Given that Ignite has a JDBC driver, does it mean that it
>>>> transitively
>>>>>>>> also
>>>>>>>>> supports Spark data frames? If yes, we should document it.
>>>>>>>>>
>>>>>>>>> D.
>>>>>>>>
>>>>>>
>>>>
>>>
>>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Data Frame support in Ignite

Valentin Kulichenko
Denis,

This only allows to limit dataset fetched from DB to Spark. This is useful,
but does not replace custom Strategy integration. Because after you create
the FD, you will use its API to do additional filtering, mapping,
aggregation, etc., and this will happen within Spark. With custom strategy
the whole processing will be done on Ignite side.

-Val

On Thu, Aug 10, 2017 at 3:07 PM, Denis Magda <[hidden email]> wrote:

> >> This JDBC integration is just a Spark data source, which means that
> Spark
> >> will fetch data in its local memory first, and only then apply filters,
> >> aggregations, etc.
>
> Seems that there is a backdoor exposed via the standard SQL syntax. You
> can execute so called “pushdown” queries [1] that are sent by Spark to a
> JDBC database right away and the result is wrapped into a form of the
> DataFrame.
>
> I could do this trick using Ignite as a JDBC compliant datasource
> executing the query below over the data stored in the cluster:
>
> SELECT p.name as person, c.name as city " +
>     "FROM person p, city c  WHERE p.city_id = c.id
>
> There are some limitations though because the actual query issued by Spark
> will be:
>
> SELECT * FROM (SELECT p.name as person, c.name as city " +
>     "FROM person p, city c  WHERE p.city_id = c.id) as res
>
> Here [2] is a complete example.
>
>
> [1] https://docs.databricks.com/spark/latest/data-sources/sql-
> databases.html#pushdown-query-to-database-engine <
> https://docs.databricks.com/spark/latest/data-sources/sql-
> databases.html#pushdown-query-to-database-engine>
> [2] https://github.com/dmagda/ignite-dataframes <
> https://github.com/dmagda/ignite-dataframes>
>
> —
> Denis
>
> > On Aug 4, 2017, at 3:41 PM, Dmitriy Setrakyan <[hidden email]> wrote:
> >
> > On Thu, Aug 3, 2017 at 9:04 PM, Valentin Kulichenko <
> > [hidden email]> wrote:
> >
> >> This JDBC integration is just a Spark data source, which means that
> Spark
> >> will fetch data in its local memory first, and only then apply filters,
> >> aggregations, etc. This is obviously slow and doesn't use all advantages
> >> Ignite provides.
> >>
> >> To create useful and valuable integration, we should create a custom
> >> Strategy that will convert Spark's logical plan into a SQL query and
> >> execute it directly on Ignite.
> >>
> >
> > I get it, but we have been talking about Data Frame support for longer
> than
> > a year. I think we should advise our users to switch to JDBC until the
> > community gets someone to implement it.
> >
> >
> >>
> >> -Val
> >>
> >> On Thu, Aug 3, 2017 at 12:12 AM, Dmitriy Setrakyan <
> [hidden email]>
> >> wrote:
> >>
> >>> On Thu, Aug 3, 2017 at 9:04 AM, Jörn Franke <[hidden email]>
> >> wrote:
> >>>
> >>>> I think the development effort would still be higher. Everything would
> >>>> have to be put via JDBC into Ignite, then checkpointing would have to
> >> be
> >>>> done via JDBC (again additional development effort), a lot of
> >> conversion
> >>>> from spark internal format to JDBC and back to ignite internal format.
> >>>> Pagination I do not see as a useful feature for managing large data
> >>> volumes
> >>>> from databases - on the contrary it is very inefficient (and one would
> >> to
> >>>> have to implement logic to fetch al pages). Pagination was also never
> >>>> thought of for fetching large data volumes, but for web pages showing
> a
> >>>> small result set over several pages, where the user can click manually
> >>> for
> >>>> the next page (what they anyway not do most of the time).
> >>>>
> >>>> While it might be a quick solution , I think a deeper integration than
> >>>> JDBC would be more beneficial.
> >>>>
> >>>
> >>> Jorn, I completely agree. However, we have not been able to find a
> >>> contributor for this feature. You sound like you have sufficient domain
> >>> expertise in Spark and Ignite. Would you be willing to help out?
> >>>
> >>>
> >>>>> On 3. Aug 2017, at 08:57, Dmitriy Setrakyan <[hidden email]>
> >>>> wrote:
> >>>>>
> >>>>>> On Thu, Aug 3, 2017 at 8:45 AM, Jörn Franke <[hidden email]>
> >>>> wrote:
> >>>>>>
> >>>>>> I think the JDBC one is more inefficient, slower requires too much
> >>>>>> development effort. You can also check the integration of Alluxio
> >> with
> >>>>>> Spark.
> >>>>>>
> >>>>>
> >>>>> As far as I know, Alluxio is a file system, so it cannot use JDBC.
> >>>> Ignite,
> >>>>> on the other hand, is an SQL system and works well with JDBC. As far
> >> as
> >>>> the
> >>>>> development effort, we are dealing with SQL, so I am not sure why
> >> JDBC
> >>>>> would be harder.
> >>>>>
> >>>>> Generally speaking, until Ignite provides native data frame
> >>> integration,
> >>>>> having JDBC-based integration out of the box is minimally acceptable.
> >>>>>
> >>>>>
> >>>>>> Then, in general I think JDBC has never designed for large data
> >>> volumes.
> >>>>>> It is for executing queries and getting a small or aggregated result
> >>> set
> >>>>>> back. Alternatively for inserting / updating single rows.
> >>>>>>
> >>>>>
> >>>>> Agree in general. However, Ignite JDBC is designed to work with
> >> larger
> >>>> data
> >>>>> volumes and supports data pagination automatically.
> >>>>>
> >>>>>
> >>>>>>> On 3. Aug 2017, at 08:17, Dmitriy Setrakyan <[hidden email]
> >>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> Jorn, thanks for your feedback!
> >>>>>>>
> >>>>>>> Can you explain how the direct support would be different from the
> >>> JDBC
> >>>>>>> support?
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> D.
> >>>>>>>
> >>>>>>>> On Thu, Aug 3, 2017 at 7:40 AM, Jörn Franke <[hidden email]
> >>>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>> These are two different things. Spark applications themselves do
> >> not
> >>>> use
> >>>>>>>> JDBC - it is more for non-spark applications to access Spark
> >>>> DataFrames.
> >>>>>>>>
> >>>>>>>> A direct support by Ignite would make more sense. Although you
> >> have
> >>> in
> >>>>>>>> theory IGFS, if the user is using HDFS, which might not be the
> >> case.
> >>>> It
> >>>>>> is
> >>>>>>>> now also very common to use Object stores, such as S3.
> >>>>>>>> Direct support could be leverage for interactive analysis or
> >>> different
> >>>>>>>> Spark applications sharing data.
> >>>>>>>>
> >>>>>>>>> On 3. Aug 2017, at 05:12, Dmitriy Setrakyan <
> >> [hidden email]
> >>>>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Igniters,
> >>>>>>>>>
> >>>>>>>>> We have had the integration with Spark Data Frames on our roadmap
> >>>> for a
> >>>>>>>>> while:
> >>>>>>>>> https://issues.apache.org/jira/browse/IGNITE-3084
> >>>>>>>>>
> >>>>>>>>> However, while browsing Spark documentation, I cam across the
> >>> generic
> >>>>>>>> JDBC
> >>>>>>>>> data frame support in Spark:
> >>>>>>>>> https://spark.apache.org/docs/latest/sql-programming-guide.
> >>>>>>>> html#jdbc-to-other-databases
> >>>>>>>>>
> >>>>>>>>> Given that Ignite has a JDBC driver, does it mean that it
> >>>> transitively
> >>>>>>>> also
> >>>>>>>>> supports Spark data frames? If yes, we should document it.
> >>>>>>>>>
> >>>>>>>>> D.
> >>>>>>>>
> >>>>>>
> >>>>
> >>>
> >>
>
>
Loading...