这是indexloc提供的服务,不要输入任何密码
Skip to content

how to read from HDFS multiple parquet files with spark.index.create .mode("overwrite").indexBy($"cellid").parquet #95

@silviuchiric

Description

@silviuchiric

parquet_issue
I have built from the master code,
then successfully import the jar file in Jupyter Notebook(%AddJar file:/srv/home/srv-taurus-stage/sbd2-notebook/jupyter/parquet-index_2.11-0.4.1-SNAPSHOT.jar)
then I added the library too *import com.github.lightcopy.implicits._)

However , when I try to create the index by providing the path to the parquet files on HDFS
spark.index.create
.mode("overwrite").indexBy($"cellid")
.parquet("hdfs://///data/taurus/stage/taurus.stage.counter-lte-eri-cell-raw-parquet/time=ingestion/bucket=hourly/date=2019-11-2*/*") that I checked one cell above for existence it fails:
Message: File does not exist: How the parquet method knows to read from hdfs, it looks like it does not like the context path hdfs:////

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions