Skip to content

ParamServerDriver job failure when task number set large #10

@tanglizhe1105

Description

@tanglizhe1105

Vocabulary: 25970
Docs: 1000
Tokens: 106776
Topics: 1000

cluster has 20 servers, each server has 8 core cpub, 48GB mem.
when set --psCount 20, lda work well
set --psCount 40 , lda also work well
but try to set --psCount 60, some tasks of Parameter server jop will failure.

log as following:

java.lang.NegativeArraySizeException

Job aborted due to stage failure: Task 119 in stage 4.0 failed 4 times, most recent failure: Lost task 119.3 in stage 4.0 (TID 147, node-26): java.lang.NegativeArraySizeException
    at com.intel.distml.util.store.IntArrayStore.init(IntArrayStore.java:31)
    at com.intel.distml.util.DataStore.createStore(DataStore.java:56)
    at com.intel.distml.util.DataStore.createStores(DataStore.java:44)
    at com.intel.distml.platform.ParamServerDriver.paramServerTask(ParamServerDriver.scala:44)
    at com.intel.distml.platform.ParamServerDriver$$anonfun$run$3.apply(ParamServerDriver.scala:75)
    at com.intel.distml.platform.ParamServerDriver$$anonfun$run$3.apply(ParamServerDriver.scala:75)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285)
    at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions