FlowFPX: Nimble tools for debugging floating-point exceptions

07/26/2023, 6:00 PM — 6:30 PM UTC
32-141

Abstract:

Reliable numerical computations are central to HPC and ML. We present FlowFPX, a Julia-based tool for tracking the onset and flow of IEEE Floating-Point exceptions that signal numerical defects. FlowFPX’s design exploits Julia’s operator overloading to trace exception flows and even inject exceptions to accelerate testing. We present intuitive visualizations of summarized exception flows including how they are generated, propagated and killed, thus helping with debugging and repair.

Description:

Imagine a busy scientist developing a numerical program P in Julia using a mix of code running on CPUs and offloaded GPU codes. Suppose the program runs to completion, but it has many NaNs ("not a number" in IEEE floating-point arithmetic) in the result.

Unable to proceed meaningfully, the scientist resorts to printing out the NaNs by decoding the values (a hugely time-consuming "hit-miss" proposition). After weeks of work, they discover that one particular division operation within an inner Julia library function J and another sqrt() inside a GPU library function G are two sources of NaNs.

Probing further, the scientist discovers that function J is called along two call paths P1 and P2. Path P1 conducts the NaN generated by function J to the output result. Path P2, on the other hand, goes through a Julia less-than (<) function that "kills" the NaN by silently consuming it in the computation.

The scientist decides to rewrite this function to conduct the NaN to the output (call it "failure manifestation" for debugging), but they are unsure if there is another call path P3 that also can be activated under a different input, and whether P3 might also kill the NaN sometimes.

Curious about why function G generates NaN, the scientist seeks the CUDA sources for it; unfortunately, they discover that G is supplied by NVIDIA in binary form with no documentation.

The scientist now hears about OUR NEW TOOL FlowFPX---a unique contribution that has many attractive features:

  • (a) FlowFPX can run program P unmodified, and shows all the call-paths through the code impinging on functions J and G. Across the thousands of numerical iterations of code C, FlowFPX summarizes all the paths that cause J and G to GENERATE NaNs (gen), the paths that PROPAGATE NaNs (prop), and paths that KILL the NaNs (kill)---a much more comprehensive report that is generated automatically.

  • (b) FlowFPX even produces a nice graphical visualization of gen, prop, and kill.

  • (c) To find additional lurking paths such as P3, FlowFPX can induce stress by making any floating-point operator foo() artificially generate NaNs. This simulates an (as yet unseen) input which might have caused foo() to spit out a NaN. Using this facility, the scientist discovers ways to failure-manifest paths in program P paths.

  • (d) The scientist also discovers that FlowFPX comes with a companion tool called BinFPE that can examine binary-only GPU codes using NVIDIA-provided binary instrumentation. This way, if G internally generates a NaN but silently kills this NaN inside the code, the scientist can do one of two things: (i) see if NVIDIA provides an alternative implementation of G that helps failure-manifest. (ii) artificially generate a NaN at the return site of the G call even if G silently kills the NaN inside. This helps keep the program P more transparent and reliable in that internal NaNs are not lost during test runs.

The scientist then goes and reads the documentation of FlowFPX and finds these important facts that further makes them a fan of the FlowFPX tool:

  • (a) NaN-bugs are not rare. A recent bug-report is https://forums.developer.nvidia.com/t/sqrt-of-positive-number-is-nan-with-newer-drivers/219078/15.

  • (b) BinFPE is backed by a publication https://dl.acm.org/doi/abs/10.1145/3520313.3534655.

  • (c) FlowFPX is built around Sherlogs.jl https://github.com/milankl/Sherlogs.jl that has been proven useful in examining Julia codes run on the Fugaku supercomputer.

  • (d) The study of floating-point exceptions is hugely important, and even for important libraries, there is disagreement on how to handle them, as discussed in https://arxiv.org/pdf/2207.09281.pdf written by Demmel et al.

Platinum sponsors

JuliaHub

Gold sponsors

ASML

Silver sponsors

Pumas AIQuEra Computing Inc.Relational AIJeffrey Sarnoff

Bronze sponsors

Jolin.ioBeacon Biosignals

Academic partners

NAWA

Local partners

Postmates

Fiscal Sponsor

NumFOCUS