Earlier this month, OpenAI introduced a new health focused space within ChatGPT, pitching it as a safer way for users to ask questions about sensitive topics like medical data, illnesses, and fitness. One of the headline features highlighted at launch was ChatGPT Health’s ability to analyze data from apps like Apple Health, MyFitnessPal, and Peloton to surface long term trends and deliver personalized results. However, a new report suggests OpenAI may have overstated how effective the feature is at drawing reliable insights from that data.
According to early tests conducted by The Washington Post‘s Geoffrey A. Fowler, when ChatGPT Health was given access to a decade’s worth of Apple Health data, the chatbot graded the reporter’s cardiac health an F. However, after reviewing the assessment, a cardiologist called it “baseless” and said the reporter’s actual risk of heart disease was extremely low.
Dr. Eric Topol from the Scripps Research Institute offered a blunt assessment of ChatGPT Health’s capabilities, saying the tool is not ready to offer medical advice and relied too heavily on unreliable smartwatch metrics. ChatGPT’s grade leaned heavily on Apple Watch estimates of VO2 max and heart rate variability, both of which have known limitations and can vary significantly between devices and software builds. Independent research has found Apple Watch VO2 max estimates often run low, yet ChatGPT still treated them as clear indicators of poor health.
ChatGPT Health gave different grades for the same data
The problems did not stop there. When the reporter asked ChatGPT Health to repeat the same grading exercise, the score fluctuated between an F and a B across conversations, with the chatbot sometimes ignoring recent blood test reports it had access to and occasionally forgetting basic details like the reporter’s age and gender. Anthropic’s Claude for Healthcare, which also debuted earlier this month, showed similar consistencies, assigning grades that shifted between a C and a B minus.
Both OpenAI and Anthropic have stressed that their tools are not meant to replace doctors and only provide general context. Still, both chatbots delivered confident, highly personalized evaluations of cardiovascular health. This combination of authority and inconsistency could scare healthy users or falsely reassure unhealthy ones. While AI may eventually unlock valuable insights from long term health data, early testing suggests that feeding years of fitness tracking data into these tools currently creates more confusion than clarity.






